Factors, Dates, and Strings of Text
forcats for categorical data manipulationlubridate for temporal data analysisstringr for text manipulationMy two cents
Working with Ordinal Categorical Data
Imagine you record the month in which some observation took place
Using a character string to record this has two problems:
# Specify the levels (all possible values, and their order)
month_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
months_factor <- factor(x = months, levels = month_levels) # Build my factor
months_factor[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
factor() we use forcats::fct(){forcats} Package
{forcats}: A suite of tools that solve common problems with factors
Key forcats functions:
fct_reorder() - Reorder factor levels based on datafct_relevel() - Reorder factor levels by handfct_lump_*() - Group small categories#You will need to reinstall the package: remotes::install_github("jcvdav/EVR628tools")
library(EVR628tools)
# Load the geartypes data
data("data_geartypes")
data_geartypes# A tibble: 840 × 3
vessel_id geartype effort_hours
<chr> <chr> <dbl>
1 00319684b-b03f-3b96-7560-0750e4b828fa TRAWLERS 2.22
2 00618559b-b68c-f85c-df65-112808b97e68 OTHER_PURSE_SEINES 577.
3 0091ceee9-9421-e3bc-9c5a-6d854975545c TUNA_PURSE_SEINES 17.0
4 00e3bfcdd-de86-c933-dbd1-a6c354a40f2c TRAWLERS 3226.
5 00e410e76-6d15-b0ad-4b4e-3b086cb9eb81 TRAWLERS 1484.
6 0108b3937-772f-d55b-aeb7-1c6113ac1722 TRAWLERS 474.
7 01391f16b-b01b-3527-c87e-c252b6054037 TRAWLERS 503.
8 0144ae898-893a-5fca-3029-d6f4d9e1c6cf TRAWLERS 286.
9 01654f527-7d77-76e3-e085-21c60790c557 TRAWLERS 233.
10 01dbe4ace-ee89-a70e-3136-11b42bbebeb7 POLE_AND_LINE 3.31
# ℹ 830 more rows
# Example: Total activity by gear type
gear_summary <- data_geartypes |>
group_by(geartype) |>
summarize(n_vessels = n_distinct(vessel_id)) |>
arrange(desc(n_vessels))
gear_summary# A tibble: 11 × 2
geartype n_vessels
<chr> <int>
1 TRAWLERS 570
2 POLE_AND_LINE 100
3 FISHING 97
4 OTHER_PURSE_SEINES 33
5 DRIFTING_LONGLINES 12
6 SET_GILLNETS 9
7 TUNA_PURSE_SEINES 7
8 SET_LONGLINES 6
9 DREDGE_FISHING 3
10 PURSE_SEINES 2
11 TROLLERS 1
When our data are already assembled, we can modify the order in which factor levels appear
This doesn’t modify the values, just the order in which they are interpreted
Use fct_reorder()
Look at the documentation for two crucial arguments:
.f What is your soon-to-be factor?.x What is the variable by which you want to order your factor?gear_summary <- data_geartypes |>
group_by(geartype) |>
summarize(n_vessels = n_distinct(vessel_id)) |>
arrange(desc(n_vessels)) |>
mutate(geartype = fct_reorder(.f = geartype, .x = n_vessels))
gear_summary# A tibble: 11 × 2
geartype n_vessels
<fct> <int>
1 TRAWLERS 570
2 POLE_AND_LINE 100
3 FISHING 97
4 OTHER_PURSE_SEINES 33
5 DRIFTING_LONGLINES 12
6 SET_GILLNETS 9
7 TUNA_PURSE_SEINES 7
8 SET_LONGLINES 6
9 DREDGE_FISHING 3
10 PURSE_SEINES 2
11 TROLLERS 1
My ggplot code now produces the expected figure
lump all other gears into a new category of “others”fct_lump_n() function.It does modify valuesgear_summary <- data_geartypes |>
mutate(geartype = fct_lump_n(f = geartype, n = 3)) |> # Keep top 3, lump the rest)
group_by(geartype) |>
summarize(n_vessels = n_distinct(vessel_id)) |>
arrange(desc(n_vessels)) |>
mutate(geartype = fct_reorder(.f = geartype, .x = n_vessels)) # Then reorder based on new n by groups
gear_summary# A tibble: 4 × 2
geartype n_vessels
<fct> <int>
1 TRAWLERS 570
2 POLE_AND_LINE 100
3 FISHING 97
4 Other 73
fct_relevelgear_summary <- data_geartypes |>
mutate(geartype = fct_lump_n(f = geartype, n = 3)) |> # Keep top 3, lump the rest)
group_by(geartype) |>
summarize(n_vessels = n_distinct(vessel_id)) |>
arrange(desc(n_vessels)) |>
mutate(geartype = fct_reorder(.f = geartype, .x = n_vessels), # Then reorder based on new n by groups
geartype = fct_relevel(.f = geartype, c("FISHING", "POLE_AND_LINE", "TRAWLERS", "Other")))
gear_summary# A tibble: 4 × 2
geartype n_vessels
<fct> <int>
1 TRAWLERS 570
2 POLE_AND_LINE 100
3 FISHING 97
4 Other 73
Did anything change?
fct_collapse to manually specify which values should be collapsed into new levels.gear_summary <- data_geartypes |>
mutate(geartype = fct_collapse(geartype,
"BOTTOM" = c("DREDGE_FISHING", "SET_GILLNETS",
"SET_LONGLINES", "TRAWLERS"),
"SURFACE" = c("DRIFTING_LONGLINES", "OTHER_PURSE_SEINES",
"POLE_AND_LINE", "PURSE_SEINES",
"TROLLERS", "TUNA_PURSE_SEINES"))) |>
group_by(geartype) |>
summarize(n_vessels = n_distinct(vessel_id))
gear_summary # Unspecified factor levels are left unmodified# A tibble: 3 × 2
geartype n_vessels
<fct> <int>
1 BOTTOM 588
2 SURFACE 155
3 FISHING 97
fct_* functionsWorking with Temporal Data
In R, there are three types of date/time data that point at an instant in time:
<date><time><dttm>Note
POSIXct and POSIXlt
POSIX stands for “Portable Operating System Interface”, a Unix standardct stands for “calendar time” (seconds elapsed since Jan 1, 1970)lt stands for “local time” (stores the human readable components)Problems:

{lubridate}: An R package that makes it easier to work with dates and times
Key lubridate functions:
There are four approaches:
as_date() or as_datetime())readr will automatically recognize it-: (24 hr format, no am / pm) or TIf your csv file looks like this:
Then you can simply read it in:
# A tibble: 2 × 4
class_date time_of_day combined event
<date> <time> <chr> <chr>
1 2025-10-21 09:00 2025-10-21 09:00:00 class_starts
2 2025-10-21 10:15 2025-10-21 10:15:00 class_ends
col_types arguments in read_, as well as col_date() or col_datetime()m, d, y, so I can use the mdy() functionLet’s say your data looks like this:
So when you read them in, they look like this:
# A tibble: 10 × 2
date whales_observed
<chr> <dbl>
1 Jan-15-2024 12
2 Jan-22-2024 8
3 Feb-5-2024 7
4 Feb-18-2024 8
5 Mar-3-2024 7
6 Mar-20-2024 12
7 Apr-2-2024 14
8 Apr-25-2024 8
9 May-8-2024 6
10 May-30-2024 6
Notice that date is of class <chr>
Can I directly build a figure with date on the x-axis and # whales on the y-axis?
Nope…
Steps:
lubridate function to convert the string to a dateSteps:
lubridate function to convert the string to a dateSteps:
lubridate function to convert the string to a dateSteps:
lubridate function to convert the string to a dateMany other helper functions
[1] "2025-10-21"
[1] "2025-10-21"
[1] "2025-10-21"
[1] "2025-10-21 10:00:00 UTC"
[1] "2025-10-21 12:00:00 UTC"
Sometimes you will not have a date column, and your data might look like this:
make_date() or make_datetime() functionsnumeric values (i.e. “Oct” won’t work, but 10 will)whale_counts_dates <- whale_counts |>
mutate(date = make_date(year, month, day)) |>
select(date, whales_observed)
whale_counts_dates# A tibble: 10 × 2
date whales_observed
<date> <dbl>
1 2025-01-15 12
2 2025-01-22 8
3 2025-02-05 7
4 2025-02-18 8
5 2025-03-03 7
6 2025-03-20 12
7 2025-04-02 14
8 2025-04-25 8
9 2025-05-08 6
10 2025-05-30 6
# A tibble: 2 × 2
date whales_observed
<date> <dbl>
1 2025-01-15 12
2 2025-01-22 8
month() to extract the month of a dateOther useful functions
If you only care about physical time, use a duration; if you need to add human times, use a period; if you need to figure out how long a span is in human units, use an interval.
Subtracting two dates will give you a difftime object:
[1] "2025-11-01 02:01:01 EDT"
Working with Characters

{stringr}: a cohesive set of functions designed to make working with strings as easy as possible
Key functions allow you to:
# A tibble: 4 × 3
trip_id passengers notes
<dbl> <dbl> <chr>
1 1 25 Dolphins; whales
2 2 32 Whale; Sea lions
3 3 30 Sea lions; sea turtles
4 4 30 Sea lion; sea turtles
tidyr offers four useful functions:
separate_longer_delim() and separate_wider_delim()separate_longer_position() and separate_wider_position()separate_longer_delim() allows me to separate a column and make the data longer based on a delimiter:
Anything weird?
The stringr package provides functions to modify, detect, and extract parts of strings
Let’s convert all letters to lowercase
sightings_tidy <- sightings |>
separate_longer_delim(cols = notes, delim = "; ") |>
rename(species = notes) |>
mutate(species = str_to_lower(string = species))
sightings_tidy# A tibble: 8 × 3
trip_id passengers species
<dbl> <dbl> <chr>
1 1 25 dolphins
2 1 25 whales
3 2 32 whale
4 2 32 sea lions
5 3 30 sea lions
6 3 30 sea turtles
7 4 30 sea lion
8 4 30 sea turtles
There is also str_to_upper(), str_to_title(), and str_to_sentence()
Anything weird with these data?
Can I just remove the “s” to make everything singular?
The str_remove() function might help
sightings_tidy <- sightings |>
separate_longer_delim(cols = notes, delim = "; ") |>
rename(species = notes) |>
mutate(species = str_to_lower(string = species),
species = str_remove(string = species, pattern = "s"))
sightings_tidy# A tibble: 8 × 3
trip_id passengers species
<dbl> <dbl> <chr>
1 1 25 dolphin
2 1 25 whale
3 2 32 whale
4 2 32 ea lions
5 3 30 ea lions
6 3 30 ea turtles
7 4 30 ea lion
8 4 30 ea turtles
Oh no!
str_remove() will remove the first instance of the matched patterns” at the end of a word to make everything singularregular expressions?regexsightings_tidy <- sightings |>
separate_longer_delim(cols = notes, delim = "; ") |>
rename(species = notes) |>
mutate(species = str_to_lower(string = species),
species = str_remove(string = species, pattern = "s$")) # Match the "s" at the end of a line only
sightings_tidy# A tibble: 8 × 3
trip_id passengers species
<dbl> <dbl> <chr>
1 1 25 dolphin
2 1 25 whale
3 2 32 whale
4 2 32 sea lion
5 3 30 sea lion
6 3 30 sea turtle
7 4 30 sea lion
8 4 30 sea turtle
stringr is incredibly usefulSome functions:
str_replace()str_extract()str_split()str_sub()filter():
str_detect()str_length()summarie():
str_flatten()forcats for categorical data manipulationlubridate for temporal data analysisstringr for text manipulation