Data Management and Transformation
By the end of this week, you should be able to:
*.xlsx
, *.csv
and *.rds
){dplyr}
filter()
select()
, rename()
, and mutate()
group_by()
and summarize()
Important concepts
.Rproj
file.Two types of paths:
/Users/jcvd/GitHub/EVR_628/data/raw/dive_profile.csv
data/raw/dive_profile.csv
Think of a file path as an address.
USA/FL/Miami/Rickenbacker Causeway/Rosenstiel School/SLAB/Room 120
SLAB/Room 120
Relative paths are best because they work across computers
Important
Best Practice: Use RStudio projects instead of setwd()
Why projects are better:
Focus on reading plain-text rectangular files
CSV Files
.csv
formatExcel Files
.xlsx
formatRDS Files
CSV files can be human readable:
AnoYear,BanderaFlag,ArteGear,EspeciesSpecies,t
1918,UNK,LP,SKJ,1361
1918,UNK,LP,YFT,0
1919,UNK,LP,SKJ,3130
1919,UNK,LP,YFT,136
1920,UNK,LP,SKJ,3583
1920,UNK,LP,YFT,907
1921,UNK,LP,SKJ,499
1921,UNK,LP,YFT,590
1922,UNK,LP,SKJ,5398
,
” indicate the separation between columnsReading excel files is also straighforward, but requires additional packages
csv
(or similar)Reading RDS files is just as easy:
A Recommended Workflow
xls
or csv
.rds
We used read_*
to read data…
And yes, we use write_*
to write data
{dplyr}
(part of the tidyverse, no need to install it){janitor}
{dplyr}
The dplyr
package, like ggplot2
relies on data being in this format:
{dplyr}
Package{dplyr}
: A grammar of data manipulation
Core functions:
filter()
: Pick observations by their valuesselect()
: Pick variables by their namesmutate()
: Create new variables with functions of existing variablesgroup_by()
: Change the scope of each function from operating on the entire dataset to operating on it group-by-groupsummarize()
: Collapse many values down to a single summarydplyr
functionsFirst argument is always a data.frame or tibble
The arguments that follow specify which columns we are working on
dplyr
functions always return a new data.frame
This class(input) == class(output)
condition means we can easily build pipelines
dplyr
functions are also referred to as dplyr
verbs, because they all imply an action
dplyr
verbs are organized into four groups based on what they operate on: rows, columns, or groups
filter()
data_MPA
datafilter()
filter()
: Keep rows that match your conditions
Example: Keep data from protected sites
# A tibble: 50 × 5
time id protected after biomass
<int> <chr> <dbl> <dbl> <dbl>
1 -5 a 1 0 9.37
2 -5 b 1 0 10.2
3 -5 c 1 0 9.16
4 -5 d 1 0 11.6
5 -5 e 1 0 10.3
6 -4 a 1 0 11.5
7 -4 b 1 0 10.4
8 -4 c 1 0 9.38
9 -4 d 1 0 7.79
10 -4 e 1 0 11.1
# ℹ 40 more rows
filter()
Multiple conditions require using logical operators
Example: All data from sites “a” and “b”, after the MPA was established
data_MPA |>
filter((id == "a" | id == "b"), # Keep data where id is equal to a or b AND
after == 1) # where after is equals to 1
# A tibble: 10 × 5
time id protected after biomass
<int> <chr> <dbl> <dbl> <dbl>
1 0 a 1 1 12.4
2 0 b 1 1 11.4
3 1 a 1 1 14.4
4 1 b 1 1 12.0
5 2 a 1 1 12.5
6 2 b 1 1 11.3
7 3 a 1 1 11.4
8 3 b 1 1 11.9
9 4 a 1 1 11.5
10 4 b 1 1 13.2
filter()
You can leverage the “%in%
” operator to avoid typing too much
Example: All data from sites “a”, “b”, “c”, and “d”, on the first year
select()
, rename()
, and mutate()
data_lionfish
# A tibble: 109 × 9
id site lat lon total_length_mm total_weight_gr size_class depth_m
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 001-Po-… Para… 20.5 -87.2 213 113. large 38.1
2 002-Po-… Para… 20.5 -87.2 124 27.6 medium 27.9
3 003-Pd-… Pared 20.5 -87.2 166 52.3 medium 18.5
4 004-Cs-… Cano… 20.5 -87.2 203 123. large 15.5
5 005-Cs-… Cano… 20.5 -87.2 212 129 large 15
6 006-Pl-… Paam… 20.5 -87.2 210 139. large 22.7
7 007-Pl-… Paam… 20.5 -87.2 132 50.3 medium 13.4
8 008-Po-… Para… 20.5 -87.2 122 17.2 medium 18.5
9 009-Po-… Para… 20.5 -87.2 224 113. large 18.2
10 010-Pd-… Pared 20.5 -87.2 117 19.6 medium 12.5
# ℹ 99 more rows
# ℹ 1 more variable: temperature_C <dbl>
select()
select()
: Allows us to state which columns we want to keep
Select columns by name
# A tibble: 109 × 3
id total_length_mm total_weight_gr
<chr> <dbl> <dbl>
1 001-Po-16/05/10 213 113.
2 002-Po-29/05/10 124 27.6
3 003-Pd-29/05/10 166 52.3
4 004-Cs-12/06/10 203 123.
5 005-Cs-12/06/10 212 129
6 006-Pl-21/06/10 210 139.
7 007-Pl-21/06/10 132 50.3
8 008-Po-04/07/10 122 17.2
9 009-Po-04/07/10 224 113.
10 010-Pd-08/07/10 117 19.6
# ℹ 99 more rows
Select columns by position
# A tibble: 109 × 3
id total_length_mm total_weight_gr
<chr> <dbl> <dbl>
1 001-Po-16/05/10 213 113.
2 002-Po-29/05/10 124 27.6
3 003-Pd-29/05/10 166 52.3
4 004-Cs-12/06/10 203 123.
5 005-Cs-12/06/10 212 129
6 006-Pl-21/06/10 210 139.
7 007-Pl-21/06/10 132 50.3
8 008-Po-04/07/10 122 17.2
9 009-Po-04/07/10 224 113.
10 010-Pd-08/07/10 117 19.6
# ℹ 99 more rows
select()
Select columns by partial name match
Remove columns by position
# A tibble: 109 × 5
id site lat lon total_length_mm
<chr> <chr> <dbl> <dbl> <dbl>
1 001-Po-16/05/10 Paraiso 20.5 -87.2 213
2 002-Po-29/05/10 Paraiso 20.5 -87.2 124
3 003-Pd-29/05/10 Pared 20.5 -87.2 166
4 004-Cs-12/06/10 Canones 20.5 -87.2 203
5 005-Cs-12/06/10 Canones 20.5 -87.2 212
6 006-Pl-21/06/10 Paamul 20.5 -87.2 210
7 007-Pl-21/06/10 Paamul 20.5 -87.2 132
8 008-Po-04/07/10 Paraiso 20.5 -87.2 122
9 009-Po-04/07/10 Paraiso 20.5 -87.2 224
10 010-Pd-08/07/10 Pared 20.5 -87.2 117
# ℹ 99 more rows
select()
Selection helpers:
contains()
: Contains substringstarts_with()
: Starts with prefixends_with()
: Ends with suffixeverything()
: All other columns columnsrename()
rename()
: Change column names
Example: Rename id
to fish_id
, total_length_mm
to length_mm
, and total_weight_mm
to weight_gr
# Rename specific columns
rename(data_lionfish,
fish_id = id,
length = total_length_mm,
weight = total_weight_gr)
# A tibble: 109 × 9
fish_id site lat lon length weight size_class depth_m temperature_C
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 001-Po-16/0… Para… 20.5 -87.2 213 113. large 38.1 28
2 002-Po-29/0… Para… 20.5 -87.2 124 27.6 medium 27.9 28
3 003-Pd-29/0… Pared 20.5 -87.2 166 52.3 medium 18.5 28
4 004-Cs-12/0… Cano… 20.5 -87.2 203 123. large 15.5 28
5 005-Cs-12/0… Cano… 20.5 -87.2 212 129 large 15 28
6 006-Pl-21/0… Paam… 20.5 -87.2 210 139. large 22.7 29
7 007-Pl-21/0… Paam… 20.5 -87.2 132 50.3 medium 13.4 29
8 008-Po-04/0… Para… 20.5 -87.2 122 17.2 medium 18.5 29
9 009-Po-04/0… Para… 20.5 -87.2 224 113. large 18.2 29
10 010-Pd-08/0… Pared 20.5 -87.2 117 19.6 medium 12.5 29
# ℹ 99 more rows
Note that the syntax is new_name = old_name
mutate()
mutate()
: Add new columns or modify existing ones
# Create new columns
data_lionfish |>
mutate(length_weight_ratio = total_length_mm / total_weight_gr)
# A tibble: 109 × 10
id site lat lon total_length_mm total_weight_gr size_class depth_m
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 001-Po-… Para… 20.5 -87.2 213 113. large 38.1
2 002-Po-… Para… 20.5 -87.2 124 27.6 medium 27.9
3 003-Pd-… Pared 20.5 -87.2 166 52.3 medium 18.5
4 004-Cs-… Cano… 20.5 -87.2 203 123. large 15.5
5 005-Cs-… Cano… 20.5 -87.2 212 129 large 15
6 006-Pl-… Paam… 20.5 -87.2 210 139. large 22.7
7 007-Pl-… Paam… 20.5 -87.2 132 50.3 medium 13.4
8 008-Po-… Para… 20.5 -87.2 122 17.2 medium 18.5
9 009-Po-… Para… 20.5 -87.2 224 113. large 18.2
10 010-Pd-… Pared 20.5 -87.2 117 19.6 medium 12.5
# ℹ 99 more rows
# ℹ 2 more variables: temperature_C <dbl>, length_weight_ratio <dbl>
mutate()
You can overwrite an existing column
You can use conditional logic
# Using case_when()
data_lionfish |>
mutate(size_class = ifelse(total_length_mm < 150, "small", "large"))
# A tibble: 109 × 9
id site lat lon total_length_mm total_weight_gr size_class depth_m
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 001-Po-… Para… 20.5 -87.2 213 113. large 38.1
2 002-Po-… Para… 20.5 -87.2 124 27.6 small 27.9
3 003-Pd-… Pared 20.5 -87.2 166 52.3 large 18.5
4 004-Cs-… Cano… 20.5 -87.2 203 123. large 15.5
5 005-Cs-… Cano… 20.5 -87.2 212 129 large 15
6 006-Pl-… Paam… 20.5 -87.2 210 139. large 22.7
7 007-Pl-… Paam… 20.5 -87.2 132 50.3 small 13.4
8 008-Po-… Para… 20.5 -87.2 122 17.2 small 18.5
9 009-Po-… Para… 20.5 -87.2 224 113. large 18.2
10 010-Pd-… Pared 20.5 -87.2 117 19.6 small 12.5
# ℹ 99 more rows
# ℹ 1 more variable: temperature_C <dbl>
data_MPA
objectRecall that it looked like this:
I’d rather have it like this:
data_MPA <- data_MPA |>
mutate(protected = ifelse(protected == 1, "yes", "no"),
after = ifelse(after == 1, "after", "before"))
data_MPA
# A tibble: 100 × 5
time id protected after biomass
<int> <chr> <chr> <chr> <dbl>
1 -5 a yes before 9.37
2 -5 b yes before 10.2
3 -5 c yes before 9.16
4 -5 d yes before 11.6
5 -5 e yes before 10.3
6 -5 f no before 9.18
7 -5 g no before 10.5
8 -5 h no before 10.7
9 -5 i no before 10.6
10 -5 j no before 9.69
# ℹ 90 more rows
group_by()
group_by()
: Group data by one or more variables
# A tibble: 100 × 5
# Groups: protected, after [4]
time id protected after biomass
<int> <chr> <chr> <chr> <dbl>
1 -5 a yes before 9.37
2 -5 b yes before 10.2
3 -5 c yes before 9.16
4 -5 d yes before 11.6
5 -5 e yes before 10.3
6 -5 f no before 9.18
7 -5 g no before 10.5
8 -5 h no before 10.7
9 -5 i no before 10.6
10 -5 j no before 9.69
# ℹ 90 more rows
This might not seem like much, until you hear about summarize()
summarize()
summarize()
: Collapse groups to single values (and create a new column)
Example: Calculate mean biomass by protection site and study period
data_MPA |>
group_by(protected, after) |> # Define the groups
summarize(biomass_m = mean(biomass)) # Collapse the biomass values of each group into a single value
# A tibble: 4 × 3
# Groups: protected [2]
protected after biomass_m
<chr> <chr> <dbl>
1 no after 10.1
2 no before 10.1
3 yes after 12.2
4 yes before 10.1
Count functions:
n()
: Number of observationsn_distinct()
: Number of unique valuesSummary functions:
mean()
, median()
, sd()
, min()
, max()
first()
, last()
, nth()
summarize()
You can calculate multiple summaries at once
Example: Let’s calculate the mean biomass and sd of biomass for each treatment status
# Summary by group
data_MPA |>
group_by(protected, after) |>
summarize(biomass_m = mean(biomass),
biomass_sd = sd(biomass))
# A tibble: 4 × 4
# Groups: protected [2]
protected after biomass_m biomass_sd
<chr> <chr> <dbl> <dbl>
1 no after 10.1 0.953
2 no before 10.1 0.648
3 yes after 12.2 1.000
4 yes before 10.1 0.995
data_MPA
analysis, this time with dplyr
verbsBy the end of this week, you should be able to:
*.xlsx
, *.csv
and *.rds
)filter()
, arrange()
)select()
, rename()
)Go to jcvdav.github.io/EVR_628 and find the guide for live coding