EVR 628- Intro to Environmental Data Science

Coding principles

Juan Carlos Villaseñor-Derbez (JC)

So Far

  • You’ve trusted me and executed code as instructed
  • I didn’t give you much background on what you were doing
  • Clearly something clicked; you are still here
  • Let’s give you a bit more background

Learning Objectives

By the end of this week, you should be able to:

  • Explain the importance of code style and documentation
  • Know about “atomic” objects and how they relate to all things R
  • Understand and check object classes
  • Create, subset, and modify vectors and data frames
  • Use the pipe operator

Code style

Helping others (and future you) read your code

Comments

  • Text embedded within your script for humans to read

  • Helps other people (including future you) understand what’s going on

  • We use the # sign to prevent the computer from reading it

    • Works in R, Python, Make, Julia, SQL

Comments: Some guidelines

  • What are you doing?
  • Why are you doing it that way?
    • Focus on explaining why you did something, not what you are doing (or how)
  • Try to keep your comments within the vertical line show in RStudio

Is this comment helpful?

# Shark analysis
lengths <- c(6, 4.1, 2.8, 5.5, 3.9, 5.8)
sharks <- c("Great White Shark", "Lemon Shark", "Bull Shark", "Hammerhead Shark", "Mako Shark", "Great White Shark")
shark_data <- data.frame(
  lengths,
  sharks
)
shark_data$sharks[shark_data$length == max(shark_data$length)]
[1] "Great White Shark"

Comments: Some guidelines

How about this?

# Build data
## Vector of shark lengths
lengths <- c(6, 4.1, 2.8, 5.5, 3.9, 5.8)
## Vector of shark names
sharks <- c("Great White Shark", "Lemon Shark",
            "Bull Shark", "Hammerhead Shark",
            "Mako Shark", "Great White Shark")

# Combine vectors into a data.frame
shark_data <- data.frame(
  lengths,
  sharks
)

# Find the length for the largest great white
shark_data$lengths[shark_data$sharks == "Great White Shark"] |> max()
[1] 6

Sectioning with comments

  • A script contains more than one bit of code, often dozens of lines

  • R can detect up to 6 levels, as indicated by the number of “#

################################################################################
# Description goes here
################################################################################

# SET UP #######################################################################

## Load packages ---------------------------------------------------------------

## Load data -------------------------------------------------------------------


# PROCESSING ###################################################################

## Some step -------------------------------------------------------------------


# VISUALIZE ####################################################################

## Another step ----------------------------------------------------------------


# EXPORT #######################################################################

## The final step --------------------------------------------------------------

Naming conventions (super important)

  • Objects
  • Files
  • Directories
i_use_snake_case                   # This is my preference

otherPeopleUseCamelCase            # This sometimes works

some.people.use.periods            # This is dangerous, especially in python

And_aFew.People_RENOUNCEconvention # You need help

ALL_CAPS                           # Reserved for super important stuff

Spaces

Use spaces:

  • around mathematical operators apart from ^ (i.e. + , - , == , < , …)
  • around the assignment operator ( <- )
  • around pipes ( |> and %>%)
# Strive for
y <- (m * x) + b

# Avoid
y<-(m*x)+b

# Avoid
 y <- ( m * x ) + b
  • No spaces around parentheses for regular function calls
  • Always put a space after a comma, just like in standard English.
# Strive for
mean(x, na.rm = TRUE)

# Avoid
mean (x ,na.rm=TRUE)

Indentation and line separation

Not great

library(EVR628tools)
library(tidyverse)
data("data_lionfish")
my_data<-data_lionfish[data_lionfish$site=="Paamul",]
ggplot(data=my_data,
mapping=aes(x=depth_m,y=total_length_mm))+
geom_point(shape=21,fill="steelblue",size=2)+
labs(x="Depth(m)",
y="Totallength(mm)",
title="Bodylengthanddepth",
subtitle="Largerfishtendtolivedeeper",
caption="SourceEVR628tools::data_lionfish")

Better

# Load packages
library(EVR628tools)
library(tidyverse)

# Load data
data("data_lionfish")

# Build my own data
my_data <- data_lionfish[data_lionfish$site == "Paamul",]

# Build my plot
ggplot(data = my_data,
       mapping = aes(x = depth_m, y = total_length_mm)) +
  geom_point(shape = 21, fill = "steelblue", size = 2) +
  labs(x = "Depth (m)",
       y = "Total length (mm)",
       title = "Body length and depth",
       subtitle = "Larger fish tend to live deeper",
       caption = "Source EVR628tools::data_lionfish")

Use Cmd + i to auto-indent

  • line by line
  • select code chunk and then indent

“Objects” and “Classes”

Things and types of things

Atomic objects

The most common atomic classes are:

  • character: "a" or 'b' (note the quotation marks)
  • numeric: 2 or 10e3 (note scientific notation)
  • logical: TRUE/FALSE or T/F, but never true/false (note no quotations)

You can check classes with function class()

class("a")
[1] "character"
class(2)
[1] "numeric"
class("2") # This is no longer a number
[1] "character"
class(TRUE)
[1] "logical"

Coercing “up” is safe

as.character(20)
[1] "20"
as.numeric(TRUE) # Logical to numeric
[1] 1

Coercing “down”… not always

as.numeric("a") # Character to numeric
[1] NA

Object classes

Classes in R map to the types of data we during Week 2:

  • numerical data in R are of class numeric
  • cetegorical data in R are of class character
  • ordinal data in R are a special class of character called factor
    • Remember when we ordered the axis label in a ggplot using fct_infreq()?

Creating “objects”

  • If you’ll need a value, dataset or plot for later, you should “assign it”
  • This will retain the object in your environment tab, leaving available for later
  • You will use the assignment operator: <-
    • Mac shortcut: Opt + -
    • Windows shortcut: Alt + -
  • Read “<-” as: “gets a value of”

Save something (pretty much anything) by creating an object

# A number
my_num <- 2 # Read as: my_num gets a value of 2
# piece of text
my_name <- "JC" # Read as: my_name gets a value of JC

Creating “objects”

They will appear in your environment pane

And you can “call them” later

my_num # Call your object in the console
[1] 2
my_num + 2 # Use your object in other operations
[1] 4
my_number # This line should fail.. Why?
Error: object 'my_number' not found

You can use class() to check what each thing is

class(my_num)
[1] "numeric"
class(my_name)
[1] "character"

Operators

Binary Operators: Arithmetic

2 + 2 # Addition
[1] 4
10 - 1 # Subtraction
[1] 9
10 * 3 # Multiplication
[1] 30
10 / 3 # Division
[1] 3.333333
2^4 # Exponentiation
[1] 16
64 ^ (1/2) # Roots as powers...
[1] 8

Arithmetic Operators:

  • Stand between two values
  • Perform arithmetic on objects of class numeric (or objects that can be coerced to them)
    • “coercion” here means “conversion”
    • TRUE or FALSE (logical values) can be “coerced” to ones and zeroes
  • Return an object of class numeric

Binary Operators: Relational

2 < 3  # Less than?
[1] TRUE
2 > 3  # Greater than?
[1] FALSE
2 <= 3 # Les than or equal to?
[1] TRUE
2 >= 3 # Greater than or equal to?
[1] FALSE
2 == 3 # Equal to?
[1] FALSE
# Also works with characters
"hotdog" == "sandwich" # Equal?
[1] FALSE

Relational operators

  • Allow comparison of objects of all atomic objects
  • Return an object of class logical

Binary Operators: Logical

(2 == 2) & ("a" == "b") # Are statements to the left AND right TRUE?
[1] FALSE
(2 == 2) | ("a" == "b") # Are statements to the left OR right TRUE?
[1] TRUE
!T # Negate the statement
[1] FALSE
  • Allow comparison of objects of class logical
  • Return an object of class logical

Combining objects

Combining Atomic Objects

  • You can combine atomic elements with function “c()

  • c stands for “combine”

colors <- c("red", "blue", "green", "orange", "black")
numbers <- c(1, 40, 1, 5, 6)
  • The objects called colors and numbers are vectors of class character and numeric
class(colors)
[1] "character"
class(numbers)
[1] "numeric"
  • Vectors are like columns in a spreadsheet

  • Vectors have lengths: number of elements

length(colors)
[1] 5

Operating on Vectors

If the vector is numeric, arithmetic operations are applied to every element automatically

numbers <- 1:10
numbers
 [1]  1  2  3  4  5  6  7  8  9 10
numbers * 2 # Multiplies each element x 2
 [1]  2  4  6  8 10 12 14 16 18 20
numbers + 5 # Adds 5 to each element
 [1]  6  7  8  9 10 11 12 13 14 15
  • Arithmetic operations don’t work on character vectors
colors <- c("red", "blue", "green", "orange", "black")
colors * 2
Error in colors * 2: non-numeric argument to binary operator

Note

What does the error message above mean?

Indexing and subsetting vectors with []

Accessing elements within an object based on their position

Read “[]” as “extract elements”

colors <- c("red", "blue", "green", "orange", "black") # Create a character vector of colors

I can extract the first element with:

colors[1] # Extract first element
[1] "red"

Extract the first and third elements

colors[c(1, 3)] # Extract elements 1 and 3
[1] "red"   "green"

Extract elements based on logical matching

colors == "red"
[1]  TRUE FALSE FALSE FALSE FALSE
colors[!colors == "red"]
[1] "blue"   "green"  "orange" "black" 

Indexing with [] and modify with <-

colors <- c("red", "blue", "green", "orange", "black") # Create a character vector of colors
colors
[1] "red"    "blue"   "green"  "orange" "black" 

Let’s modify red to white

colors[1] <- "white" # The first element of "colors" takes the value "white"

colors
[1] "white"  "blue"   "green"  "orange" "black" 

Number of elements indexed must match number of elements assigned

colors[1] <- c("white", "yellow")
Warning in colors[1] <- c("white", "yellow"): number of items to replace is not
a multiple of replacement length
colors
[1] "white"  "blue"   "green"  "orange" "black" 

Let’s code

Combining Different Classes

Combining Vectors of Different Classes

  • R’s basic class is data.frame
  • Having different classes between columns is OK, not within
  • All vectors must be the same length

Build data.frame from vectors

# Build my vectors
colors <- c("red", "blue", "green", "orange", "black") # For five colors
numbers <- c(1, 40, 1, 5, 6)                           # For five numbers
# Column names are automatically assigned
data.frame(colors,
           numbers)
colors numbers
red 1
blue 40
green 1
orange 5
black 6

Combining Vectors of Different Classes

Using vectors of different lengths fails

data.frame(colors = c("red", "green", "blue"),
           numbers = c(1, 2))
Error in data.frame(colors = c("red", "green", "blue"), numbers = c(1, : arguments imply differing number of rows: 3, 2

Unless they can be “recycled”

data.frame(colors = c("red", "green", "blue", "orange"),
           numbers = c(4, 2))
colors numbers
red 4
green 2
blue 4
orange 2

Note how I built the data.frame directly from atomic elements

Combining Vectors of Different Classes

Build a data.frame and save it to an object

my_data <- data.frame(colors = c("red", "green"),
                      numbers = c(1, 2),
                      letters = c("A", "B"))

Attributes of data.frames:

class(my_data) # Check class
[1] "data.frame"
dim(my_data) # Check number of dimensions
[1] 2 3
nrow(my_data) # Check number of rows
[1] 2
ncol(my_data) # Check number of columns
[1] 3
colnames(my_data) # Check column names
[1] "colors"  "numbers" "letters"

data.frames are step 1 in tidy data

They allow us to adhere to the standards described in Week 2:

  • Each column is a variable
  • Each row is an observation
  • Cells contain values

Tidy data

         spp    sex carapace_length
1   C. mydas Female              23
2 C. caretta   Male              24

Not tidy data

         variable    Org 1      Org 2
1             spp C. mydas C. caretta
2             sex   Female       Male
3 carapace_length       23         24
  • Say you want a plot of turtle carapace length by species
  • Only one of these data formats allows you to plot it with ggplot

Tibbles vs data.frame

  • Tibbles are a special type of data.frame used in tidyverse and spatial libraries

  • They work in the same way

library(tidyverse)
tibble(colors,
       numbers)
# A tibble: 5 × 2
  colors numbers
  <chr>    <dbl>
1 red          1
2 blue        40
3 green        1
4 orange       5
5 black        6

Tibbles vs data.frame

Tibbles are also smart

  • They display their dimensions
  • And let you know data have been omitted
data_lionfish
# A tibble: 109 × 9
   id       site    lat   lon total_length_mm total_weight_gr size_class depth_m
   <chr>    <chr> <dbl> <dbl>           <dbl>           <dbl> <chr>        <dbl>
 1 001-Po-… Para…  20.5 -87.2             213           113.  large         38.1
 2 002-Po-… Para…  20.5 -87.2             124            27.6 medium        27.9
 3 003-Pd-… Pared  20.5 -87.2             166            52.3 medium        18.5
 4 004-Cs-… Cano…  20.5 -87.2             203           123.  large         15.5
 5 005-Cs-… Cano…  20.5 -87.2             212           129   large         15  
 6 006-Pl-… Paam…  20.5 -87.2             210           139.  large         22.7
 7 007-Pl-… Paam…  20.5 -87.2             132            50.3 medium        13.4
 8 008-Po-… Para…  20.5 -87.2             122            17.2 medium        18.5
 9 009-Po-… Para…  20.5 -87.2             224           113.  large         18.2
10 010-Pd-… Pared  20.5 -87.2             117            19.6 medium        12.5
# ℹ 99 more rows
# ℹ 1 more variable: temperature_C <dbl>

Subsetting dataframes with []

data.frames are two-dimensional, so we use two numbers: [rows, cols]

my_data
  colors numbers letters
1    red       1       A
2  green       2       B

Extract the value for the second observation anad third variable

my_data[1 , 3]
[1] "A"

Extract values for first observation across all variables, implicitly

my_data[1, ] # row one, all columns
  colors numbers letters
1    red       1       A

Extract values for first variable across all observations, implicitly

my_data[ , 1]
[1] "red"   "green"

Extracting data.frame columns with $

  • Remember data.frames and tibbles are just collection of vectors
  • You can get the variable (column) out as a vector using $
  • This is why we like tidy data
ids <- data_lionfish$id
  • And then use [] on the resulting vector
  • For example, extract the first 10 ids
ids[1:10]
 [1] "001-Po-16/05/10" "002-Po-29/05/10" "003-Pd-29/05/10" "004-Cs-12/06/10"
 [5] "005-Cs-12/06/10" "006-Pl-21/06/10" "007-Pl-21/06/10" "008-Po-04/07/10"
 [9] "009-Po-04/07/10" "010-Pd-08/07/10"
  • This also works
data_lionfish$id[1:10]
 [1] "001-Po-16/05/10" "002-Po-29/05/10" "003-Pd-29/05/10" "004-Cs-12/06/10"
 [5] "005-Cs-12/06/10" "006-Pl-21/06/10" "007-Pl-21/06/10" "008-Po-04/07/10"
 [9] "009-Po-04/07/10" "010-Pd-08/07/10"

Indexing with [] and modify with <-

Works in the same way as with vectors

my_data
  colors numbers letters
1    red       1       A
2  green       2       B
my_data[1, 3] <- "turtle"

Which value will change?

my_data
  colors numbers letters
1    red       1  turtle
2  green       2       B

Functions

Functions

A function:

  • Is a block of code which only runs when it is called
  • takes arguments…
  • does something to them
  • returns an object that used the arguments passed

Some functions we’ve used so far are:

  • length()
  • dim()
  • mean()
  • aes()
  • what else?

Pipes and pipelines

Pipes

Imagine having the following numeric vector

# Area of square surveyed plots (in sq m)
areas <- c(1, 5, 2, 7, 9, 2, 3, 6, 3, 2, 1, 1, 6, 8, 3, 2, 1, 6, 1, 2, 8, 3,
           6, 8, 9, 0, 5, 3, 2, 1, 2, 6, 9, 2, 3, 6, 3, 2, 1, 1, 6, 8, 3, 2)

And wanting to know how many square plots have a side greater than 2

Before 2013 you had two options:

1: Call each function at a time

# This creates too many objects
sides <- sqrt(areas) # Calculate square root to get side length
sides_larger_than_2 <- sides[sides > 2] # Extract sides > 2
types <- unique(sides_larger_than_2) # Get unique "types" of plot
how_many <- length(types) # Count how many "types" of plot
how_many
[1] 5

2: Use nested functions

# This is hard to read:
how_many <- length(unique(sqrt(areas)[sqrt(areas) > 2]))
how_many
[1] 5

Pipes After 2013

The magrittr package introduced the pipe operator “%>%

“The Treachery of Images” by René Magritte

Pipe operator from magrittr package
  • The pipe was later popularized by {dplyr}
  • Read “%>%” as “goes into”
library(magrittr) # Load the magrittr package
# Build a pipeline
how_many <- areas[sqrt(areas) > 2] %>% # Extract values with a side > 2
  sqrt() %>%  # Take their sqrt (not needed)
  unique() %>% # Get unique values
  length() # Get length of values

how_many
[1] 5

Pipes Since 2021

The R community saw the usefulness of this and developed a “native” pipe “|>

You no longer need to load a package to use a pipe

# Build a pipeline
how_many <- areas[sqrt(areas) > 2] |>  # Extract values with a side > 2
  sqrt() |>   # Take their sqrt (not needed)
  unique() |>  # Get unique values
  length() # Get length of values

how_many
[1] 5
  • Keyboard shortcut:
    • MacOS: Cmd + Shift + M
    • Windows: Ctrl + Shift + M

Let’s code

Functions

For example, calculate the mean \(\bar{x} = \frac{\sum_{i = 1}^Nx_i}{N} = \frac{x_1 + x_2 + x_3 ... x_N}{N}\)

Sum of all values from 1 to N, divided by the number of values

# Example: calculate mean lionfish length "by hand"
sum(data_lionfish$total_length_mm) / length(data_lionfish$total_length_mm)
[1] 140.2202

If you have to do this for multiple variables, you might want to use a function

# Example: calculate mean lionfish length using a built-in function
mean(x = data_lionfish$total_length_mm)
[1] 140.2202

And if the function doesn’t exist, you can create one yourself (to be covered later)

my_mean <- function(var) {
  mean <- sum(var) / length(var)
  return(mean)
}
# Example: calculate mean lionfish length using a user-defined function
my_mean(var = data_lionfish$total_length_mm)
[1] 140.2202