Coding principles

EVR 628- Intro to Environmental Data Science

Juan Carlos Villaseñor-Derbez (JC)

Rosenstiel School of Marine, Atmospheric, and Earth Science and Institute for Data Science and Computing

So Far

You’ve trusted me and executed code as instructed
I didn’t give you much background on what you were doing
Clearly something clicked; you are still here
Let’s give you a bit more background

Learning Objectives

By the end of this week, you should be able to:

Explain the importance of code style and documentation
Know about “atomic” objects and how they relate to all things R
Understand and check object classes
Create, subset, and modify vectors and data frames
Use the pipe operator

Code style

Helping others (and future you) read your code

Comments

Text embedded within your script for humans to read
Helps other people (including future you) understand what’s going on
We use the # sign to prevent the computer from reading it
- Works in R, Python, Make, Julia, SQL

Comments: Some guidelines

What are you doing?
Why are you doing it that way?
- Focus on explaining why you did something, not what you are doing (or how)
Try to keep your comments within the vertical line show in RStudio

Is this comment helpful?

# Shark analysis
lengths <- c(6, 4.1, 2.8, 5.5, 3.9, 5.8)
sharks <- c("Great White Shark", "Lemon Shark", "Bull Shark", "Hammerhead Shark", "Mako Shark", "Great White Shark")
shark_data <- data.frame(
  lengths,
  sharks
)
shark_data$sharks[shark_data$length == max(shark_data$length)]

[1] "Great White Shark"

Comments: Some guidelines

How about this?

# Build data
## Vector of shark lengths
lengths <- c(6, 4.1, 2.8, 5.5, 3.9, 5.8)
## Vector of shark names
sharks <- c("Great White Shark", "Lemon Shark",
            "Bull Shark", "Hammerhead Shark",
            "Mako Shark", "Great White Shark")

# Combine vectors into a data.frame
shark_data <- data.frame(
  lengths,
  sharks
)

# Find the length for the largest great white
shark_data$lengths[shark_data$sharks == "Great White Shark"] |> max()

[1] 6

Sectioning with comments

A script contains more than one bit of code, often dozens of lines
R can detect up to 6 levels, as indicated by the number of “#”

################################################################################
# Description goes here
################################################################################

# SET UP #######################################################################

## Load packages ---------------------------------------------------------------

## Load data -------------------------------------------------------------------


# PROCESSING ###################################################################

## Some step -------------------------------------------------------------------


# VISUALIZE ####################################################################

## Another step ----------------------------------------------------------------


# EXPORT #######################################################################

## The final step --------------------------------------------------------------

Naming conventions (super important)

Objects
Files
Directories

i_use_snake_case                   # This is my preference

otherPeopleUseCamelCase            # This sometimes works

some.people.use.periods            # This is dangerous, especially in python

And_aFew.People_RENOUNCEconvention # You need help

ALL_CAPS                           # Reserved for super important stuff

Google Code Style Guide

Spaces

Use spaces:

around mathematical operators apart from ^ (i.e. + , - , == , < , …)
around the assignment operator ( <- )
around pipes ( |> and %>%)

# Strive for
y <- (m * x) + b

# Avoid
y<-(m*x)+b

# Avoid
 y <- ( m * x ) + b

No spaces around parentheses for regular function calls
Always put a space after a comma, just like in standard English.

# Strive for
mean(x, na.rm = TRUE)

# Avoid
mean (x ,na.rm=TRUE)

Indentation and line separation

Not great

library(EVR628tools)
library(tidyverse)
data("data_lionfish")
my_data<-data_lionfish[data_lionfish$site=="Paamul",]
ggplot(data=my_data,
mapping=aes(x=depth_m,y=total_length_mm))+
geom_point(shape=21,fill="steelblue",size=2)+
labs(x="Depth(m)",
y="Totallength(mm)",
title="Bodylengthanddepth",
subtitle="Largerfishtendtolivedeeper",
caption="SourceEVR628tools::data_lionfish")

Better

# Load packages
library(EVR628tools)
library(tidyverse)

# Load data
data("data_lionfish")

# Build my own data
my_data <- data_lionfish[data_lionfish$site == "Paamul",]

# Build my plot
ggplot(data = my_data,
       mapping = aes(x = depth_m, y = total_length_mm)) +
  geom_point(shape = 21, fill = "steelblue", size = 2) +
  labs(x = "Depth (m)",
       y = "Total length (mm)",
       title = "Body length and depth",
       subtitle = "Larger fish tend to live deeper",
       caption = "Source EVR628tools::data_lionfish")

Use Cmd + i to auto-indent

line by line
select code chunk and then indent

“Objects” and “Classes”

Things and types of things

Atomic objects

The most common atomic classes are:

character: "a" or 'b' (note the quotation marks)
numeric: 2 or 10e3 (note scientific notation)
logical: TRUE/FALSE or T/F, but never true/false (note no quotations)

You can check classes with function class()

class("a")

[1] "character"

class(2)

[1] "numeric"

class("2") # This is no longer a number

[1] "character"

class(TRUE)

[1] "logical"

Coercing “up” is safe

as.character(20)

[1] "20"

as.numeric(TRUE) # Logical to numeric

[1] 1

Coercing “down”… not always

as.numeric("a") # Character to numeric

[1] NA

Object classes

Classes in R map to the types of data we during Week 2:

numerical data in R are of class numeric
cetegorical data in R are of class character
ordinal data in R are a special class of character called factor
- Remember when we ordered the axis label in a ggplot using fct_infreq()?

Creating “objects”

If you’ll need a value, dataset or plot for later, you should “assign it”
This will retain the object in your environment tab, leaving available for later
You will use the assignment operator: <-
- Mac shortcut: Opt + -
- Windows shortcut: Alt + -
Read “<-” as: “gets a value of”

Save something (pretty much anything) by creating an object

# A number
my_num <- 2 # Read as: my_num gets a value of 2
# piece of text
my_name <- "JC" # Read as: my_name gets a value of JC

Creating “objects”

They will appear in your environment pane

And you can “call them” later

my_num # Call your object in the console

[1] 2

my_num + 2 # Use your object in other operations

[1] 4

my_number # This line should fail.. Why?

Error: object 'my_number' not found

You can use class() to check what each thing is

class(my_num)

[1] "numeric"

class(my_name)

[1] "character"

Operators

Binary Operators: Arithmetic

2 + 2 # Addition

[1] 4

10 - 1 # Subtraction

[1] 9

10 * 3 # Multiplication

[1] 30

10 / 3 # Division

[1] 3.333333

2^4 # Exponentiation

[1] 16

64 ^ (1/2) # Roots as powers...

[1] 8

Arithmetic Operators:

Stand between two values
Perform arithmetic on objects of class numeric (or objects that can be coerced to them)
- “coercion” here means “conversion”
- TRUE or FALSE (logical values) can be “coerced” to ones and zeroes
Return an object of class numeric

Binary Operators: Relational

2 < 3  # Less than?

[1] TRUE

2 > 3  # Greater than?

[1] FALSE

2 <= 3 # Les than or equal to?

[1] TRUE

2 >= 3 # Greater than or equal to?

[1] FALSE

2 == 3 # Equal to?

[1] FALSE

# Also works with characters
"hotdog" == "sandwich" # Equal?

[1] FALSE

Relational operators

Allow comparison of objects of all atomic objects
Return an object of class logical

Binary Operators: Logical

(2 == 2) & ("a" == "b") # Are statements to the left AND right TRUE?

[1] FALSE

(2 == 2) | ("a" == "b") # Are statements to the left OR right TRUE?

[1] TRUE

!T # Negate the statement

[1] FALSE

Allow comparison of objects of class logical
Return an object of class logical

Combining objects

Combining Atomic Objects

You can combine atomic elements with function “c()”
c stands for “combine”

colors <- c("red", "blue", "green", "orange", "black")
numbers <- c(1, 40, 1, 5, 6)

The objects called colors and numbers are vectors of class character and numeric

class(colors)

[1] "character"

class(numbers)

[1] "numeric"

Vectors are like columns in a spreadsheet
Vectors have lengths: number of elements

length(colors)

[1] 5

Operating on Vectors

If the vector is numeric, arithmetic operations are applied to every element automatically

numbers <- 1:10
numbers

 [1]  1  2  3  4  5  6  7  8  9 10

numbers * 2 # Multiplies each element x 2

 [1]  2  4  6  8 10 12 14 16 18 20

numbers + 5 # Adds 5 to each element

 [1]  6  7  8  9 10 11 12 13 14 15

Arithmetic operations don’t work on character vectors

colors <- c("red", "blue", "green", "orange", "black")
colors * 2

Error in colors * 2: non-numeric argument to binary operator

Note

What does the error message above mean?

Indexing and subsetting vectors with `[]`

Accessing elements within an object based on their position

Read “[]” as “extract elements”

colors <- c("red", "blue", "green", "orange", "black") # Create a character vector of colors

I can extract the first element with:

colors[1] # Extract first element

[1] "red"

Extract the first and third elements

colors[c(1, 3)] # Extract elements 1 and 3

[1] "red"   "green"

Extract elements based on logical matching

colors == "red"

[1]  TRUE FALSE FALSE FALSE FALSE

colors[!colors == "red"]

[1] "blue"   "green"  "orange" "black"

Indexing with `[]` and modify with `<-`

colors <- c("red", "blue", "green", "orange", "black") # Create a character vector of colors
colors

[1] "red"    "blue"   "green"  "orange" "black"

Let’s modify red to white

colors[1] <- "white" # The first element of "colors" takes the value "white"

colors

[1] "white"  "blue"   "green"  "orange" "black"

Number of elements indexed must match number of elements assigned

colors[1] <- c("white", "yellow")

Warning in colors[1] <- c("white", "yellow"): number of items to replace is not
a multiple of replacement length

colors

[1] "white"  "blue"   "green"  "orange" "black"

Let’s code

Combining Different Classes

Combining Vectors of Different Classes

R’s basic class is data.frame
Having different classes between columns is OK, not within
All vectors must be the same length

Build data.frame from vectors

# Build my vectors
colors <- c("red", "blue", "green", "orange", "black") # For five colors
numbers <- c(1, 40, 1, 5, 6)                           # For five numbers
# Column names are automatically assigned
data.frame(colors,
           numbers)

colors	numbers
red	1
blue	40
green	1
orange	5
black	6

Combining Vectors of Different Classes

Using vectors of different lengths fails

data.frame(colors = c("red", "green", "blue"),
           numbers = c(1, 2))

Error in data.frame(colors = c("red", "green", "blue"), numbers = c(1, : arguments imply differing number of rows: 3, 2

Unless they can be “recycled”

data.frame(colors = c("red", "green", "blue", "orange"),
           numbers = c(4, 2))

colors	numbers
red	4
green	2
blue	4
orange	2

Note how I built the data.frame directly from atomic elements

Combining Vectors of Different Classes

Build a data.frame and save it to an object

my_data <- data.frame(colors = c("red", "green"),
                      numbers = c(1, 2),
                      letters = c("A", "B"))

Attributes of data.frames:

class(my_data) # Check class

[1] "data.frame"

dim(my_data) # Check number of dimensions

[1] 2 3

nrow(my_data) # Check number of rows

[1] 2

ncol(my_data) # Check number of columns

[1] 3

colnames(my_data) # Check column names

[1] "colors"  "numbers" "letters"

data.frames are step 1 in tidy data

They allow us to adhere to the standards described in Week 2:

Each column is a variable
Each row is an observation
Cells contain values

Tidy data

         spp    sex carapace_length
1   C. mydas Female              23
2 C. caretta   Male              24

Not tidy data

         variable    Org 1      Org 2
1             spp C. mydas C. caretta
2             sex   Female       Male
3 carapace_length       23         24

Say you want a plot of turtle carapace length by species
Only one of these data formats allows you to plot it with ggplot

Tibbles vs data.frame

Tibbles are a special type of data.frame used in tidyverse and spatial libraries
They work in the same way

library(tidyverse)
tibble(colors,
       numbers)

# A tibble: 5 × 2
  colors numbers
  <chr>    <dbl>
1 red          1
2 blue        40
3 green        1
4 orange       5
5 black        6

Tibbles vs data.frame

Tibbles are also smart

They display their dimensions
And let you know data have been omitted

data_lionfish

# A tibble: 109 × 9
   id       site    lat   lon total_length_mm total_weight_gr size_class depth_m
   <chr>    <chr> <dbl> <dbl>           <dbl>           <dbl> <chr>        <dbl>
 1 001-Po-… Para…  20.5 -87.2             213           113.  large         38.1
 2 002-Po-… Para…  20.5 -87.2             124            27.6 medium        27.9
 3 003-Pd-… Pared  20.5 -87.2             166            52.3 medium        18.5
 4 004-Cs-… Cano…  20.5 -87.2             203           123.  large         15.5
 5 005-Cs-… Cano…  20.5 -87.2             212           129   large         15  
 6 006-Pl-… Paam…  20.5 -87.2             210           139.  large         22.7
 7 007-Pl-… Paam…  20.5 -87.2             132            50.3 medium        13.4
 8 008-Po-… Para…  20.5 -87.2             122            17.2 medium        18.5
 9 009-Po-… Para…  20.5 -87.2             224           113.  large         18.2
10 010-Pd-… Pared  20.5 -87.2             117            19.6 medium        12.5
# ℹ 99 more rows
# ℹ 1 more variable: temperature_C <dbl>

Subsetting dataframes with `[]`

data.frames are two-dimensional, so we use two numbers: [rows, cols]

my_data

  colors numbers letters
1    red       1       A
2  green       2       B

Extract the value for the second observation anad third variable

my_data[1 , 3]

[1] "A"

Extract values for first observation across all variables, implicitly

my_data[1, ] # row one, all columns

  colors numbers letters
1    red       1       A

Extract values for first variable across all observations, implicitly

my_data[ , 1]

[1] "red"   "green"

Extracting data.frame columns with `$`

Remember data.frames and tibbles are just collection of vectors
You can get the variable (column) out as a vector using $
This is why we like tidy data

ids <- data_lionfish$id

And then use [] on the resulting vector
For example, extract the first 10 ids

ids[1:10]

 [1] "001-Po-16/05/10" "002-Po-29/05/10" "003-Pd-29/05/10" "004-Cs-12/06/10"
 [5] "005-Cs-12/06/10" "006-Pl-21/06/10" "007-Pl-21/06/10" "008-Po-04/07/10"
 [9] "009-Po-04/07/10" "010-Pd-08/07/10"

This also works

data_lionfish$id[1:10]

 [1] "001-Po-16/05/10" "002-Po-29/05/10" "003-Pd-29/05/10" "004-Cs-12/06/10"
 [5] "005-Cs-12/06/10" "006-Pl-21/06/10" "007-Pl-21/06/10" "008-Po-04/07/10"
 [9] "009-Po-04/07/10" "010-Pd-08/07/10"

Indexing with `[]` and modify with `<-`

Works in the same way as with vectors

my_data

  colors numbers letters
1    red       1       A
2  green       2       B

my_data[1, 3] <- "turtle"

Which value will change?

my_data

  colors numbers letters
1    red       1  turtle
2  green       2       B

Functions

A function:

Is a block of code which only runs when it is called
takes arguments…
does something to them
returns an object that used the arguments passed

Some functions we’ve used so far are:

length()
dim()
mean()
aes()
what else?

Pipes and pipelines

Pipes

Imagine having the following numeric vector

# Area of square surveyed plots (in sq m)
areas <- c(1, 5, 2, 7, 9, 2, 3, 6, 3, 2, 1, 1, 6, 8, 3, 2, 1, 6, 1, 2, 8, 3,
           6, 8, 9, 0, 5, 3, 2, 1, 2, 6, 9, 2, 3, 6, 3, 2, 1, 1, 6, 8, 3, 2)

And wanting to know how many square plots have a side greater than 2

Before 2013 you had two options:

1: Call each function at a time

# This creates too many objects
sides <- sqrt(areas) # Calculate square root to get side length
sides_larger_than_2 <- sides[sides > 2] # Extract sides > 2
types <- unique(sides_larger_than_2) # Get unique "types" of plot
how_many <- length(types) # Count how many "types" of plot
how_many

[1] 5

2: Use nested functions

# This is hard to read:
how_many <- length(unique(sqrt(areas)[sqrt(areas) > 2]))
how_many

[1] 5

Pipes After 2013

The magrittr package introduced the pipe operator “%>%”

“The Treachery of Images” by René Magritte

The pipe was later popularized by {dplyr}
Read “%>%” as “goes into”

library(magrittr) # Load the magrittr package
# Build a pipeline
how_many <- areas[sqrt(areas) > 2] %>% # Extract values with a side > 2
  sqrt() %>%  # Take their sqrt (not needed)
  unique() %>% # Get unique values
  length() # Get length of values

how_many

[1] 5

Pipes Since 2021

The R community saw the usefulness of this and developed a “native” pipe “|>”

You no longer need to load a package to use a pipe

# Build a pipeline
how_many <- areas[sqrt(areas) > 2] |>  # Extract values with a side > 2
  sqrt() |>   # Take their sqrt (not needed)
  unique() |>  # Get unique values
  length() # Get length of values

how_many

[1] 5

Keyboard shortcut:
- MacOS: Cmd + Shift + M
- Windows: Ctrl + Shift + M

Let’s code

Functions

For example, calculate the mean $\bar{x} = \frac{\sum_{i = 1}^Nx_i}{N} = \frac{x_1 + x_2 + x_3 ... x_N}{N}$

“Sum of all values from 1 to N, divided by the number of values”

# Example: calculate mean lionfish length "by hand"
sum(data_lionfish$total_length_mm) / length(data_lionfish$total_length_mm)

[1] 140.2202

If you have to do this for multiple variables, you might want to use a function

# Example: calculate mean lionfish length using a built-in function
mean(x = data_lionfish$total_length_mm)

[1] 140.2202

And if the function doesn’t exist, you can create one yourself (to be covered later)

my_mean <- function(var) {
  mean <- sum(var) / length(var)
  return(mean)
}
# Example: calculate mean lionfish length using a user-defined function
my_mean(var = data_lionfish$total_length_mm)

[1] 140.2202

Coding principles

So Far

Learning Objectives

Code style

Comments

Comments: Some guidelines

Comments: Some guidelines

Sectioning with comments

Naming conventions (super important)

Spaces

Indentation and line separation

“Objects” and “Classes”

Atomic objects

Object classes

Creating “objects”

Creating “objects”

Operators

Binary Operators: Arithmetic

Binary Operators: Relational

Binary Operators: Logical

Combining objects

Combining Atomic Objects

Operating on Vectors

Indexing and subsetting vectors with []

Indexing with [] and modify with <-

Let’s code

Combining Different Classes

Combining Vectors of Different Classes

Combining Vectors of Different Classes

Combining Vectors of Different Classes

data.frames are step 1 in tidy data

Tibbles vs data.frame

Tibbles vs data.frame

Subsetting dataframes with []

Extracting data.frame columns with $

Indexing with [] and modify with <-

Functions

Functions

Pipes and pipelines

Pipes

Pipes After 2013

Pipes Since 2021

Let’s code

Functions

Indexing and subsetting vectors with `[]`

Indexing with `[]` and modify with `<-`

Subsetting dataframes with `[]`

Extracting data.frame columns with `$`

Indexing with `[]` and modify with `<-`