4/27/2021

TOC

  • Motivation: the problem
  • Common solutions
  • GNU Make
  • Work together

Disclaimer

  • This is not the solution to all [coding] problems
  • It is just another tool in your DataScience toolbox (albeit, a useful one!)

Motivation: the problem

  • Coding is rarely linear
  • We often have to go back and change data, assumptions, transformations, filters, colors, font types
  • Projects can get really big really fast… and re-running an huge Rmd file is not an option
  • We rarely work alone (new data, new approaches, new questions…)

Motivation: an example

We’ll simulate everyone’s nightmare:

  • You have built your “coding pipeline” and are ready to share results
  • Then, you (or a colleague) finds an error in the raw data (crisis!!!)
  • You fix the error, and now have to rerun the ENTIRE analysis. Chaos:
  • which script did what, again?
  • What depended on what?
  • Yes, organization solves a lot of this (but who’s got time for that?)

[Self-]Motivation:

The solution lies in the problem

The problem lies in the solution to the original problem

Consider a case

data
    |_raw_data.csv
    |_clean_data.csv
scripts
    |_clean_data.R
    |_plot_data.R
results
    |_figure1.png

A potential solution: lapply

# Identify all scripts
scripts <- list.files(path = "scripts",
                      pattern = "*.R")

# Run them all
lapply(scripts, source)

A potential solution: purrr + furrr

# Identify all scripts
scripts <- list.files(path = "scripts",
                      pattern = "*.R")

# Run them all... in parallel
plan(multisession, workers = 4) # Use four cores
future_walk(scripts, source)    # Walk through files

A potential solution: run_all.R

Have a script:

# Run all scripts
# This script runs all the code in my project, from scratch
source("scripts/clean_data.R")      # To clean the data
source("scripts/plot_data.R")       # To plot the data

And either call source(run_all.R) or manually source the ones that we think we need to run.

Problems with these

  • Do I even need to actually run all?

  • What if variables / values are left in my environment?

  • It worked when I wrote it, but not anymore

  • What if the timing isn’t correct?

Existing solutions

Many R packages

  • Many, many, many other great solutions…

Existing solutions

But some shortcommings

  • R-specific (hinders collaboration)
  • Things keep changing (for better, but still)

Existing solutions

But some shortcommings

  • R-specific (hinders collaboration)
  • Things keep changing (for better, but still)

  • They really are just leveraging an existing infrastructure

Enter make

Overview of make and Makefile

From GNU’s website:

“GNU Make is a tool which controls the generation of executables and other non-source files of a program from the program’s source files.

How does it work?

  • make “looks” for a file called Makefile
  • You write that Makefile, listing all the good stuff
  • What’s “the good stuff”?

The good stuff:

  • Targets: What needs to be created
  • Prerequisites: The dependencies
  • Commands: How do we go from dependency to target?
  • Together, these make a rule
  • We specify them as:

Three things to keep in mind:

  • Targets: What needs to be created
  • Prerequisites: The dependencies
  • Commands: How do we go from dependency to target?
  • Together, these make a rule
  • We specify them as:
target: prerequisite
  command
  • Note on notation, indentation…

A lame example

taco: recipe fridge/tortilla fridge/meat fridge/salsa
  follow recipe
  
happiness: taco           #A target can be a prerequisite
  eat taco

Short example

Remember our previous case?

data
    |_raw_data.csv
    |_clean_data.csv
scripts
    |_clean_data.R
    |_plot_data.R
results
    |_figure1.png

What’s the Makefile for this project?

Recall:

target: prerequisites
  command

So, then:

results/figure1.png: scripts/plot_data.R data/clean_data.csv
  Rscript scripts/plot_data.R

data/clean_data.csv: scripts/clean_data.R data/raw_data.csv
  Rscript scripts/clean_data.R

And one command does it all: make

Cool things

Making it into a graph

For this, we need graphviz and makefile2graph

make -Bnd | make2graph | dot -Tpng -o makefile-dag.png
  • -Bnd tells make to “B: Unconditionally make all targets, n: just print, d: print debug info”
  • | is just the OG pipe
  • -Tpng tells dotT use the following output format: png

Making it into a graph

make -Bnd | make2graph | dot -Tpng -o makefile-dag.png

Automatic variables

  • $@: the file name of the target
  • $<: the name of the first prerequisite
  • $^: the names of all prerequisites
  • $(@D): the directory part of the target
  • $(@F): the file part of the target
  • $(<D): the directory part of the first prerequisite
  • $(<F): the file part of the first prerequisite

Automatic variables

Let’s use two: $(<D) and $(<F)

results/figure1.png: scripts/plot_data.R data/clean_data.csv
  Rscript scripts/plot_data.R

Can be written as

results/figure1.png: scripts/plot_data.R data/clean_data.csv
  cd $(<D);Rscript $(<F)

Pattern rules

Imagine you have something like this:

results/figure1.png: scripts/figure1.R
  Rscript scripts/figure1.R

results/figure2.png: scripts/figure2.R
  Rscript scripts/figure2.R

results/figure3.png: scripts/figure3.R
  Rscript scripts/figure3.R

results/figure4.png: scripts/figure4.R
  Rscript scripts/figure4.R

results/figure5.png: scripts/figure5.R
  Rscript scripts/figure5.R
.
.
.
.
.
.

Pattern rules

Writing out this chunk \(n\) times presents opportunity for \(\alpha n\) errors:

results/figure1.png: scripts/figure1.R
  cd $(<D);Rscript $(<F)

Instead, we use pattern rules with %, which is equivalent to *

results/%.png: scripts/%.R
    cd $(<D);Rscript $(<F)

(I think it’s cool that the workflow induces standardization)

Environmental variables

You can include them:

R_OPTS=--no-save --no-restore --no-init-file --no-site-file

results/%.png: scripts/%.R
    cd $(<D);Rscript $(R_OPTS) $(<F)

Hands-on

Today’s project

draft.Rmd
draft.html
scripts
  |_00_clean_data.R
  |_01_figure_1.R
  |_02_figure_2.R
  |_03_regression.R
results
  |_img
  |  |_ first_year.png
  |  |_ time_sereies.png
  |_tab
     |_ reg_table.html

Today’s project

  • make the project
  • Simulate getting a “fixed” version of the data and re-makeing the project (end-to-end update)
  • Simulate making one change (partial update)
  • Simulate getting NEW (or more?) data and re-makeint the project once again (end-to-end update)
  • Finally, create you own code and add a pair of target / prerequisites (partial update)
  • If we have time, we’ll generalize some of the code (less typing!)
  • If we have time, we’ll automate the generation of the workflow image

Other resources