GNU make and Makefile for rerpoducibility

4/27/2021

Disclaimer

This is not the solution to all [coding] problems
It is just another tool in your DataScience toolbox (albeit, a useful one!)

Motivation: the problem

Coding is rarely linear
We often have to go back and change data, assumptions, transformations, filters, colors, font types
Projects can get really big really fast… and re-running an huge Rmd file is not an option
We rarely work alone (new data, new approaches, new questions…)

Motivation: an example

We’ll simulate everyone’s nightmare:

You have built your “coding pipeline” and are ready to share results
Then, you (or a colleague) finds an error in the raw data (crisis!!!)
You fix the error, and now have to rerun the ENTIRE analysis. Chaos:
which script did what, again?
What depended on what?

Yes, organization solves a lot of this (but who’s got time for that?)

[Self-]Motivation:

~~The solution lies in the problem~~

The problem lies in the solution to the original problem

Consider a case

data
    |_raw_data.csv
    |_clean_data.csv
scripts
    |_clean_data.R
    |_plot_data.R
results
    |_figure1.png

A potential solution: `lapply`

# Identify all scripts
scripts <- list.files(path = "scripts",
                      pattern = "*.R")

# Run them all
lapply(scripts, source)

A potential solution: `purrr` + `furrr`

# Identify all scripts
scripts <- list.files(path = "scripts",
                      pattern = "*.R")

# Run them all... in parallel
plan(multisession, workers = 4) # Use four cores
future_walk(scripts, source)    # Walk through files

A potential solution: `run_all.R`

Have a script:

# Run all scripts
# This script runs all the code in my project, from scratch
source("scripts/clean_data.R")      # To clean the data
source("scripts/plot_data.R")       # To plot the data

And either call source(run_all.R) or manually source the ones that we think we need to run.

Problems with these

Do I even need to actually run all?
What if variables / values are left in my environment?
It worked when I wrote it, but not anymore
What if the timing isn’t correct?

Existing solutions

Many R packages

drake

targets

Many, many, many other great solutions…

Existing solutions

But some shortcommings

R-specific (hinders collaboration)
Things keep changing (for better, but still)

Existing solutions

But some shortcommings

R-specific (hinders collaboration)
Things keep changing (for better, but still)

They really are just leveraging an existing infrastructure

Enter `make`

Overview of `make` and `Makefile`

From GNU’s website:

“GNU Make is a tool which controls the generation of executables and other non-source files of a program from the program’s source files.”

How does it work?

make “looks” for a file called Makefile
You write that Makefile, listing all the good stuff
What’s “the good stuff”?

The good stuff:

Targets: What needs to be created
Prerequisites: The dependencies
Commands: How do we go from dependency to target?
Together, these make a rule
We specify them as:

Three things to keep in mind:

Targets: What needs to be created
Prerequisites: The dependencies
Commands: How do we go from dependency to target?
Together, these make a rule
We specify them as:

target: prerequisite
  command

Note on notation, indentation…

A lame example

taco: recipe fridge/tortilla fridge/meat fridge/salsa
  follow recipe
  
happiness: taco           #A target can be a prerequisite
  eat taco

Short example

Remember our previous case?

data
    |_raw_data.csv
    |_clean_data.csv
scripts
    |_clean_data.R
    |_plot_data.R
results
    |_figure1.png

What’s the `Makefile` for this project?

Recall:

target: prerequisites
  command

So, then:

results/figure1.png: scripts/plot_data.R data/clean_data.csv
  Rscript scripts/plot_data.R

data/clean_data.csv: scripts/clean_data.R data/raw_data.csv
  Rscript scripts/clean_data.R

And one command does it all: make

Cool things

Making it into a graph

For this, we need graphviz and makefile2graph

make -Bnd | make2graph | dot -Tpng -o makefile-dag.png

-Bnd tells make to “B: Unconditionally make all targets, n: just print, d: print debug info”
| is just the OG pipe
-Tpng tells dot “T use the following output format: png”

Making it into a graph

make -Bnd | make2graph | dot -Tpng -o makefile-dag.png

Automatic variables

$@: the file name of the target
$<: the name of the first prerequisite
$^: the names of all prerequisites
$(@D): the directory part of the target
$(@F): the file part of the target
$(<D): the directory part of the first prerequisite
$(<F): the file part of the first prerequisite

Automatic variables

Let’s use two: `$(<D)` and `$(<F)`

results/figure1.png: scripts/plot_data.R data/clean_data.csv
  Rscript scripts/plot_data.R

Can be written as

results/figure1.png: scripts/plot_data.R data/clean_data.csv
  cd $(<D);Rscript $(<F)

Pattern rules

Imagine you have something like this:

results/figure1.png: scripts/figure1.R
  Rscript scripts/figure1.R

results/figure2.png: scripts/figure2.R
  Rscript scripts/figure2.R

results/figure3.png: scripts/figure3.R
  Rscript scripts/figure3.R

results/figure4.png: scripts/figure4.R
  Rscript scripts/figure4.R

results/figure5.png: scripts/figure5.R
  Rscript scripts/figure5.R
.
.
.
.
.
.

Pattern rules

Writing out this chunk $n$ times presents opportunity for $\alpha n$ errors:

results/figure1.png: scripts/figure1.R
  cd $(<D);Rscript $(<F)

Instead, we use pattern rules with %, which is equivalent to *

results/%.png: scripts/%.R
    cd $(<D);Rscript $(<F)

(I think it’s cool that the workflow induces standardization)

Environmental variables

You can include them:

R_OPTS=--no-save --no-restore --no-init-file --no-site-file

results/%.png: scripts/%.R
    cd $(<D);Rscript $(R_OPTS) $(<F)

Hands-on

The repo is here: https://github.com/jcvdav/make_tutorial

Today’s project

draft.Rmd
draft.html
scripts
  |_00_clean_data.R
  |_01_figure_1.R
  |_02_figure_2.R
  |_03_regression.R
results
  |_img
  |  |_ first_year.png
  |  |_ time_sereies.png
  |_tab
     |_ reg_table.html

Today’s project

make the project
Simulate getting a “fixed” version of the data and re-makeing the project (end-to-end update)
Simulate making one change (partial update)
Simulate getting NEW (or more?) data and re-makeint the project once again (end-to-end update)
Finally, create you own code and add a pair of target / prerequisites (partial update)
If we have time, we’ll generalize some of the code (less typing!)
If we have time, we’ll automate the generation of the workflow image

TOC

Disclaimer

Motivation: the problem

Motivation: an example

[Self-]Motivation:

Consider a case

A potential solution: `lapply`

A potential solution: `purrr` + `furrr`

A potential solution: `run_all.R`

Problems with these

Existing solutions

Many R packages

Existing solutions

But some shortcommings

Existing solutions

But some shortcommings

Enter `make`

Overview of `make` and `Makefile`

How does it work?

The good stuff:

Three things to keep in mind:

A lame example

Short example

Remember our previous case?

What’s the `Makefile` for this project?

Recall:

So, then:

Cool things

Making it into a graph

Making it into a graph

Automatic variables

Automatic variables

Let’s use two: `$(<D)` and `$(<F)`

Pattern rules

Pattern rules

Environmental variables

Hands-on

Today’s project

Today’s project

Other resources

TOC

Disclaimer

Motivation: the problem

Motivation: an example

[Self-]Motivation:

Consider a case

A potential solution: lapply

A potential solution: purrr + furrr

A potential solution: run_all.R

Problems with these

Existing solutions

Many R packages

Existing solutions

But some shortcommings

Existing solutions

But some shortcommings

Enter make

Overview of make and Makefile

How does it work?

The good stuff:

Three things to keep in mind:

A lame example

Short example

Remember our previous case?

What’s the Makefile for this project?

Recall:

So, then:

Cool things

Making it into a graph

Making it into a graph

Automatic variables

Automatic variables

Let’s use two: $(<D) and $(<F)

Pattern rules

Pattern rules

Environmental variables

Hands-on

Today’s project

Today’s project

Other resources

A potential solution: `lapply`

A potential solution: `purrr` + `furrr`

A potential solution: `run_all.R`

Enter `make`

Overview of `make` and `Makefile`

What’s the `Makefile` for this project?

Let’s use two: `$(<D)` and `$(<F)`