EVR 628- Intro to Environmental Data Science

Keeping track of your code with Git and GitHub

Juan Carlos Villaseñor-Derbez (JC)

Learning Objectives

By the end of this week, you should be able to:

  • Understand the importance of reproducibility
  • Create the basic structure for an organized project
  • Explain the difference between Git and GitHub
  • Create and “clone” a “repo”
  • “commit” and “push”

Learning Git and GitHub may (will) be painful

  • Usually taught later in courses
  • Limited use if you learn it at the end
  • I would rather push materials to later in the course, but have you dominate this

Today’s Agenda

  • Reproducible Research
  • File Structure & Organization
  • Version Control with Git
  • GitHub & Collaboration
  • Hands-on: Building Your Repository

Version control and reproducibility

Why Should You Care?

  • Career advancement: Version control is a required skill in data science
  • Collaboration: Work with others without breaking their code
  • Reproducibility: Essential for scientific integrity / managing the real world
  • Maintain versions through time
  • Backup & recovery: Never lose your work again
  • Portfolio building: Showcase your projects to employers

The Problem: Research can be chaotic

  • File naming: analysis_final_v2_really_final_USE_THIS_ONE.R
    • (And the filenames loaded therein might look even worse)
  • Lost work: “Where did I put that working version?”
  • Collaboration conflicts: “Who changed what and when?”
  • Reproducibility failures: “Why do we get different results?”
  • Publication delays: “I can’t reproduce my own results”

Reproducible Research

Research that can be repeated by others using the same data and methods

Key Components:

  • Data: Raw, processed, and metadata
  • Code: Scripts, functions, and workflows
  • Documentation: Methods, assumptions, and decisions
  • Environment: Software versions and dependencies
  • Results: Outputs, tables, figures

The Reproducibility Crisis

  • 70% of researchers have tried and failed to reproduce another scientist’s experiments (Baker 2016)
  • 50% of researchers have failed to reproduce their own experiments (Baker 2016)
  • $28 billion wasted annually on irreproducible pre-clinical research (Freedman, Cockburn, and Simcoe 2016)
  • Environmental science is not immune to these issues

Real-World Example: Climate Data Analysis

Scenario: You’re analyzing sea surface temperature trends

Without version control:

  • Multiple versions of scripts with unclear differences
  • No record of data processing steps
  • Can’t track which parameters produced which results
  • Collaboration becomes chaotic

With version control:

  • Clear history of all changes
  • Reproducible workflow from raw data to final results
  • Easy collaboration and peer review
  • Confidence in your results

My drawer folder of shame

  • Who did what?
  • When?
  • Why?

Organizing your projects

File vs path

Folder \(\neq\) file

  • Folder is an address (where in your computer does this file live?)
  • File name is the actual file
  • When reading / writing data from R, we must specify where the file lives
  • Folders are represented as my_folder/
  • Files always have an .extension, which tells us the type of file

Project Structure Fundamentals

Important

Good file organization is the foundation of reproducible research

Principles:

  • Clarity: Names should be self-explanatory
  • Separation: Keep raw data, processed data, and outputs separate
  • Consistency: Respect folder structure
  • Documentation: Include README files and metadata

Example

What’s missing?

File Naming Best Practices

DO:

  • Use descriptive names: sea_surface_temp_analysis.R
    • I prefer snake_case, but CamelCase is appropriate too
  • Use consistent separators: underscores or hyphens

DON’T:

  • Use spaces: my file.R
  • Use special characters: analysis@final.R
  • Use vague names: stuff.R
  • Use different naming conventions within the same project
  • Use versions for code analysis_v1.2.R = analysis_final_v2_really_final_USE_THIS_ONE.R

Sometimes: include dates (e.g. data_2024_01_15.csv)

Naming conventions

i_use_snake_case                   # This is my preference

otherPeopleUseCamelCase            # This sometimes works

some.people.use.periods            # This is dangerous, especially in python

And_aFew.People_RENOUNCEconvention # You need help

ALL_CAPS                           # Reserved for super important stuff

Version control 101

What is Version Control?

A system that tracks changes to files over time

Think of it as:

  • A time machine for your code
  • A detailed history book of your project
  • A safety net for your work
  • A collaboration tool for teams

Git vs GitHub

Git

  • Software
  • Something you install in your computer
  • A command line tool (but we can bypass that)

GitHub

  • A service (offered by Microsoft)
  • Lives “on the cloud”
  • Provides a graphical interface
  • How you connect your computer to your colleague’s computer
  • Back up of your code

Git vs GitHub

Source: https://www.edureka.co/blog/git-vs-github/

Git

Git was created by Linus Torvalds (creator of Linux) in 2005

Key Features:

  • Free and open source: No licensing costs
  • Ample documentation
  • The most popular version control system

Four Git concepts

Repository

Clone

Commit

Push / Pull

Git Concepts: Repository

Repository (repo): A directory containing your project and its complete history

Think of it as:

  • A project folder with superpowers
  • A database of all your changes
  • A collaboration space for your team

Types:

  • Local repository: On your computer
  • Remote repository: On a server (like GitHub)

Example

  • The folder on my computer called mex_ports is the local repo
  • What you saw online was the remote repo
  • 1 folder = 1 RStudio project = 1 repo

Git Concepts: Cloning

Cloning: The process of bringing a repo from the remote to local for the first time

When you clone a repo:

  • You get all files available in the remote, at that point in time
  • You only do this once per project (unless you have more than one computer)
  • RStudio has a nice way of doing it without having to use the command line
  • Cloning is just GitHub’s way of referring to “download and connect remote to local”

Git Concepts: Commit

Commit: A snapshot of your project at a specific point in time

Each commit contains:

  • Changes: What files were modified
  • Message: Description of what was changed and why
  • Author: Who made the changes
  • Timestamp: When the changes were made
  • Unique ID: A hash that identifies this commit

Think of commits as:

  • Checkpoints in your project’s history
  • Save points in a video game
  • Pages in a project’s storybook

Git Concepts: Push / Pull

Refers to uploading and downloading changes to and from the remote

  • With a push, you decide when you send a bundle of commits
  • With a pull, you decide when you bring down your colleague’s version of code

Git Workflow: The Basic Cycle

  1. Modify files in your working directory
  2. Stage changes you want to commit
  3. Commit staged changes with a message
  4. Repeat as you work on your project

Push the changes up to GitHub

This cycle creates a timeline of your project’s development

When / What to commit

  • No specific rule
  • Whatever a “change” might look like
  • Sometimes Added color to points in figure 2 is enough (one line in one file)
  • Sometimes Refactored code to work with new package is appropriate (many lines in many files)
  • Small commits and frequent commits are best

Writing Good Commit Messages

Good commit messages:

  • Are descriptive and specific
  • Explain WHAT and/or WHY and/or HOW things changed
  • Are concise but informative

Examples:

  • Add sea surface temperature data processing function
  • Fix bug in monthly trend calculation
  • Update README with installation instructions
  • stuff
  • fixed things

GitHub: The Social Platform for Code

GitHub is a web-based platform that hosts Git repositories (including this course)

Key Features:

  • Free hosting for public repositories
  • Collaboration tools: Issues, pull requests, discussions, contact authors
  • Project management: Wikis, project boards, releases
  • Integration: Connects with other tools and services
  • Portfolio: Showcase your work to potential employers

GitHub vs DropBox/Box/GoogleDrive…

  • Designed for the intended task
  • You have granular control over what is updated and when
  • Allows multiple versions of same project to evolve in parallel
    • e.g. Your pipeline is done, mine isn’t yet

An example

  • mex_ports repository
    • I created a new R script on August 1st
    • Emma (student intern) worked on other files in the repo Aug 5 - Aug 11
    • Emma was having trouble reading data I had created
    • I had made a minor mistake, which I fixed
  • Note how I simply overwrite the file
  • No need to have V1 (didn’t work) and V2 (works) in the same repo
  • If I ever want to go back to an earlier version, I can!

Resources for Learning More

Git and GitHub:

Next class

Hands-on: We’ll play with Git and GitHub

Install git and create a GitHub account

Instructions available here, and under Week 3 in the course’s website

Learning Objectives - Revisited

By the end of this week, you should be able to:

  • Understand the importance of reproducibility
  • Create the basic structure for an organized project
  • Explain the difference between Git and GitHub
  • Create and “clone” a “repo”
  • “commit” and “push”

Assignment

Assignment: Build Your Portfolio

Due Sunday, September 14, 2025 11:59 PM via Canvas:

  1. Create a GitHub repository for your portfolio (I suggest you call it portfolio)
  1. The repository should be public, or shared with my github username: jcvdav
  1. Add information to your README.md file
  2. Update your .gitignore file
  3. Make at least 2 meaningful commits
  4. Share your repository link via Canvas

Grading criteria:

Repository:

  • exists
  • has a README.md file with content (name, about you, about the project)
  • has a .gitignore file
  • has at least two commits, in addition to the default one

Remember: Version control is a skill that gets better with practice. Start using Git for everything!

Extra slides

Essential Git Commands

We can do most things in RStudio, but you should know:

  • Getting started
    • git init: Create a new repository
    • git clone: Copy a repository from remote to local
  • Daily workflow
    • git status: See what’s changed
    • git add: Stage changes for commit
    • git commit: Save changes with a message
  • Connecting with GitHub
    • git push: Upload changes to remote repository
    • git pull: Download changes from remote repository

GitHub Workflows

By yourself

  • Clone: Download the repository to your computer
  • Commit: Make and save your changes
  • Push: Upload your changes to GitHub

In a group

  • Clone: Download the repository to your computer
  • Branch: Create a separate line of development
  • Commit: Make and save your changes
  • Push: Upload your changes to GitHub
  • Pull Request: Propose changes to the original repository
  • Review: Get feedback and approval from collaborators
  • Merge: Integrate approved changes into the main project

References

Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (May): 452–54. https://doi.org/10.1038/533452a.
Braga, Pedro Henrique Pereira, Katherine Hébert, Emma J Hudgins, Eric R Scott, Brandon P M Edwards, Luna L Sánchez Reyes, Matthew J Grainger, et al. 2023. “Not Just for Programmers: How GitHub Can Accelerate Collaborative and Reproducible Research in Ecology and Evolution.” Methods Ecol. Evol. 14 (June): 1364–80.
Freedman, Leonard P, Iain M Cockburn, and Timothy S Simcoe. 2016. “The Economics of Reproducibility in Preclinical Research.” PLOS Biology 13 (June): e1002165. https://doi.org/10.1371/journal.pbio.1002165.