Geographic data models

Vector data

Juan Carlos Villaseñor-Derbez (JC)

Introduction

GIS vs Geocomputation

GIS:

  • Powerful
  • Emphasize graphical user interface (GUI)
  • You can only do what the developers allow you to do
  • Difficult to reproduce

Geocomputation:

  • Just as powerful (same underlying code)
  • Command line interface (CLI)
  • Flexible
  • Reproducible

Academic research, software development and practical applications that use geographic data to solve problems, with a focus on reproducibility, flexibility and tool development. - Lovlace, Nowosad, Muenchow (GeocompR, 2024)

How do we get the world into a computer?

  • Imagine an attribute that occurs somewhere in the World:
    • SST
    • site identity
    • hurricane path
    • city boundaries
  • How do we represent it?
  • How do we work on it?
  • How do we store the data?

We need a model

Data can be represented in a vector model, or a raster model

In today’s class:

  • Introduce the why behind each model
  • Discuss where and when they are used
  • Focus on vector:
    • It’s main attributes
    • It’s implementation in R

Today’s goals

By the end of the lecture, you should be able to:

  • list the two main data models, and articulate their main characteristics
  • understand the principles of vector data
  • list the three main components of a spatial vector object
  • list the tree main types of a simple feature
  • given an sf object, plot it and inspect it

Two models

vector

  • World is represented using geometries: points, lines, and polygons
  • Discrete and well-defined borders = high precision
  • Computationally intensive

raster

  • Space is divided into equal-sized grid cells (or pixels)
  • Aggregate spatial features to a given resolution
  • You can lose some precision, but much faster to work with

Note

Spatial data can be represented in a vector or raster model

Example: Rosenstiel

Code
# Build the "spatial" part
rsmaes_pt_coords <- st_point(x = c(-80.1632879, 25.7328129),
                             dim = "XY") |> 
  st_sfc(crs = 4326)
# Combine with the "attributes" part
rsmaes_pt <- st_sf(id = "Rosenstiel (point)",
                   geometry = rsmaes_pt_coords)
# Visualize
mapview(rsmaes_pt)
Code
rsmaes_line_coords <- st_linestring(
  x = matrix(data = c(-80.163017, 25.733950,
                      -80.164236, 25.732816,
                      -80.163772, 25.732353,
                      -80.163924, 25.732148,
                      -80.163455, 25.731597,
                      -80.162187, 25.731605,
                      -80.160968, 25.732172,
                      -80.163017, 25.733950),
             ncol = 2,
             byrow = T),
  dim = "XY") |> 
  list() |> 
  st_sfc(crs = 4326)

rsmaes_line <- st_sf(id = "Rosenstiel (line)",
                     geometry = rsmaes_line_coords)

mapview(rsmaes_line)
Code
rsmaes_poly_coords <- st_polygon(
  x = list(
    matrix(
      data = c(
        -80.163017, 25.733950,
        -80.164236, 25.732816,
        -80.163772, 25.732353,
        -80.163924, 25.732148,
        -80.163455, 25.731597,
        -80.162187, 25.731605,
        -80.160968, 25.732172,
        -80.163017, 25.733950),
      ncol = 2,
      byrow = T)),
  dim = "XY") |> 
  st_sfc(crs = 4326)

rsmaes_poly <- st_sf(id = "Rosenstiel (polygon)",
                     geometry = rsmaes_poly_coords)

mapview(rsmaes_poly)
Code
rsmaes_rast <- rasterize(
  x = vect(rsmaes_poly),
  y = rast(resolution = 0.0001,
           crs = "EPSG:4326",
           val = 0,
           xmin = -80.168, xmax = -80.155,
           ymin = 25.730, ymax = 25.735),
  field = 1)

names(rsmaes_rast) <- "Rosenstiel"
rsmaes_rast[is.na(rsmaes_rast)] <- 0

mapview(rsmaes_rast,
        legend = F)

Common uses

vector

  • Common in social sciences:
    • Census tract boundaries
    • Political boundaries
    • Roads

raster

  • Common in natural sciences:
    • Sea surface temperature
    • Satellite imagery
    • Species distribution models
  • As environmental scientists and practitioners we will typically use both

  • The “correct” type of data varies by use case

  • You’ll be able to make this choice in a few weeks

raster is faster, but vector is correctr”

Vector data model

World is represented using geometries: points, lines, and polygons

Q: What is a line? What is a polygon?

A: A sequence of points

Q: How do we represent a point?

A: Coordinates

Q: What are coordinates?

A: Two numbers representing distance from an origin: a coordinate reference system

Note

The vector data model is based on points located within a coordinate reference system (CRS)

What is where? And what does “where” mean?

Coordinates, Coordinate Reference Systems, and Attributes

What does “where” mean?

Consider the following two coordinates:

  • (-80.1632879, 25.7328129)

  • (1186764, 2863941)

Where are they?

The importance of attributes

  • Coordinates and a CRS are enough to place the point on a map
  • Also enough to perform spatial analysis
  • But not enough to convey information about the location
  • Often not enough to derive insights
  • We also need attributes

Note

The three main components of a spatial vector object are attributes (what), geometry, and CRS (where)

Attributes

  • Are not inherently spatial
  • Can be a discrete or categorical variable representing a characteristic of interest:
    • Identity
    • Sea Surface Temperature
    • Time
  • Note that you can have more than one attribute tied to a geometry
    • 2 attributes: vessel-tracking data or hurricane path (id and time)
    • 2+ attributes: Buoy weather data (id, time, and all variables measured)

Handling spatial data in a computer

How are these compotents handled in a computer?

We rely on four key low-level libraries for geocomputation:

  • GDAL: read, write, and manipulating geographic data
  • PROJ: coordinate system transformation
  • GEOS: planar geometry engine only to be used with data with a projected CRS)
  • S2: a spherical geometry engine, written in C++ (fast) and developed by Google (reliable)

All are incredibly powerful and open source, but not exactly user-friendly

Ways of interacting with these libraries

The sf package

  • From R, our main point of access will be the sf package
  • When we load sf, it automatically links to these low level libraries
library(sf)
Linking to GEOS 3.13.0, GDAL 3.10.0, PROJ 9.5.1; sf_use_s2() is TRUE
  • sf provides:
    • A consistent command-line interface to communicate with the libraries
    • R classes and objects to handle vector data using the Simple Features standard

Introduction to Simple Features

Simple Feature

  • Simple features is an official, open standard (i.e. registered at the ISO)
  • Hierarchical data model to represent types of geometries (i.e. points, lines, polygons, multi-*, and collections)

The standard says:

“A simple feature is defined by the OpenGIS Abstract specification to have both spatial and non-spatial attributes. Spatial attributes are geometry valued, and simple features are based on 2D geometry with linear interpolation between vertices.”

—Pebesma, 2024

Quick example 1: working with sf objects

# Read the data
milton <- read_sf(here("data", "milton.gpkg"), quiet = T)
# Inspect it
class(milton)
[1] "sf"         "tbl_df"     "tbl"        "data.frame"
# See its names
names(milton)
[1] "NAME"     "ISO_TIME" "USA_WIND" "USA_PRES" "USA_SSHS" "geom"    
# Plot it (note you get one plot per column, except for geometry)
plot(milton)

Quick example 2: sf object is just a data.frame

You can quickly calculate summary statistics for any attribute variable

Code
summary(milton$USA_WIND)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  25.00   34.00   75.00   81.87  134.00  150.00 

Keep only some columns

Code
milton_light <- milton[,c(2, 3)]
milton_light
Simple feature collection with 55 features and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -95.5 ymin: 20.6 xmax: -77.5 ymax: 29.5
Geodetic CRS:  WGS 84
# A tibble: 55 × 3
   ISO_TIME            USA_WIND         geom
   <dttm>                 <dbl>  <POINT [°]>
 1 2024-10-03 20:00:00       25 (-94.7 20.6)
 2 2024-10-03 23:00:00       25 (-94.8 20.7)
 3 2024-10-04 02:00:00       25 (-94.9 20.7)
 4 2024-10-04 05:00:00       25   (-95 20.7)
 5 2024-10-04 08:00:00       25   (-95 20.8)
 6 2024-10-04 11:00:00       25 (-95.1 20.8)
 7 2024-10-04 14:00:00       25 (-95.1 20.9)
 8 2024-10-04 17:00:00       28   (-95.1 21)
 9 2024-10-04 20:00:00       30 (-95.2 21.1)
10 2024-10-04 23:00:00       30 (-95.3 21.3)
# ℹ 45 more rows

Or plot the attribute variables

Code
ggplot(data = milton_light,
       mapping = aes(x = ISO_TIME,
                     y = USA_WIND)) +
  geom_line() +
  geom_point() +
  labs(x = "Date",
       y = "Maximum sustained wind speed (knots)")

Why simple features?

Why use simple features

  • STANDARD = transferability and reproducibility (important in science and policy making!)
  • Widely supported data model (e.g. QGIS and PostGIS)
  • Relatively quick I/O
  • Plotting performance (easy to use attributes to specify visualization characteristics)
  • Full R support: sf objects can be treated as data.frames (wrangling and visualization, pipe-friendly)

Visualizing simple features

(Just an overview for now)

Basic maps

You can quickly produce rough-and-ready maps using the base R function plot

plot(USA)

This will give you a multi-panel figure with one panel per attribute (default max of 9)

Other parameters to plot()

You can specify you only want one panel

plot(USA, max.plot = 1)

Other parameters to plot()

Or that you specificlaly want a column (either by name or by position)

plot(USA["woe_id"])

plot(USA[,18])

Adding layers

Use reset = FALSE and add = TRUE

plot(USA[,1], reset = F)
plot(FL, add = T, col = "orange")
plot(milton, add = T, col = "red")
plot(MEX, add = T, col = "gray")

Adding layers

reset = FALSE needed when plotting the first layer produces a key OR the extent needs to be updated as new layers come in

plot(USA$geometry)
plot(FL, add = T, col = "orange")
plot(milton, add = T, col = "red")
plot(MEX, add = T)

See ?sf::plot_sf() for more details

Basic = limitations

Base plot() is great for:

  • Rapidly visualizing your data (just a few lines of code)
  • Checking correspondence between layers

Base plot() is terrible for:

  • Building complex visualizations (including formal maps)

We will cover visualization in more detail in a few weeks. In the meantime, two quick ways

Visualizing sf objects with ggplot

p <- ggplot(data = USA) +
  geom_sf(fill = "gray90")

p

p +
  geom_sf(data = FL, fill = "orange") +
  geom_sf(data = milton, color = "red", size = 1) +
  geom_sf(data = MEX)

This is what I use to build 99% of my maps, including those for publications and proposals

Visualizing sf objects with mapview

  • Just like plot(), mapview produces quick visualizations
  • Customization is painful, but worth it (although leaflet may be better)
mapview(list(USA, FL, milton, MEX))

This is what I use to interactively explore my data

Note

For now, recall that you can quickly visualize sf objects using plot(), ggplot(), or mapview()

Geometry types

The main types of simple features

  • POINT zero-dimensional geometry containing a single point
  • LINESTRING sequence of points connected by straight, non-self intersecting line pieces; one-dimensional geometry
  • POLYGON geometry with a positive area (two-dimensional); sequence of points form a closed, non-self intersecting ring; the first ring denotes the exterior ring, zero or more subsequent rings denote holes in this exterior ring
  • MULTIPOINT set of points
  • MULTILINESTRING set of linestrings
  • MULTIPOLYGON set of polygons
  • GEOMETRYCOLLECTION set of geometries of any type except GEOMETRYCOLLECTION

GeoCompR, 2024

sf objects

Remember the objects I built on the slides with tabs?

Point

class(rsmaes_pt)
[1] "sf"         "data.frame"
rsmaes_pt
Simple feature collection with 1 feature and 1 field
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -80.16329 ymin: 25.73281 xmax: -80.16329 ymax: 25.73281
Geodetic CRS:  WGS 84
                  id                   geometry
1 Rosenstiel (point) POINT (-80.16329 25.73281)

Polygon

class(rsmaes_poly)
[1] "sf"         "data.frame"
rsmaes_poly
Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -80.16424 ymin: 25.7316 xmax: -80.16097 ymax: 25.73395
Geodetic CRS:  WGS 84
                    id                       geometry
1 Rosenstiel (polygon) POLYGON ((-80.16302 25.7339...

Note

Both are sf objects, which are data.frames /tibbles with a special column called geometry

Let’s dive deeper into how an object of class sf is made

The sf class

An sf object

Two main parts: geometries (spatial) and attributes (non-geographic)

GeoCompR, 2024 - Rhomboids are functions, rectangles are objects

rsmaes_pt <- c(-80.1632879, 25.7328129) |>      # Coordinates
  st_point() |>                                 # Build sfg
  st_sfc(crs = "ESRI:4326") |>                  # Build sfc (an specify CRS)
  st_sf(                                        # Build sf
    data.frame(id = "Rosenstiel (point)",       # Add the data.frame portion
               foundation = 1943))        
rsmaes_pt
Simple feature collection with 1 feature and 2 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -80.16329 ymin: 25.73281 xmax: -80.16329 ymax: 25.73281
CRS:           NA
                  id foundation
1 Rosenstiel (point)       1943
  st_sfc.st_point.c..80.1632879..25.7328129....crs....ESRI.4326..
1                                      POINT (-80.16329 25.73281)

Simple feature geometries (sfg)

  • These are the essential building blocks
  • They represent the type of simple features that can be handled in R
  • There is one function to create each type
  • You will rarely use them
  • They take numbers, matrices, or lists as inputs
  • Do not hold information on CRS

Simple feature columns (sfc)

  • A collection of 1 + sfg objects
  • Its contents may be more than one type of geometry (e.g. points and polygons)
  • Can contain a CRS, but must be the same across all geometries

Spherical geometries

What are spherical geometries?

  • Most software used to perform spatial operations assumes planar (e.g. projected) geometries
  • But our world is not flat
  • Therefore, when working with planar geometries, your choice of projection can influence your calculations
  • Using spherical geometries can bypass this
  • Spherical geometries are turned on by default (sf_use_s2(TRUE))

What do spherical geometries do?

S2 geometry assumes that straight lines between points on the globe are not formed by straight lines in the equirectangular projection, but by great circles: the shortest path over the sphere.

mex <- st_point(c(-99.165852, 19.376395)) |>
  st_sfc(crs = 4326)
ams <- st_point(c(4.646219, 52.38739)) |>
  st_sfc(crs = 4326)

# What is the distance between these points?
st_distance(mex, ams)
Units: [m]
        [,1]
[1,] 9206771
sf_use_s2(F) # Turn off spherical geometries
st_distance(mex, ams)
Units: [m]
        [,1]
[1,] 9220537

Why are they different?

One tries to calculate the distance along a straight (red) line. The other does so alogn the “curved” (black) line. On a spahere, the black line is shorter.

Which one is right?

In reality

Wrap up

Main points

  1. There are two models with which we represent spatial data (vector and raster)
  2. The vector model relies on simple features: points, lines, and polygons (coordinates)
  3. sf is the main R package for working with vector data
  4. an sf object should have three things:
    • attributes (what: data frame)
    • features (where: geometry column)
    • CRS (in relation to what?: in the header)
  5. sf objects can be treated like tibbles / data.frames
  6. Quickly visualize sf objects using plot(), ggplot(), or mapview()

Next class

  • Working with simple feature objects in R

Working with