Vector Operations in R

Lead Poisoning in Syracuse

Lesson 1 with Elizabeth Green

Lesson Objectives

Dive into scripting vector data operations
Learn to handle feature collections like tables
Address autocorrelation in point data
Glimpse a spatial regression framework

Specific Achievements

Read CSVs and shapefiles
Intersect and dissolve geometries
Smooth point data with Kriging
Include autocorrelation in a regression

Simple Features

A standardized set of geometric shapes are the essense of vector data, but these are next to useless outside a tabular structure.

library(sf)

lead <- read.csv('data/SYR_soil_PB.csv')
lead[['geometry']] <- st_sfc(
  st_point(),
  crs = 32618)

The lead table now has the “simple feature column”, which st_sfc creates from a CRS and a geometry type, but each point is “EMPTY”. The empty gemometry is equivalent to a NA value.

> head(lead)

         x       y ID      ppm    geometry
408164.3 4762321  0 3.890648 POINT EMPTY
405914.9 4767394  1 4.899391 POINT EMPTY
405724.0 4767706  2 4.434912 POINT EMPTY
406702.8 4769201  3 5.285548 POINT EMPTY
405392.3 4765598  4 5.295919 POINT EMPTY
405644.1 4762037  5 4.681277 POINT EMPTY

geometry	description
`st_point`	a single point
`st_linestring`	sequence of points connected by straight, non-self intersecting line pieces
`st_polygon`	one [or more] sequences of points in a closed, non-self intersecting ring [with holes]
`st_multi*`	sequence of `*`, either `point`, `linestring`, or `polygon`

Each element of the simple feature column is a simple feature geometry, here created from the “x” and “y” elements of a given feature.

lead[[1, 'geometry']] <- st_point(
  c(x = lead[[1, 'x']],
    y = lead[[1, 'y']]),
  dim = 'XY')

> head(lead)

         x       y ID      ppm                 geometry
408164.3 4762321  0 3.890648 POINT (408164.3 4762321)
405914.9 4767394  1 4.899391              POINT EMPTY
405724.0 4767706  2 4.434912              POINT EMPTY
406702.8 4769201  3 5.285548              POINT EMPTY
405392.3 4765598  4 5.295919              POINT EMPTY
405644.1 4762037  5 4.681277              POINT EMPTY

The whole data frame must be cast to a simple feature object, which causes functions like print and plot to use methods introduced by the sf library.

lead <- st_sf(lead)

For example, the print method automatically shows the CRS and truncates the number of records displayed.

> lead

Simple feature collection with 3149 features and 4 fields (with 3149 geometries empty)
geometry type:  POINT
dimension:      XY
bbox:           xmin: 408164.3 ymin: 4762321 xmax: 408164.3 ymax: 4762321
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
          x       y ID      ppm                 geometry
1  408164.3 4762321  0 3.890648 POINT (408164.3 4762321)
2  405914.9 4767394  1 4.899391              POINT EMPTY
3  405724.0 4767706  2 4.434912              POINT EMPTY
4  406702.8 4769201  3 5.285548              POINT EMPTY
5  405392.3 4765598  4 5.295919              POINT EMPTY
6  405644.1 4762037  5 4.681277              POINT EMPTY
7  409183.1 4763057  6 3.364148              POINT EMPTY
8  407945.4 4770014  7 4.096946              POINT EMPTY
9  406341.4 4762603  8 4.689880              POINT EMPTY
10 404638.1 4767636  9 5.422257              POINT EMPTY

Naturally, there is a shortcut to creating an sf object from a data frame with point coordinates. The CRS must be specified via EPSG integer or proj4 string.

lead <- read.csv('data/SYR_soil_PB.csv')
lead <- st_as_sf(lead,
  coords = c('x', 'y'),
  crs = 32618)

> lead

Simple feature collection with 3149 features and 2 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
   ID      ppm                 geometry
1   0 3.890648 POINT (408164.3 4762321)
2   1 4.899391 POINT (405914.9 4767394)
3   2 4.434912   POINT (405724 4767706)
4   3 5.285548 POINT (406702.8 4769201)
5   4 5.295919 POINT (405392.3 4765598)
6   5 4.681277 POINT (405644.1 4762037)
7   6 3.364148 POINT (409183.1 4763057)
8   7 4.096946 POINT (407945.4 4770014)
9   8 4.689880 POINT (406341.4 4762603)
10  9 5.422257 POINT (404638.1 4767636)

Now that table is an sf object, the data are easilly displayed as a map.

plot(lead['ppm'])

For ggplot2 figures, use geom_sf to draw maps. In the aes mapping for feature collections, the “x” and “y” variables are automatically assigned to the sticky “geometry” column, while other attributes can be assigned to visual elements like fill or color.

library(ggplot2)

ggplot(data = lead,
       mapping = aes(color = ppm)) +
  geom_sf()

Feature Collections

More complicated geometries are usually not stored in CSV files, but they are usually still read as tabular data. We will see that the similarity of feature collections to non-spatial tabular data goes quite far; the usual data manipulations done on tabular data work just as well on sf objects.

blockgroups <- read_sf('data/bg_00')

Confirm that the coordinates in the geometry column are the correct UTMs.

> blockgroups

Simple feature collection with 147 features and 3 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 401938.3 ymin: 4759734 xmax: 412486.4 ymax: 4771049
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
# A tibble: 147 x 4
   BKG_KEY   Shape_Leng Shape_Area                                 geometry
   <chr>          <dbl>      <dbl>                            <POLYGON [m]>
 1 36067000…     13520.   6135184. ((403476.4 4767682, 403356.7 4767804, 4…
 2 36067000…      2547.    301840. ((406271.7 4770188, 406186.1 4770270, 4…
 3 36067000…      2678.    250998. ((406730.3 4770235, 406687.8 4770205, 4…
 4 36067000…      3392.    656276. ((406436.1 4770029, 406340 4769973, 406…
 5 36067000…      2224.    301086. ((407469 4770033, 407363.9 4770035, 407…
 6 36067000…      3263.    606495. ((408398.6 4769564, 408283.1 4769556, 4…
 7 36067000…      2878.    274532. ((407477.4 4769773, 407401 4769767, 407…
 8 36067000…      3606.    331035. ((407486 4769507, 407443.5 4769504, 407…
 9 36067001…      2951.    376786. ((410704.4 4769103, 410625.2 4769100, 4…
10 36067001…      2868.    265836. ((409318.3 4769203, 409299.6 4769535, 4…
# … with 137 more rows

Also note the table dimensions reveal 147 features in the collection.

> dim(blockgroups)

[1] 147   4

Simple feature collections are data frames.

> class(blockgroups)

[1] "sf"         "tbl_df"     "tbl"        "data.frame"

The blockgroups object is a data.frame, but it also has the class attribute of sf. This additional class extends the data.frame class in ways useful for feature collection. For instance, the geometry column becomes “sticky” in most table operations, like subsetting.

> blockgroups[1:5, 'BKG_KEY']

Simple feature collection with 5 features and 1 field
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 402304.2 ymin: 4767421 xmax: 407469 ymax: 4771049
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
# A tibble: 5 x 2
  BKG_KEY                                                          geometry
  <chr>                                                       <POLYGON [m]>
1 3606700010… ((403476.4 4767682, 403356.7 4767804, 403117.2 4768027, 4028…
2 3606700030… ((406271.7 4770188, 406186.1 4770270, 406107.9 4770345, 4060…
3 3606700030… ((406730.3 4770235, 406687.8 4770205, 406650.9 4770179, 4066…
4 3606700020… ((406436.1 4770029, 406340 4769973, 406307.2 4769954, 406206…
5 3606700040… ((407469 4770033, 407363.9 4770035, 407233.4 4770036, 407235…

Table Operations

plot
merge or join
“split-apply-combine” or group-by and summarize

> ggplot(blockgroups,
+        aes(fill = Shape_Area)) +
+   geom_sf()

Merging with a regular data frame is done by normal merging on non-spatial columns found in both tables.

census <- read.csv('data/SYR_census.csv')
census <- within(census, {
     BKG_KEY <- as.character(BKG_KEY)
})

As usual, there’s the difficulty that CSV files do not include metadata on data types, which have to be set manually.

Merge tables on a unique identifier (our primary key is “BKG_KEY”), but let the “sf” object come first or its special attributes get lost.

library(dplyr)

census_blockgroups <- inner_join(
  blockgroups, census,
  by = c('BKG_KEY'))

> class(census_blockgroups)

[1] "sf"         "tbl_df"     "tbl"        "data.frame"

The census data is now easily vizualized as a map.

ggplot(census_blockgroups,
       aes(fill = POP2000)) +
  geom_sf()

Feature collections also cooperate with the common “split-apply-combine” sequence of steps in data manipulation.

split – group the features by some factor
apply – apply a function that summarizes each subset, including their geometries
combine – return the results as columns of a new feature collection

census_tracts <- census_blockgroups %>%
  group_by(TRACT) %>%
  summarise(
    POP2000 = sum(POP2000),
    perc_hispa = sum(HISPANIC) / POP2000)

Read in the census tracts from a separate shapefile to confirm that the boundaries were dissolved as expected.

tracts <- read_sf('data/ct_00')
ggplot(census_tracts,
       aes(fill = POP2000)) +
  geom_sf() +
  geom_sf(data = tracts,
          color = 'red', fill = NA)

By default, the sticky geometries are summarized with st_union. The alternative st_combine does not dissolve internal boundaries. Check ?summarise.sf for more details.

Top of Section

Spatial Join

The data about lead contamination in soils is at points; the census information is for polygons. Combine the information from these tables by determining which polygon contains each point.

ggplot(census_tracts,
       aes(fill = POP2000)) +
  geom_sf() +
  geom_sf(data = lead, color = 'red',
          fill = NA, size = 0.1)

In the previous section, we performed a table join using a non-spatial data type. More specifically, we performed an equality join: records were merged wherever the join variable in the “left table” equalled the variable in the “right table”. Spatial joins operate on the “geometry” columns and require expanding beyond equality based matches. Several different kinds of “intersections” can be specified that denote a successful match.

Image by Kraus / CC BY

Before doing any spatial join, it is essential that both tables share a common CRS.

> st_crs(lead)

Coordinate Reference System:
  EPSG: 32618 
  proj4string: "+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs"

> st_crs(census_tracts)

Coordinate Reference System:
  EPSG: 32618 
  proj4string: "+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs"

The st_join function performs a left join using the geometries of the two simple feature collections.

> st_join(lead, census_tracts)

Simple feature collection with 3149 features and 5 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
   ID      ppm TRACT POP2000 perc_hispa                 geometry
1   0 3.890648  6102    2154 0.01578459 POINT (408164.3 4762321)
2   1 4.899391  3200    2444 0.10515548 POINT (405914.9 4767394)
3   2 4.434912   100     393 0.03053435   POINT (405724 4767706)
4   3 5.285548   700    1630 0.03190184 POINT (406702.8 4769201)
5   4 5.295919  3900    4405 0.23541430 POINT (405392.3 4765598)
6   5 4.681277  6000    3774 0.04027557 POINT (405644.1 4762037)
7   6 3.364148  5602    2212 0.08318264 POINT (409183.1 4763057)
8   7 4.096946   400    3630 0.01019284 POINT (407945.4 4770014)
9   8 4.689880  6000    3774 0.04027557 POINT (406341.4 4762603)
10  9 5.422257  2100    1734 0.09169550 POINT (404638.1 4767636)

Only the “left” geometry is preserved in the output
Matching defaults to st_intersects,but permits any predicate function

> st_join(lead, census_tracts,
+   join = st_contains)

Simple feature collection with 3149 features and 5 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
   ID      ppm TRACT POP2000 perc_hispa                 geometry
1   0 3.890648    NA      NA         NA POINT (408164.3 4762321)
2   1 4.899391    NA      NA         NA POINT (405914.9 4767394)
3   2 4.434912    NA      NA         NA   POINT (405724 4767706)
4   3 5.285548    NA      NA         NA POINT (406702.8 4769201)
5   4 5.295919    NA      NA         NA POINT (405392.3 4765598)
6   5 4.681277    NA      NA         NA POINT (405644.1 4762037)
7   6 3.364148    NA      NA         NA POINT (409183.1 4763057)
8   7 4.096946    NA      NA         NA POINT (407945.4 4770014)
9   8 4.689880    NA      NA         NA POINT (406341.4 4762603)
10  9 5.422257    NA      NA         NA POINT (404638.1 4767636)

This command would match on polygons within (contained by) each point, which is the wrong way around. Stick with the default.

The population data is at the coarser scale, so the lead ought to be averaged within a census tract. Once each lead measurement is joined to “TRACT”, the spatial data can by dropped.

lead_tracts <- lead %>%
    st_join(census_tracts) %>%
    st_drop_geometry()

Two more steps to group the data by “TRACT” and average the lead concentrations within each “TRACT”.

lead_tracts <- lead %>%
    st_join(census_tracts) %>%
    st_drop_geometry() %>%
    group_by(TRACT) %>%
    summarise(avg_ppm = mean(ppm))

To visualize the average lead concentration from soil samples within each cencus tract, merge the data frame to the sf object on “TRACT”.

census_lead_tracts <- census_tracts %>%
  inner_join(lead_tracts)

Make a plot of avg_ppm for each census tract.

ggplot(census_lead_tracts,
       aes(fill = avg_ppm)) +
  geom_sf() +
  scale_fill_gradientn(
    colors = heat.colors(7))

A problem with this approach to aggregation is that it ignores autocorrelation. Points close to eachother tend to have similar values and shouldn’t be given equal weight in averages within a polygon.

Take a closer look at tract 5800 (around 43.025N, 76.152W), and notice that several low values are nearly stacked on top of eachother.

library(mapview)

mapview(census_tracts,
        legend = FALSE,
        viewer.suppress = TRUE) +
  mapview(lead['ppm'])

Top of Section

The Semivariogram

A semivariogram quantifies the effect of distance on the correlation between values from different locations. On average, measurements of the same variable at two nearby locations are more similar (lower variance) than when those locations are distant.

The gstat library, a geospatial analogue to the stats library provides variogram estimation among several additional tools.

library(gstat)
lead_xy <- read.csv('data/SYR_soil_PB.csv')

The empirical semivariogram shown below is a windowed average of the squared difference in lead concentrations beween sample points.

v_ppm <- variogram(
  ppm ~ 1,
  locations = ~ x + y,
  data = lead_xy)
plot(v_ppm)

Fitting a model semivariogram tidies up the information about autocorrelation in the data, so we can use it for interpolation.

v_ppm_fit <- fit.variogram(
  v_ppm,
  model = vgm(1, "Sph", 900, 1))
plot(v_ppm, v_ppm_fit)

Kriging

The modeled semivariogram acts as a parameter when performing Guassian process regression, commonly known as kriging. The steps to perform Kriging with the gstat library are:

Generate a modeled semivariogram
Generate new locations for “predicted” values
Call krige with the data, locations for prediction, and semivariogram

Generate a low resolution (for demonstration) grid of points overlaying the bounding box for the lead data and trim it down to the polygons of interest. Remember, that goal is aggregating lead concentrations within each census tract.

pred_ppm <- st_make_grid(
  lead, cellsize = 400,
  what = 'centers')
idx <- unlist(
  st_intersects(census_tracts, pred_ppm))
pred_ppm <- pred_ppm[idx]

Map the result to verify generated points.

ggplot(census_tracts,
       aes(fill = POP2000)) +
  geom_sf() +
  geom_sf(data = pred_ppm,
          color = 'red', fill = NA)

Almost done …

Generate a modeled semivariogram
Generate new locations for “predicted” values
Call krige with the data, locations, and semivariogram

The first argument specifies the model for the means, which is constant according to our formula for lead concentrations. The observed ppm values are in the locations data.frame along with the point geometry. The result is a data frame with predictions for lead concentrations at the points in newdata.

pred_ppm <- krige(
  formula = ppm ~ 1,
  locations = lead,
  newdata = pred_ppm,
  model = v_ppm_fit)

And the same commands that joined lead concentrations to census tracts apply to joining the predicted concentrations in too.

pred_ppm_tracts <-
  pred_ppm %>%
  st_join(census_tracts) %>%
  st_drop_geometry() %>%
  group_by(TRACT) %>%
  summarise(pred_ppm = mean(var1.pred))
census_lead_tracts <- 
  census_lead_tracts %>%
  inner_join(pred_ppm_tracts)

The predictions should be, and are, close to the original averages with deviations that correct for autocorrelation.

ggplot(census_lead_tracts,
       aes(x = pred_ppm, y = avg_ppm)) +
  geom_point()

The effect of paying attention to autocorrelation is subtle, but it is noticable and had the expected effect in tract 5800. The pred_ppm value is a little higher than the average.

> census_lead_tracts[52,]

Simple feature collection with 1 feature and 5 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 405633.1 ymin: 4762867 xmax: 406445.9 ymax: 4764711
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
# A tibble: 1 x 6
  TRACT POP2000 perc_hispa avg_ppm pred_ppm                        geometry
  <int>   <int>      <dbl>   <dbl>    <dbl>                   <POLYGON [m]>
1  5800    2715     0.0302    5.44     5.53 ((406445.9 4762893, 406017.5 4…

Top of Section

Regression

Continuing the theme that vector data is tabular data, the natural progression in statistical analysis is toward regression. Building a regression model requires making good assumptions about relationships in your data:

between columns as independent and dependent variables
between rows as more-or-less independent observations

The following model assumes an association (in the linear least-squares sense), between the hispanic population and lead concentrations and assumes independence of every census tract (i.e. row).

ppm.lm <- lm(pred_ppm ~ perc_hispa,
  census_lead_tracts)

Is that model any good?

census_lead_tracts <- census_lead_tracts %>%
  mutate(lm.resid = resid(ppm.lm))
plot(census_lead_tracts['lm.resid'])

Polygons close to eachother tend to have similar residuals: there is autocorrelation. It is tempting to ask for a semivariogram plot of the residuals, but that requires a precise definition of the distance between polygons. A favored alternative for quantifying autoregression in non-point feature collections is Moran’s I. This analog to Pearson’s correlation coeficient quantifies autocorrelation rather than cross-correlation.

Moran’s I is the correlation between all pairs of features, weighted to emphasize features that are close together. It does not dodge the problem of distance weighting, but provides default assumptions for how to do it.

library(sp)
library(spdep)

tracts <- as(
  st_geometry(census_tracts), 'Spatial')
tracts_nb <- poly2nb(tracts)

The neighbors variable is the network of features sharing a boundary point.

plot(census_lead_tracts['lm.resid'],
     reset = FALSE)
plot.nb(tracts_nb, coordinates(tracts),
        add = TRUE)

Reshape the adjacency matrix into a list of neighbors with associated weights.

tracts_weight <- nb2listw(tracts_nb)

Visualize correlation between the residuals and the weighted average of their neighbors with moran.plot from the spdep. The positive trend line is consistent with the earlier observation that features in close proximity have similar residuals.

moran.plot(
  census_lead_tracts[['lm.resid']],
  tracts_weight,
  labels = census_lead_tracts[['TRACT']],
  pch = 19)

There are many ways to use geospatial information about tracts to impose assumptions about non-independence between observations in the regression. One approach is a Spatial Auto-Regressive (SAR) model, which regresses each value against the weighted average of neighbors.

ppm.sarlm <- lagsarlm(
  pred_ppm ~ perc_hispa,
  data = census_lead_tracts,
  tracts_weight,
  tol.solve = 1.0e-30)

The Moran’s I plot of residuals now shows less correlation; which means the SAR model’s assumption about spatial autocorrelation (i.e. between table rows) makes the rest of the model more plausible.

moran.plot(
  resid(ppm.sarlm),
  tracts_weight,
  labels = census_lead_tracts[['TRACT']],
  pch = 19)

Feeling more confident in the model, we can now take a look at the regression coefficients and overall model fit.

> summary(ppm.sarlm)

Call:lagsarlm(formula = pred_ppm ~ perc_hispa, data = census_lead_tracts, 
    listw = tracts_weight, tol.solve = 1e-30)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.992669 -0.210942  0.037845  0.246989  0.888440 

Type: lag 
Coefficients: (asymptotic standard errors) 
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  1.16047    0.46978  2.4702   0.0135
perc_hispa   1.45285    0.95717  1.5179   0.1291

Rho: 0.7499, LR test value: 22.434, p-value: 2.175e-06
Asymptotic standard error: 0.095079
    z-value: 7.8871, p-value: 3.1086e-15
Wald statistic: 62.207, p-value: 3.1086e-15

Log likelihood: -29.16485 for lag model
ML residual variance (sigma squared): 0.14018, (sigma: 0.37441)
Number of observations: 57 
Number of parameters estimated: 4 
AIC: 66.33, (AIC for lm: 86.764)
LM test for residual autocorrelation
test value: 4.5564, p-value: 0.032797

Top of Section

Summary

Rapid tour of packages providing tools for manipulation and analysis of vector data in R.

sf
- st_sf: make a simple features data frame
- read_sf: read a file into a simple features data frame
- st_join: match tables on their geometry columns
gstat
- variogram: distance lagged correlation for points
- fit.variogram: autocorrelation model
- krige: smooth interpolation over points

sp: a low level package with useful Spatial* class objects
spdep
- poly2nb, nb2listw: calculate spatial neighbors and their weights
- moran.plot: infer autocorrelation from residual scatter plots
- lagsarlm: SAR model estimation

Top of Section

If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.

# Nothing here yet!