Vector Operations in R

Lead Poisoning in Syracuse

Lesson 1 with Ian Carroll

Contents


Objectives for this lesson

Specific achievements

Import Clarification

Packages or libraries expand the vocabulary of the R interpreter, sometimes too quickly for a user’s comfort. The modules library provides an import command, used throughout this lesson, to help keep tabs the many new commands we import.

library('modules')
import('magrittr', '%>%')

Top of Section


Simple Features

A standardized set of geometric shapes are the essense of vector data, but these are next to useless outside a tabular structure.

sf <- import('sf')

lead <- read.csv('data/SYR_soil_PB.csv')
lead[['geometry']] <- sf$st_sfc(
  sf$st_point(),
  crs = 32618)
head(lead)
         x       y ID      ppm    geometry
1 408164.3 4762321  0 3.890648 POINT EMPTY
2 405914.9 4767394  1 4.899391 POINT EMPTY
3 405724.0 4767706  2 4.434912 POINT EMPTY
4 406702.8 4769201  3 5.285548 POINT EMPTY
5 405392.3 4765598  4 5.295919 POINT EMPTY
6 405644.1 4762037  5 4.681277 POINT EMPTY
geometry description
st_point a single point
st_linestring sequence of points connected by straight, non-self intersecting line pieces
st_polygon one [or more] sequences of points in a closed, non-self intersecting ring [with holes]
st_multi* sequence of *, either point, linestring, or polygon

Each element of the simple feature column is a simple feature geometry, here created from the “x” and “y” elements of a given feature.

lead[[1, 'geometry']] <- sf$st_point(
  c(x = lead[[1, 'x']], y = lead[[1, 'y']]),
  dim = 'XY')
head(lead)
         x       y ID      ppm                 geometry
1 408164.3 4762321  0 3.890648 POINT (408164.3 4762321)
2 405914.9 4767394  1 4.899391              POINT EMPTY
3 405724.0 4767706  2 4.434912              POINT EMPTY
4 406702.8 4769201  3 5.285548              POINT EMPTY
5 405392.3 4765598  4 5.295919              POINT EMPTY
6 405644.1 4762037  5 4.681277              POINT EMPTY

The whole data frame must be cast to a simple feature object, which causes functions like print and plot to use methods introduced by the sf library.

lead <- sf$st_sf(lead)

For example, the print method automatically shows the CRS and truncates the number of records displayed.

lead
Simple feature collection with 3149 features and 4 fields (with 3149 geometries empty)
geometry type:  POINT
dimension:      XY
bbox:           xmin: 408164.3 ymin: 4762321 xmax: 408164.3 ymax: 4762321
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
          x       y ID      ppm                 geometry
1  408164.3 4762321  0 3.890648 POINT (408164.3 4762321)
2  405914.9 4767394  1 4.899391              POINT EMPTY
3  405724.0 4767706  2 4.434912              POINT EMPTY
4  406702.8 4769201  3 5.285548              POINT EMPTY
5  405392.3 4765598  4 5.295919              POINT EMPTY
6  405644.1 4762037  5 4.681277              POINT EMPTY
7  409183.1 4763057  6 3.364148              POINT EMPTY
8  407945.4 4770014  7 4.096946              POINT EMPTY
9  406341.4 4762603  8 4.689880              POINT EMPTY
10 404638.1 4767636  9 5.422257              POINT EMPTY

Naturally, there is a shortcut to creating an sf object from a data frame with point coordinates. The CRS must be specified via EPSG integer or proj4 string.

lead <- read.csv('data/SYR_soil_PB.csv')
lead <- sf$st_as_sf(lead,
  coords = c('x', 'y'),
  crs = 32618)
lead
Simple feature collection with 3149 features and 2 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
   ID      ppm                 geometry
1   0 3.890648 POINT (408164.3 4762321)
2   1 4.899391 POINT (405914.9 4767394)
3   2 4.434912   POINT (405724 4767706)
4   3 5.285548 POINT (406702.8 4769201)
5   4 5.295919 POINT (405392.3 4765598)
6   5 4.681277 POINT (405644.1 4762037)
7   6 3.364148 POINT (409183.1 4763057)
8   7 4.096946 POINT (407945.4 4770014)
9   8 4.689880 POINT (406341.4 4762603)
10  9 5.422257 POINT (404638.1 4767636)

Now that table is an sf object, the data are easilly displayed as a map.

plot(lead['ppm'])

plot of chunk unnamed-chunk-9

Feature Collections

More complicated geometries are usually not stored in CSV files, but they are usually still read as tabular data. We will see that the similarity of feature collections to non-spatial tabular data goes quite far; the usual data manipulations done on tabular data work just as well on sf objects.

blockgroups <- sf$read_sf('data/bg_00')

Confirm that the coordinates in the geometry column are the correct UTMs.

blockgroups
Simple feature collection with 147 features and 6 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 401938.3 ymin: 4759734 xmax: 412486.4 ymax: 4771049
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
# A tibble: 147 x 7
   BKG_KEY      Shape_Leng Shape_Area BKG_KEY_1    Shape_Le_1 Shape_Ar_1
   <chr>             <dbl>      <dbl> <chr>             <dbl>      <dbl>
 1 360670001001     13520.   6135184. 360670001001     13520.   6135184.
 2 360670003002      2547.    301840. 360670003002      2547.    301840.
 3 360670003001      2678.    250998. 360670003001      2678.    250998.
 4 360670002001      3392.    656276. 360670002001      3392.    656276.
 5 360670004004      2224.    301086. 360670004004      2224.    301086.
 6 360670004001      3263.    606495. 360670004001      3263.    606495.
 7 360670004003      2878.    274532. 360670004003      2878.    274532.
 8 360670004002      3606.    331035. 360670004002      3606.    331035.
 9 360670010001      2951.    376786. 360670010001      2951.    376786.
10 360670010003      2868.    265836. 360670010003      2868.    265836.
# ... with 137 more rows, and 1 more variable: geometry <sf_geometry [m]>

Note the table dimensions show 147 features in the collection.

dim(blockgroups)
[1] 147   7

Simple feature collections are data frames.

class(blockgroups)
[1] "sf"         "tbl_df"     "tbl"        "data.frame"

The blockgroups object is a data.frame, but it also has the class attribute of sf. This additional class extends the data.frame class in ways useful for feature collection. For instance, the geometry column becomes “sticky” in most table operations, like subsetting.

blockgroups[1:5, 'BKG_KEY']
Simple feature collection with 5 features and 1 field
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 402304.2 ymin: 4767421 xmax: 407469 ymax: 4771049
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
# A tibble: 5 x 2
  BKG_KEY                            geometry
  <chr>                     <sf_geometry [m]>
1 360670001001 POLYGON ((403476.4 4767682,...
2 360670003002 POLYGON ((406271.7 4770188,...
3 360670003001 POLYGON ((406730.3 4770235,...
4 360670002001 POLYGON ((406436.1 4770029,...
5 360670004004 POLYGON ((407469 4770033, 4...

Table Operations

In the plot method for feature collections, the “x” in the usual plot(x, y, ...) automatically takes the sticky “geometry” column and the type of plot is assumed to be a map. Only the “y” needs to be specified.

plot(blockgroups['Shape_Area'])

plot of chunk unnamed-chunk-15

Merging with a regular data frame is done by normal merging non-spatial columns found in both tables.

census <- read.csv('data/SYR_census.csv')
census$BKG_KEY <- factor(census$BKG_KEY)

As usual, there’s the difficulty that CSV files do not include metadata on data types, which have to be set manually.

Merge tables on a unique identifier (our primary key is “BKG_KEY”), but let the “sf” object come first or is special attributes get lost.

import('dplyr', 
  'inner_join', 'group_by', 'summarise')

census_blockgroups <- inner_join(
  blockgroups, census,
  by = c('BKG_KEY'))
class(census_blockgroups)
[1] "sf"         "tbl_df"     "tbl"        "data.frame"

The census data is now easily vizualized as a map.

plot(census_blockgroups['POP2000'])

plot of chunk unnamed-chunk-19

Feature collections also cooperate with the common “split-apply-combine” sequence of steps in data manipulation.

census_tracts <- census_blockgroups %>%
  group_by(TRACT) %>%
  summarise(
    POP2000 = sum(POP2000),
    perc_hispa = sum(HISPANIC) / sum(POP2000))

Read in the census tracts from a separate shapefile to confirm that the boundaries were dissolved as expected.

tracts <- sf$read_sf('data/ct_00')
plot(census_tracts['POP2000'])
plot(sf$st_geometry(tracts), border = 'red', add = TRUE)

plot of chunk unnamed-chunk-21

By default, the sticky geometries are summarized with st_union. The alternative st_combine does not dissolve internal boundaries. Check ?summarise.sf for more details.

Top of Section


Spatial Join

The data about lead contamination in soils is at points; the census information is for polygons. Combine the information from these tables by determining which polygon contains each point.

plot(census_tracts['POP2000'],
  pal = cm.colors)
plot(sf$st_geometry(lead),
  pch = '.', add = TRUE)

plot of chunk unnamed-chunk-1

In the previous section, we performed a table join using a non-spatial data type. More specifically, we performed an equality join: records were merged wherever the join variable in the “left table” equalled the variable in the “right table”. Spatial joins operate on the “geometry” columns and require expanding beyond equality based matches. Several different kinds of “intersections” can be specified that denote a successful match.


Image by Kraus / CC BY

Before doing any spatial join, it is essential that both tables share a common CRS.

sf$st_crs(lead)
sf$st_crs(census_tracts)

The st_join function performs a left join using the geometries of the two simple feature collections.

sf$st_join(lead, census_tracts)
Simple feature collection with 3149 features and 5 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
   ID      ppm TRACT POP2000 perc_hispa                 geometry
1   0 3.890648  6102    2154 0.01578459 POINT (408164.3 4762321)
2   1 4.899391  3200    2444 0.10515548 POINT (405914.9 4767394)
3   2 4.434912   100     393 0.03053435   POINT (405724 4767706)
4   3 5.285548   700    1630 0.03190184 POINT (406702.8 4769201)
5   4 5.295919  3900    4405 0.23541430 POINT (405392.3 4765598)
6   5 4.681277  6000    3774 0.04027557 POINT (405644.1 4762037)
7   6 3.364148  5602    2212 0.08318264 POINT (409183.1 4763057)
8   7 4.096946   400    3630 0.01019284 POINT (407945.4 4770014)
9   8 4.689880  6000    3774 0.04027557 POINT (406341.4 4762603)
10  9 5.422257  2100    1734 0.09169550 POINT (404638.1 4767636)
sf$st_join(lead, census_tracts,
  join = st_contains)
Simple feature collection with 3149 features and 5 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
   ID      ppm TRACT POP2000 perc_hispa                 geometry
1   0 3.890648    NA      NA         NA POINT (408164.3 4762321)
2   1 4.899391    NA      NA         NA POINT (405914.9 4767394)
3   2 4.434912    NA      NA         NA   POINT (405724 4767706)
4   3 5.285548    NA      NA         NA POINT (406702.8 4769201)
5   4 5.295919    NA      NA         NA POINT (405392.3 4765598)
6   5 4.681277    NA      NA         NA POINT (405644.1 4762037)
7   6 3.364148    NA      NA         NA POINT (409183.1 4763057)
8   7 4.096946    NA      NA         NA POINT (407945.4 4770014)
9   8 4.689880    NA      NA         NA POINT (406341.4 4762603)
10  9 5.422257    NA      NA         NA POINT (404638.1 4767636)

The population data is at the coarser scale, so the lead ought to be averaged within a census tract. Once each lead measurement is joined to “TRACT”, the spatial data can by dropped.

lead_tracts <- lead %>%
    sf$st_join(census_tracts) %>%
    sf$st_set_geometry(NULL)

Two more steps to group the data by “TRACT” and average the lead concentrations within each “TRACT”.

lead_tracts <- lead %>%
    sf$st_join(census_tracts) %>%
    sf$st_set_geometry(NULL) %>%
    group_by(TRACT) %>%
    summarise(avg_ppm = mean(ppm))

To visualize the average lead concentration from soil samples within each cencus tract, merge the data frame to the sf object on “TRACT”.

census_lead_tracts <- census_tracts %>%
  inner_join(lead_tracts)
plot(census_lead_tracts['avg_ppm'],
  pal = heat.colors)

plot of chunk unnamed-chunk-7

A problem with this approach to aggregation is that it ignores autocorrelation. Points close to eachother tend to have similar values and shouldn’t be given equal weight in averages within a polygon.

Take a closer look at tract 5800 (in row 52), and notice that several low values are nearly stacked on top of eachother.

m <- import('mapview')
m$mapview(sf$st_geometry(census_tracts)) +
  m$mapview(lead['ppm'])

Top of Section


The Semivariogram

A semivariogram quantifies the effect of distance on the correlation between values from different locations. On average, measurements of the same variable at two nearby locations are more similar (lower variance) than when those locations are distant.

The gstat library, a geospatial analogue to the stats library provides variogram estimation among several additional tools.

g <- import('gstat')

lead_xy <- read.csv('data/SYR_soil_PB.csv')

The empirical semivariogram shown below is a windowed average of the squared difference in lead concentrations beween sample points.

v_ppm <- g$variogram(
  ppm ~ 1,
  locations = ~ x + y,
  data = lead_xy)
plot(v_ppm)

plot of chunk unnamed-chunk-2

Fitting a model semivariogram tidies up the information about autocorrelation in the data, so we can use it for interpolation.

v_ppm_fit <- g$fit.variogram(
  v_ppm,
  model = g$vgm(1, "Sph", 900, 1))
plot(v_ppm, v_ppm_fit)

plot of chunk unnamed-chunk-3

Kriging

The modeled semivariogram acts as a parameter for fitting a Guassian process, or Kriging. The steps to perform Kriging with the gstat library are

  1. Generate a modeled semivariogram
  2. Generate new locations for “predicted” values
  3. Call krige with the data, locations, and semivariogram

Generate a low resolution (for demonstration) grid of points overlaying the bounding box for the lead data and trim it down to the polygons of interest. Remember, that goal is aggregating lead concentrations within each census tract.

pred_ppm <- sf$st_make_grid(
  lead, cellsize = 400,
  what = 'centers')
idx <- unlist(
  sf$st_intersects(census_tracts, pred_ppm))
pred_ppm <- pred_ppm[idx]

Map the result to verify generated points.

plot(census_tracts['POP2000'],
  pal = cm.colors)
plot(pred_ppm,
  pch = '.', add = TRUE)

plot of chunk unnamed-chunk-5

Like the gstat function, the krige function doesn’t work seemlessly with sf objects, so again create a data frame with coordinates.

pred_ppm_xy <- data.frame(
  do.call(rbind, pred_ppm))
names(pred_ppm_xy) <- c('x', 'y')

Almost done …

  1. Generate a modeled semivariogram
  2. Generate new locations for “predicted” values
  3. Call krige with the data, locations, and semivariogram

The first argument specifies the model for the means, which is constant according to our formula for lead concentrations. The result, like the input for locations, is a data frame with coordinates and a prediction for lead concentrations.

pred_ppm_xy <- g$krige(
  ppm ~ 1,
  locations = ~ x + y,
  data = lead_xy,
  newdata = pred_ppm_xy,
  model = v_ppm_fit)

The same commands that created the sf table for lead apply to creating a sf table for the predicted concentrations.

pred_ppm <- sf$st_as_sf(
  pred_ppm_xy,
  coords = c('x', 'y'),
  crs = sf$st_crs(lead))

And the same commands that joined lead concentrations to census tracts apply to joining the predicted concentrations in too.

pred_ppm_tracts <-
  pred_ppm %>%
  sf$st_join(census_tracts) %>%
  sf$st_set_geometry(NULL) %>%
  group_by(TRACT) %>%
  summarise(pred_ppm = mean(var1.pred))
census_lead_tracts <- 
  census_lead_tracts %>%
  inner_join(pred_ppm_tracts)
plot(census_lead_tracts[
  c('pred_ppm', 'avg_ppm')],
  pal = heat.colors)

plot of chunk unnamed-chunk-10

The effect of paying attention to autocorrelation is subtle, but it is noticable and had the expected effect in tract 5800.

census_lead_tracts[52,]
Simple feature collection with 1 feature and 5 fields
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 405633.1 ymin: 4762867 xmax: 406445.9 ymax: 4764711
epsg (SRID):    32618
proj4string:    +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
# A tibble: 1 x 6
  TRACT POP2000 perc_hispa avg_ppm pred_ppm                       geometry
  <int>   <int>      <dbl>   <dbl>    <dbl>              <sf_geometry [m]>
1  5800    2715     0.0302    5.44     5.65 POLYGON ((406445.9 4762893,...

Top of Section


Regression

Continuing the theme that vector data is tabular data, the natural progression in statistical analysis is toward regression. Building a regression model requires making good assumptions about relationships in your data:

The following model assumes an association (in the linear least-squares sense), between the hispanic population and lead concentrations and assumes independence of every census tract (i.e. row).

ppm.lm <- lm(pred_ppm ~ perc_hispa,
  census_lead_tracts)

Is that model any good?

census_lead_tracts['lm.resid'] <- resid(ppm.lm)
plot(census_lead_tracts['lm.resid'])

plot of chunk unnamed-chunk-2

Polygons close to eachother tend to have similar residuals: there is autocorrelation. It is tempting to ask for a semivariogram plot of the residuals, but that requires a precise definition of the distance between polygons. A favored alternative for quantifying autoregression in non-point feature collections is Moran’s I. This analog to Pearson’s correlation coeficient quantifies autocorrelation rather than cross-correlation.

Moran’s I is the correlation between all pairs of features, weighted to emphasize features that are close together. It does not dodge the problem of distance weighting, it actually adds flexibility, along with some norms.

sp <- import('sp')
sd <- import('spdep')
tracts <- as(
  sf$st_geometry(census_tracts), 'Spatial')
tracts_nb <- sd$poly2nb(tracts)

The neighbors variable is the network of features sharing a boundary point.

plot(census_lead_tracts['lm.resid'])
sd$plot.nb(tracts_nb,
  sp$coordinates(tracts), add = TRUE)

plot of chunk unnamed-chunk-4

Reshape the adjacency matrix into a list of neighbors with associated weights.

tracts_weight <- sd$nb2listw(tracts_nb)

Visualize correlation between the residuals and the weighted average of their neighbors with moran.plot from the spdep

sd$moran.plot(
  census_lead_tracts[['lm.resid']],
  tracts_weight,
  labels = census_lead_tracts[['TRACT']],
  pch = 19)

plot of chunk unnamed-chunk-6

There are many ways to use geospatial information about tracts to impose assumptions about non-independence between observations in the regression. One approach is a Spatial Auto-Regressive (SAR) model, which regresses each value against the weighted average of neighbors.

ppm.sarlm <- sd$lagsarlm(
  pred_ppm ~ perc_hispa,
  data = census_lead_tracts,
  tracts_weight,
  tol.solve = 1.0e-30)

The Moran’s I plot of residuals shows less correlation; which means the SAR model’s assumption about spatial autocorrelation (i.e. between table rows) makes the rest of the model more plausible.

sd$moran.plot(
  resid(ppm.sarlm),
  tracts_weight,
  labels = census_lead_tracts[['TRACT']],
  pch = 19)

plot of chunk unnamed-chunk-9

Feeling more confident in the model, we can now take a look at the regression coefficients.

summary(ppm.sarlm)

Call:
sd$lagsarlm(formula = pred_ppm ~ perc_hispa, data = census_lead_tracts, 
    listw = tracts_weight, tol.solve = 1e-30)

Residuals:
      Min        1Q    Median        3Q       Max 
-1.002799 -0.212714  0.074072  0.235047  0.870624 

Type: lag 
Coefficients: (asymptotic standard errors) 
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  1.10395    0.45580  2.4220  0.01544
perc_hispa   1.27550    0.99076  1.2874  0.19795

Rho: 0.76278, LR test value: 23.183, p-value: 1.473e-06
Asymptotic standard error: 0.092139
    z-value: 8.2786, p-value: 2.2204e-16
Wald statistic: 68.535, p-value: < 2.22e-16

Log likelihood: -31.49178 for lag model
ML residual variance (sigma squared): 0.15101, (sigma: 0.3886)
Number of observations: 57 
Number of parameters estimated: 4 
AIC: 70.984, (AIC for lm: 92.166)
LM test for residual autocorrelation
test value: 5.5923, p-value: 0.01804

Top of Section


Summary

Rapid tour of packages providing tools for manipulation and analysis of vector data in R.

Top of Section