Vector Operations in R
Lead Poisoning in Syracuse
Lesson 1 with Ian Carroll
Contents
Objectives for this lesson
- Dive into scripting vector data operations
- Learn to handle feature collections like tables
- Address autocorrelation in point data
- Glimpse a spatial regression framework
Specific achievements
- Read CSVs and shapefiles
- Intersect and dissolve geometries
- Smooth point data with Kriging
- Include autocorrelation in a regression
Import Clarification
Packages or libraries expand the vocabulary of the R interpreter, sometimes too
quickly for a user’s comfort. The modules library provides an import
command, used throughout this lesson, to help keep tabs the many new commands we
import.
library('modules')
import('magrittr', '%>%')
Simple Features
A standardized set of geometric shapes are the essense of vector data, but these are next to useless outside a tabular structure.
sf <- import('sf')
lead <- read.csv('data/SYR_soil_PB.csv')
lead[['geometry']] <- sf$st_sfc(
sf$st_point(),
crs = 32618)
head(lead)
x y ID ppm geometry
1 408164.3 4762321 0 3.890648 POINT EMPTY
2 405914.9 4767394 1 4.899391 POINT EMPTY
3 405724.0 4767706 2 4.434912 POINT EMPTY
4 406702.8 4769201 3 5.285548 POINT EMPTY
5 405392.3 4765598 4 5.295919 POINT EMPTY
6 405644.1 4762037 5 4.681277 POINT EMPTY
geometry | description |
---|---|
st_point |
a single point |
st_linestring |
sequence of points connected by straight, non-self intersecting line pieces |
st_polygon |
one [or more] sequences of points in a closed, non-self intersecting ring [with holes] |
st_multi* |
sequence of * , either point , linestring , or polygon |
Each element of the simple feature column is a simple feature geometry, here created from the “x” and “y” elements of a given feature.
lead[[1, 'geometry']] <- sf$st_point(
c(x = lead[[1, 'x']], y = lead[[1, 'y']]),
dim = 'XY')
head(lead)
x y ID ppm geometry
1 408164.3 4762321 0 3.890648 POINT (408164.3 4762321)
2 405914.9 4767394 1 4.899391 POINT EMPTY
3 405724.0 4767706 2 4.434912 POINT EMPTY
4 406702.8 4769201 3 5.285548 POINT EMPTY
5 405392.3 4765598 4 5.295919 POINT EMPTY
6 405644.1 4762037 5 4.681277 POINT EMPTY
The whole data frame must be cast to a simple feature object, which causes
functions like print
and plot
to use methods introduced by the sf
library.
lead <- sf$st_sf(lead)
For example, the print
method automatically shows the CRS and truncates the
number of records displayed.
lead
Simple feature collection with 3149 features and 4 fields (with 3149 geometries empty)
geometry type: POINT
dimension: XY
bbox: xmin: 408164.3 ymin: 4762321 xmax: 408164.3 ymax: 4762321
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
x y ID ppm geometry
1 408164.3 4762321 0 3.890648 POINT (408164.3 4762321)
2 405914.9 4767394 1 4.899391 POINT EMPTY
3 405724.0 4767706 2 4.434912 POINT EMPTY
4 406702.8 4769201 3 5.285548 POINT EMPTY
5 405392.3 4765598 4 5.295919 POINT EMPTY
6 405644.1 4762037 5 4.681277 POINT EMPTY
7 409183.1 4763057 6 3.364148 POINT EMPTY
8 407945.4 4770014 7 4.096946 POINT EMPTY
9 406341.4 4762603 8 4.689880 POINT EMPTY
10 404638.1 4767636 9 5.422257 POINT EMPTY
Naturally, there is a shortcut to creating an sf
object from a data frame with
point coordinates. The CRS must be specified via EPSG integer or proj4 string.
lead <- read.csv('data/SYR_soil_PB.csv')
lead <- sf$st_as_sf(lead,
coords = c('x', 'y'),
crs = 32618)
lead
Simple feature collection with 3149 features and 2 fields
geometry type: POINT
dimension: XY
bbox: xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
ID ppm geometry
1 0 3.890648 POINT (408164.3 4762321)
2 1 4.899391 POINT (405914.9 4767394)
3 2 4.434912 POINT (405724 4767706)
4 3 5.285548 POINT (406702.8 4769201)
5 4 5.295919 POINT (405392.3 4765598)
6 5 4.681277 POINT (405644.1 4762037)
7 6 3.364148 POINT (409183.1 4763057)
8 7 4.096946 POINT (407945.4 4770014)
9 8 4.689880 POINT (406341.4 4762603)
10 9 5.422257 POINT (404638.1 4767636)
Now that table is an sf
object, the data are easilly displayed as a map.
plot(lead['ppm'])
Feature Collections
More complicated geometries are usually not stored in CSV files, but they are
usually still read as tabular data. We will see that the similarity of feature
collections to non-spatial tabular data goes quite far; the usual data
manipulations done on tabular data work just as well on sf
objects.
blockgroups <- sf$read_sf('data/bg_00')
Confirm that the coordinates in the geometry column are the correct UTMs.
blockgroups
Simple feature collection with 147 features and 6 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 401938.3 ymin: 4759734 xmax: 412486.4 ymax: 4771049
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
# A tibble: 147 x 7
BKG_KEY Shape_Leng Shape_Area BKG_KEY_1 Shape_Le_1 Shape_Ar_1
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 360670001001 13520. 6135184. 360670001001 13520. 6135184.
2 360670003002 2547. 301840. 360670003002 2547. 301840.
3 360670003001 2678. 250998. 360670003001 2678. 250998.
4 360670002001 3392. 656276. 360670002001 3392. 656276.
5 360670004004 2224. 301086. 360670004004 2224. 301086.
6 360670004001 3263. 606495. 360670004001 3263. 606495.
7 360670004003 2878. 274532. 360670004003 2878. 274532.
8 360670004002 3606. 331035. 360670004002 3606. 331035.
9 360670010001 2951. 376786. 360670010001 2951. 376786.
10 360670010003 2868. 265836. 360670010003 2868. 265836.
# ... with 137 more rows, and 1 more variable: geometry <sf_geometry [m]>
Note the table dimensions show 147 features in the collection.
dim(blockgroups)
[1] 147 7
Simple feature collections are data frames.
class(blockgroups)
[1] "sf" "tbl_df" "tbl" "data.frame"
The blockgroups
object is a data.frame
, but it also has the class attribute
of sf
. This additional class extends the data.frame
class in ways useful for
feature collection. For instance, the geometry column becomes “sticky” in most
table operations, like subsetting.
blockgroups[1:5, 'BKG_KEY']
Simple feature collection with 5 features and 1 field
geometry type: POLYGON
dimension: XY
bbox: xmin: 402304.2 ymin: 4767421 xmax: 407469 ymax: 4771049
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
# A tibble: 5 x 2
BKG_KEY geometry
<chr> <sf_geometry [m]>
1 360670001001 POLYGON ((403476.4 4767682,...
2 360670003002 POLYGON ((406271.7 4770188,...
3 360670003001 POLYGON ((406730.3 4770235,...
4 360670002001 POLYGON ((406436.1 4770029,...
5 360670004004 POLYGON ((407469 4770033, 4...
Table Operations
- plot
- merge or join
- “split-apply-combine” or group-by and summarize
In the plot
method for feature collections, the “x” in the usual plot(x,
y, ...)
automatically takes the sticky “geometry” column and the type of plot
is assumed to be a map. Only the “y” needs to be specified.
plot(blockgroups['Shape_Area'])
Merging with a regular data frame is done by normal merging non-spatial columns found in both tables.
census <- read.csv('data/SYR_census.csv')
census$BKG_KEY <- factor(census$BKG_KEY)
As usual, there’s the difficulty that CSV files do not include metadata on data types, which have to be set manually.
Merge tables on a unique identifier (our primary key is “BKG_KEY”), but let the “sf” object come first or is special attributes get lost.
import('dplyr',
'inner_join', 'group_by', 'summarise')
census_blockgroups <- inner_join(
blockgroups, census,
by = c('BKG_KEY'))
class(census_blockgroups)
[1] "sf" "tbl_df" "tbl" "data.frame"
The census data is now easily vizualized as a map.
plot(census_blockgroups['POP2000'])
Feature collections also cooperate with the common “split-apply-combine” sequence of steps in data manipulation.
- split – group the features by some factor
- apply – apply a function that summarizes each subset, including their geometries
- combine – return the results as columns of a new feature collection
census_tracts <- census_blockgroups %>%
group_by(TRACT) %>%
summarise(
POP2000 = sum(POP2000),
perc_hispa = sum(HISPANIC) / sum(POP2000))
Read in the census tracts from a separate shapefile to confirm that the boundaries were dissolved as expected.
tracts <- sf$read_sf('data/ct_00')
plot(census_tracts['POP2000'])
plot(sf$st_geometry(tracts), border = 'red', add = TRUE)
By default, the sticky geometries are summarized with st_union
. The
alternative st_combine
does not dissolve internal boundaries. Check
?summarise.sf
for more details.
Spatial Join
The data about lead contamination in soils is at points; the census information is for polygons. Combine the information from these tables by determining which polygon contains each point.
plot(census_tracts['POP2000'],
pal = cm.colors)
plot(sf$st_geometry(lead),
pch = '.', add = TRUE)
In the previous section, we performed a table join using a non-spatial data type. More specifically, we performed an equality join: records were merged wherever the join variable in the “left table” equalled the variable in the “right table”. Spatial joins operate on the “geometry” columns and require expanding beyond equality based matches. Several different kinds of “intersections” can be specified that denote a successful match.
Before doing any spatial join, it is essential that both tables share a common CRS.
sf$st_crs(lead)
sf$st_crs(census_tracts)
The st_join
function performs a left join using the geometries of the two
simple feature collections.
sf$st_join(lead, census_tracts)
Simple feature collection with 3149 features and 5 fields
geometry type: POINT
dimension: XY
bbox: xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
ID ppm TRACT POP2000 perc_hispa geometry
1 0 3.890648 6102 2154 0.01578459 POINT (408164.3 4762321)
2 1 4.899391 3200 2444 0.10515548 POINT (405914.9 4767394)
3 2 4.434912 100 393 0.03053435 POINT (405724 4767706)
4 3 5.285548 700 1630 0.03190184 POINT (406702.8 4769201)
5 4 5.295919 3900 4405 0.23541430 POINT (405392.3 4765598)
6 5 4.681277 6000 3774 0.04027557 POINT (405644.1 4762037)
7 6 3.364148 5602 2212 0.08318264 POINT (409183.1 4763057)
8 7 4.096946 400 3630 0.01019284 POINT (407945.4 4770014)
9 8 4.689880 6000 3774 0.04027557 POINT (406341.4 4762603)
10 9 5.422257 2100 1734 0.09169550 POINT (404638.1 4767636)
- Only the “left” geometry is preserved in the output
- Matching defaults to
st_intersects
but permits any predicate function
sf$st_join(lead, census_tracts,
join = st_contains)
Simple feature collection with 3149 features and 5 fields
geometry type: POINT
dimension: XY
bbox: xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
ID ppm TRACT POP2000 perc_hispa geometry
1 0 3.890648 NA NA NA POINT (408164.3 4762321)
2 1 4.899391 NA NA NA POINT (405914.9 4767394)
3 2 4.434912 NA NA NA POINT (405724 4767706)
4 3 5.285548 NA NA NA POINT (406702.8 4769201)
5 4 5.295919 NA NA NA POINT (405392.3 4765598)
6 5 4.681277 NA NA NA POINT (405644.1 4762037)
7 6 3.364148 NA NA NA POINT (409183.1 4763057)
8 7 4.096946 NA NA NA POINT (407945.4 4770014)
9 8 4.689880 NA NA NA POINT (406341.4 4762603)
10 9 5.422257 NA NA NA POINT (404638.1 4767636)
The population data is at the coarser scale, so the lead ought to be averaged within a census tract. Once each lead measurement is joined to “TRACT”, the spatial data can by dropped.
lead_tracts <- lead %>%
sf$st_join(census_tracts) %>%
sf$st_set_geometry(NULL)
Two more steps to group the data by “TRACT” and average the lead concentrations within each “TRACT”.
lead_tracts <- lead %>%
sf$st_join(census_tracts) %>%
sf$st_set_geometry(NULL) %>%
group_by(TRACT) %>%
summarise(avg_ppm = mean(ppm))
To visualize the average lead concentration from soil samples within each cencus tract, merge the data frame to the sf
object on “TRACT”.
census_lead_tracts <- census_tracts %>%
inner_join(lead_tracts)
plot(census_lead_tracts['avg_ppm'],
pal = heat.colors)
A problem with this approach to aggregation is that it ignores autocorrelation. Points close to eachother tend to have similar values and shouldn’t be given equal weight in averages within a polygon.
Take a closer look at tract 5800 (in row 52), and notice that several low values are nearly stacked on top of eachother.
m <- import('mapview')
m$mapview(sf$st_geometry(census_tracts)) +
m$mapview(lead['ppm'])
The Semivariogram
A semivariogram quantifies the effect of distance on the correlation between values from different locations. On average, measurements of the same variable at two nearby locations are more similar (lower variance) than when those locations are distant.
The gstat library, a geospatial analogue to the stats library provides variogram estimation among several additional tools.
g <- import('gstat')
lead_xy <- read.csv('data/SYR_soil_PB.csv')
The empirical semivariogram shown below is a windowed average of the squared difference in lead concentrations beween sample points.
v_ppm <- g$variogram(
ppm ~ 1,
locations = ~ x + y,
data = lead_xy)
plot(v_ppm)
Fitting a model semivariogram tidies up the information about autocorrelation in the data, so we can use it for interpolation.
v_ppm_fit <- g$fit.variogram(
v_ppm,
model = g$vgm(1, "Sph", 900, 1))
plot(v_ppm, v_ppm_fit)
Kriging
The modeled semivariogram acts as a parameter for fitting a Guassian process, or Kriging. The steps to perform Kriging with the gstat library are
- Generate a modeled semivariogram
- Generate new locations for “predicted” values
- Call
krige
with the data, locations, and semivariogram
Generate a low resolution (for demonstration) grid of points overlaying the bounding box for the lead data and trim it down to the polygons of interest. Remember, that goal is aggregating lead concentrations within each census tract.
pred_ppm <- sf$st_make_grid(
lead, cellsize = 400,
what = 'centers')
idx <- unlist(
sf$st_intersects(census_tracts, pred_ppm))
pred_ppm <- pred_ppm[idx]
Map the result to verify generated points.
plot(census_tracts['POP2000'],
pal = cm.colors)
plot(pred_ppm,
pch = '.', add = TRUE)
Like the gstat
function, the krige
function doesn’t work seemlessly with
sf
objects, so again create a data frame with coordinates.
pred_ppm_xy <- data.frame(
do.call(rbind, pred_ppm))
names(pred_ppm_xy) <- c('x', 'y')
Almost done …
- Generate a modeled semivariogram
- Generate new locations for “predicted” values
- Call
krige
with the data, locations, and semivariogram
The first argument specifies the model for the means, which is constant according to our formula for lead concentrations. The result, like the input for locations, is a data frame with coordinates and a prediction for lead concentrations.
pred_ppm_xy <- g$krige(
ppm ~ 1,
locations = ~ x + y,
data = lead_xy,
newdata = pred_ppm_xy,
model = v_ppm_fit)
The same commands that created the sf
table for lead apply to creating a sf
table for the predicted concentrations.
pred_ppm <- sf$st_as_sf(
pred_ppm_xy,
coords = c('x', 'y'),
crs = sf$st_crs(lead))
And the same commands that joined lead concentrations to census tracts apply to joining the predicted concentrations in too.
pred_ppm_tracts <-
pred_ppm %>%
sf$st_join(census_tracts) %>%
sf$st_set_geometry(NULL) %>%
group_by(TRACT) %>%
summarise(pred_ppm = mean(var1.pred))
census_lead_tracts <-
census_lead_tracts %>%
inner_join(pred_ppm_tracts)
plot(census_lead_tracts[
c('pred_ppm', 'avg_ppm')],
pal = heat.colors)
The effect of paying attention to autocorrelation is subtle, but it is noticable and had the expected effect in tract 5800.
census_lead_tracts[52,]
Simple feature collection with 1 feature and 5 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 405633.1 ymin: 4762867 xmax: 406445.9 ymax: 4764711
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
# A tibble: 1 x 6
TRACT POP2000 perc_hispa avg_ppm pred_ppm geometry
<int> <int> <dbl> <dbl> <dbl> <sf_geometry [m]>
1 5800 2715 0.0302 5.44 5.65 POLYGON ((406445.9 4762893,...
Regression
Continuing the theme that vector data is tabular data, the natural progression in statistical analysis is toward regression. Building a regression model requires making good assumptions about relationships in your data:
- between columns as independent and dependent variables
- between rows as more-or-less independent observations
The following model assumes an association (in the linear least-squares sense), between the hispanic population and lead concentrations and assumes independence of every census tract (i.e. row).
ppm.lm <- lm(pred_ppm ~ perc_hispa,
census_lead_tracts)
Is that model any good?
census_lead_tracts['lm.resid'] <- resid(ppm.lm)
plot(census_lead_tracts['lm.resid'])
Polygons close to eachother tend to have similar residuals: there is autocorrelation. It is tempting to ask for a semivariogram plot of the residuals, but that requires a precise definition of the distance between polygons. A favored alternative for quantifying autoregression in non-point feature collections is Moran’s I. This analog to Pearson’s correlation coeficient quantifies autocorrelation rather than cross-correlation.
Moran’s I is the correlation between all pairs of features, weighted to emphasize features that are close together. It does not dodge the problem of distance weighting, it actually adds flexibility, along with some norms.
sp <- import('sp')
sd <- import('spdep')
tracts <- as(
sf$st_geometry(census_tracts), 'Spatial')
tracts_nb <- sd$poly2nb(tracts)
The neighbors
variable is the network of features sharing a boundary point.
plot(census_lead_tracts['lm.resid'])
sd$plot.nb(tracts_nb,
sp$coordinates(tracts), add = TRUE)
Reshape the adjacency matrix into a list of neighbors with associated weights.
tracts_weight <- sd$nb2listw(tracts_nb)
Visualize correlation between the residuals and the weighted average of their
neighbors with moran.plot
from the spdep
sd$moran.plot(
census_lead_tracts[['lm.resid']],
tracts_weight,
labels = census_lead_tracts[['TRACT']],
pch = 19)
There are many ways to use geospatial information about tracts to impose assumptions about non-independence between observations in the regression. One approach is a Spatial Auto-Regressive (SAR) model, which regresses each value against the weighted average of neighbors.
ppm.sarlm <- sd$lagsarlm(
pred_ppm ~ perc_hispa,
data = census_lead_tracts,
tracts_weight,
tol.solve = 1.0e-30)
The Moran’s I plot of residuals shows less correlation; which means the SAR model’s assumption about spatial autocorrelation (i.e. between table rows) makes the rest of the model more plausible.
sd$moran.plot(
resid(ppm.sarlm),
tracts_weight,
labels = census_lead_tracts[['TRACT']],
pch = 19)
Feeling more confident in the model, we can now take a look at the regression coefficients.
summary(ppm.sarlm)
Call:
sd$lagsarlm(formula = pred_ppm ~ perc_hispa, data = census_lead_tracts,
listw = tracts_weight, tol.solve = 1e-30)
Residuals:
Min 1Q Median 3Q Max
-1.002799 -0.212714 0.074072 0.235047 0.870624
Type: lag
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.10395 0.45580 2.4220 0.01544
perc_hispa 1.27550 0.99076 1.2874 0.19795
Rho: 0.76278, LR test value: 23.183, p-value: 1.473e-06
Asymptotic standard error: 0.092139
z-value: 8.2786, p-value: 2.2204e-16
Wald statistic: 68.535, p-value: < 2.22e-16
Log likelihood: -31.49178 for lag model
ML residual variance (sigma squared): 0.15101, (sigma: 0.3886)
Number of observations: 57
Number of parameters estimated: 4
AIC: 70.984, (AIC for lm: 92.166)
LM test for residual autocorrelation
test value: 5.5923, p-value: 0.01804
Summary
Rapid tour of packages providing tools for manipulation and analysis of vector data in R.
- sf
st_sf
: make a simple features data frameread_sf
: read a file into a simple features data framest_join
: match tables on their geometry columns
- gstat
variogram
: distance lagged correlation for pointsfit.variogram
: autocorrelation modelkrige
: smooth interpolation over points
- sp: a low level package with useful
Spatial*
class objects - spdep
poly2nb
,nb2listw
: calculate spatial neighbors and their weightsmoran.plot
: infer autocorrelation from residual scatter plotslagsarlm
: SAR model estimation