Vector Operations in R
Lead Poisoning in Syracuse
Lesson 1 with Elizabeth Green
Lesson Objectives
- Dive into scripting vector data operations
- Learn to handle feature collections like tables
- Address autocorrelation in point data
- Glimpse a spatial regression framework
Specific Achievements
- Read CSVs and shapefiles
- Intersect and dissolve geometries
- Smooth point data with Kriging
- Include autocorrelation in a regression
Simple Features
A standardized set of geometric shapes are the essense of vector data, but these are next to useless outside a tabular structure.
library(sf)
lead <- read.csv('data/SYR_soil_PB.csv')
lead[['geometry']] <- st_sfc(
st_point(),
crs = 32618)
The lead
table now has the “simple feature column”, which st_sfc
creates from a CRS and a geometry type, but each point is “EMPTY”. The empty gemometry is equivalent to a NA value.
> head(lead)
x y ID ppm geometry
1 408164.3 4762321 0 3.890648 POINT EMPTY
2 405914.9 4767394 1 4.899391 POINT EMPTY
3 405724.0 4767706 2 4.434912 POINT EMPTY
4 406702.8 4769201 3 5.285548 POINT EMPTY
5 405392.3 4765598 4 5.295919 POINT EMPTY
6 405644.1 4762037 5 4.681277 POINT EMPTY
geometry | description |
---|---|
st_point |
a single point |
st_linestring |
sequence of points connected by straight, non-self intersecting line pieces |
st_polygon |
one [or more] sequences of points in a closed, non-self intersecting ring [with holes] |
st_multi* |
sequence of * , either point , linestring , or polygon |
Each element of the simple feature column is a simple feature geometry, here created from the “x” and “y” elements of a given feature.
lead[[1, 'geometry']] <- st_point(
c(x = lead[[1, 'x']],
y = lead[[1, 'y']]),
dim = 'XY')
> head(lead)
x y ID ppm geometry
1 408164.3 4762321 0 3.890648 POINT (408164.3 4762321)
2 405914.9 4767394 1 4.899391 POINT EMPTY
3 405724.0 4767706 2 4.434912 POINT EMPTY
4 406702.8 4769201 3 5.285548 POINT EMPTY
5 405392.3 4765598 4 5.295919 POINT EMPTY
6 405644.1 4762037 5 4.681277 POINT EMPTY
The whole data frame must be cast to a simple feature object, which causes
functions like print
and plot
to use methods introduced by the sf
library.
lead <- st_sf(lead)
For example, the print
method automatically shows the CRS and truncates the
number of records displayed.
> lead
Simple feature collection with 3149 features and 4 fields (with 3149 geometries empty)
geometry type: POINT
dimension: XY
bbox: xmin: 408164.3 ymin: 4762321 xmax: 408164.3 ymax: 4762321
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
x y ID ppm geometry
1 408164.3 4762321 0 3.890648 POINT (408164.3 4762321)
2 405914.9 4767394 1 4.899391 POINT EMPTY
3 405724.0 4767706 2 4.434912 POINT EMPTY
4 406702.8 4769201 3 5.285548 POINT EMPTY
5 405392.3 4765598 4 5.295919 POINT EMPTY
6 405644.1 4762037 5 4.681277 POINT EMPTY
7 409183.1 4763057 6 3.364148 POINT EMPTY
8 407945.4 4770014 7 4.096946 POINT EMPTY
9 406341.4 4762603 8 4.689880 POINT EMPTY
10 404638.1 4767636 9 5.422257 POINT EMPTY
Naturally, there is a shortcut to creating an sf
object from a data frame with
point coordinates. The CRS must be specified via EPSG integer or proj4 string.
lead <- read.csv('data/SYR_soil_PB.csv')
lead <- st_as_sf(lead,
coords = c('x', 'y'),
crs = 32618)
> lead
Simple feature collection with 3149 features and 2 fields
geometry type: POINT
dimension: XY
bbox: xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
ID ppm geometry
1 0 3.890648 POINT (408164.3 4762321)
2 1 4.899391 POINT (405914.9 4767394)
3 2 4.434912 POINT (405724 4767706)
4 3 5.285548 POINT (406702.8 4769201)
5 4 5.295919 POINT (405392.3 4765598)
6 5 4.681277 POINT (405644.1 4762037)
7 6 3.364148 POINT (409183.1 4763057)
8 7 4.096946 POINT (407945.4 4770014)
9 8 4.689880 POINT (406341.4 4762603)
10 9 5.422257 POINT (404638.1 4767636)
Now that table is an sf
object, the data are easilly displayed as a map.
plot(lead['ppm'])
For ggplot2 figures, use geom_sf
to draw maps. In the aes
mapping for feature collections, the “x” and “y” variables are automatically assigned to the sticky “geometry” column, while other attributes can be assigned to visual elements like fill
or color
.
library(ggplot2)
ggplot(data = lead,
mapping = aes(color = ppm)) +
geom_sf()
Feature Collections
More complicated geometries are usually not stored in CSV files, but they are
usually still read as tabular data. We will see that the similarity of feature
collections to non-spatial tabular data goes quite far; the usual data
manipulations done on tabular data work just as well on sf
objects.
blockgroups <- read_sf('data/bg_00')
Confirm that the coordinates in the geometry column are the correct UTMs.
> blockgroups
Simple feature collection with 147 features and 3 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 401938.3 ymin: 4759734 xmax: 412486.4 ymax: 4771049
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
# A tibble: 147 x 4
BKG_KEY Shape_Leng Shape_Area geometry
<chr> <dbl> <dbl> <POLYGON [m]>
1 36067000… 13520. 6135184. ((403476.4 4767682, 403356.7 4767804, 4…
2 36067000… 2547. 301840. ((406271.7 4770188, 406186.1 4770270, 4…
3 36067000… 2678. 250998. ((406730.3 4770235, 406687.8 4770205, 4…
4 36067000… 3392. 656276. ((406436.1 4770029, 406340 4769973, 406…
5 36067000… 2224. 301086. ((407469 4770033, 407363.9 4770035, 407…
6 36067000… 3263. 606495. ((408398.6 4769564, 408283.1 4769556, 4…
7 36067000… 2878. 274532. ((407477.4 4769773, 407401 4769767, 407…
8 36067000… 3606. 331035. ((407486 4769507, 407443.5 4769504, 407…
9 36067001… 2951. 376786. ((410704.4 4769103, 410625.2 4769100, 4…
10 36067001… 2868. 265836. ((409318.3 4769203, 409299.6 4769535, 4…
# … with 137 more rows
Also note the table dimensions reveal 147 features in the collection.
> dim(blockgroups)
[1] 147 4
Simple feature collections are data frames.
> class(blockgroups)
[1] "sf" "tbl_df" "tbl" "data.frame"
The blockgroups
object is a data.frame
, but it also has the class attribute
of sf
. This additional class extends the data.frame
class in ways useful for
feature collection. For instance, the geometry column becomes “sticky” in most
table operations, like subsetting.
> blockgroups[1:5, 'BKG_KEY']
Simple feature collection with 5 features and 1 field
geometry type: POLYGON
dimension: XY
bbox: xmin: 402304.2 ymin: 4767421 xmax: 407469 ymax: 4771049
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
# A tibble: 5 x 2
BKG_KEY geometry
<chr> <POLYGON [m]>
1 3606700010… ((403476.4 4767682, 403356.7 4767804, 403117.2 4768027, 4028…
2 3606700030… ((406271.7 4770188, 406186.1 4770270, 406107.9 4770345, 4060…
3 3606700030… ((406730.3 4770235, 406687.8 4770205, 406650.9 4770179, 4066…
4 3606700020… ((406436.1 4770029, 406340 4769973, 406307.2 4769954, 406206…
5 3606700040… ((407469 4770033, 407363.9 4770035, 407233.4 4770036, 407235…
Table Operations
- plot
- merge or join
- “split-apply-combine” or group-by and summarize
> ggplot(blockgroups,
+ aes(fill = Shape_Area)) +
+ geom_sf()
Merging with a regular data frame is done by normal merging on non-spatial columns found in both tables.
census <- read.csv('data/SYR_census.csv')
census <- within(census, {
BKG_KEY <- as.character(BKG_KEY)
})
As usual, there’s the difficulty that CSV files do not include metadata on data types, which have to be set manually.
Merge tables on a unique identifier (our primary key is “BKG_KEY”), but let the “sf” object come first or its special attributes get lost.
library(dplyr)
census_blockgroups <- inner_join(
blockgroups, census,
by = c('BKG_KEY'))
> class(census_blockgroups)
[1] "sf" "tbl_df" "tbl" "data.frame"
The census data is now easily vizualized as a map.
ggplot(census_blockgroups,
aes(fill = POP2000)) +
geom_sf()
Feature collections also cooperate with the common “split-apply-combine” sequence of steps in data manipulation.
- split – group the features by some factor
- apply – apply a function that summarizes each subset, including their geometries
- combine – return the results as columns of a new feature collection
census_tracts <- census_blockgroups %>%
group_by(TRACT) %>%
summarise(
POP2000 = sum(POP2000),
perc_hispa = sum(HISPANIC) / POP2000)
Read in the census tracts from a separate shapefile to confirm that the boundaries were dissolved as expected.
tracts <- read_sf('data/ct_00')
ggplot(census_tracts,
aes(fill = POP2000)) +
geom_sf() +
geom_sf(data = tracts,
color = 'red', fill = NA)
By default, the sticky geometries are summarized with st_union
. The
alternative st_combine
does not dissolve internal boundaries. Check
?summarise.sf
for more details.
Spatial Join
The data about lead contamination in soils is at points; the census information is for polygons. Combine the information from these tables by determining which polygon contains each point.
ggplot(census_tracts,
aes(fill = POP2000)) +
geom_sf() +
geom_sf(data = lead, color = 'red',
fill = NA, size = 0.1)
In the previous section, we performed a table join using a non-spatial data type. More specifically, we performed an equality join: records were merged wherever the join variable in the “left table” equalled the variable in the “right table”. Spatial joins operate on the “geometry” columns and require expanding beyond equality based matches. Several different kinds of “intersections” can be specified that denote a successful match.
Before doing any spatial join, it is essential that both tables share a common CRS.
> st_crs(lead)
Coordinate Reference System:
EPSG: 32618
proj4string: "+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs"
> st_crs(census_tracts)
Coordinate Reference System:
EPSG: 32618
proj4string: "+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs"
The st_join
function performs a left join using the geometries of the two
simple feature collections.
> st_join(lead, census_tracts)
Simple feature collection with 3149 features and 5 fields
geometry type: POINT
dimension: XY
bbox: xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
ID ppm TRACT POP2000 perc_hispa geometry
1 0 3.890648 6102 2154 0.01578459 POINT (408164.3 4762321)
2 1 4.899391 3200 2444 0.10515548 POINT (405914.9 4767394)
3 2 4.434912 100 393 0.03053435 POINT (405724 4767706)
4 3 5.285548 700 1630 0.03190184 POINT (406702.8 4769201)
5 4 5.295919 3900 4405 0.23541430 POINT (405392.3 4765598)
6 5 4.681277 6000 3774 0.04027557 POINT (405644.1 4762037)
7 6 3.364148 5602 2212 0.08318264 POINT (409183.1 4763057)
8 7 4.096946 400 3630 0.01019284 POINT (407945.4 4770014)
9 8 4.689880 6000 3774 0.04027557 POINT (406341.4 4762603)
10 9 5.422257 2100 1734 0.09169550 POINT (404638.1 4767636)
- Only the “left” geometry is preserved in the output
- Matching defaults to
st_intersects
,but permits any predicate function
> st_join(lead, census_tracts,
+ join = st_contains)
Simple feature collection with 3149 features and 5 fields
geometry type: POINT
dimension: XY
bbox: xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
First 10 features:
ID ppm TRACT POP2000 perc_hispa geometry
1 0 3.890648 NA NA NA POINT (408164.3 4762321)
2 1 4.899391 NA NA NA POINT (405914.9 4767394)
3 2 4.434912 NA NA NA POINT (405724 4767706)
4 3 5.285548 NA NA NA POINT (406702.8 4769201)
5 4 5.295919 NA NA NA POINT (405392.3 4765598)
6 5 4.681277 NA NA NA POINT (405644.1 4762037)
7 6 3.364148 NA NA NA POINT (409183.1 4763057)
8 7 4.096946 NA NA NA POINT (407945.4 4770014)
9 8 4.689880 NA NA NA POINT (406341.4 4762603)
10 9 5.422257 NA NA NA POINT (404638.1 4767636)
This command would match on polygons within (contained by) each point, which is the wrong way around. Stick with the default.
The population data is at the coarser scale, so the lead ought to be averaged within a census tract. Once each lead measurement is joined to “TRACT”, the spatial data can by dropped.
lead_tracts <- lead %>%
st_join(census_tracts) %>%
st_drop_geometry()
Two more steps to group the data by “TRACT” and average the lead concentrations within each “TRACT”.
lead_tracts <- lead %>%
st_join(census_tracts) %>%
st_drop_geometry() %>%
group_by(TRACT) %>%
summarise(avg_ppm = mean(ppm))
To visualize the average lead concentration from soil samples within each cencus tract, merge the data frame to the sf
object on “TRACT”.
census_lead_tracts <- census_tracts %>%
inner_join(lead_tracts)
Make a plot of avg_ppm
for each census tract.
ggplot(census_lead_tracts,
aes(fill = avg_ppm)) +
geom_sf() +
scale_fill_gradientn(
colors = heat.colors(7))
A problem with this approach to aggregation is that it ignores autocorrelation. Points close to eachother tend to have similar values and shouldn’t be given equal weight in averages within a polygon.
Take a closer look at tract 5800 (around 43.025N, 76.152W), and notice that several low values are nearly stacked on top of eachother.
library(mapview)
mapview(census_tracts,
legend = FALSE,
viewer.suppress = TRUE) +
mapview(lead['ppm'])
The Semivariogram
A semivariogram quantifies the effect of distance on the correlation between values from different locations. On average, measurements of the same variable at two nearby locations are more similar (lower variance) than when those locations are distant.
The gstat library, a geospatial analogue to the stats library provides variogram estimation among several additional tools.
library(gstat)
lead_xy <- read.csv('data/SYR_soil_PB.csv')
The empirical semivariogram shown below is a windowed average of the squared difference in lead concentrations beween sample points.
v_ppm <- variogram(
ppm ~ 1,
locations = ~ x + y,
data = lead_xy)
plot(v_ppm)
Fitting a model semivariogram tidies up the information about autocorrelation in the data, so we can use it for interpolation.
v_ppm_fit <- fit.variogram(
v_ppm,
model = vgm(1, "Sph", 900, 1))
plot(v_ppm, v_ppm_fit)
Kriging
The modeled semivariogram acts as a parameter when performing Guassian process regression, commonly known as kriging. The steps to perform Kriging with the gstat library are:
- Generate a modeled semivariogram
- Generate new locations for “predicted” values
- Call
krige
with the data, locations for prediction, and semivariogram
Generate a low resolution (for demonstration) grid of points overlaying the bounding box for the lead data and trim it down to the polygons of interest. Remember, that goal is aggregating lead concentrations within each census tract.
pred_ppm <- st_make_grid(
lead, cellsize = 400,
what = 'centers')
idx <- unlist(
st_intersects(census_tracts, pred_ppm))
pred_ppm <- pred_ppm[idx]
Map the result to verify generated points.
ggplot(census_tracts,
aes(fill = POP2000)) +
geom_sf() +
geom_sf(data = pred_ppm,
color = 'red', fill = NA)
Almost done …
- Generate a modeled semivariogram
- Generate new locations for “predicted” values
- Call
krige
with the data, locations, and semivariogram
The first argument specifies the model for the means, which is constant according to our
formula for lead concentrations. The observed ppm values are in the locations
data.frame along with the point geometry. The result is a data frame with predictions for lead concentrations at the points in newdata
.
pred_ppm <- krige(
formula = ppm ~ 1,
locations = lead,
newdata = pred_ppm,
model = v_ppm_fit)
And the same commands that joined lead concentrations to census tracts apply to joining the predicted concentrations in too.
pred_ppm_tracts <-
pred_ppm %>%
st_join(census_tracts) %>%
st_drop_geometry() %>%
group_by(TRACT) %>%
summarise(pred_ppm = mean(var1.pred))
census_lead_tracts <-
census_lead_tracts %>%
inner_join(pred_ppm_tracts)
The predictions should be, and are, close to the original averages with deviations that correct for autocorrelation.
ggplot(census_lead_tracts,
aes(x = pred_ppm, y = avg_ppm)) +
geom_point()
The effect of paying attention to autocorrelation is subtle, but it is noticable and had the expected effect in tract 5800. The pred_ppm value is a little higher than the average.
> census_lead_tracts[52,]
Simple feature collection with 1 feature and 5 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 405633.1 ymin: 4762867 xmax: 406445.9 ymax: 4764711
epsg (SRID): 32618
proj4string: +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs
# A tibble: 1 x 6
TRACT POP2000 perc_hispa avg_ppm pred_ppm geometry
<int> <int> <dbl> <dbl> <dbl> <POLYGON [m]>
1 5800 2715 0.0302 5.44 5.53 ((406445.9 4762893, 406017.5 4…
Regression
Continuing the theme that vector data is tabular data, the natural progression in statistical analysis is toward regression. Building a regression model requires making good assumptions about relationships in your data:
- between columns as independent and dependent variables
- between rows as more-or-less independent observations
The following model assumes an association (in the linear least-squares sense), between the hispanic population and lead concentrations and assumes independence of every census tract (i.e. row).
ppm.lm <- lm(pred_ppm ~ perc_hispa,
census_lead_tracts)
Is that model any good?
census_lead_tracts <- census_lead_tracts %>%
mutate(lm.resid = resid(ppm.lm))
plot(census_lead_tracts['lm.resid'])
Polygons close to eachother tend to have similar residuals: there is autocorrelation. It is tempting to ask for a semivariogram plot of the residuals, but that requires a precise definition of the distance between polygons. A favored alternative for quantifying autoregression in non-point feature collections is Moran’s I. This analog to Pearson’s correlation coeficient quantifies autocorrelation rather than cross-correlation.
Moran’s I is the correlation between all pairs of features, weighted to emphasize features that are close together. It does not dodge the problem of distance weighting, but provides default assumptions for how to do it.
library(sp)
library(spdep)
tracts <- as(
st_geometry(census_tracts), 'Spatial')
tracts_nb <- poly2nb(tracts)
The neighbors
variable is the network of features sharing a boundary point.
plot(census_lead_tracts['lm.resid'],
reset = FALSE)
plot.nb(tracts_nb, coordinates(tracts),
add = TRUE)
Reshape the adjacency matrix into a list of neighbors with associated weights.
tracts_weight <- nb2listw(tracts_nb)
Visualize correlation between the residuals and the weighted average of their
neighbors with moran.plot
from the spdep. The positive trend line is consistent with the earlier observation that features in close proximity have similar residuals.
moran.plot(
census_lead_tracts[['lm.resid']],
tracts_weight,
labels = census_lead_tracts[['TRACT']],
pch = 19)
There are many ways to use geospatial information about tracts to impose assumptions about non-independence between observations in the regression. One approach is a Spatial Auto-Regressive (SAR) model, which regresses each value against the weighted average of neighbors.
ppm.sarlm <- lagsarlm(
pred_ppm ~ perc_hispa,
data = census_lead_tracts,
tracts_weight,
tol.solve = 1.0e-30)
The Moran’s I plot of residuals now shows less correlation; which means the SAR model’s assumption about spatial autocorrelation (i.e. between table rows) makes the rest of the model more plausible.
moran.plot(
resid(ppm.sarlm),
tracts_weight,
labels = census_lead_tracts[['TRACT']],
pch = 19)
Feeling more confident in the model, we can now take a look at the regression coefficients and overall model fit.
> summary(ppm.sarlm)
Call:lagsarlm(formula = pred_ppm ~ perc_hispa, data = census_lead_tracts,
listw = tracts_weight, tol.solve = 1e-30)
Residuals:
Min 1Q Median 3Q Max
-0.992669 -0.210942 0.037845 0.246989 0.888440
Type: lag
Coefficients: (asymptotic standard errors)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.16047 0.46978 2.4702 0.0135
perc_hispa 1.45285 0.95717 1.5179 0.1291
Rho: 0.7499, LR test value: 22.434, p-value: 2.175e-06
Asymptotic standard error: 0.095079
z-value: 7.8871, p-value: 3.1086e-15
Wald statistic: 62.207, p-value: 3.1086e-15
Log likelihood: -29.16485 for lag model
ML residual variance (sigma squared): 0.14018, (sigma: 0.37441)
Number of observations: 57
Number of parameters estimated: 4
AIC: 66.33, (AIC for lm: 86.764)
LM test for residual autocorrelation
test value: 4.5564, p-value: 0.032797
Summary
Rapid tour of packages providing tools for manipulation and analysis of vector data in R.
- sf
st_sf
: make a simple features data frameread_sf
: read a file into a simple features data framest_join
: match tables on their geometry columns
- gstat
variogram
: distance lagged correlation for pointsfit.variogram
: autocorrelation modelkrige
: smooth interpolation over points
- sp: a low level package with useful
Spatial*
class objects - spdep
poly2nb
,nb2listw
: calculate spatial neighbors and their weightsmoran.plot
: infer autocorrelation from residual scatter plotslagsarlm
: SAR model estimation
If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.
# Nothing here yet!