Vector Operations in R
Lead Poisoning in Syracuse
Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.
Introduction
Vector data is a way to represent features on Earth’s surface, which can be points (e.g. the locations of ice cream shops), polygons (e.g. the boundaries of countries), or lines (e.g. streets or rivers). This lesson is a whirlwind tour through a set of tools in R that we can use to manipulate vector data and do data analysis on them to uncover spatial patterns.
The example dataset in this lesson is a set of measurements of soil lead levels in the city of Syracuse, New York, collected as part of a study investigating the relationship between background lead levels in the environment and lead levels in children’s blood. As a geospatial analyst and EPA consultant for the city of Syracuse, your task is to investigate the relationship between metal concentration (in particular lead) and population. In particular, research suggests higher concentration of metals in minorities.
Lesson Objectives
- Dive into scripting vector data operations
- Manipulate spatial feature collections like data frames
- Address autocorrelation in point data
- Understand the basics of spatial regression
Specific Achievements
- Read CSVs and shapefiles
- Intersect and dissolve geometries
- Smooth point data with Kriging
- Include autocorrelation in a regression on polygon data
Simple Features
In this section, we will perform multiple operations to explore the datasets,
including soil lead concentration data (points) and census data (polygons
representing block groups and census tracts).
We will manipulate the datasets provided to join information using common table keys,
spatially aggregate the block group data to the census tract level,
and visualize vector spatial data with plot
and ggplot
commands for sf objects.
A standardized set of geometric shapes are the essence of vector data. The sf package puts sets of shapes in a tabular structure so that we can manipulate and analyze them.
library(sf)
lead <- read.csv('data/SYR_soil_PB.csv')
We read in a data frame with point coordinates called lead
from an external CSV file,
then create an sf
object from the data frame.
We use st_as_sf()
with the coords
argument to specify
which columns represent the x
and y
coordinates of each point.
The CRS must be specified with the argument crs
via EPSG integer or proj4 string.
lead <- st_as_sf(lead,
coords = c('x', 'y'),
crs = 32618)
The lead
data frame now has a “simple feature column” called geometry
,
which st_as_sf
creates from a CRS and a geometry type.
Each element of the simple feature column is a simple feature geometry, here
created from the "x"
and "y"
elements of a given feature.
In this case, st_as_sf
creates a point geometry because we supplied a
single x
and y
coordinate for each feature, but vector features
can be points, lines, or polygons.
> lead
Simple feature collection with 3149 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
CRS: EPSG:32618
First 10 features:
ID ppm geometry
1 0 3.890648 POINT (408164.3 4762321)
2 1 4.899391 POINT (405914.9 4767394)
3 2 4.434912 POINT (405724 4767706)
4 3 5.285548 POINT (406702.8 4769201)
5 4 5.295919 POINT (405392.3 4765598)
6 5 4.681277 POINT (405644.1 4762037)
7 6 3.364148 POINT (409183.1 4763057)
8 7 4.096946 POINT (407945.4 4770014)
9 8 4.689880 POINT (406341.4 4762603)
10 9 5.422257 POINT (404638.1 4767636)
geometry | description |
---|---|
st_point |
a single point |
st_linestring |
sequence of points connected by straight, non-self-intersecting line pieces |
st_polygon |
one [or more] sequences of points in a closed, non-self-intersecting ring [with holes] |
st_multi* |
sequence of * , either point , linestring , or polygon |
Now that our data frame is a simple feature object, calling
functions like print
and plot
will use methods introduced by the sf
package.
For example, the print
method we just used automatically shows the CRS and truncates the
number of records displayed.
Using the plot
method, the data are easily displayed as a map.
Here, the points are colored by lead concentration (ppm
).
plot(lead['ppm'])
For ggplot2 figures, use geom_sf
to draw maps. In the aes
mapping for feature collections,
the x
and y
variables are automatically assigned to the geometry
column,
while other attributes can be assigned to visual elements like fill
or color
.
library(ggplot2)
ggplot(data = lead,
mapping = aes(color = ppm)) +
geom_sf()
Feature Collections
More complicated geometries are usually not stored in CSV files, but they are
usually still read as tabular data. We will see that the similarity of feature
collections to non-spatial tabular data goes quite far; the usual data
manipulations done on tabular data work just as well on sf
objects.
Here, we read the boundaries of the Census block groups
in the city of Syracuse from a shapefile.
blockgroups <- st_read('data/bg_00')
Reading layer `bg_00' from data source `/nfs/public-data/training/bg_00' using driver `ESRI Shapefile'
Simple feature collection with 147 features and 3 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 401938.3 ymin: 4759734 xmax: 412486.4 ymax: 4771049
CRS: 32618
The geometry
column contains projected UTM coordinates of the polygon vertices.
Also note the table dimensions show that there are 147 features in the collection.
> blockgroups
Simple feature collection with 147 features and 3 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 401938.3 ymin: 4759734 xmax: 412486.4 ymax: 4771049
CRS: 32618
First 10 features:
BKG_KEY Shape_Leng Shape_Area geometry
1 360670001001 13520.233 6135183.6 POLYGON ((403476.4 4767682,...
2 360670003002 2547.130 301840.0 POLYGON ((406271.7 4770188,...
3 360670003001 2678.046 250998.4 POLYGON ((406730.3 4770235,...
4 360670002001 3391.920 656275.6 POLYGON ((406436.1 4770029,...
5 360670004004 2224.179 301085.7 POLYGON ((407469 4770033, 4...
6 360670004001 3263.257 606494.9 POLYGON ((408398.6 4769564,...
7 360670004003 2878.404 274532.3 POLYGON ((407477.4 4769773,...
8 360670004002 3605.653 331034.9 POLYGON ((407486 4769507, 4...
9 360670010001 2950.688 376786.4 POLYGON ((410704.4 4769103,...
10 360670010003 2868.260 265835.7 POLYGON ((409318.3 4769203,...
Simple feature collections are data frames.
> class(blockgroups)
[1] "sf" "data.frame"
The blockgroups
object is a data.frame
, but it also has the class attribute
of sf
. This additional class extends the data.frame
class in ways useful for
feature collection. For instance, the geometry
column becomes “sticky” in most
table operations, like subsetting. This means that the geometry
column is retained
even if it is not explicitly named.
> blockgroups[1:5, 'BKG_KEY']
Simple feature collection with 5 features and 1 field
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 402304.2 ymin: 4767421 xmax: 407469 ymax: 4771049
CRS: 32618
BKG_KEY geometry
1 360670001001 POLYGON ((403476.4 4767682,...
2 360670003002 POLYGON ((406271.7 4770188,...
3 360670003001 POLYGON ((406730.3 4770235,...
4 360670002001 POLYGON ((406436.1 4770029,...
5 360670004004 POLYGON ((407469 4770033, 4...
Table Operations
We can plot the polygon features using ggplot2 or do common data wrangling operations:
- plot
- merge or join
- “split-apply-combine” or group-by and summarize
> ggplot(blockgroups,
+ aes(fill = Shape_Area)) +
+ geom_sf()
Merging with a regular data frame is done by normal merging on non-spatial columns found in both tables.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
census <- read.csv('data/SYR_census.csv')
census <- mutate(census,
BKG_KEY = as.character(BKG_KEY)
)
Here, we read in a data frame called census
with demographic characteristics
of the Syracuse block groups.
As usual, there’s the difficulty that CSV files do not include metadata on data
types, which have to be set manually. We do this here by changing the BKG_KEY
column to character type using as.character()
.
Merge tables on a unique identifier (our primary key is "BKG_KEY"
), but let the
sf
object come first or its special attributes get lost.
The inner_join()
function from dplyr joins two data frames on a
common key specified with the by
argument, keeping only the rows from each data
frame with matching keys in the other data frame and discarding the rest.
census_blockgroups <- inner_join(
blockgroups, census,
by = c('BKG_KEY'))
> class(census_blockgroups)
[1] "sf" "data.frame"
The census data is now easily visualized as a map.
ggplot(census_blockgroups,
aes(fill = POP2000)) +
geom_sf()
Feature collections also cooperate with the common “split-apply-combine” sequence of steps in data manipulation.
- split – group the features by some factor
- apply – apply a function that summarizes each subset, including their geometries
- combine – return the results as columns of a new feature collection
census_tracts <- census_blockgroups %>%
group_by(TRACT) %>%
summarise(
POP2000 = sum(POP2000),
perc_hispa = sum(HISPANIC) / POP2000)
Here, we use the dplyr function group_by
to specify that we are doing an operation
on all block groups within a Census tract.
We can use summarise()
to calculate as many summary statistics as we want, each one separated by a comma.
Here POP2000
is the sum of all block group populations in each tract and perc_hispa
is the sum of the Hispanic
population divided by the total population (because now we have redefined POP2000
to be the population total).
The summarise()
operation automatically combines the block group polygons from each tract!
Read in the census tracts from a separate shapefile to confirm that the boundaries were dissolved as expected.
tracts <- st_read('data/ct_00')
Reading layer `ct_00' from data source `/nfs/public-data/training/ct_00' using driver `ESRI Shapefile'
Simple feature collection with 57 features and 7 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 401938.3 ymin: 4759734 xmax: 412486.4 ymax: 4771049
CRS: 32618
ggplot(census_tracts,
aes(fill = POP2000)) +
geom_sf() +
geom_sf(data = tracts,
color = 'red', fill = NA)
By default, the sticky geometries are summarized with st_union
. The
alternative st_combine
does not dissolve internal boundaries. Check
?summarise.sf
for more details.
Spatial Join
To summarize average lead concentration by census tracts, we need to spatially join the lead data with the census tract boundaries.
The data about lead contamination in soils is at points; the census information is for polygons. Combine the information from these tables by determining which polygon contains each point.
ggplot(census_tracts,
aes(fill = POP2000)) +
geom_sf() +
geom_sf(data = lead, color = 'red',
fill = NA, size = 0.1)
In the previous section, we performed a table join using a non-spatial data type. More specifically, we performed an equality join: records were merged wherever the join variable in the “left table” equalled the variable in the “right table”. Spatial joins operate on the “geometry” columns and require expanding beyond equality-based matches. Several different kinds of “intersections” can be specified that denote a successful match.
Before doing any spatial join, it is essential that both tables share a common CRS.
> st_crs(lead)
Coordinate Reference System:
User input: EPSG:32618
wkt:
PROJCS["WGS 84 / UTM zone 18N",
GEOGCS["WGS 84",
DATUM["WGS_1984",
SPHEROID["WGS 84",6378137,298.257223563,
AUTHORITY["EPSG","7030"]],
AUTHORITY["EPSG","6326"]],
PRIMEM["Greenwich",0,
AUTHORITY["EPSG","8901"]],
UNIT["degree",0.0174532925199433,
AUTHORITY["EPSG","9122"]],
AUTHORITY["EPSG","4326"]],
PROJECTION["Transverse_Mercator"],
PARAMETER["latitude_of_origin",0],
PARAMETER["central_meridian",-75],
PARAMETER["scale_factor",0.9996],
PARAMETER["false_easting",500000],
PARAMETER["false_northing",0],
UNIT["metre",1,
AUTHORITY["EPSG","9001"]],
AXIS["Easting",EAST],
AXIS["Northing",NORTH],
AUTHORITY["EPSG","32618"]]
> st_crs(census_tracts)
Coordinate Reference System:
User input: 32618
wkt:
PROJCS["WGS_1984_UTM_Zone_18N",
GEOGCS["GCS_WGS_1984",
DATUM["WGS_1984",
SPHEROID["WGS_84",6378137.0,298.257223563]],
PRIMEM["Greenwich",0.0],
UNIT["Degree",0.0174532925199433],
AUTHORITY["EPSG","4326"]],
PROJECTION["Transverse_Mercator"],
PARAMETER["False_Easting",500000.0],
PARAMETER["False_Northing",0.0],
PARAMETER["Central_Meridian",-75.0],
PARAMETER["Scale_Factor",0.9996],
PARAMETER["Latitude_Of_Origin",0.0],
UNIT["Meter",1.0],
AUTHORITY["EPSG","32618"]]
The st_join
function performs a left join using the geometries of the two
simple feature collections.
> st_join(lead, census_tracts)
Simple feature collection with 3149 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
CRS: EPSG:32618
First 10 features:
ID ppm TRACT POP2000 perc_hispa geometry
1 0 3.890648 6102 2154 0.01578459 POINT (408164.3 4762321)
2 1 4.899391 3200 2444 0.10515548 POINT (405914.9 4767394)
3 2 4.434912 100 393 0.03053435 POINT (405724 4767706)
4 3 5.285548 700 1630 0.03190184 POINT (406702.8 4769201)
5 4 5.295919 3900 4405 0.23541430 POINT (405392.3 4765598)
6 5 4.681277 6000 3774 0.04027557 POINT (405644.1 4762037)
7 6 3.364148 5602 2212 0.08318264 POINT (409183.1 4763057)
8 7 4.096946 400 3630 0.01019284 POINT (407945.4 4770014)
9 8 4.689880 6000 3774 0.04027557 POINT (406341.4 4762603)
10 9 5.422257 2100 1734 0.09169550 POINT (404638.1 4767636)
- Only the “left” geometry is preserved in the output
- Matching defaults to
st_intersects
, but permits any predicate function (usejoin
argument to specify)
> st_join(lead, census_tracts,
+ join = st_contains)
Simple feature collection with 3149 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 401993 ymin: 4759765 xmax: 412469 ymax: 4770955
CRS: EPSG:32618
First 10 features:
ID ppm TRACT POP2000 perc_hispa geometry
1 0 3.890648 NA NA NA POINT (408164.3 4762321)
2 1 4.899391 NA NA NA POINT (405914.9 4767394)
3 2 4.434912 NA NA NA POINT (405724 4767706)
4 3 5.285548 NA NA NA POINT (406702.8 4769201)
5 4 5.295919 NA NA NA POINT (405392.3 4765598)
6 5 4.681277 NA NA NA POINT (405644.1 4762037)
7 6 3.364148 NA NA NA POINT (409183.1 4763057)
8 7 4.096946 NA NA NA POINT (407945.4 4770014)
9 8 4.689880 NA NA NA POINT (406341.4 4762603)
10 9 5.422257 NA NA NA POINT (404638.1 4767636)
This command would match on polygons within (contained by) each point, which is the wrong way around. Stick with the default.
The population data is at the coarser scale, so the lead concentration ought to be averaged
within a census tract. Once each lead measurement is joined to TRACT
, the
spatial data can be dropped using st_drop_geometry()
.
lead_tracts <- lead %>%
st_join(census_tracts) %>%
st_drop_geometry()
Two more steps are needed: group_by()
the data by TRACT
and summarise()
to average the lead concentrations within each TRACT
.
lead_tracts <- lead %>%
st_join(census_tracts) %>%
st_drop_geometry() %>%
group_by(TRACT) %>%
summarise(avg_ppm = mean(ppm))
To visualize the average lead concentration from soil samples within each census tract,
merge the data frame to the sf
object on the TRACT
column.
Again, we use inner_join
to keep all rows with matching keys between census_tracts
and lead_tracts
.
census_lead_tracts <- census_tracts %>%
inner_join(lead_tracts)
Make a plot of avg_ppm
for each census tract.
ggplot(census_lead_tracts,
aes(fill = avg_ppm)) +
geom_sf() +
scale_fill_gradientn(
colors = heat.colors(7))
A problem with this approach to aggregation is that it ignores autocorrelation. Points close to each other tend to have similar values and shouldn’t be given equal weight in averages within a polygon.
Take a closer look at tract 5800 (around 43.025N, 76.152W), and notice that several low values are nearly stacked on top of each other.
library(mapview)
mapview(lead['ppm'],
map.types = 'OpenStreetMap',
viewer.suppress = TRUE) +
mapview(census_tracts,
label = census_tracts$TRACT,
legend = FALSE)