Geospatial Packages in R

Lesson 9 with Benoît Parmentier

Objectives for this lesson
A note on software
Working with raster data
Mixing raster and vectors: prelude
Links & References
Solutions

Objectives for this lesson

Meet the key R packages for geospatial analysis
Learn basic wrangling of vector and raster data
Distinguish “scriptable” tasks from desktop GIS needs

Specific achievements

Load and plot vector layers (points, lines and polygons)
Filter vector layers based on associated data or another layer (overlay)
Read projections and transform coordinates
Perform geometric operations (union, intersection and buffering) on polygon layers
Load, subset and plot raster layers (grids of pixels)
Filter (mask) and aggregate raster pixels
Extract raster values based on a vectors

Top of Section

A note on software

The key R packages for this lesson are

sf
rgdal

In addition to the common case of depending on other R packages, these have dependencies on system libraries. System libraries cannot be installed by R’s install.packages() and must be aquired separately. The good people at the Open Source Geospatial Foundation make it free and easy to add the critical system libraries:

GDAL for read/write in geospatial data formats
GEOS for geometry operations
PROJ.4 for cartographic projections

Importing vector data

The US Census website distributes county polygons (and much more) that are provided with the handouts. The sf package reads shapefiles (“.shp”) and most other vector data:

library(sf)
library(rgdal)

shp <- 'data/cb_2016_us_county_5m'
counties <- st_read(shp, stringsAsFactors = FALSE)

The counties object is a data.frame that includes a sfc, which stands for “simple feature column”. This special column is usually called “geometry” or “geom”.

head(counties)

Simple feature collection with 6 features and 9 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -114.7556 ymin: 29.26116 xmax: -81.10192 ymax: 38.77443
epsg (SRID):    4269
proj4string:    +proj=longlat +datum=NAD83 +no_defs
  STATEFP COUNTYFP COUNTYNS       AFFGEOID GEOID      NAME LSAD
1      04      015 00025445 0500000US04015 04015    Mohave   06
2      12      035 00308547 0500000US12035 12035   Flagler   06
3      20      129 00485135 0500000US20129 20129    Morton   06
4      28      093 00695770 0500000US28093 28093  Marshall   06
5      29      510 00767557 0500000US29510 29510 St. Louis   25
6      35      031 00929107 0500000US35031 35031  McKinley   06
        ALAND    AWATER                       geometry
1 34475567011 387344307 MULTIPOLYGON(((-114.755618 ...
2  1257365642 221047161 MULTIPOLYGON(((-81.52366 29...
3  1889993251    507796 MULTIPOLYGON(((-102.041952 ...
4  1828989833   9195190 MULTIPOLYGON(((-89.72432442...
5   160458044  10670040 MULTIPOLYGON(((-90.318212 3...
6 14116799068  14078537 MULTIPOLYGON(((-109.046481 ...

Geometry types

Like any data.frame column, the geometry column is comprised of a single data type. The “MULTIPOLYGON” is just one of several standard geometric data types.

Common Types	Description
POINT	zero-dimensional geometry containing a single point
LINESTRING	sequence of points connected by straight, non-self intersecting line pieces; one-dimensional geometry
POLYGON	sequence of points in closed, non-intersecting rings; the first denotes the exterior ring, any subsequent rings denote holes
MULTI*	set of * (POINT, LINESTRING, or POLYGON)

The spatial data types are built upon eachother in a logical way: lines are built from points, polygons are built from lines, and so on.

Coordinate Reference System (CRS)

A key feature of a geospatial data type is its associated CRS, stored as both an EPSG ID and a PROJ.4 string.

st_crs(counties)

$epsg
[1] 4269

$proj4string
[1] "+proj=longlat +datum=NAD83 +no_defs"

attr(,"class")
[1] "crs"

Bounding Box

A bounding box for all or any subset of records is generated by st_bbox()

st_bbox(counties)

      xmin       ymin       xmax       ymax 
-179.14734  -14.55255  179.77847   71.35256

library(dplyr)
counties_md <- filter(counties, STATEFP == '24')

st_bbox(counties_md)

     xmin      ymin      xmax      ymax 
-79.48765  37.91172 -75.04894  39.72312

Grid

A bounding box summarizes the limits, but is not itself a geometry (not a POINT or POLYGON), even though it has a CRS attribute.

st_crs(st_bbox(counties_md))

$epsg
[1] 4269

$proj4string
[1] "+proj=longlat +datum=NAD83 +no_defs"

attr(,"class")
[1] "crs"

A rectangular grid made over a sf object is a geometry—by default, a POLYGON.

grid_md <- st_make_grid(counties_md, n = 4)

grid_md

Geometry set for 16 features 
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: -79.48765 ymin: 37.91172 xmax: -75.04894 ymax: 39.72312
epsg (SRID):    4269
proj4string:    +proj=longlat +datum=NAD83 +no_defs
First 5 geometries:

POLYGON((-79.487651 37.911717, -78.377973 37.91...

POLYGON((-78.377973 37.911717, -77.268295 37.91...

POLYGON((-77.268295 37.911717, -76.158617 37.91...

POLYGON((-76.158617 37.911717, -75.048939 37.91...

POLYGON((-79.487651 38.36456825, -78.377973 38....

Plot Layers

Spatial objects defined by sf are compatible with the plot function. Setting the plot parameter add = TRUE allows an existing plot to serve as a layer underneath the new one, so long as the CRS lines up.

plot(grid_md)
plot(counties_md, add = TRUE)

plot of chunk plot_counties

Create geometry

Instead of reading a shapefile, we can build spatial objects from coordinates. Here’s a sfc object with a single “POINT”, corresponding to SESYNC’s postition in WGS84 degrees lat, lon.

sesync <- st_sfc(
    st_point(c(-76.503394, 38.976546)),
    crs = 4326)

The CRS for SESYNC is not identical to the CRS for counties, but they are imperceptably different at the scale shown. Note that plot won’t prevent you from layering up geometries with different coordinate systems: you must safegaurd your own plots from this mistake.

counties_md <- st_transform(counties_md, crs = st_crs(sesync))
plot(counties_md$geometry)
plot(sesync, col = "green", pch = 20, add = TRUE)

plot of chunk plot_point

The arguments col and pch are graphical parameters used in base R, see ?par.

Subsetting vector data

An object created with st_read is a data.frame, which is why the dplyr function filter used above on the non-geospatial column named “STATEFP” worked normally. The equivalent of a “filter” operation on the “geometry” column is called a spatial “overlay”. It can be seen as a type of subsetting based on spatial (rather than numeric or string) matching. Matching is implemented with functions like st_within(x, y)

st_within(sesync, counties_md)

[[1]]
[1] 5

The output implies that the 1st (and only) point in sesync is within the 5th element of counties_md.

Question: What was the message issued by the last command all about?
Answer: It is a reminder that all geometric calculations are performed as if the coordinates (in this case longitutde and latitude) are Cartesian x,y coordinates.

The overlay functions in the sf package follow the pattern st_predicate(x, y) and perform the test “x [is] predicate y”. Some key examples are:

st_intersects	boundary or interior of x intersects boundary or interior of y
st_within	interior and boundary of x do not intersect exterior of y
st_contains	y is within x
st_overlaps	interior of x intersects interior of y
st_equals	x has the same interior and boundary as y

Exercise 1

Produce a map of Maryland counties with the county that contains SESYNC colored in red.

View solution

Coordinate transformations

For the next part of this lesson, we import a new polygon layer corresponding to the 1:250k map of US hydrological units (HUC) downloaded from the United States Geological Survey.

shp <- 'data/huc250k'
huc <- st_read(shp, stringsAsFactors = FALSE)

Compare the coordinate reference systems of counties and huc, as given by their Proj4 strings.

st_crs(counties_md)$proj4string

[1] "+proj=longlat +datum=WGS84 +no_defs"

st_crs(huc)$proj4string

[1] "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD27 +units=m +no_defs"

The Census data uses unprojected (longitude, latitude) coordinates, but huc is in an Albers equal-area projection (indicated as “+proj=aea”).

Other parameters differ between the two coordinate systems, such as the “datum”, which indicates the standard by which the irregular surface of the Earth is approximated by an ellipsoid. The function st_transform() converts a sfc between coordinate reference systems, specified with the parameter crs = x. A numeric x must be a valid EPSG code; a character x is interpretted as a PROJ.4 string.

Here is a slightly different Albers equal-area projection:

prj <- '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs'

Use st_transform() to assign the two layers to a common projection string (prj). This is a lot of work: it has to re-calculate new coordinates for every polygon in the sfc.

counties_md <- st_transform(counties_md, crs = prj)
huc <- st_transform(huc, crs = prj)
sesync <- st_transform(sesync, crs = prj)
plot(counties_md$geometry)
plot(huc, border = 'blue', add = TRUE)
plot(sesync, col = 'green', pch = 20, add = TRUE)

plot of chunk plot_over

Geometric operations on vector layers

The data for a map of waterhsed boundaries within the state of MD is all here; in the country-wide huc and in the state boundary “surrounding” all of counties_md. To get just the huc in a MD outline:

remove the internal county boundaries within the state
clip the hydrological areas to their intersection with the state

The first step is a spatial union operation: we want the resulting object to combine the area covered by all the multipolygons in counties_md. To perform a union of all sub-geometries in a single sfc, we use the st_union() function with a single argument.

state_md <- st_union(counties_md)
plot(state_md)

plot of chunk st_union

The output, state_md, is a new sfc that is no longer a column of a data frame. Tabular data can’t safely survive a spatial union and is discarded.

The second step is a spatial intersection, since we want to limit the polygons to areas covered by both huc and state_md. The st_intersection() function intersects its first argument with the second.

huc_md <- st_intersection(huc, state_md)
plot(huc_md, border = 'blue', col = NA, add = TRUE)

plot of chunk st_ntersect

The individual hydrological units are preserved but any part of them (or any whole polygon) lying outside the state_md polygon is cut from the output. The attribute data remains in the corresponding records of the data.frame, but (as warned) has not been updated. For example, the “AREA” attribute of any clipped HUC does not reflect the new polygon.

The GEOS library provides many functions dealing with distances and areas. Many of these are accessible through the sf package, including:

st_buffer: to create a buffer of specific width around a geometry
st_distance: to calculate the shortest distance between geometries
st_area: to calculate the area of polygons

Keep in mind that all these functions use planar geometry equations and thus become less precise over larger distances, where the Earth’s curvature is noticeable. To calculate geodesic distances that account for that curvature, checkout the geosphere package.

Exercise 2

Use st_buffer to create a 5km buffer around the state_md border and plot it as a dotted line (plot(..., lty = 'dotted')) over the true state border. Hint: check the layer’s units with st_crs() and express any distance in those units.

View solution

Top of Section

Working with raster data

Raster data is a matrix or cube with additional spatial metadata (e.g. extent, resolution, and projection) that allow its values to be mapped onto geographical space. The raster package provides the eponymous raster() function for reading the many formats of such data.

The National Land Cover Database is ‘.GRD’ format data, a lot of it. The file provided in this lesson is cropped and reduced to a lower resolution in order to speed processing.

library(raster)
nlcd <- raster("data/nlcd_agg.grd")

By default, raster data is not loaded into working memory, as you can confirm by checking the R object size with object.size(nlcd). This means that unlike most analyses in R, you can actually process raster datasets larger than the RAM available on your computer; the raster package automatically loads pieces of the data and computes on each of them in sequence.

The default print method for a raster object is a summary of metadata contained in the raster file.

nlcd

class       : RasterLayer 
dimensions  : 2514, 3004, 7552056  (nrow, ncol, ncell)
resolution  : 150, 150  (x, y)
extent      : 1394535, 1845135, 1724415, 2101515  (xmin, xmax, ymin, ymax)
coord. ref. : +proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=23 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
data source : /home/Ian/handouts/build/geospatial-packages-in-R-lesson/data/nlcd_agg.grd 
names       : nlcd_2011_landcover_2011_edition_2014_03_31 
values      : 0, 95  (min, max)
attributes  :
        ID      COUNT Red Green Blue Land.Cover.Class Opacity
 from:   0 7854240512   0     0    0     Unclassified       1
 to  : 255          0   0     0    0                        0

The plot method interprets the pixel values of the raster matrix according to a pre-defined color scheme.

plot(nlcd)

plot of chunk show_raster

The crop() function trims a raster object to a given spatial “extent” (or range of x and y values). The extent can be extracted from sp package objects with extent, but must be created “from scratch” for an sfc. Here, we crop the nlcd raster to the extent of the huc_md polygon, then display both layers on the same map.

extent <- matrix(st_bbox(huc_md), nrow=2)
nlcd <- crop(nlcd, extent)
plot(nlcd)
plot(huc_md, col = NA, add = TRUE)

plot of chunk crop_raster

Note that the transformed raster is now loaded in R memory, as indicated by the size of nlcd. We could have also saved the output to disk by specifying an optional filename argument to crop; the same is true for othe raster transformation functions.

A raster is fundamentally a data matrix, and individual pixel values can be extracted by regular matrix subscripting. For example, this returns the value of the bottom-left corner pixel:

nlcd[1, 1]

The meaning of this number is not immediately clear. For this particular dataset, the mapping of values to land cover classes is described in the data attributes:

head(nlcd@data@attributes[[1]])

  ID      COUNT Red     Green Blue Land.Cover.Class Opacity
1  0 7854240512   0 0.0000000    0     Unclassified       1
2  1          0   0 0.9764706    0                        1
3  2          0   0 0.0000000    0                        1
4  3          0   0 0.0000000    0                        1
5  4          0   0 0.0000000    0                        1
6  5          0   0 0.0000000    0                        1

The Land.Cover.Class vector gives string names for the land cover type corresponding to the matrix values. Note that we need to add 1 to the raster value, since these go from 0 to 255, whereas the indexing in R begins at 1.

lc_types <- nlcd@data@attributes[[1]]$Land.Cover.Class

levels(lc_types)

 [1] ""                              "Barren Land"                  
 [3] "Cultivated Crops"              "Deciduous Forest"             
 [5] "Developed, High Intensity"     "Developed, Low Intensity"     
 [7] "Developed, Medium Intensity"   "Developed, Open Space"        
 [9] "Emergent Herbaceuous Wetlands" "Evergreen Forest"             
[11] "Hay/Pasture"                   "Herbaceuous"                  
[13] "Mixed Forest"                  "Open Water"                   
[15] "Perennial Snow/Ice"            "Shrub/Scrub"                  
[17] "Unclassified"                  "Woody Wetlands"

Raster math

A mathematical function called on a raster gets applied to each pixel. For a raster r1, the function log(r1) returns a new raster where each pixel’s value is the log of the corresponding pixel in r1. The addition of r1 + r2 creates a raster where each pixel is the sum of the values from r1 and r2, provided their spatial attributes (e.g. extent, resolution, and projection) are the same.

Logical operations work too: r1 > 5 returns a raster with pixel values TRUE or FALSE and is often used in combination with the mask() function. A pasture raster results from unsetting pixel values where the mask (nlcd == 81) is false (maskvalue = FALSE).

pasture <- mask(nlcd, nlcd == 81, maskvalue = FALSE)
plot(pasture)

plot of chunk mask

To further reduce the resolution of the nlcd raster, the aggregate() function combines values in a block of a given size using a given function.

nlcd_agg <- aggregate(nlcd, fact = 25, fun = modal)
nlcd_agg@legend <- nlcd@legend
plot(nlcd_agg)

plot of chunk agg_raster

Here, fact = 25 means that we are aggregating blocks 25 x 25 pixels and fun = modal indicates that the aggregate value is the mode of the original pixels (averaging would not work since land cover is a categorical variable).

Exercise 3

The function cellStats aggregates accross an entire raster. Use it to figure out the proportion of nlcd pixels that are covered by deciduous forest (value = 41).

View solution

Top of Section

Mixing raster and vectors: prelude

The creation of geospatial tools in R has been a community effort, but not necessarilly a well-organized one. One current stumbling block is that the raster package, which is tightly integrated with the sp package, has not caught up to the sf package.

Presently, to mix raster and vectors, we must convert the desired sfc objects to their counterpart Spatial* objects:

sesync <- as(sesync, "Spatial")
huc_md <- as(huc_md, "Spatial")
counties_md <- as(counties_md, "Spatial")

Mixing rasters and vectors

The extract function allows subsetting and aggregation of raster values based on a vector spatial object.

plot(nlcd)
plot(sesync, col = 'green', pch = 16, cex = 2, add = TRUE)

plot of chunk extract_pt

When extracting by point locations (i.e. a SpatialPoints object), the result is a vector of values corresponding to each point.

sesync_lc <- extract(nlcd, sesync)

lc_types[sesync_lc + 1]

[1] Developed, Medium Intensity
18 Levels:  Barren Land Cultivated Crops ... Woody Wetlands

When extracting with a polygon, the output is a vector of all raster values for pixels falling within that polygon.

county_nlcd <- extract(nlcd_agg, counties_md[1,])

table(county_nlcd)

county_nlcd
11 21 22 23 24 41 
 3  1  4  5  2  1

To get a summary of raster values for each polygon in a SpatialPolygons object, add an aggregation function to extract via the fun argument. For example, fun = modal gives the most common land cover type for each polygon in huc_md.

modal_lc <- extract(nlcd_agg, huc_md, fun = modal)
huc_md$modal_lc <- lc_types[modal_lc + 1]

head(huc_md)

          AREA PERIMETER HUC250K_ HUC250K_ID HUC_CODE
903 6413577966  454290.2      904        916 02050306
915 1982478663  292729.7      916        927 02040205
937 5910074657  503796.5      938        948 02070004
956 3159193443  506765.4      957        968 02060002
966 4580816836  433034.1      967        978 05020006
975 2502118608  252945.8      976        987 02070009
                 HUC_NAME REG  SUB    ACC      CAT         modal_lc
903     Lower Susquehanna  02 0205 020503 02050306 Deciduous Forest
915  Brandywine-Christina  02 0204 020402 02040205      Hay/Pasture
937 Conococheague-Opequon  02 0207 020700 02070004 Deciduous Forest
956     Chester-Sassafras  02 0206 020600 02060002 Cultivated Crops
966          Youghiogheny  05 0502 050200 05020006 Deciduous Forest
975              Monocacy  02 0207 020700 02070009 Cultivated Crops

Top of Section

Links & References

For a more detailed introduction to the raster package, you can consult this vignette in CRAN.

(Book) R.S. Bivand, E.J. Pebesma and V. Gómez-Rubio (2013) Applied Spatial Data Analysis with R. UseR! Series, Springer.

R. Lovelace, J. Cheshire et al., Introduction to visualising spatial data in R. https://cran.r-project.org/doc/contrib/intro-spatial-rl.pdf

F. Rodriguez-Sanchez. Spatial data in R: Using R as a GIS. http://pakillo.github.io/R-GIS-tutorial/

CRAN Task View: Analysis of Spatial Data. https://cran.r-project.org/web/views/Spatial.html

Top of Section

Solutions

Solution 1

plot(counties_md$geometry)
overlay	<- st_within(sesync, counties_md)
counties_sesync <- counties_md[overlay[[1]], 'geometry']
plot(counties_sesync, col = "red", add = TRUE)
plot(sesync, col = 'green', pch = 20, add = TRUE)

Return

Solution 2

bubble_md <- st_buffer(state_md, 5000)
plot(state_md)
plot(bubble_md, lty = 'dotted', add = TRUE)

Return

Solution 3

cellStats(nlcd == 41, "mean")

Return

Top of Section

Geospatial Packages in R

Contents

Objectives for this lesson

Specific achievements

A note on software

Importing vector data

Geometry types

Coordinate Reference System (CRS)

Bounding Box

Grid

Plot Layers

Create geometry

Subsetting vector data

Exercise 1

Coordinate transformations

Geometric operations on vector layers

Exercise 2

Working with raster data

Raster math

Exercise 3

Mixing raster and vectors: prelude

Mixing rasters and vectors

Links & References

Solutions

Solution 1

Solution 2

Solution 3