Manipulating Tabular Data
Lesson 4 with Kelly Hondula
Lesson Objectives
- Review what makes a dataset tidy.
- Meet a complete set of functions for most table manipulations.
- Learn to transform datasets with split-apply-combine procedures.
- Understand the basic join operation.
Specific Achievements
- Reshape data frames with tidyr
- Summarize data by groups with dplyr
- Combine multiple data frame operations with pipes
- Combine multiple data frames with “joins”
Data frames occupy a central place in R analysis pipelines. While the base R functions provide most necessary tools to subset, reformat and transform data frames, the specialized packages in this lesson offer friendlier and often computationally faster ways to perform common data frame processing steps. The uniform syntax of the tidyr and dplyr packages also makes scripts more readable and easier to debug. The key functions in both packages have close counterparts in SQL (Structured Query Language), which provides the added bonus of facilitating translation between R and relational databases.
Tidy Concept
R developer Hadley Wickham (author of the tidyr, dplyr and ggplot2 packages, among others) defines tidy datasets (Wickham 2014) as those where:
- each variable forms a column
- each observation forms a row
- each type of observational unit forms a table
These guidelines may be familiar to some of you—they closely map to best practices for “normalization” in database design.
Consider a data set where the outcome of an experiment has been recorded in a perfectly appropriate way:
block | drug | control | placebo |
---|---|---|---|
1 | 0.22 | 0.58 | 0.31 |
2 | 0.12 | 0.98 | 0.47 |
3 | 0.42 | 0.19 | 0.40 |
The response data are present in a compact matrix, as you might record it on a spreadsheet. The form does not match how we think about a statistical model, such as:
In a tidy format, each row is a complete observation: it includes the response value and all the predictor values. In this data, some of those predictor values are column headers, so the table needs to be reshaped. The tidyr package provides functions to help re-organize tables.
The third principle of tidy data, one table per category of observed entities, becomes especially important in synthesis research. Following this principle requires holding tidy data in multiple tables, with associations between them formalized in metadata, as in a relational database.
Datasets split across multiple tables are unavoidable in synthesis research, and commonly used in the following two ways (often in combination):
- two tables are “un-tidied” by joins, or merging them into one table
- statistical models conform to the data model through a hierarchical structure or employing “random effects”
The dplyr package includes several functions that all perform variations on table joins needed to “un-tidy” your tables, but there are only two basic types of table relationships to recognize:
- One-to-one relationships allow tables to be combined based on the same unique identifier (or “primary key”) in both tables.
- Many-to-one relationships require non-unique “foreign keys” in the first table to match the primary key of the second.
Gather
The tidyr package’s gather
function reshapes “wide” data frames
into “long” ones.
library(tidyr)
tidy_trial <- gather(trial,
key = "treatment",
value = "response",
-block)
All columns, accept for “block”, are stacked in two columns: a “key” and a
“value”. The key column gets the name treatment
and the value column reveives
the name response
. For each row in the result, the key is taken from the name
of the column and the value from the data in the column.
> tidy_trial
block treatment response
1 1 drug 0.22
2 2 drug 0.12
3 3 drug 0.42
4 1 control 0.58
5 2 control 0.98
6 3 control 0.19
7 1 placebo 0.31
8 2 placebo 0.47
9 3 placebo 0.40
Some notes on the syntax: a big advantage of tidyr and
dplyr is that each function takes a data frame as its first argument
and returns a new data frame. As we will see later, it makes it very easy to
chain these functions in a pipeline. All functions also use column names as
variables without subsetting them from a data frame (i.e. block
instead of
trial$block
).
Spread
Data can also fail to be tidy when a table is too long. The Entity-Attribute-Value (EAV) structure common in large databases distributes multible attributes of a single entity/observation into separate rows.
Remember that the exact state of “tidy” may depend on the analysis: the key is knowing what counts as a complete observation. For example, the community ecology package vegan requires a matrix of species counts, where rows correspond to species and columns to sites. This may seem like too “wide” a format, but in the packages several multi-variate analyses, the abundance of a species across multiple sites is considered a complete observation.
Consider survey data on participant’s age and income stored in a EAV structure.
participant | attr | val |
---|---|---|
1 | age | 24 |
2 | age | 57 |
3 | age | 13 |
1 | income | 30 |
2 | income | 60 |
Transform the data with the spread
function, which “reverses” a gather
.
tidy_survey <- spread(survey,
key = attr,
value = val)
> tidy_survey
participant age income
1 1 24 30
2 2 57 60
3 3 13 NA
One difficulty with EAV tables is the nature of missing data; an entire row
rather than a single cell is missing. Think about what “missing data” could mean
here—perhaps you can supply a value instead of the NA
in the previous
result.
tidy_survey <- spread(survey,
key = attr,
value = val,
fill = 0)
> tidy_survey
participant age income
1 1 24 30
2 2 57 60
3 3 13 0
Sample Data
To learn about data transformation with dplyr, we need more data. The Census Bureau collects subnational economic data for the U.S., releasing annual County Business Patterns (CBP) datasets including the number of establishments, employment, and payroll by industry. They also conduct the American Community Survey (ACS) and publish, among other demographic and economic variables, estimates of median income for individuals working in different industries.
library(data.table)
cbp <- fread('data/cbp15co.csv')
> str(cbp)
Classes 'data.table' and 'data.frame': 2126601 obs. of 26 variables:
$ FIPSTATE: int 1 1 1 1 1 1 1 1 1 1 ...
$ FIPSCTY : int 1 1 1 1 1 1 1 1 1 1 ...
$ NAICS : chr "------" "11----" "113///" "1133//" ...
$ EMPFLAG : chr "" "" "" "" ...
$ EMP_NF : chr "G" "H" "H" "H" ...
$ EMP : int 10454 70 70 70 70 70 0 0 0 0 ...
$ QP1_NF : chr "G" "H" "H" "H" ...
$ QP1 : int 76437 790 790 790 790 790 0 0 0 0 ...
$ AP_NF : chr "G" "H" "H" "H" ...
$ AP : int 321433 3566 3551 3551 3551 3551 0 0 0 0 ...
$ EST : int 844 7 6 6 6 6 1 1 1 1 ...
$ N1_4 : int 430 5 4 4 4 4 1 1 1 1 ...
$ N5_9 : int 171 1 1 1 1 1 0 0 0 0 ...
$ N10_19 : int 118 0 0 0 0 0 0 0 0 0 ...
$ N20_49 : int 81 0 0 0 0 0 0 0 0 0 ...
$ N50_99 : int 35 1 1 1 1 1 0 0 0 0 ...
$ N100_249: int 6 0 0 0 0 0 0 0 0 0 ...
$ N250_499: int 2 0 0 0 0 0 0 0 0 0 ...
$ N500_999: int 1 0 0 0 0 0 0 0 0 0 ...
$ N1000 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ CENSTATE: int 63 63 63 63 63 63 63 63 63 63 ...
$ CENCTY : int 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
See the CBP dataset documentation for an explanation of the variables we don’t discuss in this lesson.
Modify the import to clean up this read: consider the data type for FIPS codes along with what string in this CSV file represents NAs, a.k.a. data that is not-available or missing.
cbp <- fread(
'data/cbp15co.csv',
na.strings = NULL,
colClasses = c(
FIPSTATE='character',
FIPSCTY='character'))
- Question
- What changed?
- Answer
- Using
str()
shows that the character string""
in the CSV file is no longer read into R as missing data (anNA
) but as an empty string. The two named “FIPS” columns are now correctly read as strings.
acs <- fread(
'data/ACS/sector_ACS_15_5YR_S2413.csv',
colClasses = c(FIPS='character'))
> str(acs)
Classes 'data.table' and 'data.frame': 59698 obs. of 4 variables:
$ FIPS : chr "01001" "01003" "01005" "01007" ...
$ County : chr "Autauga County, Alabama" "Baldwin County, Alabama" "Barbour County, Alabama" "Bibb County, Alabama" ...
$ Sector : chr "agriculture forestry fishing and hunting" "agriculture forestry fishing and hunting" "agriculture forestry fishing and hunting" "agriculture forestry fishing and hunting" ...
$ median_income: int 27235 40017 32260 22240 21260 30469 33300 39784 40417 20370 ...
- attr(*, ".internal.selfref")=<externalptr>
The two datasets both contain economic variables for each U.S. county and specified by different categories of industry. The data could potentially be manipulated into a single table reflecting the follow statistical model.
dplyr Functions
Function | Returns |
---|---|
filter |
keep rows that satisfy conditions |
mutate |
apply a transformation to existing [split] columns |
select |
keep columns with matching names |
inner_join |
merge columns from separate tables into one table |
group_by |
split data into groups by an existing factor |
summarize |
summarize across rows [and combine split groups] |
The table above summarizes the most commonly used functions in dplyr, which we will demonstrate in turn on data from the U.S. Census Bureau.
Filter
The cbp
table includes character NAICS
column. Of the 2 million
observations, lets see how many observations are left when we keep only the
2-digit NAICS codes, representing high-level sectors of the economy.
library(dplyr)
cbp2 <- filter(cbp,
grepl('----', NAICS),
!grepl('------', NAICS))
> str(cbp2)
'data.frame': 58901 obs. of 26 variables:
$ FIPSTATE: chr "01" "01" "01" "01" ...
$ FIPSCTY : chr "001" "001" "001" "001" ...
$ NAICS : chr "11----" "21----" "22----" "23----" ...
$ EMPFLAG : chr "" "" "" "" ...
$ EMP_NF : chr "H" "H" "H" "G" ...
$ EMP : int 70 82 196 372 971 211 2631 124 73 375 ...
$ QP1_NF : chr "H" "H" "H" "G" ...
$ QP1 : int 790 713 4793 2891 15386 2034 14905 1229 924 4201 ...
$ AP_NF : chr "H" "H" "H" "G" ...
$ AP : int 3566 3294 18611 13801 64263 11071 61502 5128 3407 16328 ...
$ EST : int 7 3 9 75 24 29 169 16 9 67 ...
$ N1_4 : int 5 0 2 51 9 18 68 9 5 41 ...
$ N5_9 : int 1 1 1 13 4 6 41 4 1 18 ...
$ N10_19 : int 0 1 2 7 4 2 34 1 1 6 ...
$ N20_49 : int 0 0 3 4 3 3 11 2 2 2 ...
$ N50_99 : int 1 1 1 0 3 0 11 0 0 0 ...
$ N100_249: int 0 0 0 0 0 0 3 0 0 0 ...
$ N250_499: int 0 0 0 0 0 0 1 0 0 0 ...
$ N500_999: int 0 0 0 0 1 0 0 0 0 0 ...
$ N1000 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ CENSTATE: int 63 63 63 63 63 63 63 63 63 63 ...
$ CENCTY : int 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
Note that a logical “and” is implied when conditions are separated by commas.
(This is perhaps the main way in which filter
differs from the base R subset
function.) Therefore, the example above is equivalent to filter(grepl('----',
NAICS), !grepl('------', NAICS)
. A logical “or”, on the other hand, must be
specified explicitly with the |
operator.
The stringr package makes the use of pattern matching by regular expressions a bit more maneageble, and streamlines this step.
library(stringr)
cbp2 <- filter(cbp,
str_detect(NAICS, '[0-9]{2}----'))
Mutate
The mutate
function is the dplyr answer to updating or altering
your columns. It performs arbitrary operations on existing columns and appends
the result as a new column of the same length.
Here’s one you’ve probably needed before:
cbp3 <- mutate(cbp2,
FIPS = str_c(FIPSTATE, FIPSCTY))
Multiple arguments to mutate
produce multiple transformations.
cbp3 <- mutate(cbp2,
FIPS = str_c(FIPSTATE, FIPSCTY),
NAICS = str_remove(NAICS, '-+'))
Chaining Functions
All the functions from the dplyr package take a data frame as their first argument, and they return a data frame. This consistent syntax is on purpose. It is designed for easily chaining data transformations together: creating a data pipeline that is easy to read and modify.
The “pipe” operator (%>%
) takes the expression on its left-hand side and
inserts it, as the first argument, into the function on its right-hand side.
Equivalent to sum(c(1,3,5))
, for example, we have:
> c(1, 3, 5) %>% sum()
[1] 9
Additional arguments are accepted—the pipe only handles the first.
> c(1, 3, 5, NA) %>% sum(na.rm = TRUE)
[1] 9
The pipe operator’s main utility is to condense a chain of operations applied to
the same piece of data, when you don’t want any intermediate results. We
can do the filter
and mutate
operations from above with one assignment.
cbp <- cbp %>%
filter(
str_detect(NAICS, '[0-9]{2}----')
) %>%
mutate(
FIPS = str_c(FIPSTATE, FIPSCTY),
NAICS = str_remove(NAICS, '-+')
)
Select
To keep particular columns of a data frame (rather than filtering rows), use
the select
function with arguments that match column names.
> names(cbp)
[1] "FIPSTATE" "FIPSCTY" "NAICS" "EMPFLAG" "EMP_NF" "EMP"
[7] "QP1_NF" "QP1" "AP_NF" "AP" "EST" "N1_4"
[13] "N5_9" "N10_19" "N20_49" "N50_99" "N100_249" "N250_499"
[19] "N500_999" "N1000" "N1000_1" "N1000_2" "N1000_3" "N1000_4"
[25] "CENSTATE" "CENCTY" "FIPS"
One way to “match” is by including complete names, each one you want to keep:
> cbp %>%
+ select(
+ FIPS,
+ NAICS,
+ N1_4, N5_9, N10_19 # a better way?
+ )
Alternatively, we can use a “select helper” to match patterns.
cbp <- cbp %>%
select(
FIPS,
NAICS,
starts_with('N')
)
Join
The CBP dataset uses FIPS to identify U.S. counties and NAICS codes to identify types of industry. The ACS dataset also uses FIPS but their data may aggregate across multiple NAICS codes representing a single industry sector.
sector <- fread(
'data/ACS/sector_naics.csv',
colClasses = c(NAICS='character'))
> View(sector)
Probably the primary challenge in combining secondary datasets for synthesis research is dealing with their different sampling frames. A very common issue is that data are collected at different “scales”, with one dataset being at higher spatial or temporal resolution than another. The differences between the CBP and ACS categories of industry present a similar problem, and require the same solution of re-aggregating data at the “lower resolution”.
Many-to-One
cbp <- cbp %>%
inner_join(sector)
Joining, by = "NAICS"
> View(cbp)
The NAICS field in the cbp
table can have the same value multiple times, it is
not a primary key in this table. In the sector
table, the NAICS field is the
primary key uniquely identifying each record. The type of relationship between
these tables is therefore “many-to-one”.
- Question
- Note that we lost a couple thousand rows through this join. How could
cbp
have fewer rows after a join on NAICS codes? - Answer
- The CBP data contains an NAICS code not mapped to a sector—the
“error code” 99 is not present in
sector
. The use of “error codes” that could easilly be mistaken for data is frowned upon.
Group By
A very common data manipulation procedure know as “split-apply-combine” tackles the problem of applying the same transformation to subsets of data while keeping the result all together. We need the total number of establishments in each size class aggregated within each county and industry sector.
The dplyr function group_by
begins the process by indicating how the data
frame should be split into subsets.
cbp_grouped <- cbp %>%
group_by(FIPS, Sector)
At this point, nothing has really changed:
> str(cbp_grouped)
Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 56704 obs. of 16 variables:
$ FIPS : chr "01001" "01001" "01001" "01001" ...
$ NAICS : chr "11" "21" "22" "23" ...
$ N1_4 : int 5 0 2 51 9 18 68 9 5 41 ...
$ N5_9 : int 1 1 1 13 4 6 41 4 1 18 ...
$ N10_19 : int 0 1 2 7 4 2 34 1 1 6 ...
$ N20_49 : int 0 0 3 4 3 3 11 2 2 2 ...
$ N50_99 : int 1 1 1 0 3 0 11 0 0 0 ...
$ N100_249: int 0 0 0 0 0 0 3 0 0 0 ...
$ N250_499: int 0 0 0 0 0 0 1 0 0 0 ...
$ N500_999: int 0 0 0 0 1 0 0 0 0 0 ...
$ N1000 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Sector : chr "agriculture forestry fishing and hunting" "mining quarrying and oil and gas extraction" "utilities" "construction" ...
- attr(*, "groups")=Classes 'tbl_df', 'tbl' and 'data.frame': 56704 obs. of 3 variables:
..$ FIPS : chr "01001" "01001" "01001" "01001" ...
..$ Sector: chr "accommodation and food services" "administrative and support and waste management and remediation services" "agriculture forestry fishing and hunting" "arts entertainment and recreation" ...
..$ .rows :List of 56704
.. ..$ : int 18
.. ..$ : int 14
.. ..$ : int 1
.. ..$ : int 17
.. ..$ : int 4
.. ..$ : int 15
.. ..$ : int 10
.. ..$ : int 16
.. ..$ : int 9
.. ..$ : int 13
.. ..$ : int 5
.. ..$ : int 2
.. ..$ : int 19
.. ..$ : int 12
.. ..$ : int 11
.. ..$ : int 7
.. ..$ : int 8
.. ..$ : int 3
.. ..$ : int 6
.. ..$ : int 37
.. ..$ : int 33
.. ..$ : int 20
.. ..$ : int 36
.. ..$ : int 23
.. ..$ : int 34
.. ..$ : int 29
.. ..$ : int 35
.. ..$ : int 28
.. ..$ : int 32
.. ..$ : int 24
.. ..$ : int 21
.. ..$ : int 38
.. ..$ : int 31
.. ..$ : int 30
.. ..$ : int 26
.. ..$ : int 27
.. ..$ : int 22
.. ..$ : int 25
.. ..$ : int 55
.. ..$ : int 51
.. ..$ : int 39
.. ..$ : int 54
.. ..$ : int 42
.. ..$ : int 52
.. ..$ : int 48
.. ..$ : int 53
.. ..$ : int 47
.. ..$ : int 43
.. ..$ : int 40
.. ..$ : int 56
.. ..$ : int 50
.. ..$ : int 49
.. ..$ : int 45
.. ..$ : int 46
.. ..$ : int 41
.. ..$ : int 44
.. ..$ : int 74
.. ..$ : int 70
.. ..$ : int 57
.. ..$ : int 73
.. ..$ : int 60
.. ..$ : int 71
.. ..$ : int 66
.. ..$ : int 72
.. ..$ : int 65
.. ..$ : int 69
.. ..$ : int 61
.. ..$ : int 58
.. ..$ : int 75
.. ..$ : int 68
.. ..$ : int 67
.. ..$ : int 63
.. ..$ : int 64
.. ..$ : int 59
.. ..$ : int 62
.. ..$ : int 93
.. ..$ : int 89
.. ..$ : int 76
.. ..$ : int 92
.. ..$ : int 79
.. ..$ : int 90
.. ..$ : int 85
.. ..$ : int 91
.. ..$ : int 84
.. ..$ : int 88
.. ..$ : int 80
.. ..$ : int 77
.. ..$ : int 94
.. ..$ : int 87
.. ..$ : int 86
.. ..$ : int 82
.. ..$ : int 83
.. ..$ : int 78
.. ..$ : int 81
.. ..$ : int 111
.. ..$ : int 107
.. ..$ : int 95
.. ..$ : int 110
.. ..$ : int 97
.. .. [list output truncated]
..- attr(*, ".drop")= logi TRUE
The group_by
statement does not change any values in the data frame; it only
adds attributes to the the original data frame. You can add multiple variables
(separated by commas) in group_by
; each distinct combination of values across
these columns defines a different group.
Summarize
The operation to perform on each group is summing: we need to sum the number of establishments in each group. Using dplyr functions, the summaries are automically combined into a data frame.
cbp <- cbp %>%
group_by(FIPS, Sector) %>%
select(starts_with('N'), -NAICS) %>%
summarize_all(sum)
Adding missing grouping variables: `FIPS`, `Sector`
The “combine” part of “split-apply-combine” occurs automatically, when the
attributes introduced by group_by
are dropped. You can see attributes either
by running the str()
function on the data frame or by inspecting it in the
RStudio Environment pane.
There is now a one-to-one relationship between cbp
and acs
, based on
the combination of FIPS and Sector as the primary key for both tables.
acs_cbp <- cbp %>%
inner_join(acs)
Joining, by = c("FIPS", "Sector")
Again, however, the one-to-one relationship does not mean all rows are preserved
by the join. The specific nature of the inner_join
is to keep all rows, even
duplicating rows if the relationship is many-to-one, where there are matching
values in both tables, and discarding the rest.
The acs_cbp
table now includes the median_income
variable from the ACS and
appropriatey aggregated establishment size information (the number of
establishments by employee bins) from the CBP table.
> View(acs_cbp)
Additional Resources
The following cheat sheets and tutorials repeat much of this lesson, but also provide information on additional functions for “data wrangling”.
The first is one of several cheat sheets created by RStudio, and provides a handy, visual summary of all the key functions discussed in this lesson. It also lists some of the auxiliary functions that can be used within each type of expression, e.g. aggregation functions for summarize, “moving window” functions for mutate, etc.
Exercises
Exercise 1
Now that we have a tidy form of survey
, convert it to a long_survey
data
frame using gather
. The only difference between survey
and long_survey
should be an additional row for zero income.
Exercise 2
Use filter
and select
to return just the annual payroll data for the top
level construction sector (“23—-“).
Exercise 3
Write code to create a data frame giving, for each state, the number of counties
in the CBP survey with establishements in mining or oil and gas extraction
(‘21—-‘) along with their total employment (“EMP”). Group the data using
both FIPSTATE
and FIPSCTY
and use the fact that one call to summarize
only combines across the lowest level of grouping. The dplyr
function n
counts rows in a group.
Exercise 4
A “pivot table” is a transformation of tidy data into a wide summary table.
First, data are summarized by two grouping factors, then one of these is
“pivoted” into columns. Starting from a filtered CBP data file, chain a
split-apply-combine procedure into the tidyr function spread
to
get the total number of employees (“EMP”) in each state (as rows) by 2-digit
NAICS code (as columns).
Solutions
Solution 1
gather(tidy_survey, key = "attr",
value = "val", -participant)
participant attr val
1 1 age 24
2 2 age 57
3 3 age 13
4 1 income 30
5 2 income 60
6 3 income 0
Solution 2
cbp_23 <- fread('data/cbp15co.csv', na.strings = '') %>%
filter(NAICS == '23----') %>%
select(starts_with('FIPS'), starts_with('AP'))
Solution 3
cbp_21 <- fread('data/cbp15co.csv', na.strings = '') %>%
filter(NAICS == '21----') %>%
group_by(FIPSTATE, FIPSCTY) %>%
summarize(EMP = sum(EMP)) %>%
summarize(EMP = sum(EMP), counties = n())
Solution 4
pivot <- fread('data/cbp15co.csv', na.strings = '') %>%
filter(str_detect(NAICS, '[0-9]{2}----')) %>%
group_by(FIPSTATE, NAICS) %>%
summarize(EMP = sum(EMP)) %>%
spread(key = NAICS, value = EMP)
If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.
# Nothing here yet!