Manipulating Tabular Data

Lesson 4 with Kelly Hondula


Lesson Objectives

Specific Achievements

Data frames occupy a central place in R analysis pipelines. While the base R functions provide most necessary tools to subset, reformat and transform data frames, the specialized packages in this lesson offer friendlier and often computationally faster ways to perform common data frame processing steps. The uniform syntax of the tidyr and dplyr packages also makes scripts more readable and easier to debug. The key functions in both packages have close counterparts in SQL (Structured Query Language), which provides the added bonus of facilitating translation between R and relational databases.

Top of Section


Tidy Concept

R developer Hadley Wickham (author of the tidyr, dplyr and ggplot2 packages, among others) defines tidy datasets (Wickham 2014) as those where:

These guidelines may be familiar to some of you—they closely map to best practices for “normalization” in database design.

Consider a data set where the outcome of an experiment has been recorded in a perfectly appropriate way:

block drug control placebo
1 0.22 0.58 0.31
2 0.12 0.98 0.47
3 0.42 0.19 0.40

The response data are present in a compact matrix, as you might record it on a spreadsheet. The form does not match how we think about a statistical model, such as:

In a tidy format, each row is a complete observation: it includes the response value and all the predictor values. In this data, some of those predictor values are column headers, so the table needs to be reshaped. The tidyr package provides functions to help re-organize tables.

The third principle of tidy data, one table per category of observed entities, becomes especially important in synthesis research. Following this principle requires holding tidy data in multiple tables, with associations between them formalized in metadata, as in a relational database.

Datasets split across multiple tables are unavoidable in synthesis research, and commonly used in the following two ways (often in combination):

The dplyr package includes several functions that all perform variations on table joins needed to “un-tidy” your tables, but there are only two basic types of table relationships to recognize:

Top of Section


Gather

The tidyr package’s gather function reshapes “wide” data frames into “long” ones.

library(tidyr)
tidy_trial <- gather(trial,
  key = "treatment",
  value = "response",
  -block)

All columns, accept for “block”, are stacked in two columns: a “key” and a “value”. The key column gets the name treatment and the value column reveives the name response. For each row in the result, the key is taken from the name of the column and the value from the data in the column.

> tidy_trial
  block treatment response
1     1      drug     0.22
2     2      drug     0.12
3     3      drug     0.42
4     1   control     0.58
5     2   control     0.98
6     3   control     0.19
7     1   placebo     0.31
8     2   placebo     0.47
9     3   placebo     0.40

Some notes on the syntax: a big advantage of tidyr and dplyr is that each function takes a data frame as its first argument and returns a new data frame. As we will see later, it makes it very easy to chain these functions in a pipeline. All functions also use column names as variables without subsetting them from a data frame (i.e. block instead of trial$block).

Spread

Data can also fail to be tidy when a table is too long. The Entity-Attribute-Value (EAV) structure common in large databases distributes multible attributes of a single entity/observation into separate rows.

Remember that the exact state of “tidy” may depend on the analysis: the key is knowing what counts as a complete observation. For example, the community ecology package vegan requires a matrix of species counts, where rows correspond to species and columns to sites. This may seem like too “wide” a format, but in the packages several multi-variate analyses, the abundance of a species across multiple sites is considered a complete observation.

Consider survey data on participant’s age and income stored in a EAV structure.

participant attr val
1 age 24
2 age 57
3 age 13
1 income 30
2 income 60

Transform the data with the spread function, which “reverses” a gather.

tidy_survey <- spread(survey,
  key = attr,
  value = val)
> tidy_survey
  participant     age  income
1           1      24      30
2           2      57      60
3           3      13      NA

One difficulty with EAV tables is the nature of missing data; an entire row rather than a single cell is missing. Think about what “missing data” could mean here—perhaps you can supply a value instead of the NA in the previous result.

tidy_survey <- spread(survey,
  key = attr,
  value = val,
  fill = 0)
> tidy_survey
  participant     age  income
1           1      24      30
2           2      57      60
3           3      13       0

Top of Section


Sample Data


Credit: US Census Bureau

To learn about data transformation with dplyr, we need more data. The Census Bureau collects subnational economic data for the U.S., releasing annual County Business Patterns (CBP) datasets including the number of establishments, employment, and payroll by industry. They also conduct the American Community Survey (ACS) and publish, among other demographic and economic variables, estimates of median income for individuals working in different industries.

library(data.table)
cbp <- fread('data/cbp15co.csv')
> str(cbp)
Classes 'data.table' and 'data.frame':	2126601 obs. of  26 variables:
 $ FIPSTATE: int  1 1 1 1 1 1 1 1 1 1 ...
 $ FIPSCTY : int  1 1 1 1 1 1 1 1 1 1 ...
 $ NAICS   : chr  "------" "11----" "113///" "1133//" ...
 $ EMPFLAG : chr  "" "" "" "" ...
 $ EMP_NF  : chr  "G" "H" "H" "H" ...
 $ EMP     : int  10454 70 70 70 70 70 0 0 0 0 ...
 $ QP1_NF  : chr  "G" "H" "H" "H" ...
 $ QP1     : int  76437 790 790 790 790 790 0 0 0 0 ...
 $ AP_NF   : chr  "G" "H" "H" "H" ...
 $ AP      : int  321433 3566 3551 3551 3551 3551 0 0 0 0 ...
 $ EST     : int  844 7 6 6 6 6 1 1 1 1 ...
 $ N1_4    : int  430 5 4 4 4 4 1 1 1 1 ...
 $ N5_9    : int  171 1 1 1 1 1 0 0 0 0 ...
 $ N10_19  : int  118 0 0 0 0 0 0 0 0 0 ...
 $ N20_49  : int  81 0 0 0 0 0 0 0 0 0 ...
 $ N50_99  : int  35 1 1 1 1 1 0 0 0 0 ...
 $ N100_249: int  6 0 0 0 0 0 0 0 0 0 ...
 $ N250_499: int  2 0 0 0 0 0 0 0 0 0 ...
 $ N500_999: int  1 0 0 0 0 0 0 0 0 0 ...
 $ N1000   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_1 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_2 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_3 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_4 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CENSTATE: int  63 63 63 63 63 63 63 63 63 63 ...
 $ CENCTY  : int  1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr> 

See the CBP dataset documentation for an explanation of the variables we don’t discuss in this lesson.

Modify the import to clean up this read: consider the data type for FIPS codes along with what string in this CSV file represents NAs, a.k.a. data that is not-available or missing.

cbp <- fread(
  'data/cbp15co.csv',
  na.strings = NULL,
  colClasses = c(
    FIPSTATE='character',
    FIPSCTY='character'))
Question
What changed?
Answer
Using str() shows that the character string "" in the CSV file is no longer read into R as missing data (an NA) but as an empty string. The two named “FIPS” columns are now correctly read as strings.
acs <- fread(
  'data/ACS/sector_ACS_15_5YR_S2413.csv',
  colClasses = c(FIPS='character'))
> str(acs)
Classes 'data.table' and 'data.frame':	59698 obs. of  4 variables:
 $ FIPS         : chr  "01001" "01003" "01005" "01007" ...
 $ County       : chr  "Autauga County, Alabama" "Baldwin County, Alabama" "Barbour County, Alabama" "Bibb County, Alabama" ...
 $ Sector       : chr  "agriculture forestry fishing and hunting" "agriculture forestry fishing and hunting" "agriculture forestry fishing and hunting" "agriculture forestry fishing and hunting" ...
 $ median_income: int  27235 40017 32260 22240 21260 30469 33300 39784 40417 20370 ...
 - attr(*, ".internal.selfref")=<externalptr> 

The two datasets both contain economic variables for each U.S. county and specified by different categories of industry. The data could potentially be manipulated into a single table reflecting the follow statistical model.

Top of Section


dplyr Functions

Function Returns
filter keep rows that satisfy conditions
mutate apply a transformation to existing [split] columns
select keep columns with matching names
inner_join merge columns from separate tables into one table
group_by split data into groups by an existing factor
summarize summarize across rows [and combine split groups]

The table above summarizes the most commonly used functions in dplyr, which we will demonstrate in turn on data from the U.S. Census Bureau.

Filter

The cbp table includes character NAICS column. Of the 2 million observations, lets see how many observations are left when we keep only the 2-digit NAICS codes, representing high-level sectors of the economy.

library(dplyr)
cbp2 <- filter(cbp,
  grepl('----', NAICS),
  !grepl('------', NAICS))
> str(cbp2)
'data.frame':	58901 obs. of  26 variables:
 $ FIPSTATE: chr  "01" "01" "01" "01" ...
 $ FIPSCTY : chr  "001" "001" "001" "001" ...
 $ NAICS   : chr  "11----" "21----" "22----" "23----" ...
 $ EMPFLAG : chr  "" "" "" "" ...
 $ EMP_NF  : chr  "H" "H" "H" "G" ...
 $ EMP     : int  70 82 196 372 971 211 2631 124 73 375 ...
 $ QP1_NF  : chr  "H" "H" "H" "G" ...
 $ QP1     : int  790 713 4793 2891 15386 2034 14905 1229 924 4201 ...
 $ AP_NF   : chr  "H" "H" "H" "G" ...
 $ AP      : int  3566 3294 18611 13801 64263 11071 61502 5128 3407 16328 ...
 $ EST     : int  7 3 9 75 24 29 169 16 9 67 ...
 $ N1_4    : int  5 0 2 51 9 18 68 9 5 41 ...
 $ N5_9    : int  1 1 1 13 4 6 41 4 1 18 ...
 $ N10_19  : int  0 1 2 7 4 2 34 1 1 6 ...
 $ N20_49  : int  0 0 3 4 3 3 11 2 2 2 ...
 $ N50_99  : int  1 1 1 0 3 0 11 0 0 0 ...
 $ N100_249: int  0 0 0 0 0 0 3 0 0 0 ...
 $ N250_499: int  0 0 0 0 0 0 1 0 0 0 ...
 $ N500_999: int  0 0 0 0 1 0 0 0 0 0 ...
 $ N1000   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_1 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_2 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_3 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_4 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CENSTATE: int  63 63 63 63 63 63 63 63 63 63 ...
 $ CENCTY  : int  1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr> 

Note that a logical “and” is implied when conditions are separated by commas. (This is perhaps the main way in which filter differs from the base R subset function.) Therefore, the example above is equivalent to filter(grepl('----', NAICS), !grepl('------', NAICS). A logical “or”, on the other hand, must be specified explicitly with the | operator.

The stringr package makes the use of pattern matching by regular expressions a bit more maneageble, and streamlines this step.

library(stringr)
cbp2 <- filter(cbp,
  str_detect(NAICS, '[0-9]{2}----'))

Mutate

The mutate function is the dplyr answer to updating or altering your columns. It performs arbitrary operations on existing columns and appends the result as a new column of the same length.

Here’s one you’ve probably needed before:

cbp3 <- mutate(cbp2,
  FIPS = str_c(FIPSTATE, FIPSCTY))

Multiple arguments to mutate produce multiple transformations.

cbp3 <- mutate(cbp2,
  FIPS = str_c(FIPSTATE, FIPSCTY),
  NAICS = str_remove(NAICS, '-+'))

Chaining Functions

All the functions from the dplyr package take a data frame as their first argument, and they return a data frame. This consistent syntax is on purpose. It is designed for easily chaining data transformations together: creating a data pipeline that is easy to read and modify.

The “pipe” operator (%>%) takes the expression on its left-hand side and inserts it, as the first argument, into the function on its right-hand side. Equivalent to sum(c(1,3,5)), for example, we have:

> c(1, 3, 5) %>% sum()
[1] 9

Additional arguments are accepted—the pipe only handles the first.

> c(1, 3, 5, NA) %>% sum(na.rm = TRUE)
[1] 9

The pipe operator’s main utility is to condense a chain of operations applied to the same piece of data, when you don’t want any intermediate results. We can do the filter and mutate operations from above with one assignment.

cbp <- cbp %>%
  filter(
    str_detect(NAICS, '[0-9]{2}----')
  ) %>%
  mutate(
    FIPS = str_c(FIPSTATE, FIPSCTY),
    NAICS = str_remove(NAICS, '-+')
  )

Select

To keep particular columns of a data frame (rather than filtering rows), use the select function with arguments that match column names.

> names(cbp)
 [1] "FIPSTATE" "FIPSCTY"  "NAICS"    "EMPFLAG"  "EMP_NF"   "EMP"     
 [7] "QP1_NF"   "QP1"      "AP_NF"    "AP"       "EST"      "N1_4"    
[13] "N5_9"     "N10_19"   "N20_49"   "N50_99"   "N100_249" "N250_499"
[19] "N500_999" "N1000"    "N1000_1"  "N1000_2"  "N1000_3"  "N1000_4" 
[25] "CENSTATE" "CENCTY"   "FIPS"    

One way to “match” is by including complete names, each one you want to keep:

> cbp %>%
+   select(
+     FIPS,
+     NAICS,
+     N1_4, N5_9, N10_19 # a better way?
+   )

Alternatively, we can use a “select helper” to match patterns.

cbp <- cbp %>%
  select(
    FIPS,
    NAICS,
    starts_with('N')
  )

Top of Section


Join

The CBP dataset uses FIPS to identify U.S. counties and NAICS codes to identify types of industry. The ACS dataset also uses FIPS but their data may aggregate across multiple NAICS codes representing a single industry sector.

sector <- fread(
  'data/ACS/sector_naics.csv',
  colClasses = c(NAICS='character'))
> View(sector)

Probably the primary challenge in combining secondary datasets for synthesis research is dealing with their different sampling frames. A very common issue is that data are collected at different “scales”, with one dataset being at higher spatial or temporal resolution than another. The differences between the CBP and ACS categories of industry present a similar problem, and require the same solution of re-aggregating data at the “lower resolution”.

Many-to-One

cbp <- cbp %>%
  inner_join(sector)
Joining, by = "NAICS"
> View(cbp)

The NAICS field in the cbp table can have the same value multiple times, it is not a primary key in this table. In the sector table, the NAICS field is the primary key uniquely identifying each record. The type of relationship between these tables is therefore “many-to-one”.

Question
Note that we lost a couple thousand rows through this join. How could cbp have fewer rows after a join on NAICS codes?
Answer
The CBP data contains an NAICS code not mapped to a sector—the “error code” 99 is not present in sector. The use of “error codes” that could easilly be mistaken for data is frowned upon.

Group By

A very common data manipulation procedure know as “split-apply-combine” tackles the problem of applying the same transformation to subsets of data while keeping the result all together. We need the total number of establishments in each size class aggregated within each county and industry sector.

The dplyr function group_by begins the process by indicating how the data frame should be split into subsets.

cbp_grouped <- cbp %>%
  group_by(FIPS, Sector)

At this point, nothing has really changed:

> str(cbp_grouped)
Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':	56704 obs. of  16 variables:
 $ FIPS    : chr  "01001" "01001" "01001" "01001" ...
 $ NAICS   : chr  "11" "21" "22" "23" ...
 $ N1_4    : int  5 0 2 51 9 18 68 9 5 41 ...
 $ N5_9    : int  1 1 1 13 4 6 41 4 1 18 ...
 $ N10_19  : int  0 1 2 7 4 2 34 1 1 6 ...
 $ N20_49  : int  0 0 3 4 3 3 11 2 2 2 ...
 $ N50_99  : int  1 1 1 0 3 0 11 0 0 0 ...
 $ N100_249: int  0 0 0 0 0 0 3 0 0 0 ...
 $ N250_499: int  0 0 0 0 0 0 1 0 0 0 ...
 $ N500_999: int  0 0 0 0 1 0 0 0 0 0 ...
 $ N1000   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_1 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_2 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_3 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_4 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Sector  : chr  "agriculture forestry fishing and hunting" "mining quarrying and oil and gas extraction" "utilities" "construction" ...
 - attr(*, "groups")=Classes 'tbl_df', 'tbl' and 'data.frame':	56704 obs. of  3 variables:
  ..$ FIPS  : chr  "01001" "01001" "01001" "01001" ...
  ..$ Sector: chr  "accommodation and food services" "administrative and support and waste management and remediation services" "agriculture forestry fishing and hunting" "arts entertainment and recreation" ...
  ..$ .rows :List of 56704
  .. ..$ : int 18
  .. ..$ : int 14
  .. ..$ : int 1
  .. ..$ : int 17
  .. ..$ : int 4
  .. ..$ : int 15
  .. ..$ : int 10
  .. ..$ : int 16
  .. ..$ : int 9
  .. ..$ : int 13
  .. ..$ : int 5
  .. ..$ : int 2
  .. ..$ : int 19
  .. ..$ : int 12
  .. ..$ : int 11
  .. ..$ : int 7
  .. ..$ : int 8
  .. ..$ : int 3
  .. ..$ : int 6
  .. ..$ : int 37
  .. ..$ : int 33
  .. ..$ : int 20
  .. ..$ : int 36
  .. ..$ : int 23
  .. ..$ : int 34
  .. ..$ : int 29
  .. ..$ : int 35
  .. ..$ : int 28
  .. ..$ : int 32
  .. ..$ : int 24
  .. ..$ : int 21
  .. ..$ : int 38
  .. ..$ : int 31
  .. ..$ : int 30
  .. ..$ : int 26
  .. ..$ : int 27
  .. ..$ : int 22
  .. ..$ : int 25
  .. ..$ : int 55
  .. ..$ : int 51
  .. ..$ : int 39
  .. ..$ : int 54
  .. ..$ : int 42
  .. ..$ : int 52
  .. ..$ : int 48
  .. ..$ : int 53
  .. ..$ : int 47
  .. ..$ : int 43
  .. ..$ : int 40
  .. ..$ : int 56
  .. ..$ : int 50
  .. ..$ : int 49
  .. ..$ : int 45
  .. ..$ : int 46
  .. ..$ : int 41
  .. ..$ : int 44
  .. ..$ : int 74
  .. ..$ : int 70
  .. ..$ : int 57
  .. ..$ : int 73
  .. ..$ : int 60
  .. ..$ : int 71
  .. ..$ : int 66
  .. ..$ : int 72
  .. ..$ : int 65
  .. ..$ : int 69
  .. ..$ : int 61
  .. ..$ : int 58
  .. ..$ : int 75
  .. ..$ : int 68
  .. ..$ : int 67
  .. ..$ : int 63
  .. ..$ : int 64
  .. ..$ : int 59
  .. ..$ : int 62
  .. ..$ : int 93
  .. ..$ : int 89
  .. ..$ : int 76
  .. ..$ : int 92
  .. ..$ : int 79
  .. ..$ : int 90
  .. ..$ : int 85
  .. ..$ : int 91
  .. ..$ : int 84
  .. ..$ : int 88
  .. ..$ : int 80
  .. ..$ : int 77
  .. ..$ : int 94
  .. ..$ : int 87
  .. ..$ : int 86
  .. ..$ : int 82
  .. ..$ : int 83
  .. ..$ : int 78
  .. ..$ : int 81
  .. ..$ : int 111
  .. ..$ : int 107
  .. ..$ : int 95
  .. ..$ : int 110
  .. ..$ : int 97
  .. .. [list output truncated]
  ..- attr(*, ".drop")= logi TRUE

The group_by statement does not change any values in the data frame; it only adds attributes to the the original data frame. You can add multiple variables (separated by commas) in group_by; each distinct combination of values across these columns defines a different group.

Summarize

The operation to perform on each group is summing: we need to sum the number of establishments in each group. Using dplyr functions, the summaries are automically combined into a data frame.

cbp <- cbp %>%
  group_by(FIPS, Sector) %>%
  select(starts_with('N'), -NAICS) %>%
  summarize_all(sum)
Adding missing grouping variables: `FIPS`, `Sector`

The “combine” part of “split-apply-combine” occurs automatically, when the attributes introduced by group_by are dropped. You can see attributes either by running the str() function on the data frame or by inspecting it in the RStudio Environment pane.

There is now a one-to-one relationship between cbp and acs, based on the combination of FIPS and Sector as the primary key for both tables.

acs_cbp <- cbp %>%
  inner_join(acs)
Joining, by = c("FIPS", "Sector")

Again, however, the one-to-one relationship does not mean all rows are preserved by the join. The specific nature of the inner_join is to keep all rows, even duplicating rows if the relationship is many-to-one, where there are matching values in both tables, and discarding the rest.

The acs_cbp table now includes the median_income variable from the ACS and appropriatey aggregated establishment size information (the number of establishments by employee bins) from the CBP table.

> View(acs_cbp)

Top of Section


Additional Resources

The following cheat sheets and tutorials repeat much of this lesson, but also provide information on additional functions for “data wrangling”.

The first is one of several cheat sheets created by RStudio, and provides a handy, visual summary of all the key functions discussed in this lesson. It also lists some of the auxiliary functions that can be used within each type of expression, e.g. aggregation functions for summarize, “moving window” functions for mutate, etc.

Top of Section


Exercises

Exercise 1

Now that we have a tidy form of survey, convert it to a long_survey data frame using gather. The only difference between survey and long_survey should be an additional row for zero income.

View solution

Exercise 2

Use filter and select to return just the annual payroll data for the top level construction sector (“23—-“).

View solution

Exercise 3

Write code to create a data frame giving, for each state, the number of counties in the CBP survey with establishements in mining or oil and gas extraction (‘21—-‘) along with their total employment (“EMP”). Group the data using both FIPSTATE and FIPSCTY and use the fact that one call to summarize only combines across the lowest level of grouping. The dplyr function n counts rows in a group.

View solution

Exercise 4

A “pivot table” is a transformation of tidy data into a wide summary table. First, data are summarized by two grouping factors, then one of these is “pivoted” into columns. Starting from a filtered CBP data file, chain a split-apply-combine procedure into the tidyr function spread to get the total number of employees (“EMP”) in each state (as rows) by 2-digit NAICS code (as columns).

Solutions

Solution 1

gather(tidy_survey, key = "attr",
  value = "val", -participant)
  participant    attr val
1           1     age  24
2           2     age  57
3           3     age  13
4           1  income  30
5           2  income  60
6           3  income   0

Return

Solution 2

cbp_23 <- fread('data/cbp15co.csv', na.strings = '') %>%
  filter(NAICS == '23----') %>%
  select(starts_with('FIPS'), starts_with('AP'))

Return

Solution 3

cbp_21 <- fread('data/cbp15co.csv', na.strings = '') %>%
  filter(NAICS == '21----') %>%
  group_by(FIPSTATE, FIPSCTY) %>%
  summarize(EMP = sum(EMP)) %>%
  summarize(EMP = sum(EMP), counties = n())

Return

Solution 4

pivot <- fread('data/cbp15co.csv', na.strings = '') %>%
  filter(str_detect(NAICS, '[0-9]{2}----')) %>%
  group_by(FIPSTATE, NAICS) %>%
  summarize(EMP = sum(EMP)) %>%
  spread(key = NAICS, value = EMP)

Return

Top of Section


If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.

# Nothing here yet!