Effective at SESYNC's closure in December 2022, this page is no longer maintained. The information may be out of date or inaccurate.

Manipulating Tabular Data

Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.


Lesson Objectives

  • Review what makes a dataset tidy.
  • Meet a complete set of functions for most table manipulations.
  • Learn to transform datasets with split-apply-combine procedures.
  • Understand the basic join operation.

Specific Achievements

  • Reshape data frames with tidyr
  • Summarize data by groups with dplyr
  • Combine multiple data frame operations with pipes
  • Combine multiple data frames with “joins”

Data frames occupy a central place in R analysis pipelines. While the base R functions provide most necessary tools to subset, reformat and transform data frames, the specialized packages in this lesson offer friendlier and often computationally faster ways to perform common data frame processing steps. The uniform syntax of the tidyr and dplyr packages also makes scripts more readable and easier to debug. The key functions in both packages have close counterparts in SQL (Structured Query Language), which provides the added bonus of facilitating translation between R and relational databases.

Top of Section


Tidy Concept

R developer Hadley Wickham (author of the tidyr, dplyr and ggplot2 packages, among others) defines tidy datasets (Wickham 2014) as those where:

  • each variable forms a column
  • each observation forms a row
  • each type of observational unit forms a table

These guidelines may be familiar to some of you—they closely map to best practices for “normalization” in database design.

Consider a data set where the outcome of an experiment has been recorded in a perfectly appropriate way:

block drug control placebo
1 0.22 0.58 0.31
2 0.12 0.98 0.47
3 0.42 0.19 0.40

The response data are present in a compact matrix, as you might record it on a spreadsheet. The form does not match how we think about a statistical model, such as:

\[response \sim block + treatment\]

In a tidy format, each row is a complete observation: it includes the response value and all the predictor values. In this data, some of those predictor values are column headers, so the table needs to be reshaped. The tidyr package provides functions to help re-organize tables.

The third principle of tidy data, one table per category of observed entities, becomes especially important in synthesis research. Following this principle requires holding tidy data in multiple tables, with associations between them formalized in metadata, as in a relational database.

Datasets split across multiple tables are unavoidable in synthesis research, and commonly used in the following two ways (often in combination):

  • two tables are “un-tidied” by joins, or merging them into one table
  • statistical models conform to the data model through a hierarchical structure or employing “random effects”

The dplyr package includes several functions that all perform variations on table joins needed to “un-tidy” your tables, but there are only two basic types of table relationships to recognize:

  • One-to-one relationships allow tables to be combined based on the same unique identifier (or “primary key”) in both tables.
  • Many-to-one relationships require non-unique “foreign keys” in the first table to match the primary key of the second.

Top of Section


Pivot_longer

The tidyr package can be used to reshape “wide” data frames into “long,” tidy data frame. By making each observation a single row, it “lengthens” the data.

We can use tidyr pivot_longer function to reshape the table for drug trial data from above (named trial) into a tidy data frame.

library(tidyr)
tidy_trial <- pivot_longer(trial,
  cols = c(drug, control, placebo),
  names_to = 'treatment',
  values_to = 'response')

All columns, except for “block”, are stacked in two columns: a “name” and a “value”, as specified by the cols arguement. The name column gets the name treatment from the names_to argument, and the value column receives the name response from the values_to argument. For each row in the result, the “name” (treatment) is taken from the name of the column and the “value” (response) from the data in the column. The resulting table is saved as tidy_trial.

> tidy_trial
# A tibble: 9 x 3
  block treatment response
  <int> <chr>        <dbl>
1     1 drug         0.22 
2     1 control      0.580
3     1 placebo      0.31 
4     2 drug         0.12 
5     2 control      0.98 
6     2 placebo      0.47 
7     3 drug         0.42 
8     3 control      0.19 
9     3 placebo      0.4  

Some notes on the syntax: a big advantage of tidyr and dplyr is that each function takes a data frame as its first argument and returns a new data frame. As we will see later, it makes it very easy to chain these functions in a pipeline. All functions also use column names as variables without subsetting them from a data frame (i.e. block instead of trial$block).

Pivot_wider

Data can also fail to be tidy when a table is too long. The Entity-Attribute-Value (EAV) structure common in large databases distributes multiple attributes of a single entity/observation into separate rows.

Remember that the exact state of “tidy” may depend on the analysis: the key is knowing what counts as a complete observation. For example, the community ecology package vegan requires a matrix of species counts, where rows correspond to species and columns to sites. This may seem like too “wide” a format, but in several multi-variate analyses, the abundance of a species across multiple sites is considered a complete observation.

Consider survey data on participant’s age and income stored in a EAV structure.

participant attr val
1 age 24
2 age 57
3 age 13
1 income 30
2 income 60

Transform the data with the pivot_wider function, which “reverses” a pivot_longer.

tidy_survey <- pivot_wider(survey,
  names_from = attr,
  values_from = val)

The column names for the resulting data frame tidy_survey are taken from the attr column and the data from the val column.

> tidy_survey
# A tibble: 3 x 3
  participant   age income
        <int> <int>  <int>
1           1    24     30
2           2    57     60
3           3    13     NA

One difficulty with EAV tables is the nature of missing data; an entire row rather than a single cell is missing. It is important to evaluate how to handle the “missing data.” By default, “missing data” is coded as NA.

The values_fill argument allows you to specify values for this missing data.

tidy_survey <- pivot_wider(survey,
  names_from = attr,
  values_from = val,
  values_fill = 0)
> tidy_survey
# A tibble: 3 x 3
  participant   age income
        <int> <int>  <int>
1           1    24     30
2           2    57     60
3           3    13      0

Top of Section


Sample Data


Credit: US Census Bureau

To learn about data transformation with dplyr, we need more data. The Census Bureau collects subnational economic data for the U.S., releasing annual County Business Patterns (CBP) datasets including the number of establishments, employment, and payroll by industry. They also conduct the American Community Survey (ACS) and publish, among other demographic and economic variables, such as estimates of median income for individuals working in different industries.

These two datasets both contain economic variables for each U.S. county specified by different categories of industry. The data could potentially be manipulated into a single table reflecting the following statistical model.

\[median\_income \sim industry + establishment\_size\]

First, load the CBP data. fread from data.table is faster at reading large data sets than base R read.csv.

library(data.table)
cbp <- fread('data/cbp15co.csv')
> str(cbp)
Classes 'data.table' and 'data.frame':	2126601 obs. of  26 variables:
 $ FIPSTATE: int  1 1 1 1 1 1 1 1 1 1 ...
 $ FIPSCTY : int  1 1 1 1 1 1 1 1 1 1 ...
 $ NAICS   : chr  "------" "11----" "113///" "1133//" ...
 $ EMPFLAG : chr  "" "" "" "" ...
 $ EMP_NF  : chr  "G" "H" "H" "H" ...
 $ EMP     : int  10454 70 70 70 70 70 0 0 0 0 ...
 $ QP1_NF  : chr  "G" "H" "H" "H" ...
 $ QP1     : int  76437 790 790 790 790 790 0 0 0 0 ...
 $ AP_NF   : chr  "G" "H" "H" "H" ...
 $ AP      : int  321433 3566 3551 3551 3551 3551 0 0 0 0 ...
 $ EST     : int  844 7 6 6 6 6 1 1 1 1 ...
 $ N1_4    : int  430 5 4 4 4 4 1 1 1 1 ...
 $ N5_9    : int  171 1 1 1 1 1 0 0 0 0 ...
 $ N10_19  : int  118 0 0 0 0 0 0 0 0 0 ...
 $ N20_49  : int  81 0 0 0 0 0 0 0 0 0 ...
 $ N50_99  : int  35 1 1 1 1 1 0 0 0 0 ...
 $ N100_249: int  6 0 0 0 0 0 0 0 0 0 ...
 $ N250_499: int  2 0 0 0 0 0 0 0 0 0 ...
 $ N500_999: int  1 0 0 0 0 0 0 0 0 0 ...
 $ N1000   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_1 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_2 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_3 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_4 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CENSTATE: int  63 63 63 63 63 63 63 63 63 63 ...
 $ CENCTY  : int  1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr> 

The CBP dataset includes NAICS (North American Industry Classification System) industry codes and the number of businesses or establishments of different employee sizes in each US county. States and counties are identified by Federal Information Processing System (FIPS) codes. See the CBP dataset documentation for an explanation of the variables we don’t discuss in this lesson.

We need to modify the import to clean up this read. The data type for the state and city codes should be read in as a character type using colClasses.

cbp <- fread(
  'data/cbp15co.csv',
  colClasses = c(
    FIPSTATE='character',
    FIPSCTY='character'))

Next, load the ACS data. The ACS data we are using in this example includes the median income by industry sector for each US county.

acs <- fread(
  'data/ACS/sector_ACS_15_5YR_S2413.csv',
  colClasses = c(FIPS='character'))
> str(acs)
Classes 'data.table' and 'data.frame':	59698 obs. of  4 variables:
 $ FIPS         : chr  "01001" "01003" "01005" "01007" ...
 $ County       : chr  "Autauga County, Alabama" "Baldwin County, Alabama" "Barbour County, Alabama" "Bibb County, Alabama" ...
 $ Sector       : chr  "agriculture forestry fishing and hunting" "agriculture forestry fishing and hunting" "agriculture forestry fishing and hunting" "agriculture forestry fishing and hunting" ...
 $ median_income: int  27235 40017 32260 22240 21260 30469 33300 39784 40417 20370 ...
 - attr(*, ".internal.selfref")=<externalptr> 

Top of Section


dplyr Functions

Function Returns
filter keep rows that satisfy conditions
mutate apply a transformation to existing [split] columns
select keep columns with matching names
inner_join merge columns from separate tables into one table
group_by split data into groups by an existing factor
summarize summarize across rows [and combine split groups]

The table above summarizes the most commonly used functions in dplyr. We will use dplyr manipulate data frames with the U.S. Census Bureau data in order to prepare for data analysis.

Filter

The cbp table includes character NAICS column for industry codes. NAICS codes can have up to 6 digits. As digits increase, the industry code becomes more specific. Of the 2 million observations, lets see how many observations are left when we keep only the 2-digit NAICS codes, representing high-level, broad sectors of the economy.

We will use the filter command to only include rows where the NAICS code is 2 digits long.

Empty digits are coded as “-“; we only include NAICS codes with 4 dashes using the grepl command to find these rows. The filtered data is saved as cbp2.

library(dplyr)
cbp2 <- filter(cbp,
  grepl('----', NAICS),
  !grepl('------', NAICS))
> str(cbp2)
Classes 'data.table' and 'data.frame':	58901 obs. of  26 variables:
 $ FIPSTATE: chr  "01" "01" "01" "01" ...
 $ FIPSCTY : chr  "001" "001" "001" "001" ...
 $ NAICS   : chr  "11----" "21----" "22----" "23----" ...
 $ EMPFLAG : chr  "" "" "" "" ...
 $ EMP_NF  : chr  "H" "H" "H" "G" ...
 $ EMP     : int  70 82 196 372 971 211 2631 124 73 375 ...
 $ QP1_NF  : chr  "H" "H" "H" "G" ...
 $ QP1     : int  790 713 4793 2891 15386 2034 14905 1229 924 4201 ...
 $ AP_NF   : chr  "H" "H" "H" "G" ...
 $ AP      : int  3566 3294 18611 13801 64263 11071 61502 5128 3407 16328 ...
 $ EST     : int  7 3 9 75 24 29 169 16 9 67 ...
 $ N1_4    : int  5 0 2 51 9 18 68 9 5 41 ...
 $ N5_9    : int  1 1 1 13 4 6 41 4 1 18 ...
 $ N10_19  : int  0 1 2 7 4 2 34 1 1 6 ...
 $ N20_49  : int  0 0 3 4 3 3 11 2 2 2 ...
 $ N50_99  : int  1 1 1 0 3 0 11 0 0 0 ...
 $ N100_249: int  0 0 0 0 0 0 3 0 0 0 ...
 $ N250_499: int  0 0 0 0 0 0 1 0 0 0 ...
 $ N500_999: int  0 0 0 0 1 0 0 0 0 0 ...
 $ N1000   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_1 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_2 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_3 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_4 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CENSTATE: int  63 63 63 63 63 63 63 63 63 63 ...
 $ CENCTY  : int  1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr> 

Note that a logical “and” is implied when conditions are separated by commas. (This is perhaps the main way in which filter differs from the base R subset function.) Therefore, the example above is equivalent to filter(grepl('----', NAICS) & !grepl('------', NAICS). A logical “or”, on the other hand, must be specified explicitly with the | operator.

Alternatively, the stringr package makes the use of pattern matching by regular expressions a bit more maneageble, and streamlines this step.

library(stringr)
cbp2 <- filter(cbp,
  str_detect(NAICS, '[0-9]{2}----'))

This code only inclues NAICS codes with any 2 numbers followed by 4 dashes.

Mutate

The mutate function is the dplyr answer to updating or altering your columns. It performs operations on existing columns and appends the result as a new column of the same length.

In the CBP data, FPS codes are split by state and county; however, the convention is to combine into 1 code, concatenating the 2 digit state and 3 digit county code.

The mutate command will add a new column FIPS to the cbp2 data frame. Values for the FIPS column will be determined using operation str_c from the stringr package. str_c combines the FIPS state and county codes.

cbp3 <- mutate(cbp2,
  FIPS = str_c(FIPSTATE, FIPSCTY))

Multiple arguments to mutate produce multiple transformations.

cbp3 <- mutate(cbp2,
  FIPS = str_c(FIPSTATE, FIPSCTY),
  NAICS = str_remove(NAICS, '-+'))

FIPS is a new column. But can also transform the data and rewrite an existing column as done here with NAICS to remove the dashes from the NAICS codes in the NAICS column.

Chaining Functions

All the functions from the dplyr package take a data frame as their first argument, and they return a data frame. This consistent syntax is on purpose. It is designed for easily chaining data transformations together: creating a data pipeline that is easy to read and modify.

The “pipe” operator (%>%) takes the expression on its left-hand side and inserts it, as the first argument, into the function on its right-hand side. x %>% function() is equivalent to function(x).

For example, instead of sum(c(1,3,5)), we have:

> c(1, 3, 5) %>% sum()
[1] 9

Additional arguments can be added to the function—the pipe only handles the first argument.

> c(1, 3, 5, NA) %>% sum(na.rm = TRUE)
[1] 9

The pipe operator’s main utility is to condense a chain of operations applied to the same piece of data, when you don’t want any intermediate results. So instead of: function_A(function_B(function_C(x))) pipes allow you to do the following: x %>% function_A() %>% function_B() %>% function_C()

We can do the filter and mutate operations from above with one assignment.

cbp <- cbp %>%
  filter(
    str_detect(NAICS, '[0-9]{2}----')
  ) %>%
  mutate(
    FIPS = str_c(FIPSTATE, FIPSCTY),
    NAICS = str_remove(NAICS, '-+')
  )

Select

To keep particular columns of a data frame (rather than filtering rows), use the select function with arguments that match column names.

> names(cbp)
 [1] "FIPSTATE" "FIPSCTY"  "NAICS"    "EMPFLAG"  "EMP_NF"   "EMP"     
 [7] "QP1_NF"   "QP1"      "AP_NF"    "AP"       "EST"      "N1_4"    
[13] "N5_9"     "N10_19"   "N20_49"   "N50_99"   "N100_249" "N250_499"
[19] "N500_999" "N1000"    "N1000_1"  "N1000_2"  "N1000_3"  "N1000_4" 
[25] "CENSTATE" "CENCTY"   "FIPS"    

One way to “match” is by including complete names, each one you want to keep:

> cbp %>%
+   select(
+     FIPS,
+     NAICS,
+     N1_4, N5_9, N10_19 
+   )

Alternatively, we can use a “select helper” to match patterns.

cbp <- cbp %>%
  select(
    FIPS,
    NAICS,
    starts_with('N')
  )

The cbp data frame now only includes columns that we are interested in for the our analysis: the full FIPS county code, the NAICS industry code, and the number of establishments at different employee size classess.

Top of Section


Join

The CBP dataset uses FIPS to identify U.S. counties and NAICS codes to identify types of industry. The ACS dataset also uses FIPS for counties but uses industry sectors instead of NAICS codes. The sectors are broader than the NAICS codes; with multiple NAICS codes representing a single industry sector.

To combine the ACS and CBP datasets, we need to relate the NAICS codes to the industry sector.

A data frame sector relates the NAICS codes to the industry sector as used in the CBP dataset.

sector <- fread(
  'data/ACS/sector_naics.csv',
  colClasses = c(NAICS='character'))
> View(sector)

Probably the primary challenge in combining secondary datasets for synthesis research is dealing with their different sampling frames. A very common issue is that data are collected at different “scales”, with one dataset being at higher spatial or temporal resolution than another. The differences between the CBP and ACS categories of industry present a similar problem, and require the same solution of re-aggregating data at the “lower resolution”.

Many-to-One

A dplyr join function can be used to “join” the cbp and sector data frames.

Join commands combine two data frames based on columns in the data frame specified by the by = argument. If no columns are given, the default is to combine using all columns with the same names. In our example, cbp and sector will join based on the NAICS column in each of the data frames.

There are multiple types of join commands which differ in how they handle data that is missing in one of the data frames. To combine cbp and sector we will use inner_join which will only include rows that have the same NAICS code in both cbp and sector. Information on the other types of join commands can be found in the dplyr [join documentation] (https://dplyr.tidyverse.org/reference/join.html).

cbp <- cbp %>%
  inner_join(sector)
Joining, by = "NAICS"
> View(cbp)

The NAICS field in the cbp table can have the same value multiple times, it is not a primary key in this table. In the sector table, the NAICS field is the primary key uniquely identifying each record. The type of relationship between these tables is therefore “many-to-one”.

Question
Note that we lost a couple thousand rows through this join. How could cbp have fewer rows after a join on NAICS codes?
Answer
The CBP data contains an NAICS code not mapped to a sector—the “error code” 99 is not present in sector. The use of “error codes” that could easilly be mistaken for data is frowned upon.

Group By

A very common data manipulation procedure know as “split-apply-combine” tackles the problem of applying the same transformation to subsets of data while keeping the result all together.

Because there were multiple NAICS codes associated with a Sector, there are now multiple rows for a single sector within a county. We need the total number of establishments in each size class aggregated within each county and industry sector.

The dplyr function group_by begins the process by indicating how the data frame should be split into subsets.

cbp_grouped <- cbp %>%
  group_by(FIPS, Sector)

At this point, nothing has really changed:

> str(cbp_grouped)
tibble [56,704 × 16] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ FIPS    : chr [1:56704] "01001" "01001" "01001" "01001" ...
 $ NAICS   : chr [1:56704] "11" "21" "22" "23" ...
 $ N1_4    : int [1:56704] 5 0 2 51 9 18 68 9 5 41 ...
 $ N5_9    : int [1:56704] 1 1 1 13 4 6 41 4 1 18 ...
 $ N10_19  : int [1:56704] 0 1 2 7 4 2 34 1 1 6 ...
 $ N20_49  : int [1:56704] 0 0 3 4 3 3 11 2 2 2 ...
 $ N50_99  : int [1:56704] 1 1 1 0 3 0 11 0 0 0 ...
 $ N100_249: int [1:56704] 0 0 0 0 0 0 3 0 0 0 ...
 $ N250_499: int [1:56704] 0 0 0 0 0 0 1 0 0 0 ...
 $ N500_999: int [1:56704] 0 0 0 0 1 0 0 0 0 0 ...
 $ N1000   : int [1:56704] 0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_1 : int [1:56704] 0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_2 : int [1:56704] 0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_3 : int [1:56704] 0 0 0 0 0 0 0 0 0 0 ...
 $ N1000_4 : int [1:56704] 0 0 0 0 0 0 0 0 0 0 ...
 $ Sector  : chr [1:56704] "agriculture forestry fishing and hunting" "mining quarrying and oil and gas extraction" "utilities" "construction" ...
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, "groups")= tibble [56,704 × 3] (S3: tbl_df/tbl/data.frame)
  ..$ FIPS  : chr [1:56704] "01001" "01001" "01001" "01001" ...
  ..$ Sector: chr [1:56704] "accommodation and food services" "administrative and support and waste management and remediation services" "agriculture forestry fishing and hunting" "arts entertainment and recreation" ...
  ..$ .rows : list<int> [1:56704] 
  .. ..$ : int 18
  .. ..$ : int 14
  .. ..$ : int 1
  .. ..$ : int 17
  .. ..$ : int 4
  .. ..$ : int 15
  .. ..$ : int 10
  .. ..$ : int 16
  .. ..$ : int 9
  .. ..$ : int 13
  .. ..$ : int 5
  .. ..$ : int 2
  .. ..$ : int 19
  .. ..$ : int 12
  .. ..$ : int 11
  .. ..$ : int 7
  .. ..$ : int 8
  .. ..$ : int 3
  .. ..$ : int 6
  .. ..$ : int 37
  .. ..$ : int 33
  .. ..$ : int 20
  .. ..$ : int 36
  .. ..$ : int 23
  .. ..$ : int 34
  .. ..$ : int 29
  .. ..$ : int 35
  .. ..$ : int 28
  .. ..$ : int 32
  .. ..$ : int 24
  .. ..$ : int 21
  .. ..$ : int 38
  .. ..$ : int 31
  .. ..$ : int 30
  .. ..$ : int 26
  .. ..$ : int 27
  .. ..$ : int 22
  .. ..$ : int 25
  .. ..$ : int 55
  .. ..$ : int 51
  .. ..$ : int 39
  .. ..$ : int 54
  .. ..$ : int 42
  .. ..$ : int 52
  .. ..$ : int 48
  .. ..$ : int 53
  .. ..$ : int 47
  .. ..$ : int 43
  .. ..$ : int 40
  .. ..$ : int 56
  .. ..$ : int 50
  .. ..$ : int 49
  .. ..$ : int 45
  .. ..$ : int 46
  .. ..$ : int 41
  .. ..$ : int 44
  .. ..$ : int 74
  .. ..$ : int 70
  .. ..$ : int 57
  .. ..$ : int 73
  .. ..$ : int 60
  .. ..$ : int 71
  .. ..$ : int 66
  .. ..$ : int 72
  .. ..$ : int 65
  .. ..$ : int 69
  .. ..$ : int 61
  .. ..$ : int 58
  .. ..$ : int 75
  .. ..$ : int 68
  .. ..$ : int 67
  .. ..$ : int 63
  .. ..$ : int 64
  .. ..$ : int 59
  .. ..$ : int 62
  .. ..$ : int 93
  .. ..$ : int 89
  .. ..$ : int 76
  .. ..$ : int 92
  .. ..$ : int 79
  .. ..$ : int 90
  .. ..$ : int 85
  .. ..$ : int 91
  .. ..$ : int 84
  .. ..$ : int 88
  .. ..$ : int 80
  .. ..$ : int 77
  .. ..$ : int 94
  .. ..$ : int 87
  .. ..$ : int 86
  .. ..$ : int 82
  .. ..$ : int 83
  .. ..$ : int 78
  .. ..$ : int 81
  .. ..$ : int 111
  .. ..$ : int 107
  .. ..$ : int 95
  .. ..$ : int 110
  .. ..$ : int 97
  .. .. [list output truncated]
  .. ..@ ptype: int(0) 
  ..- attr(*, ".drop")= logi TRUE

The group_by statement does not change any values in the data frame; it only adds attributes to the the original data frame. You can add multiple variables (separated by commas) in group_by; each distinct combination of values across these columns defines a different group.

Summarize

Differents operations can be used on each group. We need to sum the number of establishments in each group.

Using dplyr functions, we group the data by couny and sector (group_by), select the columns for number of establishments (select) and sum the number of establishments within each group (summarize_all).

cbp <- cbp %>%
  group_by(FIPS, Sector) %>%
  select(starts_with('N'), -NAICS) %>%
  summarize_all(sum)
Adding missing grouping variables: `FIPS`, `Sector`

The “combine” part of “split-apply-combine” occurs automatically, when the attributes introduced by group_by are dropped. You can see attributes either by running the str() function on the data frame or by inspecting it in the RStudio Environment pane.

There is now a one-to-one relationship between cbp and acs, based on the combination of FIPS and Sector as the primary key for both tables.

To finish creating the dataset for the analysis of interest, we now join the acs data frame. The two data frames are joined by FIPS and Sector.

acs_cbp <- cbp %>%
  inner_join(acs)
Joining, by = c("FIPS", "Sector")

Again, however, the one-to-one relationship does not mean all rows are preserved by the join. The specific nature of the inner_join is to keep all rows, even duplicating rows if the relationship is many-to-one, where there are matching values in both tables, and discarding the rest.

The acs_cbp table now includes the median_income variable from the ACS and appropriatey aggregated establishment size information (the number of establishments by employee bins) from the CBP table.

> View(acs_cbp)

Top of Section


Additional Resources

The following cheat sheets and tutorials repeat much of this lesson, but also provide information on additional functions for “data wrangling”.

The first is one of several cheat sheets created by RStudio, and provides a handy, visual summary of all the key functions discussed in this lesson. It also lists some of the auxiliary functions that can be used within each type of expression, e.g. aggregation functions for summarize, “moving window” functions for mutate, etc.

Top of Section


Exercises

Exercise 1

Now that we have a tidy form of survey, convert it to a long_survey data frame using pivot_longer. The only difference between survey and long_survey should be an additional row for zero income.

View solution

Exercise 2

Use filter and select to return just the annual payroll data for the top level construction sector (“23—-“), using data from CBP. Annual payroll information are in columns AP and AP_NF.

View solution

Exercise 3

Write code to create a data frame giving, for each state, the number of counties in the CBP survey with establishements in mining or oil and gas extraction (‘21—-‘) along with their total employment (“EMP”).

View solution

Exercise 4

A “pivot table” is a transformation of tidy data into a wide summary table. First, data are summarized by two grouping factors, then one of these is “pivoted” into columns. Using only data with a 2-digit NAICS code, chain a split-apply-combine procedure into a “wide” table to get the total number of employees (“EMP”) in each state (as rows) by NAICS code (as columns).

Solutions

Solution 1

long_survey <- pivot_longer(tidy_survey,  
                            cols=c(age, income), 
                            values_to = 'val', 
                            names_to = 'attr', 
                          )

Return

Solution 2

cbp_23 <- fread('data/cbp15co.csv', na.strings = '') %>%
  filter(NAICS == '23----') %>%
  select(starts_with('FIPS'), starts_with('AP'))

Return

Solution 3

cbp_21 <- fread('data/cbp15co.csv', na.strings = '') %>%
  filter(NAICS == '21----') %>%
  group_by(FIPSTATE) %>%
  summarize(EMP = sum(EMP), counties = n())
`summarise()` ungrouping output (override with `.groups` argument)

Return

Solution 4

pivot <- fread('data/cbp15co.csv', na.strings = '') %>%
  filter(str_detect(NAICS, '[0-9]{2}----')) %>%
  group_by(FIPSTATE, NAICS) %>%
  summarize(EMP = sum(EMP)) %>%
  pivot_wider(names_from = NAICS, values_from = EMP)
`summarise()` regrouping output by 'FIPSTATE' (override with `.groups` argument)

Return

Top of Section


If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.

# Nothing here yet!