Manipulating Tabular Data
Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.
Lesson Objectives
- Review what makes a dataset tidy.
- Meet a complete set of functions for most table manipulations.
- Learn to transform datasets with split-apply-combine procedures.
- Understand the basic join operation.
Specific Achievements
- Reshape data frames with tidyr
- Summarize data by groups with dplyr
- Combine multiple data frame operations with pipes
- Combine multiple data frames with “joins”
Data frames occupy a central place in R analysis pipelines. While the base R functions provide most necessary tools to subset, reformat and transform data frames, the specialized packages in this lesson offer friendlier and often computationally faster ways to perform common data frame processing steps. The uniform syntax of the tidyr and dplyr packages also makes scripts more readable and easier to debug. The key functions in both packages have close counterparts in SQL (Structured Query Language), which provides the added bonus of facilitating translation between R and relational databases.
Tidy Concept
R developer Hadley Wickham (author of the tidyr, dplyr and ggplot2 packages, among others) defines tidy datasets (Wickham 2014) as those where:
- each variable forms a column
- each observation forms a row
- each type of observational unit forms a table
These guidelines may be familiar to some of you—they closely map to best practices for “normalization” in database design.
Consider a data set where the outcome of an experiment has been recorded in a perfectly appropriate way:
block | drug | control | placebo |
---|---|---|---|
1 | 0.22 | 0.58 | 0.31 |
2 | 0.12 | 0.98 | 0.47 |
3 | 0.42 | 0.19 | 0.40 |
The response data are present in a compact matrix, as you might record it on a spreadsheet. The form does not match how we think about a statistical model, such as:
\[response \sim block + treatment\]In a tidy format, each row is a complete observation: it includes the response value and all the predictor values. In this data, some of those predictor values are column headers, so the table needs to be reshaped. The tidyr package provides functions to help re-organize tables.
The third principle of tidy data, one table per category of observed entities, becomes especially important in synthesis research. Following this principle requires holding tidy data in multiple tables, with associations between them formalized in metadata, as in a relational database.
Datasets split across multiple tables are unavoidable in synthesis research, and commonly used in the following two ways (often in combination):
- two tables are “un-tidied” by joins, or merging them into one table
- statistical models conform to the data model through a hierarchical structure or employing “random effects”
The dplyr package includes several functions that all perform variations on table joins needed to “un-tidy” your tables, but there are only two basic types of table relationships to recognize:
- One-to-one relationships allow tables to be combined based on the same unique identifier (or “primary key”) in both tables.
- Many-to-one relationships require non-unique “foreign keys” in the first table to match the primary key of the second.
Pivot_longer
The tidyr package can be used to reshape “wide” data frames into “long,” tidy data frame. By making each observation a single row, it “lengthens” the data.
We can use tidyr pivot_longer
function to reshape the table for drug trial data from above (named trial
) into a tidy data frame.
library(tidyr)
tidy_trial <- pivot_longer(trial,
cols = c(drug, control, placebo),
names_to = 'treatment',
values_to = 'response')
All columns, except for “block”, are stacked in two columns: a “name” and a
“value”, as specified by the cols
arguement. The name column gets the name treatment
from the names_to
argument,
and the value
column receives the name response
from the values_to
argument.
For each row in the result, the “name” (treatment) is taken from the name
of the column and the “value” (response) from the data in the column. The resulting table is saved as tidy_trial
.
> tidy_trial
# A tibble: 9 x 3
block treatment response
<int> <chr> <dbl>
1 1 drug 0.22
2 1 control 0.580
3 1 placebo 0.31
4 2 drug 0.12
5 2 control 0.98
6 2 placebo 0.47
7 3 drug 0.42
8 3 control 0.19
9 3 placebo 0.4
Some notes on the syntax: a big advantage of tidyr and
dplyr is that each function takes a data frame as its first argument
and returns a new data frame. As we will see later, it makes it very easy to
chain these functions in a pipeline. All functions also use column names as
variables without subsetting them from a data frame (i.e. block
instead of
trial$block
).
Pivot_wider
Data can also fail to be tidy when a table is too long. The Entity-Attribute-Value (EAV) structure common in large databases distributes multiple attributes of a single entity/observation into separate rows.
Remember that the exact state of “tidy” may depend on the analysis: the key is knowing what counts as a complete observation. For example, the community ecology package vegan requires a matrix of species counts, where rows correspond to species and columns to sites. This may seem like too “wide” a format, but in several multi-variate analyses, the abundance of a species across multiple sites is considered a complete observation.
Consider survey data on participant’s age and income stored in a EAV structure.
participant | attr | val |
---|---|---|
1 | age | 24 |
2 | age | 57 |
3 | age | 13 |
1 | income | 30 |
2 | income | 60 |
Transform the data with the pivot_wider
function, which “reverses” a pivot_longer
.
tidy_survey <- pivot_wider(survey,
names_from = attr,
values_from = val)
The column names for the resulting data frame tidy_survey
are taken from the attr
column and the data from the val
column.
> tidy_survey
# A tibble: 3 x 3
participant age income
<int> <int> <int>
1 1 24 30
2 2 57 60
3 3 13 NA
One difficulty with EAV tables is the nature of missing data; an entire row
rather than a single cell is missing. It is important to evaluate how to handle the “missing data.” By default, “missing data” is coded as NA
.
The values_fill
argument allows you to specify values for this missing data.
tidy_survey <- pivot_wider(survey,
names_from = attr,
values_from = val,
values_fill = 0)
> tidy_survey
# A tibble: 3 x 3
participant age income
<int> <int> <int>
1 1 24 30
2 2 57 60
3 3 13 0
Sample Data
To learn about data transformation with dplyr, we need more data. The Census Bureau collects subnational economic data for the U.S., releasing annual County Business Patterns (CBP) datasets including the number of establishments, employment, and payroll by industry. They also conduct the American Community Survey (ACS) and publish, among other demographic and economic variables, such as estimates of median income for individuals working in different industries.
These two datasets both contain economic variables for each U.S. county specified by different categories of industry. The data could potentially be manipulated into a single table reflecting the following statistical model.
\[median\_income \sim industry + establishment\_size\]First, load the CBP data. fread
from data.table is faster at reading large data sets than base R read.csv
.
library(data.table)
cbp <- fread('data/cbp15co.csv')
> str(cbp)
Classes 'data.table' and 'data.frame': 2126601 obs. of 26 variables:
$ FIPSTATE: int 1 1 1 1 1 1 1 1 1 1 ...
$ FIPSCTY : int 1 1 1 1 1 1 1 1 1 1 ...
$ NAICS : chr "------" "11----" "113///" "1133//" ...
$ EMPFLAG : chr "" "" "" "" ...
$ EMP_NF : chr "G" "H" "H" "H" ...
$ EMP : int 10454 70 70 70 70 70 0 0 0 0 ...
$ QP1_NF : chr "G" "H" "H" "H" ...
$ QP1 : int 76437 790 790 790 790 790 0 0 0 0 ...
$ AP_NF : chr "G" "H" "H" "H" ...
$ AP : int 321433 3566 3551 3551 3551 3551 0 0 0 0 ...
$ EST : int 844 7 6 6 6 6 1 1 1 1 ...
$ N1_4 : int 430 5 4 4 4 4 1 1 1 1 ...
$ N5_9 : int 171 1 1 1 1 1 0 0 0 0 ...
$ N10_19 : int 118 0 0 0 0 0 0 0 0 0 ...
$ N20_49 : int 81 0 0 0 0 0 0 0 0 0 ...
$ N50_99 : int 35 1 1 1 1 1 0 0 0 0 ...
$ N100_249: int 6 0 0 0 0 0 0 0 0 0 ...
$ N250_499: int 2 0 0 0 0 0 0 0 0 0 ...
$ N500_999: int 1 0 0 0 0 0 0 0 0 0 ...
$ N1000 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ CENSTATE: int 63 63 63 63 63 63 63 63 63 63 ...
$ CENCTY : int 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
The CBP dataset includes NAICS (North American Industry Classification System) industry codes and the number of businesses or establishments of different employee sizes in each US county. States and counties are identified by Federal Information Processing System (FIPS) codes. See the CBP dataset documentation for an explanation of the variables we don’t discuss in this lesson.
We need to modify the import to clean up this read. The data type for the state and city codes should be read in as a character type using colClasses
.
cbp <- fread(
'data/cbp15co.csv',
colClasses = c(
FIPSTATE='character',
FIPSCTY='character'))
Next, load the ACS data. The ACS data we are using in this example includes the median income by industry sector for each US county.
acs <- fread(
'data/ACS/sector_ACS_15_5YR_S2413.csv',
colClasses = c(FIPS='character'))
> str(acs)
Classes 'data.table' and 'data.frame': 59698 obs. of 4 variables:
$ FIPS : chr "01001" "01003" "01005" "01007" ...
$ County : chr "Autauga County, Alabama" "Baldwin County, Alabama" "Barbour County, Alabama" "Bibb County, Alabama" ...
$ Sector : chr "agriculture forestry fishing and hunting" "agriculture forestry fishing and hunting" "agriculture forestry fishing and hunting" "agriculture forestry fishing and hunting" ...
$ median_income: int 27235 40017 32260 22240 21260 30469 33300 39784 40417 20370 ...
- attr(*, ".internal.selfref")=<externalptr>
dplyr Functions
Function | Returns |
---|---|
filter |
keep rows that satisfy conditions |
mutate |
apply a transformation to existing [split] columns |
select |
keep columns with matching names |
inner_join |
merge columns from separate tables into one table |
group_by |
split data into groups by an existing factor |
summarize |
summarize across rows [and combine split groups] |
The table above summarizes the most commonly used functions in dplyr. We will use dplyr manipulate data frames with the U.S. Census Bureau data in order to prepare for data analysis.
Filter
The cbp
table includes character NAICS
column for industry codes. NAICS codes can have up to 6 digits. As digits increase, the industry code becomes more specific. Of the 2 million observations, lets see how many observations are left when we keep only the 2-digit NAICS codes, representing high-level, broad sectors of the economy.
We will use the filter
command to only include rows where the NAICS code is 2 digits long.
Empty digits are coded as “-“; we only include NAICS codes with 4 dashes using the grepl
command to find these rows. The filtered data is saved as cbp2
.
library(dplyr)
cbp2 <- filter(cbp,
grepl('----', NAICS),
!grepl('------', NAICS))
> str(cbp2)
Classes 'data.table' and 'data.frame': 58901 obs. of 26 variables:
$ FIPSTATE: chr "01" "01" "01" "01" ...
$ FIPSCTY : chr "001" "001" "001" "001" ...
$ NAICS : chr "11----" "21----" "22----" "23----" ...
$ EMPFLAG : chr "" "" "" "" ...
$ EMP_NF : chr "H" "H" "H" "G" ...
$ EMP : int 70 82 196 372 971 211 2631 124 73 375 ...
$ QP1_NF : chr "H" "H" "H" "G" ...
$ QP1 : int 790 713 4793 2891 15386 2034 14905 1229 924 4201 ...
$ AP_NF : chr "H" "H" "H" "G" ...
$ AP : int 3566 3294 18611 13801 64263 11071 61502 5128 3407 16328 ...
$ EST : int 7 3 9 75 24 29 169 16 9 67 ...
$ N1_4 : int 5 0 2 51 9 18 68 9 5 41 ...
$ N5_9 : int 1 1 1 13 4 6 41 4 1 18 ...
$ N10_19 : int 0 1 2 7 4 2 34 1 1 6 ...
$ N20_49 : int 0 0 3 4 3 3 11 2 2 2 ...
$ N50_99 : int 1 1 1 0 3 0 11 0 0 0 ...
$ N100_249: int 0 0 0 0 0 0 3 0 0 0 ...
$ N250_499: int 0 0 0 0 0 0 1 0 0 0 ...
$ N500_999: int 0 0 0 0 1 0 0 0 0 0 ...
$ N1000 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_1 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ CENSTATE: int 63 63 63 63 63 63 63 63 63 63 ...
$ CENCTY : int 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
Note that a logical “and” is implied when conditions are separated by commas.
(This is perhaps the main way in which filter
differs from the base R subset
function.) Therefore, the example above is equivalent to filter(grepl('----',
NAICS) & !grepl('------', NAICS)
. A logical “or”, on the other hand, must be
specified explicitly with the |
operator.
Alternatively, the stringr package makes the use of pattern matching by regular expressions a bit more maneageble, and streamlines this step.
library(stringr)
cbp2 <- filter(cbp,
str_detect(NAICS, '[0-9]{2}----'))
This code only inclues NAICS codes with any 2 numbers followed by 4 dashes.
Mutate
The mutate
function is the dplyr answer to updating or altering
your columns. It performs operations on existing columns and appends
the result as a new column of the same length.
In the CBP data, FPS codes are split by state and county; however, the convention is to combine into 1 code, concatenating the 2 digit state and 3 digit county code.
The mutate
command will add a new column FIPS
to the cbp2 data frame. Values for the FIPS column will be determined using operation str_c
from the stringr package. str_c
combines the FIPS state and county codes.
cbp3 <- mutate(cbp2,
FIPS = str_c(FIPSTATE, FIPSCTY))
Multiple arguments to mutate
produce multiple transformations.
cbp3 <- mutate(cbp2,
FIPS = str_c(FIPSTATE, FIPSCTY),
NAICS = str_remove(NAICS, '-+'))
FIPS is a new column. But can also transform the data and rewrite an existing column as done here with NAICS to remove the dashes from the NAICS codes in the NAICS
column.
Chaining Functions
All the functions from the dplyr package take a data frame as their first argument, and they return a data frame. This consistent syntax is on purpose. It is designed for easily chaining data transformations together: creating a data pipeline that is easy to read and modify.
The “pipe” operator (%>%
) takes the expression on its left-hand side and
inserts it, as the first argument, into the function on its right-hand side. x %>% function() is equivalent to function(x).
For example, instead of sum(c(1,3,5))
, we have:
> c(1, 3, 5) %>% sum()
[1] 9
Additional arguments can be added to the function—the pipe only handles the first argument.
> c(1, 3, 5, NA) %>% sum(na.rm = TRUE)
[1] 9
The pipe operator’s main utility is to condense a chain of operations applied to the same piece of data, when you don’t want any intermediate results. So instead of:
function_A(function_B(function_C(x)))
pipes allow you to do the following:
x %>% function_A() %>% function_B() %>% function_C()
We can do the filter
and mutate
operations from above with one assignment.
cbp <- cbp %>%
filter(
str_detect(NAICS, '[0-9]{2}----')
) %>%
mutate(
FIPS = str_c(FIPSTATE, FIPSCTY),
NAICS = str_remove(NAICS, '-+')
)
Select
To keep particular columns of a data frame (rather than filtering rows), use
the select
function with arguments that match column names.
> names(cbp)
[1] "FIPSTATE" "FIPSCTY" "NAICS" "EMPFLAG" "EMP_NF" "EMP"
[7] "QP1_NF" "QP1" "AP_NF" "AP" "EST" "N1_4"
[13] "N5_9" "N10_19" "N20_49" "N50_99" "N100_249" "N250_499"
[19] "N500_999" "N1000" "N1000_1" "N1000_2" "N1000_3" "N1000_4"
[25] "CENSTATE" "CENCTY" "FIPS"
One way to “match” is by including complete names, each one you want to keep:
> cbp %>%
+ select(
+ FIPS,
+ NAICS,
+ N1_4, N5_9, N10_19
+ )
Alternatively, we can use a “select helper” to match patterns.
cbp <- cbp %>%
select(
FIPS,
NAICS,
starts_with('N')
)
The cbp
data frame now only includes columns that we are interested in for the our analysis: the full FIPS county code, the NAICS industry code, and the number of establishments at different employee size classess.
Join
The CBP dataset uses FIPS to identify U.S. counties and NAICS codes to identify types of industry. The ACS dataset also uses FIPS for counties but uses industry sectors instead of NAICS codes. The sectors are broader than the NAICS codes; with multiple NAICS codes representing a single industry sector.
To combine the ACS and CBP datasets, we need to relate the NAICS codes to the industry sector.
A data frame sector
relates the NAICS codes to the industry sector as used in the CBP dataset.
sector <- fread(
'data/ACS/sector_naics.csv',
colClasses = c(NAICS='character'))
> View(sector)
Probably the primary challenge in combining secondary datasets for synthesis research is dealing with their different sampling frames. A very common issue is that data are collected at different “scales”, with one dataset being at higher spatial or temporal resolution than another. The differences between the CBP and ACS categories of industry present a similar problem, and require the same solution of re-aggregating data at the “lower resolution”.
Many-to-One
A dplyr join
function can be used to “join” the cbp
and sector
data frames.
Join commands combine two data frames based on columns in the data frame specified by the by =
argument. If no columns are given, the default is to combine using all columns with the same names. In our example, cbp
and sector
will join based on the NAICS
column in each of the data frames.
There are multiple types of join commands which differ in how they handle data that is missing in one of the data frames. To combine cbp
and sector
we will use inner_join
which will only include rows that have the same NAICS code in both cbp
and sector.
Information on the other types of join commands can be found in the dplyr [join documentation] (https://dplyr.tidyverse.org/reference/join.html).
cbp <- cbp %>%
inner_join(sector)
Joining, by = "NAICS"
> View(cbp)
The NAICS field in the cbp
table can have the same value multiple times, it is
not a primary key in this table. In the sector
table, the NAICS field is the
primary key uniquely identifying each record. The type of relationship between
these tables is therefore “many-to-one”.
- Question
- Note that we lost a couple thousand rows through this join. How could
cbp
have fewer rows after a join on NAICS codes? - Answer
- The CBP data contains an NAICS code not mapped to a sector—the
“error code” 99 is not present in
sector
. The use of “error codes” that could easilly be mistaken for data is frowned upon.
Group By
A very common data manipulation procedure know as “split-apply-combine” tackles the problem of applying the same transformation to subsets of data while keeping the result all together.
Because there were multiple NAICS codes associated with a Sector
, there are now multiple rows for a single sector within a county. We need the total number of establishments
in each size class aggregated within each county and industry sector.
The dplyr function group_by
begins the process by indicating how the data
frame should be split into subsets.
cbp_grouped <- cbp %>%
group_by(FIPS, Sector)
At this point, nothing has really changed:
> str(cbp_grouped)
tibble [56,704 × 16] (S3: grouped_df/tbl_df/tbl/data.frame)
$ FIPS : chr [1:56704] "01001" "01001" "01001" "01001" ...
$ NAICS : chr [1:56704] "11" "21" "22" "23" ...
$ N1_4 : int [1:56704] 5 0 2 51 9 18 68 9 5 41 ...
$ N5_9 : int [1:56704] 1 1 1 13 4 6 41 4 1 18 ...
$ N10_19 : int [1:56704] 0 1 2 7 4 2 34 1 1 6 ...
$ N20_49 : int [1:56704] 0 0 3 4 3 3 11 2 2 2 ...
$ N50_99 : int [1:56704] 1 1 1 0 3 0 11 0 0 0 ...
$ N100_249: int [1:56704] 0 0 0 0 0 0 3 0 0 0 ...
$ N250_499: int [1:56704] 0 0 0 0 0 0 1 0 0 0 ...
$ N500_999: int [1:56704] 0 0 0 0 1 0 0 0 0 0 ...
$ N1000 : int [1:56704] 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_1 : int [1:56704] 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_2 : int [1:56704] 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_3 : int [1:56704] 0 0 0 0 0 0 0 0 0 0 ...
$ N1000_4 : int [1:56704] 0 0 0 0 0 0 0 0 0 0 ...
$ Sector : chr [1:56704] "agriculture forestry fishing and hunting" "mining quarrying and oil and gas extraction" "utilities" "construction" ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "groups")= tibble [56,704 × 3] (S3: tbl_df/tbl/data.frame)
..$ FIPS : chr [1:56704] "01001" "01001" "01001" "01001" ...
..$ Sector: chr [1:56704] "accommodation and food services" "administrative and support and waste management and remediation services" "agriculture forestry fishing and hunting" "arts entertainment and recreation" ...
..$ .rows : list<int> [1:56704]
.. ..$ : int 18
.. ..$ : int 14
.. ..$ : int 1
.. ..$ : int 17
.. ..$ : int 4
.. ..$ : int 15
.. ..$ : int 10
.. ..$ : int 16
.. ..$ : int 9
.. ..$ : int 13
.. ..$ : int 5
.. ..$ : int 2
.. ..$ : int 19
.. ..$ : int 12
.. ..$ : int 11
.. ..$ : int 7
.. ..$ : int 8
.. ..$ : int 3
.. ..$ : int 6
.. ..$ : int 37
.. ..$ : int 33
.. ..$ : int 20
.. ..$ : int 36
.. ..$ : int 23
.. ..$ : int 34
.. ..$ : int 29
.. ..$ : int 35
.. ..$ : int 28
.. ..$ : int 32
.. ..$ : int 24
.. ..$ : int 21
.. ..$ : int 38
.. ..$ : int 31
.. ..$ : int 30
.. ..$ : int 26
.. ..$ : int 27
.. ..$ : int 22
.. ..$ : int 25
.. ..$ : int 55
.. ..$ : int 51
.. ..$ : int 39
.. ..$ : int 54
.. ..$ : int 42
.. ..$ : int 52
.. ..$ : int 48
.. ..$ : int 53
.. ..$ : int 47
.. ..$ : int 43
.. ..$ : int 40
.. ..$ : int 56
.. ..$ : int 50
.. ..$ : int 49
.. ..$ : int 45
.. ..$ : int 46
.. ..$ : int 41
.. ..$ : int 44
.. ..$ : int 74
.. ..$ : int 70
.. ..$ : int 57
.. ..$ : int 73
.. ..$ : int 60
.. ..$ : int 71
.. ..$ : int 66
.. ..$ : int 72
.. ..$ : int 65
.. ..$ : int 69
.. ..$ : int 61
.. ..$ : int 58
.. ..$ : int 75
.. ..$ : int 68
.. ..$ : int 67
.. ..$ : int 63
.. ..$ : int 64
.. ..$ : int 59
.. ..$ : int 62
.. ..$ : int 93
.. ..$ : int 89
.. ..$ : int 76
.. ..$ : int 92
.. ..$ : int 79
.. ..$ : int 90
.. ..$ : int 85
.. ..$ : int 91
.. ..$ : int 84
.. ..$ : int 88
.. ..$ : int 80
.. ..$ : int 77
.. ..$ : int 94
.. ..$ : int 87
.. ..$ : int 86
.. ..$ : int 82
.. ..$ : int 83
.. ..$ : int 78
.. ..$ : int 81
.. ..$ : int 111
.. ..$ : int 107
.. ..$ : int 95
.. ..$ : int 110
.. ..$ : int 97
.. .. [list output truncated]
.. ..@ ptype: int(0)
..- attr(*, ".drop")= logi TRUE
The group_by
statement does not change any values in the data frame; it only
adds attributes to the the original data frame. You can add multiple variables
(separated by commas) in group_by
; each distinct combination of values across
these columns defines a different group.
Summarize
Differents operations can be used on each group. We need to sum the number of establishments in each group.
Using dplyr functions, we group the data by couny and sector (group_by
), select the columns for number of establishments (select
) and sum the number of establishments within each group (summarize_all
).
cbp <- cbp %>%
group_by(FIPS, Sector) %>%
select(starts_with('N'), -NAICS) %>%
summarize_all(sum)
Adding missing grouping variables: `FIPS`, `Sector`
The “combine” part of “split-apply-combine” occurs automatically, when the
attributes introduced by group_by
are dropped. You can see attributes either
by running the str()
function on the data frame or by inspecting it in the
RStudio Environment pane.
There is now a one-to-one relationship between cbp
and acs
, based on
the combination of FIPS and Sector as the primary key for both tables.
To finish creating the dataset for the analysis of interest, we now join the acs
data frame. The two data frames are joined by FIPS
and Sector.
acs_cbp <- cbp %>%
inner_join(acs)
Joining, by = c("FIPS", "Sector")
Again, however, the one-to-one relationship does not mean all rows are preserved
by the join. The specific nature of the inner_join
is to keep all rows, even
duplicating rows if the relationship is many-to-one, where there are matching
values in both tables, and discarding the rest.
The acs_cbp
table now includes the median_income
variable from the ACS and
appropriatey aggregated establishment size information (the number of
establishments by employee bins) from the CBP table.
> View(acs_cbp)
Additional Resources
The following cheat sheets and tutorials repeat much of this lesson, but also provide information on additional functions for “data wrangling”.
The first is one of several cheat sheets created by RStudio, and provides a handy, visual summary of all the key functions discussed in this lesson. It also lists some of the auxiliary functions that can be used within each type of expression, e.g. aggregation functions for summarize, “moving window” functions for mutate, etc.
Exercises
Exercise 1
Now that we have a tidy form of survey
, convert it to a long_survey
data
frame using pivot_longer
. The only difference between survey
and long_survey
should be an additional row for zero income.
Exercise 2
Use filter
and select
to return just the annual payroll data for the top
level construction sector (“23—-“), using data from CBP. Annual payroll information are in columns AP and AP_NF.
Exercise 3
Write code to create a data frame giving, for each state, the number of counties in the CBP survey with establishements in mining or oil and gas extraction (‘21—-‘) along with their total employment (“EMP”).
Exercise 4
A “pivot table” is a transformation of tidy data into a wide summary table. First, data are summarized by two grouping factors, then one of these is “pivoted” into columns. Using only data with a 2-digit NAICS code, chain a split-apply-combine procedure into a “wide” table to get the total number of employees (“EMP”) in each state (as rows) by NAICS code (as columns).
Solutions
Solution 1
long_survey <- pivot_longer(tidy_survey,
cols=c(age, income),
values_to = 'val',
names_to = 'attr',
)
Solution 2
cbp_23 <- fread('data/cbp15co.csv', na.strings = '') %>%
filter(NAICS == '23----') %>%
select(starts_with('FIPS'), starts_with('AP'))
Solution 3
cbp_21 <- fread('data/cbp15co.csv', na.strings = '') %>%
filter(NAICS == '21----') %>%
group_by(FIPSTATE) %>%
summarize(EMP = sum(EMP), counties = n())
`summarise()` ungrouping output (override with `.groups` argument)
Solution 4
pivot <- fread('data/cbp15co.csv', na.strings = '') %>%
filter(str_detect(NAICS, '[0-9]{2}----')) %>%
group_by(FIPSTATE, NAICS) %>%
summarize(EMP = sum(EMP)) %>%
pivot_wider(names_from = NAICS, values_from = EMP)
`summarise()` regrouping output by 'FIPSTATE' (override with `.groups` argument)
If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.
# Nothing here yet!