Manipulating Tabular Data
Note: This lesson is in beta status! It may have issues that have not been addressed.
Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.
Lesson Objectives
- Review what makes a dataset tidy.
- Meet a complete set of functions for most table manipulations.
- Learn to transform datasets with split-apply-combine procedures.
- Understand the basic join operation.
Specific Achievements
- Reshape data frames with pandas
- Summarize data by groups with pandas
- Combine multiple data frame operations with method chaining (piping with pandas “.”)
- Combine multiple data frames with “joins” (merge)
Data frames occupy a central place in Python data analysis pipelines. The pandas package provides the objects and most necessary tools to subset, reformat and transform data frames. The key functions in the package have close counterparts in SQL (Structured Query Language), which provides the added bonus of facilitating translation between python and relational databases.
Tidy Concept
Most time is spent on cleaning and wrangling data rather than analysis. In 2014, Hadley Wickam (R developer at RStudio) published a paper that defines the concepts underlying tidy datasets. Hadley Wickam defined tidy datasets as those where:
- each variable forms a column (also called field)
- each observation forms a row
- each type of observational unit forms a table
These guidelines may be familiar to some of you—they closely map to best practices for “normalization” in database design.It correspond to the 3rd normal form’s described by Codd 1990 but uses the language of statical analysis rather than relationtional database.
Consider a data set where the outcome of an experiment has been recorded in a perfectly appropriate way:
bloc | drug | control | placebo |
---|---|---|---|
1 | 0.22 | 0.58 | 0.31 |
2 | 0.12 | 0.98 | 0.47 |
3 | 0.42 | 0.19 | 0.40 |
The response data are present in a compact matrix, as you might record it on a spreadsheet. The form does not match how we think about a statistical model, such as:
\[response \sim block + treatment\]In a tidy format, each row is a complete observation: it includes the response value and all the predictor values. In this data, some of those predictor values are column headers, so the table needs to be reshaped. The pandas package provides functions to help re-organize tables.
The third principle of tidy data, one table per category of observed entities, becomes especially important in synthesis research. Following this principle requires holding tidy data in multiple tables, with associations between them formalized in metadata, as in a relational database.
Datasets split across multiple tables are unavoidable in synthesis research, and commonly used in the following two ways (often in combination):
- two tables are “un-tidied” by joins, or merging them into one table
- statistical models conform to the data model through a hierarchical structure or employing “random effects”
The pandas package includes several functions that all perform variations on table joins needed to “un-tidy” your tables, but there are only two basic types of table relationships to recognize:
- One-to-one relationships allow tables to be combined based on the same unique identifier (or “primary key”) in both tables.
- Many-to-one relationships require non-unique “foreign keys” in the first table to match the primary key of the second.
Wide to Long
The pandas package’s melt function reshapes “wide” data frames into “long” ones.
import pandas as pd
import numpy as np
trial_df = pd.DataFrame({"block": [1,2,3],
"drug": [0.22,0.12,0.42],
"control": [0.58,0.98,0.19],
"placebo": [0.31,0.47,0.40]})
trial_df.head()
block control drug placebo
0 1 0.58 0.22 0.31
1 2 0.98 0.12 0.47
2 3 0.19 0.42 0.40
tidy_trial_df = pd.melt(trial_df,
id_vars=['block'],
var_name='treatment',
value_name='response')
tidy_trial_df.head()
block treatment response
0 1 control 0.58
1 2 control 0.98
2 3 control 0.19
3 1 drug 0.22
4 2 drug 0.12
All columns, accept for “block”, are stacked in two columns: a “key” and a “value”. The key column gets the name treatment and the value column receives the name response. For each row in the result, the key is taken from the name of the column and the value from the data in the column.
Long to Wide
Data can also fail to be tidy when a table is too long. The Entity-Attribute-Value (EAV) structure common in large databases distributes multiple attributes of a single entity/observation into separate rows.
Remember that the exact state of “tidy” may depend on the analysis: the key is knowing what counts as a complete observation. For example, the community ecology package vegan requires a matrix of species counts, where rows correspond to species and columns to sites. This may seem like too “wide” a format, but in the packages several multi-variate analyses, the abundance of a species across multiple sites is considered a complete observation.
Consider survey data on participant’s age and income stored in a EAV structure.
df2 = tidy_trial_df.pivot(index='block',
columns='treatment',
values='response')
df2 = df2.reset_index()
df2.columns
Index(['block', 'control', 'drug', 'placebo'], dtype='object', name='treatment')
df2.reset_index()
treatment index block control drug placebo
0 0 1 0.58 0.22 0.31
1 1 2 0.98 0.12 0.47
2 2 3 0.19 0.42 0.40
df2
treatment block control drug placebo
0 1 0.58 0.22 0.31
1 2 0.98 0.12 0.47
2 3 0.19 0.42 0.40
Consider survey data on participant’s age and income stored in a EAV structure.
from io import StringIO, BytesIO
text_string = StringIO("""
participant,attr,val
1,age,24
2,age,57
3,age,13
1,income,30
2,income,60
""")
survey_df = pd.read_csv(text_string, sep=",")
survey_df
participant attr val
0 1 age 24
1 2 age 57
2 3 age 13
3 1 income 30
4 2 income 60
Transform the data with the pivot
function, which “reverses” a melt
. These are equivalent to spread
and gather
in the dplyr r package.
tidy_survey = survey_df.pivot(index='participant',
columns='attr',
values='val')
print(tidy_survey.head())
attr age income
participant
1 24.0 30.0
2 57.0 60.0
3 13.0 NaN
tidy_survey = tidy_survey.reset_index()
tidy_survey.columns
Index(['participant', 'age', 'income'], dtype='object', name='attr')
tidy_survey.reset_index()
attr index participant age income
0 0 1 24.0 30.0
1 1 2 57.0 60.0
2 2 3 13.0 NaN
tidy_survey
attr participant age income
0 1 24.0 30.0
1 2 57.0 60.0
2 3 13.0 NaN
Sample Data
To learn about data transformation with pandas, we need more data. The Census Bureau collects subnational economic data for the U.S., releasing annual County Business Patterns (CBP) datasets including the number of establishments, employment, and payroll by industry. They also conduct the American Community Survey (ACS) and publish, among other demographic and economic variables, estimates of median income for individuals working in different industries.
import pandas as pd
cbp = pd.read_csv('data/cbp15co.csv')
cbp.describe()
print(cbp.dtypes)
FIPSTATE int64
FIPSCTY int64
NAICS object
EMPFLAG object
EMP_NF object
EMP int64
QP1_NF object
QP1 int64
AP_NF object
AP int64
EST int64
N1_4 int64
N5_9 int64
N10_19 int64
N20_49 int64
N50_99 int64
N100_249 int64
N250_499 int64
N500_999 int64
N1000 int64
N1000_1 int64
N1000_2 int64
N1000_3 int64
N1000_4 int64
CENSTATE int64
CENCTY int64
dtype: object
See the CBP dataset documentation for an explanation of the variables we don’t discuss in this lesson.
Modify the import to clean up this read: consider the data type for FIPS codes along with what string in this CSV file represents NAs, a.k.a. data that is not-available or missing.
import numpy as np
import pandas as pd
cbp = pd.read_csv(
'data/cbp15co.csv',
na_values = "NULL",
keep_default_na=False,
dtype = {"FIPSTATE": np.str,
"FIPSCTY": np.str}
)
- Question
- What changed?
- Answer
- Using
dtypes()
shows that the character string""
in the CSV file is no longer read into R as missing data (anNA
) but as an empty string. The two named “FIPS” columns are now correctly read as strings.
import pandas as pd
import numpy as np
acs = pd.read_csv(
'data/ACS/sector_ACS_15_5YR_S2413.csv',
dtype = {"FIPS": np.str}
)
Now let’s display the data types
#acs.dtypes
print(acs.dtypes)
FIPS object
County object
Sector object
median_income float64
dtype: object
The two datasets both contain economic variables for each U.S. county and specified by different categories of industry. The data could potentially be manipulated into a single table reflecting the follow statistical model.
\[median\_income \sim industry + establishment\_size\]Key Functions
Function | Returns |
---|---|
query |
keep rows that satisfy conditions |
assign |
apply a transformation to existing [split] columns |
['col1', 'col2'] |
select and keep columns with matching names |
merge |
merge columns from separate tables into one table |
groupby |
split data into groups by an existing factor |
agg |
summarize across rows to use after groupby [and combine split groups] |
The table above summarizes the most commonly used functions in pandas, which we will demonstrate in turn on data from the U.S. Census Bureau.
Filter and pattern matching
The cbp
table includes character NAICS
column. Of the 2 million
observations, lets see how many observations are left when we keep only the
2-digit NAICS codes, representing high-level sectors of the economy.
#import pandas as pd
cbp2 = cbp[cbp['NAICS'].str.contains("----")]
cbp2 = cbp2[~cbp2.NAICS.str.contains("-----")]
cbp2.head()
FIPSTATE FIPSCTY NAICS EMPFLAG ... N1000_3 N1000_4 CENSTATE CENCTY
1 01 001 11---- ... 0 0 63 1
10 01 001 21---- ... 0 0 63 1
17 01 001 22---- ... 0 0 63 1
27 01 001 23---- ... 0 0 63 1
93 01 001 31---- ... 0 0 63 1
[5 rows x 26 columns]
Note that a logical we used the function contains
from pandas to filter the
dataset in two steps. The function contains allows for pattern matching of any
character within strings. The ~
is used to remove the rows that contains
specific patterns.
Filtering string often uses pattern matching by regular expressions which may be a bit more manageable, and streamlines the operations.
cbp3 = cbp[cbp['NAICS'].str.contains('[0-9]{2}----')]
cbp3.head()
FIPSTATE FIPSCTY NAICS EMPFLAG ... N1000_3 N1000_4 CENSTATE CENCTY
1 01 001 11---- ... 0 0 63 1
10 01 001 21---- ... 0 0 63 1
17 01 001 22---- ... 0 0 63 1
27 01 001 23---- ... 0 0 63 1
93 01 001 31---- ... 0 0 63 1
[5 rows x 26 columns]
Altering, updating and transforming columns
The assign
function is the pandas answer to updating or altering
your columns. It performs arbitrary operations on existing columns and appends
the result as a new column of the same length.
Here are two ways to create a new column using assign
and the [ ]
operators.
cbp3["FIPS"] = cbp3["FIPSTATE"]+cbp3["FIPSCTY"]
/usr/bin/python3:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
cbp3.assign(FIPS2=lambda x: x['FIPSTATE']+x['FIPSCTY'])
FIPSTATE FIPSCTY NAICS EMPFLAG ... CENSTATE CENCTY FIPS FIPS2
1 01 001 11---- ... 63 1 01001 01001
10 01 001 21---- ... 63 1 01001 01001
17 01 001 22---- ... 63 1 01001 01001
27 01 001 23---- ... 63 1 01001 01001
93 01 001 31---- ... 63 1 01001 01001
163 01 001 42---- ... 63 1 01001 01001
218 01 001 44---- ... 63 1 01001 01001
351 01 001 48---- ... 63 1 01001 01001
381 01 001 51---- ... 63 1 01001 01001
401 01 001 52---- ... 63 1 01001 01001
429 01 001 53---- ... 63 1 01001 01001
465 01 001 54---- ... 63 1 01001 01001
509 01 001 55---- A ... 63 1 01001 01001
514 01 001 56---- ... 63 1 01001 01001
540 01 001 61---- ... 63 1 01001 01001
554 01 001 62---- ... 63 1 01001 01001
615 01 001 71---- ... 63 1 01001 01001
632 01 001 72---- ... 63 1 01001 01001
652 01 001 81---- ... 63 1 01001 01001
707 01 001 99---- ... 63 1 01001 01001
709 01 003 11---- ... 63 3 01003 01003
731 01 003 21---- ... 63 3 01003 01003
747 01 003 22---- ... 63 3 01003 01003
762 01 003 23---- ... 63 3 01003 01003
828 01 003 31---- ... 63 3 01003 01003
1052 01 003 42---- ... 63 3 01003 01003
1191 01 003 44---- ... 63 3 01003 01003
1353 01 003 48---- ... 63 3 01003 01003
1429 01 003 51---- ... 63 3 01003 01003
1471 01 003 52---- ... 63 3 01003 01003
... ... ... ... ... ... ... ... ... ...
2126060 56 045 31---- C ... 83 45 56045 56045
2126077 56 045 42---- ... 83 45 56045 56045
2126098 56 045 44---- ... 83 45 56045 56045
2126163 56 045 48---- ... 83 45 56045 56045
2126184 56 045 51---- ... 83 45 56045 56045
2126197 56 045 52---- ... 83 45 56045 56045
2126215 56 045 53---- ... 83 45 56045 56045
2126229 56 045 54---- ... 83 45 56045 56045
2126256 56 045 55---- A ... 83 45 56045 56045
2126261 56 045 56---- ... 83 45 56045 56045
2126278 56 045 61---- A ... 83 45 56045 56045
2126283 56 045 62---- ... 83 45 56045 56045
2126326 56 045 71---- A ... 83 45 56045 56045
2126331 56 045 72---- ... 83 45 56045 56045
2126345 56 045 81---- ... 83 45 56045 56045
2126376 56 045 99---- A ... 83 45 56045 56045
2126378 56 999 21---- ... 83 999 56999 56999
2126390 56 999 22---- A ... 83 999 56999 56999
2126395 56 999 23---- ... 83 999 56999 56999
2126407 56 999 42---- ... 83 999 56999 56999
2126452 56 999 44---- A ... 83 999 56999 56999
2126457 56 999 48---- E ... 83 999 56999 56999
2126466 56 999 51---- ... 83 999 56999 56999
2126489 56 999 52---- ... 83 999 56999 56999
2126519 56 999 54---- ... 83 999 56999 56999
2126542 56 999 55---- A ... 83 999 56999 56999
2126547 56 999 56---- ... 83 999 56999 56999
2126570 56 999 61---- A ... 83 999 56999 56999
2126578 56 999 62---- ... 83 999 56999 56999
2126592 56 999 81---- A ... 83 999 56999 56999
[58901 rows x 28 columns]
cbp3.shape
(58901, 27)
cbp3.head()
FIPSTATE FIPSCTY NAICS EMPFLAG ... N1000_4 CENSTATE CENCTY FIPS
1 01 001 11---- ... 0 63 1 01001
10 01 001 21---- ... 0 63 1 01001
17 01 001 22---- ... 0 63 1 01001
27 01 001 23---- ... 0 63 1 01001
93 01 001 31---- ... 0 63 1 01001
[5 rows x 27 columns]
Select
To keep particular columns of a data frame (rather than filtering rows), use the
filter
or [ ]
functions with arguments that match column names.
cbp2.columns
Index(['FIPSTATE', 'FIPSCTY', 'NAICS', 'EMPFLAG', 'EMP_NF', 'EMP', 'QP1_NF',
'QP1', 'AP_NF', 'AP', 'EST', 'N1_4', 'N5_9', 'N10_19', 'N20_49',
'N50_99', 'N100_249', 'N250_499', 'N500_999', 'N1000', 'N1000_1',
'N1000_2', 'N1000_3', 'N1000_4', 'CENSTATE', 'CENCTY'],
dtype='object')
One way to “match” is by including complete names, each one you want to keep:
cbp3 = cbp3[['FIPS','NAICS','N1_4', 'N5_9', 'N10_19']]
cbp3.head()
FIPS NAICS N1_4 N5_9 N10_19
1 01001 11---- 5 1 0
10 01001 21---- 0 1 1
17 01001 22---- 2 1 2
27 01001 23---- 51 13 7
93 01001 31---- 9 4 4
Alternatively, we can use the filter
function to select all columns starting with N or matching with ‘FIPS’ or ‘NAICS’ pattern. The filter
command is useful when chaining methods (or piping operations).
cbp4= cbp.filter(regex='^N|FIPS|NAICS',axis=1)
cbp4.head()
FIPSTATE FIPSCTY NAICS N1_4 ... N1000_1 N1000_2 N1000_3 N1000_4
0 01 001 ------ 430 ... 0 0 0 0
1 01 001 11---- 5 ... 0 0 0 0
2 01 001 113/// 4 ... 0 0 0 0
3 01 001 1133// 4 ... 0 0 0 0
4 01 001 11331/ 4 ... 0 0 0 0
[5 rows x 16 columns]
Join
The CBP dataset uses FIPS to identify U.S. counties and NAICS codes to identify types of industry. The ACS dataset also uses FIPS but their data may aggregate across multiple NAICS codes representing a single industry sector.
sector = pd.read_csv(
'data/ACS/sector_naics.csv',
dtype = {"NAICS": np.int64})
print(sector.dtypes)
Sector object
NAICS int64
dtype: object
print(cbp.dtypes)
FIPSTATE object
FIPSCTY object
NAICS object
EMPFLAG object
EMP_NF object
EMP int64
QP1_NF object
QP1 int64
AP_NF object
AP int64
EST int64
N1_4 int64
N5_9 int64
N10_19 int64
N20_49 int64
N50_99 int64
N100_249 int64
N250_499 int64
N500_999 int64
N1000 int64
N1000_1 int64
N1000_2 int64
N1000_3 int64
N1000_4 int64
CENSTATE int64
CENCTY int64
dtype: object
cbp.head()
FIPSTATE FIPSCTY NAICS EMPFLAG ... N1000_3 N1000_4 CENSTATE CENCTY
0 01 001 ------ ... 0 0 63 1
1 01 001 11---- ... 0 0 63 1
2 01 001 113/// ... 0 0 63 1
3 01 001 1133// ... 0 0 63 1
4 01 001 11331/ ... 0 0 63 1
[5 rows x 26 columns]
cbp.dtypes
FIPSTATE object
FIPSCTY object
NAICS object
EMPFLAG object
EMP_NF object
EMP int64
QP1_NF object
QP1 int64
AP_NF object
AP int64
EST int64
N1_4 int64
N5_9 int64
N10_19 int64
N20_49 int64
N50_99 int64
N100_249 int64
N250_499 int64
N500_999 int64
N1000 int64
N1000_1 int64
N1000_2 int64
N1000_3 int64
N1000_4 int64
CENSTATE int64
CENCTY int64
dtype: object
cbp.head()
FIPSTATE FIPSCTY NAICS EMPFLAG ... N1000_3 N1000_4 CENSTATE CENCTY
0 01 001 ------ ... 0 0 63 1
1 01 001 11---- ... 0 0 63 1
2 01 001 113/// ... 0 0 63 1
3 01 001 1133// ... 0 0 63 1
4 01 001 11331/ ... 0 0 63 1
[5 rows x 26 columns]
print(sector.dtypes)
Sector object
NAICS int64
dtype: object
print(sector.shape) #24 economic sectors
(24, 2)
sector.head()
Sector NAICS
0 agriculture forestry fishing and hunting 11
1 mining quarrying and oil and gas extraction 21
2 utilities 22
3 construction 23
4 manufacturing 31
Probably the primary challenge in combining secondary datasets for synthesis research is dealing with their different sampling frames. A very common issue is that data are collected at different “scales”, with one dataset being at higher spatial or temporal resolution than another. The differences between the CBP and ACS categories of industry present a similar problem, and require the same solution of re-aggregating data at the “lower resolution”.
Many-to-One
Before performing the join operation, some preprocessing is necessary to extract from the NAICS columns the first two digits matching the sector identifiers.
logical_idx = cbp['NAICS'].str.match('[0-9]{2}----') #boolean index
cbp = cbp.loc[logical_idx]
cbp.head()
FIPSTATE FIPSCTY NAICS EMPFLAG ... N1000_3 N1000_4 CENSTATE CENCTY
1 01 001 11---- ... 0 0 63 1
10 01 001 21---- ... 0 0 63 1
17 01 001 22---- ... 0 0 63 1
27 01 001 23---- ... 0 0 63 1
93 01 001 31---- ... 0 0 63 1
[5 rows x 26 columns]
cbp.shape
(58901, 26)
cbp['NAICS']= cbp.NAICS.apply(lambda x: np.int64(x[0:2])) # select first two digits
#Many to one to join economic sector code to NAICS
cbp_test = cbp.merge(sector, on = "NAICS", how='inner')
cbp_test.head()
FIPSTATE FIPSCTY ... CENCTY Sector
0 01 001 ... 1 agriculture forestry fishing and hunting
1 01 003 ... 3 agriculture forestry fishing and hunting
2 01 005 ... 5 agriculture forestry fishing and hunting
3 01 007 ... 7 agriculture forestry fishing and hunting
4 01 009 ... 9 agriculture forestry fishing and hunting
[5 rows x 27 columns]
print(cbp_test.shape)
(56704, 27)
The NAICS field in the cbp
table can have the same value multiple times, it is
not a primary key in this table. In the sector
table, the NAICS field is the
primary key uniquely identifying each record. The type of relationship between
these tables is therefore “many-to-one”.
- Question
- Note that we lost a couple thousand rows through this join. How could
cbp
have fewer rows after a join on NAICS codes? - Answer
- The CBP data contains an NAICS code not mapped to a sector—the
“error code” 99 is not present in
sector
. The use of “error codes” that could easilly be mistaken for data is frowned upon.
Group By
A very common data manipulation procedure know as “split-apply-combine” tackles the problem of applying the same transformation to subsets of data while keeping the result all together. We need the total number of establishments in each size class aggregated within each county and industry sector.
The pandas function groupby
begins the process by indicating how the data
frame should be split into subsets.
cbp["FIPS"] = cbp["FIPSTATE"]+cbp["FIPSCTY"]
cbp = cbp.merge(sector, on = "NAICS")
cbp_grouped = cbp.groupby(['FIPS','Sector'])
cbp_grouped
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f6908abe668>
At this point, nothing has really changed:
cbp_grouped.dtypes
FIPSTATE ... CENCTY
FIPS Sector ...
01001 accommodation and food services object ... int64
administrative and support and waste management... object ... int64
agriculture forestry fishing and hunting object ... int64
arts entertainment and recreation object ... int64
construction object ... int64
educational services object ... int64
finance and insurance object ... int64
health care and social assistance object ... int64
information object ... int64
management of companies and enterprises object ... int64
manufacturing object ... int64
mining quarrying and oil and gas extraction object ... int64
other services except public administration object ... int64
professional scientific and technical services object ... int64
real estate and rental and leasing object ... int64
retail trade object ... int64
transportation and warehousing object ... int64
utilities object ... int64
wholesale trade object ... int64
01003 accommodation and food services object ... int64
administrative and support and waste management... object ... int64
agriculture forestry fishing and hunting object ... int64
arts entertainment and recreation object ... int64
construction object ... int64
educational services object ... int64
finance and insurance object ... int64
health care and social assistance object ... int64
information object ... int64
management of companies and enterprises object ... int64
manufacturing object ... int64
... ... ... ...
56045 arts entertainment and recreation object ... int64
construction object ... int64
educational services object ... int64
finance and insurance object ... int64
health care and social assistance object ... int64
information object ... int64
management of companies and enterprises object ... int64
manufacturing object ... int64
mining quarrying and oil and gas extraction object ... int64
other services except public administration object ... int64
professional scientific and technical services object ... int64
real estate and rental and leasing object ... int64
retail trade object ... int64
transportation and warehousing object ... int64
utilities object ... int64
wholesale trade object ... int64
56999 administrative and support and waste management... object ... int64
construction object ... int64
educational services object ... int64
finance and insurance object ... int64
health care and social assistance object ... int64
information object ... int64
management of companies and enterprises object ... int64
mining quarrying and oil and gas extraction object ... int64
other services except public administration object ... int64
professional scientific and technical services object ... int64
retail trade object ... int64
transportation and warehousing object ... int64
utilities object ... int64
wholesale trade object ... int64
[56704 rows x 26 columns]
The groupby
statement generates a groupby data frame. You can add multiple variables
(separated by commas) in groupby
; each distinct combination of values across
these columns defines a different group.
Summarize
The operation to perform on each group is summing: we need to sum the number of establishments in each group. Using pandas functions, the summaries are automically combined into a data frame.
grouped_df = (cbp
.groupby(['FIPS', 'Sector'])
.agg('sum')
.filter(regex='^N')
.drop(columns=['NAICS'])
)
grouped_df.head(5)
N1_4 ... N1000_4
FIPS Sector ...
01001 accommodation and food services 23 ... 0
administrative and support and waste management... 18 ... 0
agriculture forestry fishing and hunting 5 ... 0
arts entertainment and recreation 5 ... 0
construction 51 ... 0
[5 rows x 13 columns]
The “combine” part of “split-apply-combine” occurs automatically, when the
attributes introduced by groupby
are dropped. You can see attributes
by running the dtypes
function on the data frame.
There is now a one-to-one relationship between cbp
and acs
, based on
the combination of FIPS and Sector as the primary key for both tables.
print(grouped_df.shape)
(56704, 13)
print(acs.shape)
(59698, 4)
acs_cbp = grouped_df.merge(acs,on='FIPS',)
print(acs_cbp.shape)
(1061416, 17)
Again, however, the one-to-one relationship does not mean all rows are preserved
by the join. The specific nature of the inner_join
is to keep all rows, even
duplicating rows if the relationship is many-to-one, where there are matching
values in both tables, and discarding the rest.
The acs_cbp
table now includes the median_income
variable from the ACS and
appropriatey aggregated establishment size information (the number of
establishments by employee bins) from the CBP table.
acs_cbp.head()
FIPS N1_4 ... Sector median_income
0 01001 23 ... agriculture forestry fishing and hunting 27235.0
1 01001 23 ... mining quarrying and oil and gas extraction 102722.0
2 01001 23 ... construction 31632.0
3 01001 23 ... manufacturing 40233.0
4 01001 23 ... wholesale trade 41656.0
[5 rows x 17 columns]
Additional Resources
The following cheat sheets and tutorials repeat much of this lesson, but also provide information on additional functions for “data wrangling”.
The first is a set of cheat sheets created by pydata.org, and provides a handy, visual summary of all the key functions discussed in this lesson. It also lists some of the auxiliary functions that can be used within each type of expression, e.g. aggregation functions for summarize, “moving window” functions for mutate, etc. For those familiar with the tidyverse univers, please consult the second link.
If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.
# Nothing here yet!