Manipulating Tabular Data

Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.

Contents


Objectives for this lesson

Specific achievements

Data frames generally occupy a central place in R analysis workflows. While the base R functions provide most necessary tools to subset, reformat and transform data frames, the specialized packages in this lesson offer a more succinct and often computationally faster way to perform the common data frame processing steps. Their straightforward syntax also makes scripts more readable and easier to debug. The key functions from the tidyr and dplyr packages all have close counterparts in SQL (Structured Query Language), which provides the added bonus of facilitating translation between R and relational databases.

Top of Section


Tidy data concept

R developer Hadley Wickham (author of the tidyr, dplyr and ggplot packages, among others) defines tidy datasets (Wickham 2014) as those where:

These guidelines may be familiar to some of you—they closely map to best practices in database design.

Conser a data frame where the outcome of an experiment has been recorded in a perfectly appropriate way:

trial drug_A drug_B placebo
1 0.22 0.58 0.31
2 0.12 0.98 0.47
3 0.42 0.19 0.4
response
  trial drug_A drug_B placebo
1     1   0.22   0.58    0.31
2     2   0.12   0.98    0.47
3     3   0.42   0.19    0.40

The response data are present in a compact matrix, as you might record it on a spreadsheet. The form does not match how we think about a statistical model, such as:

response ~ treatment + trial

In a tidy format, each row is a complete observation: it includes one response value and all the predictor values. In this data, some of those values are column headers, so we’ve got to tidy up!

Top of Section


Gather

The tidyr package’s gather function reshapes “wide” data frames into “long” ones.

library(tidyr)
tidy_response <- gather(response, key = "treatment",
  value = "response", -trial)

All columns, accept for “trial”, are stacked in two columns: a “key” and a “value”. the key column gets the name treatment and the value column reveives the name response. For each row in the result, the key is taken from the name of the column and the value from the data in the column.

tidy_response
  trial treatment response
1     1    drug_A     0.22
2     2    drug_A     0.12
3     3    drug_A     0.42
4     1    drug_B     0.58
5     2    drug_B     0.98
6     3    drug_B     0.19
7     1   placebo     0.31
8     2   placebo     0.47
9     3   placebo     0.40

Some notes on the syntax: a big advantage of tidyr and dplyr is that each function takes a data frame as its first argument and returns a new data frame. As we will see later, it makes it very easy to apply these functions in a chain. All functions also use column names as variables without subsetting them from a data frame (i.e. trial instead of response$trial).

Some analyses require a “wide” data format rather than the long format produced by gather. The community ecology package vegan uses a matrix of species counts, where rows correspond to species and columns to sites.

Suppose data are provided in a “entity-attribute-value” format.

site species n
1 lynx 2
1 hare 341
2 lynx 7
2 hare 42
3 hare 289

Transform the data with the spread function, which “reverses” a gather.

wide_counts <- spread(counts,
  key = species,
  value = n)
wide_counts
  site  hare      lynx    
1    1       341         2
2    2        42         7
3    3       289        NA
Question
Why were species and n not quoted in the call to spread?
Answer
They refer to existing column names. In gather, quotes are used to create new column names.

Think about what “missing data” means in this table. Perhaps you can safely do:

wide_counts <- spread(counts,
  key = species,
  value = n,
  fill = 0)
wide_counts
  site  hare      lynx    
1    1       341         2
2    2        42         7
3    3       289         0

Exercise 1

Now that we have a wide form of counts, convert it to a tidy_counts data frame using gather. The only difference between counts and tidy_counts should be the additional row for zero lynx at site 2. Remember, a tidy dataset has a row for every observation, even if the value is “implied”.

View solution

Top of Section


Sample data

To learn about data transformation with dplyr, we need more data. The Portal teaching database is a simplified dataset derived from a long-term study of animal populations in the Chihuahuan Desert.


Credit: The Portal Project

The teaching dataset includes three tables: two contain summary information on the study plots and observed species, respectively, while the third and largest one (animals) lists all individual observations. We only need the animals table for this lesson.

animals <- read.csv('data/animals.csv')
str(animals)
'data.frame':	35549 obs. of  9 variables:
 $ id             : int  1 2 3 4 5 6 7 8 9 10 ...
 $ month          : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day            : int  16 16 16 16 16 16 16 16 16 16 ...
 $ year           : int  1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
 $ plot_id        : int  2 3 2 7 3 1 2 1 1 6 ...
 $ species_id     : Factor w/ 49 levels "","AB","AH","AS",..: 17 17 13 13 13 24 23 13 13 24 ...
 $ sex            : Factor w/ 3 levels "","F","M": 3 3 2 3 3 3 2 3 2 2 ...
 $ hindfoot_length: int  32 33 37 36 35 14 NA 37 34 20 ...
 $ weight         : int  NA NA NA NA NA NA NA NA NA NA ...

Modify the function to specify what string in the CSV file represents NAs, a.k.a. data that is not-available or missing.

animals <- read.csv(
  'data/animals.csv',
  na.strings = '')
Question
What changed?
Answer
Using str() shows that the factors have one less level, and the empty string is no longer included.

Top of Section


Key dplyr functions

Function Returns
filter keep rows that staisfy conditions
select keep columns with matching names
group_by split data into groups by an existing factor
mutate apply a transformation to existing [split] columns
summarize summarize across rows [and combine split groups]

The table above presents the most commonly used functions in dplyr, which we will demonstrate in turn, starting from the animals data frame.

Subsetting

The animals table includes numeric year and month columns. Of the 35,549 observations, lets see how many observations are left when we keep only observations from the first three months of 1990.

library(dplyr)
animals_1990_winter <- filter(
  animals,
  year == 1990,
  month %in% 1:3)
str(animals_1990_winter)
'data.frame':	491 obs. of  9 variables:
 $ id             : int  16879 16880 16881 16882 16883 16884 16885 16886 16887 16888 ...
 $ month          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ day            : int  6 6 6 6 6 6 6 6 6 6 ...
 $ year           : int  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
 $ plot_id        : int  1 1 6 23 12 24 12 24 12 17 ...
 $ species_id     : Factor w/ 48 levels "AB","AH","AS",..: 12 17 23 33 33 33 38 39 38 13 ...
 $ sex            : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 1 2 2 ...
 $ hindfoot_length: int  37 21 16 17 17 17 25 30 28 36 ...
 $ weight         : int  35 28 7 9 10 9 35 73 44 55 ...

Note that a logical “and” is implied when conditions are separated by commas. (This is perhaps the main way in which filter differs from the base R subset function.) Therefore, the example above is equivalent to filter(animals, year == 1990 & month %in% 1:3). A logical “or” must be specified explicitly with the | operator.

To keep particular columns of a data frame (rather than choosing rows) , use the select with arguments that match the column names.

colnames(animals)
[1] "id"              "month"           "day"             "year"           
[5] "plot_id"         "species_id"      "sex"             "hindfoot_length"
[9] "weight"         

One way to “match” is by including complete names, each one you want to keep:

select(animals_1990_winter,
  id, month, day, plot_id,
  species_id, sex, hindfoot_length, weight)

Alternatively, we can use a negative “match”: keep columns that do not match the name preceded by minus sing.

animals_1990_winter <- select(
  animals_1990_winter,
  -year)

Use this option to remove a single column from a data frame.

str(animals_1990_winter)
'data.frame':	491 obs. of  8 variables:
 $ id             : int  16879 16880 16881 16882 16883 16884 16885 16886 16887 16888 ...
 $ month          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ day            : int  6 6 6 6 6 6 6 6 6 6 ...
 $ plot_id        : int  1 1 6 23 12 24 12 24 12 17 ...
 $ species_id     : Factor w/ 48 levels "AB","AH","AS",..: 12 17 23 33 33 33 38 39 38 13 ...
 $ sex            : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 1 2 2 ...
 $ hindfoot_length: int  37 21 16 17 17 17 25 30 28 36 ...
 $ weight         : int  35 28 7 9 10 9 35 73 44 55 ...


Credit: The Portal Project

Exercise 2

Write code that returns the id, sex and weight of all surveyed individuals of Reithrodontomys montanus (RO).

View solution

Top of Section


Chaining functions

All those functions from the dplyr package take a data frame as their first argument, and they return a data frame. This consistent syntax is on purpose. It is designed for easily chaining data transformations together: processing data frames in easy-to-read steps.

The “pipe” operator (%>%) from the magrittr package is loaded by dplyr. The pipe takes the expression on its left-hand side and hands it over as the first argument to the function on its right-hand side.

Equivalent to sum(c(1,3,5)), for example, we have:

c(1, 3, 5) %>% sum()
[1] 9

Additional arguments are accepted, a pipe only handles the first.

c(1, 3, 5, NA) %>% sum(na.rm = TRUE)
[1] 9

The pipe operator’s main utility is to condense a chain of operations applied to the same piece of data, when you don’t need to save the intermediate results. We can do both the filter and select operations from above with one assignment.

animals_1990_winter <- animals %>%
    filter(year == 1990, month %in% 1:3) %>%
    select(-year)
str(animals_1990_winter)
'data.frame':	491 obs. of  8 variables:
 $ id             : int  16879 16880 16881 16882 16883 16884 16885 16886 16887 16888 ...
 $ month          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ day            : int  6 6 6 6 6 6 6 6 6 6 ...
 $ plot_id        : int  1 1 6 23 12 24 12 24 12 17 ...
 $ species_id     : Factor w/ 48 levels "AB","AH","AS",..: 12 17 23 33 33 33 38 39 38 13 ...
 $ sex            : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 1 2 2 ...
 $ hindfoot_length: int  37 21 16 17 17 17 25 30 28 36 ...
 $ weight         : int  35 28 7 9 10 9 35 73 44 55 ...

Top of Section


Split-apply-combine

A very common data manipulation procedure is doing some “group-wise” operations on a dataset and combing the results for each group into a single table. For example, say you need to count the number of individuals of each species observed in the winter of 1990.

Grouping

The dplyr function group_by begins the process by indicating how the data frame should be split into subsets.

counts_1990_winter <- animals_1990_winter %>%
    group_by(species_id)

At this point, nothing has really changed:

str(counts_1990_winter)
Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':	491 obs. of  8 variables:
 $ id             : int  16879 16880 16881 16882 16883 16884 16885 16886 16887 16888 ...
 $ month          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ day            : int  6 6 6 6 6 6 6 6 6 6 ...
 $ plot_id        : int  1 1 6 23 12 24 12 24 12 17 ...
 $ species_id     : Factor w/ 48 levels "AB","AH","AS",..: 12 17 23 33 33 33 38 39 38 13 ...
 $ sex            : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 1 2 2 ...
 $ hindfoot_length: int  37 21 16 17 17 17 25 30 28 36 ...
 $ weight         : int  35 28 7 9 10 9 35 73 44 55 ...
 - attr(*, "vars")= chr "species_id"
 - attr(*, "drop")= logi TRUE
 - attr(*, "indices")=List of 20
  ..$ : int  13 46 88 102 103 108 113 114 121 127 ...
  ..$ : int  22 141 147 325
  ..$ : int  72 295 311
  ..$ : int  0 33 35 51 56 60 63 64 66 67 ...
  ..$ : int  9 12 20 27 41 42 44 49 53 54 ...
  ..$ : int  79 99 228 352 409 474
  ..$ : int  37 116 254 258 267 285 351 396 402 438
  ..$ : int  1 24 157 164 273 278 420
  ..$ : int  11 14 15 48 62 82 137 152 185 194 ...
  ..$ : int  243 283 296 302 320 332 340
  ..$ : int  32 47 58 76 109 110 112 118 142 165 ...
  ..$ : int  2 36 45 105 120 123 144 174 237 246 ...
  ..$ : int  136 140 146 148
  ..$ : int  145 276 393
  ..$ : int 80
  ..$ : int  52 59 124 135 225 233 312 373 425 433
  ..$ : int  3 4 5 17 19 21 23 25 26 28 ...
  ..$ : int  6 8 10 16 18 111 119 138 248 252 ...
  ..$ : int  7 50 291 293 421 448 480
  ..$ : int 104
 - attr(*, "group_sizes")= int  25 4 3 132 65 6 10 7 22 7 ...
 - attr(*, "biggest_group_size")= int 132
 - attr(*, "labels")='data.frame':	20 obs. of  1 variable:
  ..$ species_id: Factor w/ 48 levels "AB","AH","AS",..: 1 2 4 12 13 14 16 17 18 21 ...
  ..- attr(*, "vars")= chr "species_id"
  ..- attr(*, "drop")= logi TRUE

The group_by statement does not change any values in the data frame; it only adds attributes to the the original data frame. You can add multiple variables (separated by commas) in group_by; each distinct combination of values across these columns defines a different group.

Summarize

The operation to perform on each species is counting: we need to count how many records are in each group.

counts_1990_winter <- animals_1990_winter %>%
    group_by(species_id) %>%
    summarize(count = n())
str(counts_1990_winter)
Classes 'tbl_df', 'tbl' and 'data.frame':	20 obs. of  2 variables:
 $ species_id: Factor w/ 48 levels "AB","AH","AS",..: 1 2 4 12 13 14 16 17 18 21 ...
 $ count     : int  25 4 3 132 65 6 10 7 22 7 ...

The “combine” part of “split-apply-combine” occurs automatically, when the attributes introduced by group_by are dropped. You can see attributes either by running the str() function on the data frame or by inspecting it in the RStudio Environment pane.

The function n() takes no arguments and returns the number of records in a group. Any function that collapses a vector input to a single output is a suitable function to use within summarize.

weight_1990_winter <- animals_1990_winter %>%
    group_by(species_id) %>%
    summarize(avg_weight = mean(weight, na.rm = TRUE))
head(weight_1990_winter)
# A tibble: 6 x 2
  species_id avg_weight
      <fctr>      <dbl>
1         AB        NaN
2         AH        NaN
3         BA   7.666667
4         DM  43.372093
5         DO  48.222222
6         DS 130.000000

Exercise 3

Write code that returns the average hindfoot length of Dipodomys merriami (DM) individuals observed in each month (irrespective of the year). Make sure to exclude NA values.

View solution

Transformation of variables

The mutate function creates new columns from existing ones. The data frame returned has an additional column for each argument to mutate, unless a name is reused. Overwriting an existing column does not generate a warning.

The count variable just defined, for example, can be used to calculate the proportion of individuals represented by each species.

prop_1990_winter <- counts_1990_winter %>%
    mutate(prop = count / sum(count))
head(prop_1990_winter)
# A tibble: 6 x 3
  species_id count       prop
      <fctr> <int>      <dbl>
1         AB    25 0.05091650
2         AH     4 0.00814664
3         BA     3 0.00610998
4         DM   132 0.26883910
5         DO    65 0.13238289
6         DS     6 0.01221996

For a concise way to apply the same transformation to multiple columns, check the mutate_each function. There is also a summarize_each function to perform the same aggregation operation on multiple columns.

Both mutate and summarize can be used in the “apply” part of a “split-apply-combine” procedure. The difference is that the results are combine into data frames with differing numbers of rows.

Question
How many rows do you expect in the result of a mutate operation?
Answer
The same number you started with.

Exercise 4

A “pivot table” is a transformation of tidy data into a wide summary table. First, data are summarized by two grouping factors, then one of these is “pivoted” into columns. Starting from the animals data frame, chain a group_by and summarize transformation into a tidyr spread function to get the number of individuals counted in each month (as three columns) by species (as rows).

Top of Section


Additional information

Data wrangling with dplyr and tidyr RStudio cheat sheet.

One of several cheat sheets available on the RStudio website, it provides a brief, visual summary of all the key functions discussed in this lesson. It also lists some of the auxiliary functions that can be used within each type of expression, e.g. aggregation functions for summarize, “moving window” functions for mutate, etc.

Top of Section


Exercise solutions

Solution 1

gather(wide_counts, key = "species", value = "n", -site)
  site   species   n
1    1  hare     341
2    2  hare      42
3    3  hare     289
4    1  lynx       2
5    2  lynx       7
6    3  lynx       0

Return

Solution 2

animals_RO <- filter(animals, species_id == "RO")
select(animals_RO, id, sex, weight)
     id sex weight
1 18871   F     11
2 33397   M      8
3 33556   M      9
4 33565   F      8
5 34517   M     11
6 35402   F     12
7 35420   M     10
8 35487   F     13

Return

Solution 3

filter(animals, species_id == "DM") %>%
  group_by(month) %>%
  summarize(
    avg_wgt = mean(weight, na.rm = TRUE),
    avg_hfl = mean(hindfoot_length, na.rm = TRUE))
# A tibble: 12 x 3
   month  avg_wgt  avg_hfl
   <int>    <dbl>    <dbl>
 1     1 42.93697 36.09476
 2     2 43.95270 36.18777
 3     3 45.19864 36.11765
 4     4 44.75411 36.18793
 5     5 43.16449 35.82848
 6     6 41.52889 35.97699
 7     7 41.93692 35.71283
 8     8 41.84119 35.79850
 9     9 43.32794 35.83817
10    10 42.50980 35.95254
11    11 42.35932 35.94831
12    12 42.98561 36.04545

Return

Solution 4

group_by(animals, species_id, month) %>%
  summarize(count = n()) %>%
  spread(key = month, value = count, fill = 0)
# A tibble: 49 x 13
# Groups:   species_id [49]
   species_id   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`   `9`  `10`
 *     <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1         AB    75    52    38    18    12     5    12    11    12     9
 2         AH    27    38    24    29    58    54    33    29    44    38
 3         AS     1     0     0     0     0     0     0     1     0     0
 4         BA     5     4     3     4     3     3     2     2     1     4
 5         CB     1     1     0     1     2     5     8     6    16     5
 6         CM     0     0     0     0     0     0     0     3     9     1
 7         CQ     3     0     0     0     0     1     5     6     0     1
 8         CS     0     0     0     1     0     0     0     0     0     0
 9         CT     0     0     0     0     0     0     0     0     0     0
10         CU     0     0     0     0     0     0     0     0     0     0
# ... with 39 more rows, and 2 more variables: `11` <dbl>, `12` <dbl>

Return

Top of Section