Manipulating Tabular Data

Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.

Contents


Objectives

We will first discuss what is a tidy dataset and how to convert data to this standard form with tidyr. Next, we will explore the data processing functions in dplyr, which work particularly well with the tidy data format.

Top of Section


Tidy data concept

R developer Hadley Wickham (author of the tidyr, dplyr and ggplot packages, among others) defines tidy datasets as those where:

These guidelines may be familiar to some of you, as they closely map to best practices in database design.

Build a data.frame where the counts of three species are recorded for each day in a week:

counts_df <- data.frame(
  day = c("Monday", "Tuesday", "Wednesday"),
  wolf = c(2, 1, 3),
  hare = c(20, 25, 30),
  fox = c(4, 4, 4)
)
counts_df
        day wolf hare fox
1    Monday    2   20   4
2   Tuesday    1   25   4
3 Wednesday    3   30   4
Question
How would you structure this data in a tidy format as defined above?
Answer
Currently, counts_df has three columns (wolf, hare and fox) representing the same variable (a count). Since each reported observation is the count of indiviuuals from a given species on a given day, the tidy format should have three columns: day, species and count.

Top of Section


Reshaping multiple columns into category/value pairs

Let’s load the tidyr package and use its gather function to reshape counts_df into a tidy format:

library(tidyr)
counts_gather <- gather(counts_df,
			key = "species",
			value = "count",
			wolf:fox)
counts_gather
        day species count
1    Monday    wolf     2
2   Tuesday    wolf     1
3 Wednesday    wolf     3
4    Monday    hare    20
5   Tuesday    hare    25
6 Wednesday    hare    30
7    Monday     fox     4
8   Tuesday     fox     4
9 Wednesday     fox     4

Here, gather takes all columns between wolf and fox and reshapes them into two columns, the names of which are specified as the key and value. For each row, the key column in the new dataset indicates the column that contained that value in the original dataset.

If your analysis requires a “wide” data format rather than the tall format produced by gather, you can use the opposite operation, named spread.

counts_spread <- spread(counts_gather,
			key = species,
			value = count)
counts_spread
        day fox hare wolf
1    Monday   4   20    2
2   Tuesday   4   25    1
3 Wednesday   4   30    3
Question
Why are species and count not quoted here?
Answer
They refer to existing column names.

Exercise 1

Try removing a row from counts_gather (e.g. counts_gather <- counts_gather[-8, ]). How does that affect the outcome of spread? Let’s say the missing row means that no individual of that species was recorded on that day. How can you reflect that assumption in the outcome of spread?

Hint: View the help file for that function by entering ?gather on the console.

Top of Section


Sample data

We will use the Portal teaching database, a simplified dataset derived from a long-term study of animal populations in the Chihuahuan Desert.

surveys <- read.csv("data/surveys.csv")
str(surveys)
'data.frame':	35549 obs. of  9 variables:
 $ record_id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ month          : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day            : int  16 16 16 16 16 16 16 16 16 16 ...
 $ year           : int  1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
 $ plot_id        : int  2 3 2 7 3 1 2 1 1 6 ...
 $ species_id     : Factor w/ 49 levels "","AB","AH","AS",..: 17 17 13 13 13 24 23 13 13 24 ...
 $ sex            : Factor w/ 3 levels "","F","M": 3 3 2 3 3 3 2 3 2 2 ...
 $ hindfoot_length: int  32 33 37 36 35 14 NA 37 34 20 ...
 $ weight         : int  NA NA NA NA NA NA NA NA NA NA ...

Modify the function to specify what string in the CSV file represents NAs, a.k.a. data that is not-available or missing.

surveys <- read.csv("data/surveys.csv", na.strings = "")
Question
What has changed?
Answer
The str shows that the factors have one less level, and the empty string is not included.

Top of Section


Key functions in dplyr

Function Returns
filter(data, conditions) rows from data where conditions hold
select(data, variables) a subset of the columns in data, as specified in variables
arrange(data, variables) data sorted by variables
group_by(data, variables) a copy of data, with groups defined by variables
summarize(data, newvar = function) a data frame with newvar columns that summarize data (or each group in data) based on an aggregation function
mutate(data, newvar = function) a data frame with newvar columns defined by a function of existing columns

Subsetting and sorting

After loading dplyr, we begin our analysis by extracting the survey observations for the first three months of 1990 with filter:

library(dplyr)
surveys_1990_winter <- filter(surveys,
			      year == 1990,
			      month %in% 1:3)
str(surveys_1990_winter)
'data.frame':	491 obs. of  9 variables:
 $ record_id      : int  16879 16880 16881 16882 16883 16884 16885 16886 16887 16888 ...
 $ month          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ day            : int  6 6 6 6 6 6 6 6 6 6 ...
 $ year           : int  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
 $ plot_id        : int  1 1 6 23 12 24 12 24 12 17 ...
 $ species_id     : Factor w/ 48 levels "AB","AH","AS",..: 12 17 23 33 33 33 38 39 38 13 ...
 $ sex            : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 1 2 2 ...
 $ hindfoot_length: int  37 21 16 17 17 17 25 30 28 36 ...
 $ weight         : int  35 28 7 9 10 9 35 73 44 55 ...

To choose particular columns (rather than the rows) of a data frame, we would call select with the name of the variables to retain.

select(surveys_1990_winter,
       record_id, month, day, plot_id,
       species_id, sex, hindfoot_length, weight)

Alternatively, we can exclude a column by preceding its name with a minus sign. We use this option here to remove the redundant year column from surveys_1990_winter:

surveys_1990_winter <- select(surveys_1990_winter, -year)
str(surveys_1990_winter)
'data.frame':	491 obs. of  8 variables:
 $ record_id      : int  16879 16880 16881 16882 16883 16884 16885 16886 16887 16888 ...
 $ month          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ day            : int  6 6 6 6 6 6 6 6 6 6 ...
 $ plot_id        : int  1 1 6 23 12 24 12 24 12 17 ...
 $ species_id     : Factor w/ 48 levels "AB","AH","AS",..: 12 17 23 33 33 33 38 39 38 13 ...
 $ sex            : Factor w/ 2 levels "F","M": 1 2 2 1 2 2 2 1 2 2 ...
 $ hindfoot_length: int  37 21 16 17 17 17 25 30 28 36 ...
 $ weight         : int  35 28 7 9 10 9 35 73 44 55 ...

To complete this section, we sort the 1990 winter surveys data by descending order of species name, then by ascending order of weight. Note that arrange assumes ascending order unless the variable name is enclosed by desc().

sorted <- arrange(surveys_1990_winter,
                  desc(species_id), weight)
head(sorted)
  record_id month day plot_id species_id sex hindfoot_length weight
1     16929     1   7       3         SH   M              31     61
2     17172     2  25       3         SH   F              29     67
3     17327     3  30       2         SH   M              30     69
4     16886     1   6      24         SH   F              30     73
5     17359     3  30       3         SH   F              31     77
6     17170     2  25       3         SH   M              30     80

Exercise 2

Write code that returns the record_id, sex and weight of all surveyed individuals of Reithrodontomys montanus (RO).

Grouping and aggregation

Another common type of operation on tabular data involves the aggregation of records according to specific grouping variables. In particular, let’s say we want to count the number of individuals by species observed in the winter of 1990.

surveys_1990_winter_gb <- group_by(surveys_1990_winter, species_id)
counts_1990_winter <- summarize(surveys_1990_winter_gb, count = n())
head(counts_1990_winter)
# A tibble: 6 × 2
  species_id count
      <fctr> <int>
1         AB    25
2         AH     4
3         BA     3
4         DM   132
5         DO    65
6         DS     6

A few notes on these functions:

Exercise 3

Write code that returns the average weight and hindfoot length of Dipodomys merriami (DM) individuals observed in each month (irrespective of the year). Make sure to exclude NA values.

Pivot tables through aggregate and spread

A pivot table takes tidy data into an untidy format, summarizing data by two factors (one as a row, the other a column). Its equivalent to a particular way of grouping and aggregation with dplyr combined with the spread() function.

surveys_1990_winter_gb <- group_by(surveys_1990_winter, species_id, month)
counts_by_month <- summarize(surveys_1990_winter_gb, count = n())
pivot <- spread(counts_by_month, value = count, key = month, fill = 0)
head(pivot)
Source: local data frame [6 x 4]
Groups: species_id [6]

  species_id   `1`   `2`   `3`
      <fctr> <dbl> <dbl> <dbl>
1         AB    24     0     1
2         AH     3     1     0
3         BA     1     2     0
4         DM    60    35    37
5         DO    31    17    17
6         DS     3     1     2

Transformation of variables

The mutate function creates new columns by performing the same operation on each row. Here, we use the previously obtained count variable to derive the proportion of individuals represented by each species, and assign the result to a new prop column.

prop_1990_winter <- mutate(counts_1990_winter,
                           prop = count / sum(count))
head(prop_1990_winter)
# A tibble: 6 × 3
  species_id count       prop
      <fctr> <int>      <dbl>
1         AB    25 0.05091650
2         AH     4 0.00814664
3         BA     3 0.00610998
4         DM   132 0.26883910
5         DO    65 0.13238289
6         DS     6 0.01221996

A few notes about transformations:

Exercise 4

We often use group_by along with summarize, but you can also apply filter and mutate operations on groups.

Top of Section


Chaining operations with pipes (%>%)

What a pipe, or %>%, does is to take the expression on its left-hand side and pass it as the first argument to the function on its right-hand side. Here is a simple example:

c(1, 3, 5) %>% sum()
[1] 9

It’s identical to sum(c(1,3,5)).

Additional arguments are accepted, a pipe only handles the first.

c(1, 3, 5, NA) %>% sum(na.rm = TRUE)
[1] 9

The pipe operator’s main utility is to condense a chain of operations applied to the same piece of data, when you don’t need to save the intermediate results. We can do all the dplyr operations from above with a chain of pipes:

prop_1990_winter_piped <- surveys %>%
    filter(year == 1990, month %in% 1:3) %>% 
    select(-year) %>%
    group_by(species_id) %>%
    summarize(count = n()) %>%
    mutate(prop = count / sum(count))
identical(prop_1990_winter_piped, prop_1990_winter)
[1] TRUE

Top of Section


Additional information

Data wrangling with dplyr and tidyr RStudio cheat sheet.

Top of Section


Exercise solutions

Solution 1

If any species/day combination is missing, the corresponding cell after spread is filled with NA. To interpret missing values as zero counts, use the optional fill argument:

sol1 <- spread(counts_gather, key = species, value = count, fill = 0)
str(sol1)
'data.frame':	3 obs. of  4 variables:
 $ day : Factor w/ 3 levels "Monday","Tuesday",..: 1 2 3
 $ fox : num  4 4 4
 $ hare: num  20 25 30
 $ wolf: num  2 1 3

Solution 2

surveys_RO <- filter(surveys, species_id == "RO")
surveys_R0 <- select(surveys_RO, record_id, sex, weight)

Solution 3

surveys_dm <- filter(surveys, species_id == "DM")
surveys_dm <- group_by(surveys_dm, month)
summarize(surveys_dm, avg_wgt = mean(weight, na.rm = TRUE),
          avg_hfl = mean(hindfoot_length, na.rm = TRUE))
# A tibble: 12 × 3
   month  avg_wgt  avg_hfl
   <int>    <dbl>    <dbl>
1      1 42.93697 36.09476
2      2 43.95270 36.18777
3      3 45.19864 36.11765
4      4 44.75411 36.18793
5      5 43.16449 35.82848
6      6 41.52889 35.97699
7      7 41.93692 35.71283
8      8 41.84119 35.79850
9      9 43.32794 35.83817
10    10 42.50980 35.95254
11    11 42.35932 35.94831
12    12 42.98561 36.04545

Solution 4

Part 1

filter(surveys_1990_winter_gb, weight == min(weight))
Source: local data frame [52 x 8]
Groups: species_id, month [36]

   record_id month   day plot_id species_id    sex hindfoot_length weight
       <int> <int> <int>   <int>     <fctr> <fctr>           <int>  <int>
1      16885     1     6      12         SF      M              25     35
2      16894     1     6       1         OT      F              21     20
3      16910     1     6      20         RM      M              16      7
4      16915     1     6      19         PF      F              16      6
5      16916     1     6      17         NL      F              32    165
6      16929     1     7       3         SH      M              31     61
7      16937     1     7      15         PE      F              20     19
8      16951     1     7       5         BA      F              13      9
9      16959     1     7      13         PP      M              21     14
10     17003     1    29      24         RF      F              19     11
# ... with 42 more rows

Solution 4

Part 2

mutate(surveys_1990_winter_gb,
       ranked_hf_length = row_number(hindfoot_length))
Source: local data frame [491 x 9]
Groups: species_id, month [49]

   record_id month   day plot_id species_id    sex hindfoot_length weight
       <int> <int> <int>   <int>     <fctr> <fctr>           <int>  <int>
1      16879     1     6       1         DM      F              37     35
2      16880     1     6       1         OL      M              21     28
3      16881     1     6       6         PF      M              16      7
4      16882     1     6      23         RM      F              17      9
5      16883     1     6      12         RM      M              17     10
6      16884     1     6      24         RM      M              17      9
7      16885     1     6      12         SF      M              25     35
8      16886     1     6      24         SH      F              30     73
9      16887     1     6      12         SF      M              28     44
10     16888     1     6      17         DO      M              36     55
# ... with 481 more rows, and 1 more variables: ranked_hf_length <int>

Top of Section