March 08, 2021 by Quentin Read
This is a little story about how I learned to stop worrying and love data.table, a great (and in my opinion underrated) package for doing data science in R.
I have used R since 2011, and initially learned to work with data using mostly base R code. In about 2013 I started using the
plyr package, which later morphed into
dplyr and, along with some other packages, was dubbed the “tidyverse.” I have long admired the tidyverse’s design philosophy. It’s really helped me speed up writing data analysis and processing scripts interactively. But there are some negatives of tidyverse. It’s not very stable: code from a few years ago often breaks if you are keeping your package and R versions up to date. More importantly, it doesn’t care that much about code being fast or lean in terms of memory use. Enter
data.table and tidyverse are really well-designed but have different goals. The main goal of
data.table, a package developed by Matt Dowle and Arun Srinivasan, is to quickly and efficiently manipulate huge data frames with millions of rows. Functions in
data.table are optimized to use as little memory as possible by avoiding making unnecessary temporary copies of data frames. This results in code that is more efficient in both time and memory use. Another advantage is that
data.table is much more stable than tidyverse, even though it has actually been around for longer (since 2008), and is constantly being updated. The old
data.table code still works. Another nice feature is that it has no dependencies — contrast that with the large number of back-end packages now required to run the tidyverse.
For my food waste research I had a “big data” problem. My data includes pairwise flows of 10 different goods between all pairs of the ~3100 counties in the United States, replicated over 20 different scenarios. That’s
20 * 3100^2 * 10 which ends up being a huge data frame. Just doing basic operations in tidyverse such as
mutate() was taking a day, even if I ran the code split up across a lot of cores in parallel on the Slurm cluster. Because I was still trying to work out some kinks in the analysis, it was annoying to have to run code for a day every time I wanted to check if the full analysis worked. That’s why I thought it would be nice to learn
data.table. It had been on the back burner for a long time but this time I actually decided to sit down and learn it.
If you’ve used tidyverse you have certainly seen the
%>% (pipe) operator. The pipe allows you to chain many data-manipulation commands into one statement without having(to(nest(functions(like(this))))).
data.table also has a signature operator,
:=. This is a special assignment operator that lets you create new data frame columns in-place without having to reassign the entire data.table to a new object. This saves time and memory.
For example, in tidyverse you could write
mydata <- mydata %>% mutate(new_column = column1^2 + column2)
The equivalent in
data.table would be
mydata[, new_column := column1^2 + column2]
You don’t need to assign anything —
mydata is modified in place.
You can also chain commands in
data.table using brackets, so you can write statements in a similar way to tidyverse if that’s your style.
For example, this in tidyverse:
mydata %>% mutate(log_population = log(population)) %>% filter(year > 1990) %>% group_by(country, city) %>% summarize(log_pop_density = log_population/area)
is this in
mydata[, log_population := log(population)][ year > 1990][ .(log_pop_density = log_population/area), by = .(country, city)]
I went through the vignettes on data.table’s homepage, which are a very nice tutorial that I would recommend to beginners who are otherwise familiar with R.
In the end it was helpful that I first learned data manipulation in base R, before tidyverse had taken over the world. Because
data.table’s syntax is a little closer to base R syntax than tidyverse’s, some things were more familiar to me than if I had only learned tidyverse ways of doing things. That’s a little like a native English speaker learning German, where you occasionally recognize a cognate or bit of grammar that resembles “old-fashioned” English.
In the end, I rewrote my food waste data processing code using mostly data.table syntax with some of the
map() family of functions from the
purrr package (Jenny Bryan’s excellent tidyverse list manipulation package) mixed in. I also found a few custom functions through the art of Googling that allowed me to mimic some tidyverse behavior, in particular grouping a data.table and making a list column in each group.
data.table supports list-columns in data frames — in fact, it has had that useful feature since the beginning, which was only later adopted by tidyverse packages. Maybe my solution isn’t the optimal
data.table solution but it helped me translate my code from tidyverse to data.table without having to completely rethink how it’s done.
There are other options that allow hybrid solutions where code written in tidyverse language can run with
data.table’s performance, or even packages like
dbplyr that let you use SQL databases while writing your code in tidyverse language (
data.table is also based on SQL). This would be nice if you really like the
%>% better than the
data.table also has some appealing syntax — I am enjoying writing code in
I am happy with the performance benefit I got from learning
data.table (roughly a 3x speedup), it was a fun learning experience, and it’s always nice to learn a new language, or new syntax within a language. Just like learning a new foreign language, it helps your mind get out of set ways of thinking and frees you to come up with more creative solutions even in the language you already know. Of course it also fits with my quest to make my code more efficient and thereby get the same outcome while consuming less energy!