Visualizations the ggplot Way
Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.
- Objectives for this lesson
- Getting started
- Smooth lines
- Storing and re-plotting
- Axes, labels and themes
- Additional information
- Exercise solutions
Objectives for this lesson
- Meet the “grammar of graphics” for ggplot2
- Trust that this is better than base R’s
- Learn to layer visual elements on top of tidy data
- Glimpse the vast collection of ggplot2 options
- Create several “aesthetics” or mappings for different plots
- Build boxplots, scatterplots, smoothed lines and histograms
- Style plots with colors
- Repeat plots for different subsets of data
Let’s start by loading a few packages along with a sample dataset, which is the animals table from the Portal Project Teaching Database.
library(dplyr) animals <- read.csv('data/animals.csv', na.strings = '') %>% select(year, species_id, sex, weight) %>% na.omit()
Omitting rows that have missing values for the
weight columns is not strictly necessary, but it will prevent ggplot from returning missing values warnings.
As a first example, this code plots each invidual’s weight by species:
library(ggplot2) ggplot(animals, aes(x = species_id, y = weight)) + geom_point()
ggplot command expects a data frame and an aesthetic mapping. The
aes function creates the aesthetic, a mapping between variables in the data frame and visual elements in the plot. Here, the aesthetic maps
species_id to the x-axis and
weight to the y-axis.
ggplot function by itself does not display anything until we add a
geom_* layer, in this example a
geom_point. Layers are literally added, with
+, to the object created by the
Individual points are hard to distinguish in this plot. Might a boxplot be a better visualization? The only change needed is in the
ggplot(animals, aes(x = species_id, y = weight)) + geom_boxplot()
geom_* layers together to create a multi-layered plot:
ggplot(animals, aes(x = species_id, y = weight)) + geom_boxplot() + geom_point()
geom_* object accepts some general and some specialized arguments. For styling with shapes and colors, or for performing some dplyr-like data transformations.
ggplot(animals, aes(x = species_id, y = weight)) + geom_boxplot() + geom_point( color = 'red', stat = 'summary', fun.y = 'mean')
geom_point layer definition illustrates these features:
stat = 'summary', the plot replaces the raw data with the result of a summary function, defined by
color = redapplies one color to the whole layer.
Associating color (or any attribute, like the shape of points) to a variable is another kind of aesthetic mapping. Passing the
color argument to the
aes function works quite differently than assiging color to a
ggplot(animals, aes(x = species_id, y = weight, color = species_id)) + geom_boxplot() + geom_point(stat = 'summary', fun.y = 'mean')
Use dplyr to filter down to the animals with
species_id equal to DM. Use
ggplot to show how the mean weight of this species changes each year, showing males and females in different colors. (Hint: Baby steps! Start with a scatterplot of weight by year. Then expand your code to plot only the means. Then try to distinguish sexes.)
geom_smooth layer adds a regression line with confidence intervals (95% CI by default). The
method = 'lm' parameter specifies that a linear model is used for smoothing.
Prepare some data in dplyr as for a linear model with a categorical predictor.
levels(animals$sex) <- c('Female', 'Male') animals_dm <- filter(animals, species_id == 'DM')
ggplot(animals_dm, aes(x = year, y = weight, shape = sex)) + geom_point(size = 3, stat = 'summary', fun.y = 'mean') + geom_smooth(method = 'lm')
Even better would be to distinguish everything (points and lines) by color:
ggplot(animals_dm, aes(x = year, y = weight, shape = sex, color = sex)) + geom_point(size = 3, stat = 'summary', fun.y = 'mean') + geom_smooth(method = 'lm')
Notice that by adding aesthetic mappings in the base aesthetic (in the
ggplot command), it is applied to any layer that recognizes the parameter.
Storing and re-plotting
The output of
ggplot can be assigned to a variable (here, it’s
year_wgt). It is then possible to add new elements to it with the
+ operator. We will use this method to try different color scales for the previous plot.
year_wgt <- ggplot(animals_dm, aes(x = year, y = weight, color = sex, shape = sex)) + geom_point(size = 3, stat = "summary", fun.y = "mean") + geom_smooth(method = "lm")
The plot information stored in
year_wgt can be used on its own, or with additional layers.
By overwriting the
year_wgt variable, the stored plot gets updated with the black and red color scale.
year_wgt <- year_wgt + scale_color_manual(values = c("black", "red"))
Create a histogram, using a
geom_histogram layer, of the weights of individuals of species DM and divide the data by sex. Note that instead of using
color in the aesthetic, you’ll use
fill to distinguish the sexes. To silence that warning, open the help with
?geom_histogram and determine how to explicitly set the bin width.
Axes, labels and themes
Let’s start from the histogram like the one generated in the exercise.
histo <- ggplot(animals_dm, aes(x = weight, fill = sex)) + geom_histogram(binwidth = 3, color = 'white')
We change the title and axis labels with the
labs function. We have various functions related to the scale of each axis, i.e. the range, breaks and any transformations of the values on the axis.
histo <- histo + labs( title = 'Dipodomys merriami weight distribution', x = 'Weight (g)', y = 'Count')
For information on how to add special symbols and formatting to plot labels, see
Here, we use
scale_x_continuous to modify the continuous (as opposed to discrete) x-axis.
histo <- histo + scale_x_continuous( limits = c(20, 60), breaks = c(20, 30, 40, 50, 60))
Many plot-level options in
ggplot, from background color to font sizes, are defined as part of themes. The next modification to histo changes the base theme of the plot to
theme_bw (replacing the default
theme_grey) and set a few options manually with the
theme function. Try
?theme for a list of available theme options.
histo <- histo + theme_bw() + theme( legend.position = c(0.2, 0.5), plot.title = element_text(face = 'bold', vjust = 2), axis.title.y = element_text(size = 13, vjust = 1), axis.title.x = element_text(size = 13, vjust = 0))
Note that position is relative to plot size (i.e. between 0 and 1).
To conclude this overview of ggplot2, we’ll apply the same plotting instructions to different subsets of the data in panels called “facets”.
facet_wrap function takes a
formula argument that specifies the grouping on either side of a ‘~’ using a factor in the data.
animals_common <- filter(animals, species_id %in% c('DM', 'PP', 'DO')) faceted <- ggplot(animals_common, aes(x = weight)) + geom_histogram() + facet_wrap( ~ species_id) + labs(title = "Weight of most common species", x = "Count", y = "Weight (g)")
The un-grouped data may be added as a layer on each panel, but you have to drop the grouping variable (i.e.
faceted_all <- faceted + geom_histogram( data = select(animals_common, -species_id), alpha = 0.2)
Finally, let’s show off some additional styling with
fill and the very unusual
.. notation is shared by several ggplot functions that perform a calculation. Using
..density.. as the y-axis variable allows a geometry to display the probability density of variable assigned to the x-axis.
faceted_density <- ggplot(animals_common, aes(x = weight, fill = species_id)) + geom_histogram(aes(y = ..density..)) + facet_wrap( ~ species_id) + labs(title = "Weight of most common species", x = "Count", y = "Weight (g)")
The formula notation for
facet_grid (different from
facet_wrap) interprets left-side variables as one axis and right-side variables as another. For these three common animals, create facets in the weight histogram along two categorical variables, with a row for each sex and a column for each species.
- Data visualization with ggplot2 (RStudio cheat sheet)
- Cookbook for R - Graphs A useful reference on how to customize different graph elements in ggplot2.
- Introduction to cowplot Vignette for an add-on package for customizing ggplot figures.
animals_dm <- filter(animals, species_id == 'DM') ggplot(animals_dm, aes(x = year, y = weight, color = sex)) + geom_line(stat = 'summary', fun.y = 'mean')
filter(animals, species_id == 'DM') %>% ggplot(aes(x = weight, fill = sex)) + geom_histogram(binwidth = 1)
ggplot(animals_common, aes(x = weight)) + geom_histogram() + facet_grid(sex ~ species_id) + labs(title = 'Weight of common species by sex', x = 'Count', y = 'Weight (g)')