Visualizations the ggplot Way

Lesson 5 with Mary Shelly

Contents


Objective

This lesson is a brief overview of the ggplot2 package, which is a R implementation of the “grammar of graphics”. In base R, there are different functions for different types of graphics (plot, boxplot, hist, etc.) and each may have their own specific parameters in addition to general plot options. In contrast, ggplot2 constructs plots one layer at a time; for example, the output of a linear regression could be plotted by defining the axes, then adding individual points, tracing the line of best fit, and finally specifying overall layout parameters such as font sizes and background color.

This layered approach allows for highly customizable graphics. Even when a plot requires several lines of code, that code is broken down in simple components that are easy to interpret.

Top of Section


Getting started

Let’s start by loading a few packages along with a sample dataset, which is the animals table from the Portal Project Teaching Database.

We filter the data to remove rows that have missing values for the species_id, sex, or weight columns. (This is not strictly necessary, but it will prevent ggplot from returning missing values warnings.)

library(dplyr)
library(ggplot2)
animals <- read.csv("data/animals.csv", na.strings = "") %>%
    filter(!is.na(species_id), !is.na(sex), !is.na(weight))

Constructing layered graphics in ggplot

As a first example, this code plots the inviduals’ weights by species:

ggplot(data = animals,
       aes(x = species_id, y = weight)) +
  geom_point()

plot of chunk plot_pt

In ggplot, we specify a data frame (animals) and an aesthetic mapping (aes). The aes function associates variables from that data frame to visual elements in the plot: here, species_id on the x-axis and weight on the y-axis.

The ggplot function by itself does not plot anything until we add a geom_* layer such as geom_point. In this particular case, individual points are hard to distinguish; a boxplot might be a better visualization.

The only change here is the geom_*.

ggplot(data = animals,
       aes(x = species_id, y = weight)) +
  geom_boxplot()

plot of chunk plot_box

Multiple geom layers can be combined in a single plot:

ggplot(data = animals,
       aes(x = species_id, y = weight)) +
  geom_boxplot() +
  geom_point(stat = "summary",
             fun.y = "mean",
             color = "red")

plot of chunk plot_pt_box

The geom_point layer definition illustrates a couple new features:

To associate color (or some other attribute, like point shape) to a variable, it needs to be specified within an aes function.

ggplot(data = animals,
       aes(x = species_id, y = weight, color = species_id)) +
  geom_boxplot() +
  geom_point(stat = "summary",
             fun.y = "mean")

plot of chunk plot_box_color

Exercise 1

Using dplyr and ggplot show how the mean weight of individuals of the species DM changes over time, with males and females shown in different colors.

View solution

Top of Section


Adding a regression line

The code below shows one graph answering the question in the exercise. Adding a geom_smooth layer displays a regression line with confidence intervals (95% CI by default). The method = 'lm' parameter specifies that a linear model is used for smoothing.

animals_dm <- filter(animals, species_id == 'DM')
ggplot(data = animals_dm,
       aes(x = year, y = weight)) + 
  geom_point(aes(shape = sex),
             size = 3,
             stat = 'summary',
             fun.y = 'mean') +
  geom_smooth(method = 'lm')

plot of chunk plot_lm

To get separate regression lines for females and males, we could add a group aesthetic mapping to geom_smooth:

ggplot(data = animals_dm,
       aes(x = year, y = weight)) + 
  geom_point(aes(shape = sex),
             size = 3,
             stat = 'summary',
             fun.y = 'mean') +
  geom_smooth(aes(group = sex), method = 'lm')

plot of chunk plot_lm_group

Even better would be to distinguish everything (points and lines) by color:

ggplot(data = animals_dm,
       aes(x = year,
           y = weight,
           color = sex)) + 
  geom_point(aes(shape = sex),
             size = 3,
	     stat = 'summary',
	     fun.y = 'mean') +
  geom_smooth(method = 'lm')

plot of chunk plot_lm_color

Notice that by adding the aesthetic mapping in the ggplot command, it is applied to all layers that recognize that aesthetic (color).

Top of Section


Storing and re-plotting

The output of ggplot can be assigned to a variable (here, it’s year_wgt). It is then possible to add new elements to it with the + operator. We will use this method to try different color scales for the previous plot

year_wgt <- ggplot(data = animals_dm,
                   aes(x = year,
                   y = weight,
                   color = sex)) + 
              geom_point(aes(shape = sex),
                         size = 3,
	                 stat = "summary",
	                 fun.y = "mean") +
              geom_smooth(method = "lm")

year_wgt +
  scale_color_manual(values = c("darkblue", "orange"))

plot of chunk plot_lm_scales

By overwriting the year_wgt variable, the stored plot gets updated with the black and red color scale.

year_wgt <- year_wgt +
  scale_color_manual(values = c("black", "red"))
year_wgt

plot of chunk plot_lm_scales_2

Exercise 2

Create a histogram, using a geom_histogram() layer, of the weights of individuals of species DM and divide the data by sex. Note that instead of using color in the aesthetic, you’ll use fill to distinguish the sexes. Also open the help with ?geom_histogram and determine how to explicitly set the bin width.

View solution

Top of Section


Axes, labels and themes

Let’s start from the histogram like the one generated in the exercise.

histo <- ggplot(data = animals_dm,
                aes(x = weight, fill = sex)) +
    geom_histogram(binwidth = 3, color = "white")
histo

plot of chunk plot_hist

We change the title and axis labels with the labs function. We have various functions related to the scale of each axis, i.e. the range, breaks and any transformations of the values on the axis. Here, we use scale_x_continuous to modify a continuous (as opposed to discrete) x-axis.

histo <- histo + 
  labs(title = "Dipodomys merriami weight distribution",
       x = "Weight (g)",
       y = "Count") +
  scale_x_continuous(limits = c(20, 60),
                     breaks = c(20, 30, 40, 50, 60))
histo

plot of chunk plot_hist_axes

For information on how to add special symbols and formatting to plot labels, see ?plotmath.

Many plot-level options in ggplot, from background color to font sizes, are defined as part of themes. The next modification to histo changes the base theme of the plot to theme_bw (replacing the default theme_grey) and set a few options manually with the theme function. Try ?theme for a list of available theme options.

histo <- histo +
  theme_bw() +
  theme(legend.position = c(0.2, 0.5),
        plot.title = element_text(face = "bold", vjust = 2),
        axis.title.y = element_text(size = 13, vjust = 1), 
        axis.title.x = element_text(size = 13, vjust = 0))
histo

plot of chunk plot_hist_themes

Note that position is relative to plot size (i.e. between 0 and 1).

Top of Section


Facets

To conclude this overview of ggplot2, here are a few examples that show different subsets of the data in panels called facets. The facet_wrap function takes a formula argument that specifies the grouping on either side of a ‘~’ using a factor in the data.

animals_common <- filter(animals, species_id %in% c('DM', 'PP', 'DO'))
ggplot(data = animals_common,
       aes(x = weight)) +
  geom_histogram() +
  facet_wrap( ~ species_id) +
  labs(title = "Weight of most common species",
       x = "Count",
       y = "Weight (g)")

plot of chunk plot_facets

The un-grouped data may be added as a layer on each panel, but you have to drop the grouping variable (i.e. month).

ggplot(data = animals_common,
       aes(x = weight)) +
  geom_histogram(data = select(animals_common, -species_id),
                 alpha = 0.2) +
  geom_histogram() +
  facet_wrap( ~ species_id) +
  labs(title = "Weight of most common species",
       x = "Count",
       y = "Weight (g)")

plot of chunk plot_facets_2

Finally, let’s show off some additional styling with fill and the very unusual ..density.. argument in the aesthetic. The .. notation is shared by several ggplot functions that perform calculation, in this case the probability density rather than the frequency used before.

ggplot(data = animals_common,
       aes(x = weight, fill = species_id)) +
  geom_histogram(aes(y = ..density..)) +
  facet_wrap( ~ species_id) +
  labs(title = "Weight of most common species",
       x = "Count",
       y = "Weight (g)") +
  guides(fill = FALSE)								 

plot of chunk plot_facets_3

Exercise 3

The formula notation for facet_grid (different from facet_wrap) interprets left-side variables as one axis and right-side variables as another. For these three common animals, create facets in the weight histogram along two categorical variables, with a row for each sex and a column for each species.

View solution

Top of Section


Additional information

Top of Section


Exercise solutions

Solution 1

animals_dm <- filter(animals, species_id == 'DM')
ggplot(data = animals_dm,
       aes(x = year, y = weight, color = sex)) +
  geom_line(stat = 'summary',
            fun.y = 'mean')

plot of chunk sol1

Return

Solution 2

filter(animals, species_id == 'DM') %>%
  ggplot(aes(x = weight, fill = sex)) +         
  geom_histogram(binwidth = 1)

plot of chunk sol2

Return

Solution 3

ggplot(data = animals_common,
       aes(x = weight)) +
  geom_histogram() +
  facet_grid(sex ~ species_id) +
  labs(title = 'Weight of common species by sex',
       x = 'Count',
       y = 'Weight (g)')

plot of chunk sol3

Return

Top of Section