Visualizations the ggplot Way

Lesson 4 with Ian Carroll

Contents


Objectives for this lesson

Specific achievements

Top of Section


Getting started

Let’s start by loading a few packages along with a sample dataset, which is the animals table from the Portal Project Teaching Database.

library(dplyr)
animals <- read.csv('data/animals.csv',
  na.strings = '') %>%
  select(year, species_id, sex, weight) %>%
  na.omit()

Omitting rows that have missing values for the species_id, sex, and weight columns is not strictly necessary, but it will prevent ggplot from returning missing values warnings.

Layered graphics

As a first example, this code plots each invidual’s weight by species:

library(ggplot2)
ggplot(animals,
       aes(x = species_id, y = weight)) +
  geom_point()

plot of chunk plot_pt

The ggplot command expects a data frame and an aesthetic mapping. The aes function creates the aesthetic, a mapping between variables in the data frame and visual elements in the plot. Here, the aesthetic maps species_id to the x-axis and weight to the y-axis.

The ggplot function by itself does not display anything until we add a geom_* layer, in this example a geom_point. Layers are literally added, with +, to the object created by the ggplot function.

Individual points are hard to distinguish in this plot. Might a boxplot be a better visualization? The only change needed is in the geom_* layer.

ggplot(animals,
       aes(x = species_id, y = weight)) +
  geom_boxplot()

plot of chunk plot_box

Add geom_* layers together to create a multi-layered plot:

ggplot(animals,
       aes(x = species_id, y = weight)) +
  geom_boxplot() +
  geom_point()

plot of chunk plot_pt_box_plain

Each geom_* object accepts some general and some specialized arguments. For styling with shapes and colors, or for performing some dplyr-like data transformations.

ggplot(animals,
       aes(x = species_id, y = weight)) +
  geom_boxplot() +
  geom_point(
    color = 'red',
    stat = 'summary',
    fun.y = 'mean')

plot of chunk plot_pt_box

The geom_point layer definition illustrates these features:

Associating color (or any attribute, like the shape of points) to a variable is another kind of aesthetic mapping. Passing the color argument to the aes function works quite differently than assiging color to a geom_*.

ggplot(animals,
       aes(x = species_id, y = weight,
           color = species_id)) +
  geom_boxplot() +
  geom_point(stat = 'summary',
             fun.y = 'mean')

plot of chunk plot_box_color

Exercise 1

Use dplyr to filter down to the animals with species_id equal to DM. Use ggplot to show how the mean weight of this species changes each year, showing males and females in different colors. (Hint: Baby steps! Start with a scatterplot of weight by year. Then expand your code to plot only the means. Then try to distinguish sexes.)

View solution

Top of Section


Smooth lines

The geom_smooth layer adds a regression line with confidence intervals (95% CI by default). The method = 'lm' parameter specifies that a linear model is used for smoothing.

Prepare some data in dplyr as for a linear model with a categorical predictor.

levels(animals$sex) <- c('Female', 'Male')
animals_dm <- filter(animals,
  species_id == 'DM')
ggplot(animals_dm,
  aes(x = year, y = weight, shape = sex)) + 
  geom_point(size = 3,
    stat = 'summary', fun.y = 'mean') +
  geom_smooth(method = 'lm')

plot of chunk plot_lm

Even better would be to distinguish everything (points and lines) by color:

ggplot(animals_dm,
  aes(x = year, y = weight,
    shape = sex, color = sex)) + 
  geom_point(size = 3,
    stat = 'summary', fun.y = 'mean') +
  geom_smooth(method = 'lm')

plot of chunk plot_lm_color

Notice that by adding aesthetic mappings in the base aesthetic (in the ggplot command), it is applied to any layer that recognizes the parameter.

Top of Section


Storing and re-plotting

The output of ggplot can be assigned to a variable (here, it’s year_wgt). It is then possible to add new elements to it with the + operator. We will use this method to try different color scales for the previous plot.

year_wgt <- ggplot(animals_dm,
  aes(x = year, y = weight,
    color = sex, shape = sex)) + 
  geom_point(size = 3,
    stat = "summary",
    fun.y = "mean") +
  geom_smooth(method = "lm")

The plot information stored in year_wgt can be used on its own, or with additional layers.

year_wgt

plot of chunk plot_scale_color_manual

By overwriting the year_wgt variable, the stored plot gets updated with the black and red color scale.

year_wgt <- year_wgt +
  scale_color_manual(
    values = c("black", "red"))
year_wgt

plot of chunk plot_lm_scales_2

Exercise 2

Create a histogram, using a geom_histogram layer, of the weights of individuals of species DM and divide the data by sex. Note that instead of using color in the aesthetic, you’ll use fill to distinguish the sexes. To silence that warning, open the help with ?geom_histogram and determine how to explicitly set the bin width.

View solution

Top of Section


Axes, labels and themes

Let’s start from the histogram like the one generated in the exercise.

histo <- ggplot(animals_dm,
  aes(x = weight, fill = sex)) +
  geom_histogram(binwidth = 3, color = 'white')
histo

plot of chunk plot_hist

We change the title and axis labels with the labs function. We have various functions related to the scale of each axis, i.e. the range, breaks and any transformations of the values on the axis.

histo <- histo + labs(title =
  'Dipodomys merriami weight distribution',
  x = 'Weight (g)',
  y = 'Count')
histo

plot of chunk plot_labs

For information on how to add special symbols and formatting to plot labels, see ?plotmath.

Here, we use scale_x_continuous to modify the continuous (as opposed to discrete) x-axis.

histo <- histo + scale_x_continuous(
  limits = c(20, 60),
  breaks = c(20, 30, 40, 50, 60))
histo

plot of chunk plot_hist_axes

Many plot-level options in ggplot, from background color to font sizes, are defined as part of themes. The next modification to histo changes the base theme of the plot to theme_bw (replacing the default theme_grey) and set a few options manually with the theme function. Try ?theme for a list of available theme options.

histo <- histo + theme_bw() + theme(
  legend.position = c(0.2, 0.5),
  plot.title = element_text(
    face = 'bold', vjust = 2),
  axis.title.y = element_text(
    size = 13, vjust = 1), 
  axis.title.x = element_text(
    size = 13, vjust = 0))
histo

plot of chunk plot_hist_themes

Note that position is relative to plot size (i.e. between 0 and 1).

Top of Section


Facets

To conclude this overview of ggplot2, we’ll apply the same plotting instructions to different subsets of the data in panels called “facets”. The facet_wrap function takes a formula argument that specifies the grouping on either side of a ‘~’ using a factor in the data.

animals_common <- filter(animals,
  species_id %in% c('DM', 'PP', 'DO'))
faceted <- ggplot(
  animals_common, aes(x = weight)) +
  geom_histogram() +
  facet_wrap( ~ species_id) +
  labs(title =
       "Weight of most common species",
       x = "Count",
       y = "Weight (g)")
faceted

plot of chunk plot_facets

The un-grouped data may be added as a layer on each panel, but you have to drop the grouping variable (i.e. month).

faceted_all <- faceted +
  geom_histogram(data =
    select(animals_common, -species_id),
    alpha = 0.2)
faceted_all

plot of chunk plot_facets_all

Finally, let’s show off some additional styling with fill and the very unusual ..density.. aesthetic. The .. notation is shared by several ggplot functions that perform a calculation. Using ..density.. as the y-axis variable allows a geometry to display the probability density of variable assigned to the x-axis.

faceted_density <- ggplot(
  animals_common,
  aes(x = weight, fill = species_id)) +
  geom_histogram(aes(y = ..density..)) +
  facet_wrap( ~ species_id) +
  labs(title =
    "Weight of most common species",
    x = "Count",
    y = "Weight (g)")
faceted_density

plot of chunk plot_facets_density

Exercise 3

The formula notation for facet_grid (different from facet_wrap) interprets left-side variables as one axis and right-side variables as another. For these three common animals, create facets in the weight histogram along two categorical variables, with a row for each sex and a column for each species.

View solution

Top of Section


Additional information

Top of Section


Exercise solutions

Solution 1

animals_dm <- filter(animals, species_id == 'DM')
ggplot(animals_dm,
       aes(x = year, y = weight, color = sex)) +
  geom_line(stat = 'summary',
            fun.y = 'mean')

plot of chunk sol1

Return

Solution 2

filter(animals, species_id == 'DM') %>%
  ggplot(aes(x = weight, fill = sex)) +         
  geom_histogram(binwidth = 1)

plot of chunk sol2

Return

Solution 3

ggplot(animals_common,
       aes(x = weight)) +
  geom_histogram() +
  facet_grid(sex ~ species_id) +
  labs(title = 'Weight of common species by sex',
       x = 'Count',
       y = 'Weight (g)')

plot of chunk sol3

Return

Top of Section