Smart and Interactive Documents

Lesson 9 with Kelly Hondula

How Smart?

The reproducible pipeline under construction begins with open data, uses scripts to perform analyses and create visualizations, and ideally ends in a published write-up.

rmarkdown merges code and documentation, allowing you to create automatic reports that include the results of computations and visualizations created on-the-fly.

How Interactive?

Rather than rendering to a static document, RStudio lets you easily inject shiny input and output widgets to documents constructed with RMarkdown. These widgets can accept user input through forms, menus, and sliders, and cause corresponding tables, plots, and other graphical output to dynamically update.

Interactive documents require connection to a live R process, which any user running RStudio can provide, but so can hosting services like http://www.shinyapps.io/.

Lesson Objectives

  1. Start with “dumb” documents and the basics of Markdown.
  2. Envision an efficient, one-click pipeline with RMarkdown.
  3. Create an interactive document with Shiny.

Markdown

Markdown exists outside of the R environment. Like R, it is both a language and an interpreter.

  1. It is a language with special characters and a syntax that convey formatting instructions inside text files.

  2. The accompanying interpreter reads text files and outputs one of several types of formatted documents (e.g Word, PDF, and HTML).

RMarkdown

The rmarkdown package bundles the formatting ability of Markdown with the ability to send embedded code to an R interpreter and capture the result.

Seeing is Believing

The handout for this lesson is this lesson. The lesson’s .Rmd worksheet is the RMarkdown source for this document (with a few ommissions for you to fill in). Open it and find this line of code:

data.frame(counts = c(4, 5, 6))
##   counts
## 1      4
## 2      5
## 3      6

The output is not in the source—it was “knit” into the rendered output. Press the “Knit” button in RStudio to generate the single-page view of this lesson. As we proceed, fill in the ... areas of your worksheet, and press the “Knit” button to verify the output.

Markdown Syntax

Before getting to the good stuff, a quick introduction to “dumb” Markdown formatting.

Preformatted Text

Text fenced by “```” is left untouched by the Markdown interpreter, usually for the purpose of displaying code. Everything else is formatted according to a light-weight syntax.

The *emphasis* indicated by asterisks here does not become
italicized, as it would outside a "code fence".

That’s three backtick characters, found next to the “1” on QWERTY keyboards, above and below the text.

Bulleted Lists (preformatted)

Sequential lines beginning with “-” are grouped into a bulleted list. The following preformatted text shows the syntax.

- SQL
- Python
- R

Without a code fence (which is not present in your worksheet), the chunk of text above will be rendered as a bulleted list. All the sections with “(preformatted)” in the online lesson are paired with a section in your worksheet for you to complete and then knit.

Numbered Lists (preformatted)

Sequential lines beginning with a number are grouped into a numbered list. The actual number you use is irrelevant.

6. SQL
1. Python
5. R

Tables (preformatted)

Separate text with vertical bars (|) to indicate columns of a table and hyphens (-) to mark the beginning of a table or to separate the header row.

id | treatment
---|----------
1  | control
2  | exclosure

Configuration

Text at the top of a Markdown file and fenced by --- stores configuration. Variables set here can, for example, change the type of document produced.

---
output: html_document
---

Change the output variable to ioslides_presentation and knit again to generate output formatted as a slideshow.

Headers (preformatted)

Default formatting for an html_document differs in some cases from an ioslides_presentation. The use of # to indicate a hierarchy of heading sizes serves an additional purpose in a slideshow.

# The Biggest Heading
## The Second Biggest Heading
### The Third Biggest Heading

R + Markdown (preformatted)

The rmarkdown package evaluates the R expressions within a code fence and inserts the result into the output document. To send these “code chunks” to the R interpreter, append {r} to the upper fence.

```{r}
seq(1, 10)
```

Chunk Options (preformatted)

Each code chunk runs under options specified globally or within the {r ...} expression. The option echo = FALSE would cause the output to render without the input. The option eval = FALSE, would prevent evaluation and output rendering.

```{r echo = FALSE}
seq(1, 10)
```

Chunk Labels (preformatted)

The first entry after {r will be interpretted as a chunk label if it is not setting an option. Chunk options are specified after the optional label and separated by commas. Labels do not show up in the document—we’ll have other uses for them.

```{r does_not_run, eval = FALSE}
seq(1, 10)
```

Reproducible Pipeline

A pipeline might rely on several scripts that separately aquire data, tidy it, fit or run models, and validate results. Embedding calls to those external scripts is one way to create a one-click pipeline.

Sourced Pipeline (preformatted)

The source function in R includes the contents of a separate file in the current code chunk. The entire script is evaluated in the current environment.

```{r load_data, echo = FALSE}
source('worksheet-9.R')
```

The lesson’s .R worksheet is an R script creating a rodents data frame, which “sourcing” makes available to following lines as well as subsequent code chunks.

Sourced Pipeline (preformatted)

```{r bar_plot}
library(ggplot2)
ggplot(rodents, aes(x = species_id, y = count)) +
  geom_bar(stat = 'identity')
```

If your entire pipeline can be scripted in R, you could embed the entire analysis in code chunks within your write-up. The better practice demonstrated here is “modularizing” your analysis by splitting it into isolated scripts, and then using an rmarkdown document to execute the pipeline.

Non-sourced pipelines (preformatted)

The code interpreter is not limited to R. Several interpreters, including python and sql, can be used for code written directly into a code chunk.

```{python}
greeting = 'Hello, {}!'
print(greeting.format('world'))
```

Non-sourced pipelines (preformatted)

Access to your operating system shell, for example the Linux bash interpreter, provides a way to run any scripted pipeline step.

```{bash}
python -c 'import os; print("Hello, {}!".format(os.environ["USER"]))'
```

An important distinction between sourced and non-sourced pipelines is the inabillity of interpreters other than R to return R objects. By using source, an R script is run in the current R session, which provides an easy way to pass data between scripts. Typically, file-based input and output is necessary for multi-lingual pipelines. For Python, however, the reticulate package provides a bi-directional interface.

Efficient Pipelines (preformatted)

There is no reason to run every step of a pipeline after making changes “downstream”. Like more comprehensive software for automating pipelines, rmarkdown includes the notion of tracking dependencies and caching results. Cached code chunks are not re-evaluated unless the content of the code changes.

Enable cache in the setup chunk to turn off re-evaluation of any code chunk that has not been modified since the last knit.

```{r setup, include = FALSE}
library(knitr)
opts_chunk$set(message = FALSE, warning = FALSE, out.width = '75%', cache = TRUE)
```

Cache (preformatted)

Render the worksheet again to create a cache for each code chunk, and then modify your bar_plot chunk to show species’ weights and render again. The “slow” load_data chunk zips right by, using its cache, but the plot will change.

```{r bar_plot}
library(ggplot2)
ggplot(rodents, aes(x = species_id, y = weight)) +
  geom_bar(stat = 'identity')
```

Cache Dependencies (preformatted)

With the dependson option, even an unmodified chunk will be re-evaluated if a dependency runs.

```{r clean_bar_plot, dependson = 'load_data'}
ggplot(rodents, aes(x = species_id, y = weight)) +
  geom_bar(stat = 'identity')
```

Add the above new chunk with dependson = 'load_data' so it updates if and only if the load_data chunk is executed. Knit the document and compare the bar_chart and clean_bar_chart outputs; at this point bar_plot and clean_bar_plot should be identical. Now make load_data clean the data, then knit again and compare the plots.

```{r load_data, echo = FALSE}
source('worksheet-9.R')
rodents <- subset(rodents, !is.na(weight))
```

The updated result of clean_bar_plot now reflects the cleaning operation on the rodents data frame. But the bar_plot chunk simply loaded results from its cache, because the dependency was not explicit.

Note that the second plot will execute when the load_data chunk changes, but this chunk contains a call to source. The rodents variable could change if the code in the sourced file is updated, but this would not trigger re-generation of the second plot!

External Dependencies (preformatted)

By adding the option cache.extra, any trigger can be given to force re-evaluation of an unmodified chunk. In combination with the md5sum function from the tools package, this permits external file dependencies.

```{r setup}
library(knitr)
library(tools)
opts_chunk$set(message = FALSE, warning = FALSE, out.width = '75%', cache = TRUE)
```
```{r load_data, echo = FALSE, cache.extra = md5sum('worksheet-9.R')}
source('worksheet-9.R')
rodents <- subset(rodents, !is.na(weight))
```

A change to the rodents data frame, for example by dropping NAs at a more appropriate data cleaning step in the sourced .R script, will now be reflected in the clean_bar_plot result with dependson = load_data, but not the bar_plot plot.

  summarize(count = n(), weight = mean(weight, na.rm = TRUE))

Interact with Shiny

Enough about “smart” documents, what about “interactive”?

What is Shiny?

Shiny is a web application framework for R that allows you to create interactive apps for exploratory data analysis and visualization, to facilitate remote collaboration, share results, and much more.

---
output: ioslides_presentation
runtime: shiny_prerendered
---

Input and Output

The shiny package provides functions that generate two key types of content in the output document: input and output “widgets”. The user can change the input and the output, e.g. a table or plot, dynamically responds.

Writing an interactive document requires careful attention to how your input and output objects relate to each other, i.e. knowing what actions will initiate what sections of code to run at what time.

Input Objects (preformatted)

Input objects collect information from the user and save it into a list variable called input. The value for any given named entity in the list updates when the user changes the input widget with the corresponding name.

```{r echo = FALSE}
selectInput('pick_species',
  label = 'Pick a Species',
  choices = unique(species[['id']]))
```

RStudio has a nice gallery of input objects and accompanying sample code.

Contexts (preformatted)

As shown in the figure above, an interactive document runs R code in multiple “contexts”; for example, while rendering the document and in the connected R process running on the server. The “data” context is a special context needed for cached chunk output that we want available to the server.

```{r load_data, echo = FALSE, cache.extra = md5sum('worksheet.R'), context = 'data'}
source('worksheet-9.R')
rodents <- subset(rodents, !is.na(weight))
```

You might have to clear (i.e. delete) the cache since we added the runtime.

Output objects (preformatted)

Output objects are created in ther “server” context by several functions in the shiny package that produce a range of widgets.

```{r context = 'server'}
library(dplyr)
output[['ts_plot']] <- renderPlot({
  animals %>%
    filter(species_id == input[['pick_species']]) %>%
    ggplot(aes(year)) + 
      geom_bar()
})
```
```{r echo = FALSE}
plotOutput('ts_plot')
```

Render Functions

Key functions for creating output objects:

  • renderPrint()
  • renderText()
  • renderPlot()
  • renderTable()
  • renderDataTable()

Reactivity (preformatted)

The output objects in an interactive document have to be understood in terms of reactivity: each one “knows” its content should react to certain changes in the environment, including to the input list.

Create additional environment-aware objects with reactive() from the shiny package. A useful type of reactive object for an efficient pipeline is the result of data manipulations, which can be calculated once and used multiple times.

```{r context = 'server'}
plot_data <- reactive({
  filter(animals,
         species_id == input[['pick_species']])
})
output[['react_ts_plot']] <- renderPlot({
  plot_data() %>%
    ggplot(aes(year)) +
      geom_bar()
})
```

Reactivity (preformatted)

Don’t forget to include your plot in the document!

```{r echo = FASE}
plotOutput('react_ts_plot')
```

In the worked example, the step of filtering the animals data frame still only occurs once. In a scenario where the subset of animals were used for multiple computations or vizualitions, creating the reactive plot_data() object makes a more efficient pipeline.

Exercises

Exercise 1

Create a table with two columns, starting with a header row with fields “Character” and “Example”. Fill in the table with rows for the special Markdown characters *, **, ^, and ~, providing an example of each.

Exercise 2

Display your presentation on GitHub. Your repository on GitHub includes a free web hosting service known as GitHub Pages. Publish your worksheet there with the following steps.

  • Remove any bits of the shiny runtime (GitHub only serves static pages).
  • Copy the HTML output file to docs/index.html.
  • Stage, commit & push the docs/index.html file to GitHub.
  • On GitHub, under Settings > GitHub Pages, set docs/ as the source.

Solutions

Solution 1

character format
* italics
** bold
^ superscript
~~ strikethrough

Solution 2

Just follow the instructions!