Why learn R?
Basic math
Data types and variables
Data structures
Load data into R
Base plotting
Writing your own functions
Distributions and Statistics
Additional resources and references

Why learn R?

High-level programming language designed for interactive statistical analysis and graphics
Easy to handle multiple types of data in the same object (data frames) including missing values
Large user community especially within ecology

Basic math

R had a lot of built-in mathematical functionality. Use common operators such as +,-,*,/,( ),^ and names like pi,max,range,sin,log and sqrt to use R for calculations.

1 + 2

## [1] 3

5/3

## [1] 1.666667

16/(sqrt(4))

## [1] 8

Data types and variables

Some of the common types of data R can handle are:

logical (true/false, 1 or 0)
integer
numeric
character
dates

Vectors

Vectors are the basic data structure in R. They are a collection of data that are all of the same type. Create a vector by combining elements together using the function c(). Use the operator : for a sequence of numbers (forwards or backwards), otherwise separate elements with commas.

c(1:10, 4:-4)

##  [1]  1  2  3  4  5  6  7  8  9 10  4  3  2  1  0 -1 -2 -3 -4

All elements of an vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible type.

c(1, 2, "three", 4)

## [1] "1"     "2"     "three" "4"

Assign a value to an object with the assignment operator using the syntax objectName <- value. Object names in R cannot start with a number or underscore.

x <- c(1+2,4:10)
x

## [1]  3  4  5  6  7  8  9 10

Operations in R are “vectorized” meaning that functions can be performed on all elements of a vector simultaneously.

Each item in a vector can be given the attribute of a names. The names can be given when you are creating a vector or afterwards using the names() function.

x <- c(dogs = 4, cats = 5, fish = 2)
x

## dogs cats fish 
##    4    5    2

names(x) <- c("plants", "animals", "rocks")
x

##  plants animals   rocks 
##       4       5       2

Subsetting

R’s subsetting capabilities can be accessed very concisely using square brackets [ ]. Identify elements of a vector using their numeric index or names inside of square brackets. Note that in R the first element of a vector has an index of 1.

Use in brackets	Subset instructions
positive integers	elements at the specified positions
negative integers	omit elements at the specified positions
logical vectors	select elements where the corresponding value is TRUE
nothing	return the original vector (all)

mynumbers <- c(1:10) # store vector with assignment operator
mynumbers

##  [1]  1  2  3  4  5  6  7  8  9 10

mynumbers[3]

## [1] 3

Lists

Lists like vectors but their elements can be of any data type or structure, including another list! You construct lists by using list() instead of c().

c() will combine several lists into one. If given a combination of atomic vectors and lists, c() will coerce the vectors to lists before combining them.

Compare the results of list() and c()

x <- list(list(1, 2), c(3, 4))
y <- c(list(1, 2), c(3, 4))
str(x)

## List of 2
##  $ :List of 2
##   ..$ : num 1
##   ..$ : num 2
##  $ : num [1:2] 3 4

str(y)

## List of 4
##  $ : num 1
##  $ : num 2
##  $ : num 3
##  $ : num 4

Subset lists using double brackets [[ ]] and either the name or index of an element of a list.

With lists, you can use subsetting + assignment + NULL to remove components from a list. To add a literal NULL to a list, use [ and list(NULL). Notice the difference in the structure of x and y:

x <- list(a = 1, b = 2)
x[["b"]] <- NULL
str(x)

## List of 1
##  $ a: num 1

y <- list(a = 1)
y["b"] <- list(NULL)
str(y)

## List of 2
##  $ a: num 1
##  $ b: NULL

Note that NULL removes items whereas NA is used to represent a missing value. A NULL item does not exist whereas an NA exists but does not have a value.

Factors

A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class(), âfactorâ, which makes them behave differently from regular integer vectors, and the levels(), which defines the set of allowed values.

Use factor() to create a vector with factors, or as.factor() to convert an existing vector to factors.

education <- factor(x = c("middle", "highschool", "college"), ordered = TRUE)
education

## [1] middle     highschool college   
## Levels: college < highschool < middle

Note that by default, character data is read in as factors when you load data into R. Later, we will use the argument stringsAsfactors = FALSE to suppress this behavior because it can cause confusion.

Data structures

Data can be stored in several types of data structures depending on its complexity.

Dimensions	Homogeneous	Heterogeneous
1d*	Vector	List
2d	Matrix	Data frame
nd	Array

*Note that vectors in R are not mathematical vectors and therefore there is no difference between row and column orientation.

Data frames

Data frames are 2-dimensional and can contain heterogenous data like numbers in one column and categories in another. It is the data structure most similar to a spreadsheet in Excel.

Data frames are a collection of equal-length vectors. This means that each column can contain a different type of data. Each row of a data frame should represent an observation.

Read more about well-structured (“tidy”) data frames here.

Combine x and animals into a data frame with the aptly named function data.frame(). Note the period between the words. Store your data frame as an object called my_df.

my_df <- data.frame(animals, x)
my_df

Some functions to get to know your data frame are:

function	returns
`dim()`	dimensions
`nrow()`	number of rows
`ncol()`	number of columns
`names()`	(column) names
`str()`	structure
`summary()`	summary info
`head()`	shows beginning rows

Subsetting for data frames

Just like vectors, data frames can be subset and manipulated with square brackets. The square brackets work as follows: anything before the comma refers to the rows that will be selected, anything after the comma refers to the number of columns that should be returned.

Review: Important symbols

Symbol	Meaning
`?`	get help
`c()`	combine
`#`	comment
`:`	sequence
`<-`	assignment
`[ ]`	selection

Load data into R

We will use the function read.table() that reads in a file by passing it the location of the file. The general syntax for the functions to read in data are to give the path to the file name, and then supply optinal additional arguments as necessary like specifying the type of data in each column. Specific file types can be read in using functions like read.csv() which are wrappers for the read.table() function that have different default settings.

Type a comma after read.table( and then press tab to see what arguments that this function takes. Hovering over each item in the list will show a description of that argument from the help documentation about that function. Specify the values to use for an argument using the syntax name = value.

read.table(file="%sandbox%/data/plots.csv", header = TRUE, sep = ",")

Use the assignment operator “<-” to store that data in memory and work with it

plots <- read.table(file=""%sandbox%/data/plots.csv", sep = ",", header = TRUE)
surveys <- read.csv(file=""%sandbox%/data/surveys.csv", sep = ",", header = TRUE)

You can specify what indicates missing data in the read.csv function using either na.strings = "NA" or na = "NA". You can also specify multiple things to be interpreted as missing values, such as na.strings = c("missing", "no data", "< 0.05 mg/L", "XX").

After reading in the Surveys and Plots csv files, let’s explore what types of data are in each column and what kind of structure your data has.

str(plots)
summary(plots)

str(surveys)
summary(surveys)

Each column in a data frame can be referred to using the $ operator and the data frame name and the column name. surveys$record_id refers to the record_id column in the surveys data frame.

Exercise: Fix each of the following common data frame subsetting errors:

plots[plots$plot_id = 4, ]
plots[-1:4, ]
plots[plots$plot_id <= 5]
plots[plots$plot_id == 4 | 6, ]

Base plotting

R has excellent plotting capabilities for many types of graphics. The plot() function is the most basic plotting function. It is polymorphic, ie. it uses the information you give it to determine what kind of plot to make.

For more advanced plotting such as multi-faceted plots, the libraries lattice and ggplot2 are excellent options.

Scatterplots

basic syntax is plot(x, y) or uses the formula notation plot(y ~ x)

plot(surveys$month, surveys$weight)
plot(surveys$year, surveys$weight)
plot(surveys$year, log(surveys$weight))

Histograms

hist(surveys$weight)
hist(log(surveys$weight))

Boxplots

Use a boxplot to compare the number of species seen each year.

par(mfrow=c(1,1))
boxplot(surveys$weight ~ surveys$year)
boxplot(surveys$weight ~ surveys$month)
boxplot(log(surveys$weight) ~ surveys$year)

Graphical parameters

par()

adjust line types, plotting characters, cex, x and y labels
plot different factors/types in different colors

Multi-panel plots can be made by changing the graphical parameters with the par() function.

surveys1990 <- subset(surveys, year == 1990)
surveys1996 <- subset(surveys, year == 1996)

par(mfrow=c(1,2))
hist(log(surveys1990$weight))
hist(log(surveys1996$weight))

Writing your own functions

Functions enable easy reuse within a project, helping you not to repeat yourself. If you see blocks of similar lines of code through your project, those are usually candidates for being moved into functions.

If your calculations are performed through a series of functions, then the project becomes more modular and easier to change. This is especially the case for which a particular input always gives a particular output.

Three components of functions

name: descriptive name that does not start with a number
body: the code inside the function
arguments: control how you can call the function

A function needs to have a name, probably at least one argument (although it doesnât have to), and a body of code that does something. At the end it usually should (although doesnât have to) return an object out of the function. The important idea behind functions is that objects that are created within the function are local to the environment of the function â they donât exist outside of the function. But you can âreturnâ the value of the object from the function, meaning pass the value of it into the global environment.

myfunction <- function(x) {
  # do something to x here
  # return a value of interest
}

The base R language does not have a function to calculate the standard error of the mean. Since this is a common statistical value of interest, let’s write a function to calculate the standard error. Recall that the standard error is calculated as the square root of the variance over the sample size.

The 3 functions needed for the standard error calculation are sqrt for square root, var for variance, and length for sample size. Calculate the standard error of the wgt column using these three functions.

sqrt(var(surveys$weight)/length(surveys$weight))

We can generalize the calculation that we made by storing it as a function called stderr. The calculation that we made above goes into the body of the function:

stderr <- function(x){
  # this function returns the standard error of the mean
  sqrt(var(x)/length(x))
  }

Let’s practice!

say_hello <- function(){
    print("Hello, world!")
}

Run the function using the name of the function followed by an opening and closing parenthesis ().

Now let’s modify the say_hello function to take an argument.

Note that the sprintf() function is a convenient way to replace parts of a string with variables.

say_hello <- function(name){
    print(sprintf("Hello, %s!", name))
}

Distributions and Statistics

Since it is designed for statistics, R can easily draw random numbers from statistical distributions and calculate distribution values.

To generate random numbers from a normal distribution, use the function rnorm()

ten_random_values <- rnorm(n = 10)

Function	Returns	Notes
rnorm	Draw random numbers from normal distribution	Specify `n`, `mean`, `sd`
pnorm	Estimate probability of a specific number occuring
qnorm	Cumulative probability that a given number or smaller occurs	left-tailed by default
dnorm	Returns quantile given a cumulative probability	opposite of pnorm

Statistical distributions and their functions See Table 14.1 in R for Everyone by Jared Lander for a full table

Distribution	Random Number	Density	Distribution	Quantile
Normal	rnorm	dnorm	pnorm	qnorm
Binomial	rbinom	dbinom	pbinom	qbinom
Poisson	rpois	dpois	ppois	qpois
Gamma	rgamma	dgamma	pgamma	qgamma
Exponential	rexp	dexp	pexp	qexp
Uniform	runif	dunif	punif	qunif
Logistic	rlogis	dlogis	plogis	qlogis

R has built in functions for handling many statistical tests.

x <- rnorm(n = 100, mean = 25, sd = 7)
y <- rbinom(n = 100, size = 50, prob = .85)

t.test(x, y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -23.164, df = 122.32, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.48857 -15.57740
## sample estimates:
## mean of x mean of y 
##  25.33702  42.37000

Perform a linear regression using the lm() function and the formula notation y ~ x. Save the results of the model to view more details than the default output of the model.

my_model <- lm(y ~ x)
summary(my_model)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.008 -1.024  0.498  1.535  4.930 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.03128    0.90628  45.274   <2e-16 ***
## x            0.05284    0.03451   1.531    0.129    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.386 on 98 degrees of freedom
## Multiple R-squared:  0.02336,    Adjusted R-squared:  0.0134 
## F-statistic: 2.345 on 1 and 98 DF,  p-value: 0.1289

Challenge: Plot a linear regression line over a scatterplot and include the p-value of the regression in the plot’s title. Hint: View the structure of the model output to determine how to access the p-value.

Additional resources and references

Advanced libraries for

Purpose	Package(s)
Graphics	ggplot2
Dates and times	lubridate, chron
Data manipulation	tidyr, dplyr
String manipulation	stringr
Reading in data	readr, readxl

Jenny Bryan’s Stat 545 class materials index of topics from which some of this material was adopted
keyboard shortcuts in RStudio
R Graph catalog
Intro R Shiny app
Learn R within R using the swirl package

Bas(e)ic R

Kelly Hondula