Why learn R?

Basic math

R had a lot of built-in mathematical functionality. Use common operators such as +,-,*,/,( ),^ and names like pi,max,range,sin,log and sqrt to use R for calculations.

1 + 2
## [1] 3
5/3
## [1] 1.666667
16/(sqrt(4))
## [1] 8

Data types and variables

Some of the common types of data R can handle are:

Vectors

Vectors are the basic data structure in R. They are a collection of data that are all of the same type. Create a vector by combining elements together using the function c(). Use the operator : for a sequence of numbers (forwards or backwards), otherwise separate elements with commas.

c(1:10, 4:-4)
##  [1]  1  2  3  4  5  6  7  8  9 10  4  3  2  1  0 -1 -2 -3 -4

All elements of an vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible type.

c(1, 2, "three", 4)
## [1] "1"     "2"     "three" "4"

Assign a value to an object with the assignment operator using the syntax objectName <- value. Object names in R cannot start with a number or underscore.

x <- c(1+2,4:10)
x
## [1]  3  4  5  6  7  8  9 10

Operations in R are “vectorized” meaning that functions can be performed on all elements of a vector simultaneously.

Each item in a vector can be given the attribute of a names. The names can be given when you are creating a vector or afterwards using the names() function.

x <- c(dogs = 4, cats = 5, fish = 2)
x
## dogs cats fish 
##    4    5    2
names(x) <- c("plants", "animals", "rocks")
x
##  plants animals   rocks 
##       4       5       2

Subsetting

R’s subsetting capabilities can be accessed very concisely using square brackets [ ]. Identify elements of a vector using their numeric index or names inside of square brackets. Note that in R the first element of a vector has an index of 1.

Use in brackets Subset instructions
positive integers elements at the specified positions
negative integers omit elements at the specified positions
logical vectors select elements where the corresponding value is TRUE
nothing return the original vector (all)
mynumbers <- c(1:10) # store vector with assignment operator
mynumbers
##  [1]  1  2  3  4  5  6  7  8  9 10
mynumbers[3]
## [1] 3

Lists

Lists like vectors but their elements can be of any data type or structure, including another list! You construct lists by using list() instead of c().

c() will combine several lists into one. If given a combination of atomic vectors and lists, c() will coerce the vectors to lists before combining them.

Compare the results of list() and c()

x <- list(list(1, 2), c(3, 4))
y <- c(list(1, 2), c(3, 4))
str(x)
## List of 2
##  $ :List of 2
##   ..$ : num 1
##   ..$ : num 2
##  $ : num [1:2] 3 4
str(y)
## List of 4
##  $ : num 1
##  $ : num 2
##  $ : num 3
##  $ : num 4

Subset lists using double brackets [[ ]] and either the name or index of an element of a list.

With lists, you can use subsetting + assignment + NULL to remove components from a list. To add a literal NULL to a list, use [ and list(NULL). Notice the difference in the structure of x and y:

x <- list(a = 1, b = 2)
x[["b"]] <- NULL
str(x)
## List of 1
##  $ a: num 1
y <- list(a = 1)
y["b"] <- list(NULL)
str(y)
## List of 2
##  $ a: num 1
##  $ b: NULL

Note that NULL removes items whereas NA is used to represent a missing value. A NULL item does not exist whereas an NA exists but does not have a value.

Factors

A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class(), “factor”, which makes them behave differently from regular integer vectors, and the levels(), which defines the set of allowed values.

Use factor() to create a vector with factors, or as.factor() to convert an existing vector to factors.

education <- factor(x = c("middle", "highschool", "college"), ordered = TRUE)
education
## [1] middle     highschool college   
## Levels: college < highschool < middle

Note that by default, character data is read in as factors when you load data into R. Later, we will use the argument stringsAsfactors = FALSE to suppress this behavior because it can cause confusion.

Data structures

Data can be stored in several types of data structures depending on its complexity.

Dimensions Homogeneous Heterogeneous
1d* Vector List
2d Matrix Data frame
nd Array

*Note that vectors in R are not mathematical vectors and therefore there is no difference between row and column orientation.

Data frames

Data frames are 2-dimensional and can contain heterogenous data like numbers in one column and categories in another. It is the data structure most similar to a spreadsheet in Excel.

Data frames are a collection of equal-length vectors. This means that each column can contain a different type of data. Each row of a data frame should represent an observation.

Read more about well-structured (“tidy”) data frames here.

Combine x and animals into a data frame with the aptly named function data.frame(). Note the period between the words. Store your data frame as an object called my_df.

my_df <- data.frame(animals, x)
my_df

Some functions to get to know your data frame are:

function returns
dim() dimensions
nrow() number of rows
ncol() number of columns
names() (column) names
str() structure
summary() summary info
head() shows beginning rows

Subsetting for data frames

Just like vectors, data frames can be subset and manipulated with square brackets. The square brackets work as follows: anything before the comma refers to the rows that will be selected, anything after the comma refers to the number of columns that should be returned.

Review: Important symbols

Symbol Meaning
? get help
c() combine
# comment
: sequence
<- assignment
[ ] selection

Load data into R

We will use the function read.table() that reads in a file by passing it the location of the file. The general syntax for the functions to read in data are to give the path to the file name, and then supply optinal additional arguments as necessary like specifying the type of data in each column. Specific file types can be read in using functions like read.csv() which are wrappers for the read.table() function that have different default settings.

Type a comma after read.table( and then press tab to see what arguments that this function takes. Hovering over each item in the list will show a description of that argument from the help documentation about that function. Specify the values to use for an argument using the syntax name = value.

read.table(file="%sandbox%/data/plots.csv", header = TRUE, sep = ",")

Use the assignment operator “<-” to store that data in memory and work with it

plots <- read.table(file=""%sandbox%/data/plots.csv", sep = ",", header = TRUE)
surveys <- read.csv(file=""%sandbox%/data/surveys.csv", sep = ",", header = TRUE)

You can specify what indicates missing data in the read.csv function using either na.strings = "NA" or na = "NA". You can also specify multiple things to be interpreted as missing values, such as na.strings = c("missing", "no data", "< 0.05 mg/L", "XX").

After reading in the Surveys and Plots csv files, let’s explore what types of data are in each column and what kind of structure your data has.

str(plots)
summary(plots)

str(surveys)
summary(surveys)

Each column in a data frame can be referred to using the $ operator and the data frame name and the column name. surveys$record_id refers to the record_id column in the surveys data frame.

Exercise: Fix each of the following common data frame subsetting errors:

plots[plots$plot_id = 4, ]
plots[-1:4, ]
plots[plots$plot_id <= 5]
plots[plots$plot_id == 4 | 6, ]

Base plotting

R has excellent plotting capabilities for many types of graphics. The plot() function is the most basic plotting function. It is polymorphic, ie. it uses the information you give it to determine what kind of plot to make.

For more advanced plotting such as multi-faceted plots, the libraries lattice and ggplot2 are excellent options.

Scatterplots

basic syntax is plot(x, y) or uses the formula notation plot(y ~ x)

plot(surveys$month, surveys$weight)
plot(surveys$year, surveys$weight)
plot(surveys$year, log(surveys$weight))

Histograms

hist(surveys$weight)
hist(log(surveys$weight))

Boxplots

Use a boxplot to compare the number of species seen each year.

par(mfrow=c(1,1))
boxplot(surveys$weight ~ surveys$year)
boxplot(surveys$weight ~ surveys$month)
boxplot(log(surveys$weight) ~ surveys$year)

Graphical parameters

par()

  • adjust line types, plotting characters, cex, x and y labels
  • plot different factors/types in different colors

Multi-panel plots can be made by changing the graphical parameters with the par() function.

surveys1990 <- subset(surveys, year == 1990)
surveys1996 <- subset(surveys, year == 1996)

par(mfrow=c(1,2))
hist(log(surveys1990$weight))
hist(log(surveys1996$weight))

Writing your own functions

Functions enable easy reuse within a project, helping you not to repeat yourself. If you see blocks of similar lines of code through your project, those are usually candidates for being moved into functions.

If your calculations are performed through a series of functions, then the project becomes more modular and easier to change. This is especially the case for which a particular input always gives a particular output.

Three components of functions

A function needs to have a name, probably at least one argument (although it doesn’t have to), and a body of code that does something. At the end it usually should (although doesn’t have to) return an object out of the function. The important idea behind functions is that objects that are created within the function are local to the environment of the function – they don’t exist outside of the function. But you can “return” the value of the object from the function, meaning pass the value of it into the global environment.

myfunction <- function(x) {
  # do something to x here
  # return a value of interest
}

The base R language does not have a function to calculate the standard error of the mean. Since this is a common statistical value of interest, let’s write a function to calculate the standard error. Recall that the standard error is calculated as the square root of the variance over the sample size.

The 3 functions needed for the standard error calculation are sqrt for square root, var for variance, and length for sample size. Calculate the standard error of the wgt column using these three functions.

sqrt(var(surveys$weight)/length(surveys$weight))

We can generalize the calculation that we made by storing it as a function called stderr. The calculation that we made above goes into the body of the function:

stderr <- function(x){
  # this function returns the standard error of the mean
  sqrt(var(x)/length(x))
  }

Let’s practice!

say_hello <- function(){
    print("Hello, world!")
}

Run the function using the name of the function followed by an opening and closing parenthesis ().

Now let’s modify the say_hello function to take an argument.

Note that the sprintf() function is a convenient way to replace parts of a string with variables.

say_hello <- function(name){
    print(sprintf("Hello, %s!", name))
}

Distributions and Statistics

Since it is designed for statistics, R can easily draw random numbers from statistical distributions and calculate distribution values.

To generate random numbers from a normal distribution, use the function rnorm()

ten_random_values <- rnorm(n = 10)
Function Returns Notes
rnorm Draw random numbers from normal distribution Specify n, mean, sd
pnorm Estimate probability of a specific number occuring
qnorm Cumulative probability that a given number or smaller occurs left-tailed by default
dnorm Returns quantile given a cumulative probability opposite of pnorm

Statistical distributions and their functions See Table 14.1 in R for Everyone by Jared Lander for a full table

Distribution Random Number Density Distribution Quantile
Normal rnorm dnorm pnorm qnorm
Binomial rbinom dbinom pbinom qbinom
Poisson rpois dpois ppois qpois
Gamma rgamma dgamma pgamma qgamma
Exponential rexp dexp pexp qexp
Uniform runif dunif punif qunif
Logistic rlogis dlogis plogis qlogis

R has built in functions for handling many statistical tests.

x <- rnorm(n = 100, mean = 25, sd = 7)
y <- rbinom(n = 100, size = 50, prob = .85)

t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -23.164, df = 122.32, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -18.48857 -15.57740
## sample estimates:
## mean of x mean of y 
##  25.33702  42.37000

Perform a linear regression using the lm() function and the formula notation y ~ x. Save the results of the model to view more details than the default output of the model.

my_model <- lm(y ~ x)
summary(my_model)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.008 -1.024  0.498  1.535  4.930 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 41.03128    0.90628  45.274   <2e-16 ***
## x            0.05284    0.03451   1.531    0.129    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.386 on 98 degrees of freedom
## Multiple R-squared:  0.02336,    Adjusted R-squared:  0.0134 
## F-statistic: 2.345 on 1 and 98 DF,  p-value: 0.1289

Challenge: Plot a linear regression line over a scatterplot and include the p-value of the regression in the plot’s title. Hint: View the structure of the model output to determine how to access the p-value.

Additional resources and references

Advanced libraries for

Purpose Package(s)
Graphics ggplot2
Dates and times lubridate, chron
Data manipulation tidyr, dplyr
String manipulation stringr
Reading in data readr, readxl