R had a lot of built-in mathematical functionality. Use common operators such as +
,-
,*
,/
,( )
,^
and names like pi
,max
,range
,sin
,log
and sqrt
to use R for calculations.
1 + 2
## [1] 3
5/3
## [1] 1.666667
16/(sqrt(4))
## [1] 8
Some of the common types of data R can handle are:
Vectors are the basic data structure in R. They are a collection of data that are all of the same type. Create a vector by combining elements together using the function c()
. Use the operator :
for a sequence of numbers (forwards or backwards), otherwise separate elements with commas.
c(1:10, 4:-4)
## [1] 1 2 3 4 5 6 7 8 9 10 4 3 2 1 0 -1 -2 -3 -4
All elements of an vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible type.
c(1, 2, "three", 4)
## [1] "1" "2" "three" "4"
Assign a value to an object with the assignment operator using the syntax objectName <- value
. Object names in R cannot start with a number or underscore.
x <- c(1+2,4:10)
x
## [1] 3 4 5 6 7 8 9 10
Operations in R are “vectorized” meaning that functions can be performed on all elements of a vector simultaneously.
Each item in a vector can be given the attribute of a names. The names can be given when you are creating a vector or afterwards using the names()
function.
x <- c(dogs = 4, cats = 5, fish = 2)
x
## dogs cats fish
## 4 5 2
names(x) <- c("plants", "animals", "rocks")
x
## plants animals rocks
## 4 5 2
R’s subsetting capabilities can be accessed very concisely using square brackets [ ]
. Identify elements of a vector using their numeric index or names inside of square brackets. Note that in R the first element of a vector has an index of 1.
Use in brackets | Subset instructions |
---|---|
positive integers | elements at the specified positions |
negative integers | omit elements at the specified positions |
logical vectors | select elements where the corresponding value is TRUE |
nothing | return the original vector (all) |
mynumbers <- c(1:10) # store vector with assignment operator
mynumbers
## [1] 1 2 3 4 5 6 7 8 9 10
mynumbers[3]
## [1] 3
Lists like vectors but their elements can be of any data type or structure, including another list! You construct lists by using list()
instead of c()
.
c()
will combine several lists into one. If given a combination of atomic vectors and lists, c()
will coerce the vectors to lists before combining them.
Compare the results of
list()
andc()
x <- list(list(1, 2), c(3, 4))
y <- c(list(1, 2), c(3, 4))
str(x)
## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 2
## $ : num [1:2] 3 4
str(y)
## List of 4
## $ : num 1
## $ : num 2
## $ : num 3
## $ : num 4
Subset lists using double brackets [[ ]]
and either the name or index of an element of a list.
With lists, you can use subsetting + assignment + NULL
to remove components from a list. To add a literal NULL
to a list, use [
and list(NULL)
. Notice the difference in the structure of x
and y
:
x <- list(a = 1, b = 2)
x[["b"]] <- NULL
str(x)
## List of 1
## $ a: num 1
y <- list(a = 1)
y["b"] <- list(NULL)
str(y)
## List of 2
## $ a: num 1
## $ b: NULL
Note that NULL
removes items whereas NA
is used to represent a missing value. A NULL
item does not exist whereas an NA
exists but does not have a value.
A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class(), âfactorâ, which makes them behave differently from regular integer vectors, and the levels(), which defines the set of allowed values.
Use factor()
to create a vector with factors, or as.factor()
to convert an existing vector to factors.
education <- factor(x = c("middle", "highschool", "college"), ordered = TRUE)
education
## [1] middle highschool college
## Levels: college < highschool < middle
Note that by default, character data is read in as factors when you load data into R. Later, we will use the argument stringsAsfactors = FALSE
to suppress this behavior because it can cause confusion.
Data can be stored in several types of data structures depending on its complexity.
Dimensions | Homogeneous | Heterogeneous |
---|---|---|
1d* | Vector | List |
2d | Matrix | Data frame |
nd | Array |
*Note that vectors in R are not mathematical vectors and therefore there is no difference between row and column orientation.
Data frames are 2-dimensional and can contain heterogenous data like numbers in one column and categories in another. It is the data structure most similar to a spreadsheet in Excel.
Data frames are a collection of equal-length vectors. This means that each column can contain a different type of data. Each row of a data frame should represent an observation.
Read more about well-structured (“tidy”) data frames here.
Combine
x
andanimals
into a data frame with the aptly named functiondata.frame()
. Note the period between the words. Store your data frame as an object calledmy_df
.
my_df <- data.frame(animals, x)
my_df
Some functions to get to know your data frame are:
function | returns |
---|---|
dim() |
dimensions |
nrow() |
number of rows |
ncol() |
number of columns |
names() |
(column) names |
str() |
structure |
summary() |
summary info |
head() |
shows beginning rows |
Just like vectors, data frames can be subset and manipulated with square brackets. The square brackets work as follows: anything before the comma refers to the rows that will be selected, anything after the comma refers to the number of columns that should be returned.
Symbol | Meaning |
---|---|
? |
get help |
c() |
combine |
# |
comment |
: |
sequence |
<- |
assignment |
[ ] |
selection |
We will use the function read.table()
that reads in a file by passing it the location of the file. The general syntax for the functions to read in data are to give the path to the file name, and then supply optinal additional arguments as necessary like specifying the type of data in each column. Specific file types can be read in using functions like read.csv()
which are wrappers for the read.table()
function that have different default settings.
Type a comma after read.table(
and then press tab to see what arguments that this function takes. Hovering over each item in the list will show a description of that argument from the help documentation about that function. Specify the values to use for an argument using the syntax name = value
.
read.table(file="%sandbox%/data/plots.csv", header = TRUE, sep = ",")
Use the assignment operator “<-” to store that data in memory and work with it
plots <- read.table(file=""%sandbox%/data/plots.csv", sep = ",", header = TRUE)
surveys <- read.csv(file=""%sandbox%/data/surveys.csv", sep = ",", header = TRUE)
You can specify what indicates missing data in the read.csv function using either na.strings = "NA"
or na = "NA"
. You can also specify multiple things to be interpreted as missing values, such as na.strings = c("missing", "no data", "< 0.05 mg/L", "XX")
.
After reading in the Surveys and Plots csv files, let’s explore what types of data are in each column and what kind of structure your data has.
str(plots)
summary(plots)
str(surveys)
summary(surveys)
Each column in a data frame can be referred to using the $
operator and the data frame name and the column name. surveys$record_id
refers to the record_id column in the surveys data frame.
Exercise: Fix each of the following common data frame subsetting errors:
plots[plots$plot_id = 4, ]
plots[-1:4, ]
plots[plots$plot_id <= 5]
plots[plots$plot_id == 4 | 6, ]
R has excellent plotting capabilities for many types of graphics. The plot()
function is the most basic plotting function. It is polymorphic, ie. it uses the information you give it to determine what kind of plot to make.
For more advanced plotting such as multi-faceted plots, the libraries lattice and ggplot2 are excellent options.
basic syntax is plot(x, y)
or uses the formula notation plot(y ~ x)
plot(surveys$month, surveys$weight)
plot(surveys$year, surveys$weight)
plot(surveys$year, log(surveys$weight))
hist(surveys$weight)
hist(log(surveys$weight))
Use a boxplot to compare the number of species seen each year.
par(mfrow=c(1,1))
boxplot(surveys$weight ~ surveys$year)
boxplot(surveys$weight ~ surveys$month)
boxplot(log(surveys$weight) ~ surveys$year)
par()
Multi-panel plots can be made by changing the graphical parameters with the par()
function.
surveys1990 <- subset(surveys, year == 1990)
surveys1996 <- subset(surveys, year == 1996)
par(mfrow=c(1,2))
hist(log(surveys1990$weight))
hist(log(surveys1996$weight))
Functions enable easy reuse within a project, helping you not to repeat yourself. If you see blocks of similar lines of code through your project, those are usually candidates for being moved into functions.
If your calculations are performed through a series of functions, then the project becomes more modular and easier to change. This is especially the case for which a particular input always gives a particular output.
Three components of functions
A function needs to have a name, probably at least one argument (although it doesnât have to), and a body of code that does something. At the end it usually should (although doesnât have to) return an object out of the function. The important idea behind functions is that objects that are created within the function are local to the environment of the function â they donât exist outside of the function. But you can âreturnâ the value of the object from the function, meaning pass the value of it into the global environment.
myfunction <- function(x) {
# do something to x here
# return a value of interest
}
The base R language does not have a function to calculate the standard error of the mean. Since this is a common statistical value of interest, let’s write a function to calculate the standard error. Recall that the standard error is calculated as the square root of the variance over the sample size.
The 3 functions needed for the standard error calculation are sqrt
for square root, var
for variance, and length
for sample size. Calculate the standard error of the wgt
column using these three functions.
sqrt(var(surveys$weight)/length(surveys$weight))
We can generalize the calculation that we made by storing it as a function called stderr. The calculation that we made above goes into the body of the function:
stderr <- function(x){
# this function returns the standard error of the mean
sqrt(var(x)/length(x))
}
Let’s practice!
say_hello <- function(){
print("Hello, world!")
}
Run the function using the name of the function followed by an opening and closing parenthesis ()
.
Now let’s modify the say_hello
function to take an argument.
Note that the sprintf()
function is a convenient way to replace parts of a string with variables.
say_hello <- function(name){
print(sprintf("Hello, %s!", name))
}
Since it is designed for statistics, R can easily draw random numbers from statistical distributions and calculate distribution values.
To generate random numbers from a normal distribution, use the function rnorm()
ten_random_values <- rnorm(n = 10)
Function | Returns | Notes |
---|---|---|
rnorm | Draw random numbers from normal distribution | Specify n , mean , sd |
pnorm | Estimate probability of a specific number occuring | |
qnorm | Cumulative probability that a given number or smaller occurs | left-tailed by default |
dnorm | Returns quantile given a cumulative probability | opposite of pnorm |
Statistical distributions and their functions See Table 14.1 in R for Everyone by Jared Lander for a full table
Distribution | Random Number | Density | Distribution | Quantile |
---|---|---|---|---|
Normal | rnorm | dnorm | pnorm | qnorm |
Binomial | rbinom | dbinom | pbinom | qbinom |
Poisson | rpois | dpois | ppois | qpois |
Gamma | rgamma | dgamma | pgamma | qgamma |
Exponential | rexp | dexp | pexp | qexp |
Uniform | runif | dunif | punif | qunif |
Logistic | rlogis | dlogis | plogis | qlogis |
R has built in functions for handling many statistical tests.
x <- rnorm(n = 100, mean = 25, sd = 7)
y <- rbinom(n = 100, size = 50, prob = .85)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = -23.164, df = 122.32, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -18.48857 -15.57740
## sample estimates:
## mean of x mean of y
## 25.33702 42.37000
Perform a linear regression using the lm()
function and the formula notation y ~ x
. Save the results of the model to view more details than the default output of the model.
my_model <- lm(y ~ x)
summary(my_model)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.008 -1.024 0.498 1.535 4.930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.03128 0.90628 45.274 <2e-16 ***
## x 0.05284 0.03451 1.531 0.129
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.386 on 98 degrees of freedom
## Multiple R-squared: 0.02336, Adjusted R-squared: 0.0134
## F-statistic: 2.345 on 1 and 98 DF, p-value: 0.1289
Challenge: Plot a linear regression line over a scatterplot and include the p-value of the regression in the plot’s title. Hint: View the structure of the model output to determine how to access the p-value.
Advanced libraries for
Purpose | Package(s) |
---|---|
Graphics | ggplot2 |
Dates and times | lubridate, chron |
Data manipulation | tidyr, dplyr |
String manipulation | stringr |
Reading in data | readr, readxl |