# Bas(e)ic R

Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.

## Contents

- Why learn R?
- The Console
- The Editor
- Data types
- Multi-dimensional data structures
- Load data into R
- Parts of an Object
- Base plotting
- Creating functions
- Distributions and Statistics
- Flow control
- Reminder on important symbols
- Exercise solutions

## Why learn R?

- High-level programming language good for interactive statistical analysis
- General purpose programming language for scripting entire data-processing pipelines
- Large selection of add-on packages that extend the capabilities of “base R”
- Large user community especially within statistics and ecology
- Open source

## The Console

The interpreter accepts R commands interactively through the console. Basic math, as you would type it on a calculator, is usually a valid command in the R language:

```
1 + 2
```

```
[1] 3
```

```
5/3
```

```
[1] 1.666667
```

```
4^2
```

```
[1] 16
```

- Question
- Why is the output prefixed by
`[1]`

? - Answer
- That’s the index, or position in a vector, of the first result.

A command giving a vector of results shows this clearly:

```
seq(1, 20)
```

```
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
```

The interpreter understands more than arithmatic operations!
The last command was to use (or “call”) the **function** `seq()`

.
Most of “learning R” involves getting to know a whole lot of functions, the effect of each function’s arguments (e.g. the input values 1 and 10), and what each function returns (e.g. the output vector).

We can expand the vocabulary known to the R interpreter by creating a **variable**.
Using the symbol `<-`

is referred to as assignment: we assign the output of any command to the right of `<-`

to any **variable** written to its left.

```
x <- seq(1, 20)
```

You’ll notice that nothing prints to the console, because we assigned the output to a variable.
We can print the value of `x`

by evaluating it without assignment.

```
x
```

```
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
```

Assigning values to new variables is the only time you can reference something previously unknown to the interpreter–and only to the left of `<-`

!
All other commands must reference things already in the interpreter’s vocabulary.

When you start a new session, the R interpreter already knows many things, including

- any number
- any string of characters
- operators that are universal (e.g.
`+`

or`/`

) and specific to R (e.g.`$`

or`%*%`

) - functions in
`base R`

To reference a number or function you just type it in as above, but to referece a string of characters you must surround them in quotation marks.

```
'ab.cd'
```

```
[1] "ab.cd"
```

- Question
- Is it better to use
`'`

or`"`

? - Answer
- Neither one is better. You will often encounter stylistic choices like this, so if you don’t have a personal preference try to mimic existing styles.

Without quotation marks, the interpreter checks for things named `ab.cd`

and doesn’t find anything:

```
ab.cd
```

```
Error in eval(expr, envir, enclos): object 'ab.cd' not found
```

Anything you assign to a variable becomes known to R, so you can refer to it later.

```
y <- 'ab.cd'
typeof(y)
```

```
[1] "character"
```

## Basic math

The R language includes a lot of built-in mathematical functionality:

- binary operators
`+`

,`-`

,`*`

,`/`

, and`^`

(for raising to a power) - “smooth” functions like
`sin`

,`log`

and`sqrt`

- additional functions like
`max`

,`range`

and

## Exercise 1

add an exercise

## The Editor

The **console** is for evaluating commands you don’t intend to keep or reuse. It’s useful for testing commands and poking around.

The **editor** is where you compose scripts that will process data, perform analyses, code up visualizations, and even write reports.

These work together in RStudio, which has multiple ways to send parts of the script you are editing to the console for immediate evaluation. Alternatively you can “source” the entire script.

Open up “worksheet.R” in the editor, and follow along by replacing the `...`

placeholders with the code here. Then evalute just this line (Ctrl R on Windows, ⌘ R on Mac OS).

```
vals <- seq(1, 100)
```

The elements of this statement, from right to left are:

`)`

is the closing paren of a function call`1`

and`100`

are both arguments, or parameters, to the function`(`

is the opening paren of the function call`seq`

is the name of the function`<-`

is an operator that assigns what’s named on the left to equal the result of the expression on the right`vals`

is the name of a variable

- Question
- Why call
`vals`

a “variable” and`seq`

a “function”? - Answer
- It is true they are both names of objects known to R, and could be called variables. But
`seq`

has the important distinguishing feature of being**callable**, which is indicated in documentation by writing the function name with empty parens, as in`seq()`

.

Our call to the function `seq`

could have been much more explicit. We could give the arguments by the names that `seq`

is expecting.

```
vals <- seq(from = 1,
to = 100)
```

Run this code either line-by-line, or highlight the section to run (optionally with keyboard shortcut Ctrl-Return or ⌘ Return).

- Question
- What’s an advantage of naming arguments?
- Answer
- One advantage is that you can put them in any order. A related advantage is that you can then skip some arguments, which is fine to do if each skipped argument has a default value.

How would you get to know the names of a function’s arguments?

```
?seq
```

How would you even know what function to call?

```
??sequence
```

The `<-`

symbol used above is an operator, a shorthand for calling a function without placing arguments within parentheses.
The `seq()`

function also has an operator form when only the `from`

and `to`

arguments are used.

```
1:100
```

```
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
[52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
[69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
[86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
```

The `:`

operator is most commonly used while accessing parts of other objects, as we’ll see below.

## Data types

Type | Example |
---|---|

integer | -4, 0, 999 |

double | 3.1, -4, Inf, NaN |

character | ‘a’, “4”, “👏” |

logical | TRUE, FALSE |

missing | NA |

## Data structures

Compound objects, built from one or more of these data types, or even other objects.

Common one-dimensional, array data structures:

- Vectors
- Lists
- Factors

## Vectors

Vectors are the basic data structure in R. They are a collection of data that are all of the same type. Create a vector by combining elements together using the function `c()`

. Use the operator `:`

for a sequence of numbers (forwards or backwards), otherwise separate elements with commas.

```
counts <- c(4, 3, 7, 5)
```

All elements of an vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible type.

```
c(1, 2, "c")
```

```
[1] "1" "2" "c"
```

## Lists

Lists are like vectors but their elements can be of any data type or structure, including another list! You construct lists by using `list()`

instead of `c()`

.

Compare the results of `list()`

and `c()`

```
x <- list(list(1, 2), c(3, 4))
y <- c(list(1, 2), c(3, 4))
```

- Question
- What’s different about the structure of the variables
`x`

and`y`

? Use the function`str()`

to investigate. - Answer
- The list contains two elements, a list and a vector. The vector
`y`

flattened the elements to a single element of the most flexible data type.

## Factors

A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the `class()`

, “factor”, which makes them behave differently from regular integer vectors, and their `levels()`

, or the set of allowed values.

Use `factor()`

to create a vector with predefined values, which are often characters or “strings”.

```
education <- factor(
c("college", "highschool", "college", "middle"),
levels = c("middle", "highschool", "college"))
```

```
str(education)
```

```
Factor w/ 3 levels "middle","highschool",..: 3 2 3 1
```

A factor can be unorderd, as above, or ordered with each level somehow “less than” the next.

```
education <- factor(
c("college", "highschool", "college", "middle"),
levels = c("middle", "highschool", "college"),
ordered = TRUE)
```

```
str(education)
```

```
Ord.factor w/ 3 levels "middle"<"highschool"<..: 3 2 3 1
```

## Multi-dimensional data structures

Data can be stored in several types of data structures depending on its complexity.

Dimensions | Homogeneous | Heterogeneous |
---|---|---|

1d | c() | list() |

2d | matrix() | data.frame() |

nd | array() |

Of these, the data frame is far and away the most used.

## Data frames

Data frames are 2-dimensional and can contain heterogenous data like numbers in one column and a factor in another.

It is the data structure most similar to a spreadsheet, with two key differences:

- Data frames columns are
*equal-length*vectors. - As vectors, the columns are homogeneous and cannot hold values of the
*wrong*type.

Creating a data frame from scratch can be done by combining vectors with the `data.frame()`

function.

```
df <- data.frame(education, counts)
```

```
df
```

```
education counts
1 college 4
2 highschool 3
3 college 7
4 middle 5
```

Some functions to get to know your data frame are:

Function | Output |
---|---|

`dim()` |
dimensions |

`nrow()` |
number of rows |

`ncol()` |
number of columns |

`names()` |
(column) names |

`str()` |
structure |

`summary()` |
summary info |

`head()` |
shows beginning rows |

```
names(df)
```

```
[1] "education" "counts"
```

## Exercise 2

Create a data frame with two columns, one called “species” and another called “count”. Store your data frame as a variable called `data`

. You can do this with or without populating the data frame with values.

Read a CSV file into a data frame using the `read.csv()`

function.

```
surveys <- read.csv('data/surveys.csv')
```

```
head(surveys)
```

```
record_id month day year plot_id species_id sex hindfoot_length weight
1 1 7 16 1977 2 NL M 32 NA
2 2 7 16 1977 3 NL M 33 NA
3 3 7 16 1977 2 DM F 37 NA
4 4 7 16 1977 7 DM M 36 NA
5 5 7 16 1977 3 DM M 35 NA
6 6 7 16 1977 1 PF M 14 NA
```

## Load data into R

We will use the function `read.table()`

that reads in a file by passing it the location of the file. The general syntax for the functions to read in data are to give the path to the file name, and then supply optinal additional arguments as necessary like specifying the type of data in each column. Specific file types can be read in using functions like `read.csv()`

which are wrappers for the `read.table()`

function that have different default settings.

Type a comma after `read.table(`

and then press **tab** to see what arguments that this function takes. Hovering over each item in the list will show a description of that argument from the help documentation about that function. Specify the values to use for an argument using the syntax `name = value`

.

```
read.table(file="data/plots.csv", header = TRUE, sep = ",")
```

```
plot_id plot_type
1 1 Spectab exclosure
2 2 Control
3 3 Long-term Krat Exclosure
4 4 Control
5 5 Rodent Exclosure
6 6 Short-term Krat Exclosure
7 7 Rodent Exclosure
8 8 Control
9 9 Spectab exclosure
10 10 Rodent Exclosure
11 11 Control
12 12 Control
13 13 Short-term Krat Exclosure
14 14 Control
15 15 Long-term Krat Exclosure
16 16 Rodent Exclosure
17 17 Control
18 18 Short-term Krat Exclosure
19 19 Long-term Krat Exclosure
20 20 Short-term Krat Exclosure
21 21 Long-term Krat Exclosure
22 22 Control
23 23 Rodent Exclosure
24 24 Rodent Exclosure
```

Use the assignment operator “<-“ to store that data in memory and work with it

```
plots <- read.table(file="data/plots.csv", sep = ",", header = TRUE)
surveys <- read.csv(file="data/surveys.csv", sep = ",", header = TRUE)
```

You can specify what indicates missing data in the read.csv function using either `na.strings = "NA"`

or `na = "NA"`

. You can also specify multiple things to be interpreted as missing values, such as `na.strings = c("missing", "no data", "< 0.05 mg/L", "XX")`

.

After reading in the Surveys and Plots csv files, let’s explore what types of data are in each column and what kind of structure your data has.

```
str(plots)
```

```
'data.frame': 24 obs. of 2 variables:
$ plot_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ plot_type: Factor w/ 5 levels "Control","Long-term Krat Exclosure",..: 5 1 2 1 3 4 3 1 5 3 ...
```

```
summary(plots)
```

```
plot_id plot_type
Min. : 1.00 Control :8
1st Qu.: 6.75 Long-term Krat Exclosure :4
Median :12.50 Rodent Exclosure :6
Mean :12.50 Short-term Krat Exclosure:4
3rd Qu.:18.25 Spectab exclosure :2
Max. :24.00
```

```
str(surveys)
```

```
'data.frame': 35549 obs. of 9 variables:
$ record_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 16 16 16 16 16 16 16 16 16 16 ...
$ year : int 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
$ plot_id : int 2 3 2 7 3 1 2 1 1 6 ...
$ species_id : Factor w/ 49 levels "","AB","AH","AS",..: 17 17 13 13 13 24 23 13 13 24 ...
$ sex : Factor w/ 3 levels "","F","M": 3 3 2 3 3 3 2 3 2 2 ...
$ hindfoot_length: int 32 33 37 36 35 14 NA 37 34 20 ...
$ weight : int NA NA NA NA NA NA NA NA NA NA ...
```

```
summary(surveys)
```

```
record_id month day year
Min. : 1 Min. : 1.000 Min. : 1.00 Min. :1977
1st Qu.: 8888 1st Qu.: 4.000 1st Qu.: 9.00 1st Qu.:1984
Median :17775 Median : 6.000 Median :16.00 Median :1990
Mean :17775 Mean : 6.474 Mean :16.11 Mean :1990
3rd Qu.:26662 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:1997
Max. :35549 Max. :12.000 Max. :31.00 Max. :2002
plot_id species_id sex hindfoot_length weight
Min. : 1.0 DM :10596 : 2511 Min. : 2.00 Min. : 4.00
1st Qu.: 5.0 PP : 3123 F:15690 1st Qu.:21.00 1st Qu.: 20.00
Median :11.0 DO : 3027 M:17348 Median :32.00 Median : 37.00
Mean :11.4 PB : 2891 Mean :29.29 Mean : 42.67
3rd Qu.:17.0 RM : 2609 3rd Qu.:36.00 3rd Qu.: 48.00
Max. :24.0 DS : 2504 Max. :70.00 Max. :280.00
(Other):10799 NA's :4111 NA's :3266
```

Each column in a data frame can be referred to using the `$`

operator and the data frame name and the column name. `surveys$record_id`

refers to the record_id column in the surveys data frame.

Note that by default, character data is read in as factors when you load data into R. Later, we will use the argument `stringsAsfactors = FALSE`

to suppress this behavior because it can cause confusion.

## Exercise 3

Fix each of the following common data frame subsetting errors:

```
plots[plots$plot_id = 4, ]
plots[-1:4, ]
plots[plots$plot_id <= 5]
plots[plots$plot_id == 4 | 6, ]
```

## Parts of an Object

Parts of objects are always accessible, either by their name or by their position, using square brackets: `[`

and `]`

.

## Position

```
counts[1]
```

```
[1] 4
```

```
counts[3]
```

```
[1] 7
```

## Names

Parts of an object can usually also have a name. The names can be given when you are creating a vector or afterwards using the `names()`

function.

```
df['education']
```

```
education
1 college
2 highschool
3 college
4 middle
```

```
names(df) <- c("ed", "ct")
```

```
df['ed']
```

```
ed
1 college
2 highschool
3 college
4 middle
```

- Question
- This use of
`<-`

with`names(x)`

on the left is a little odd. What’s going on? - Answer
- We are overwriting an existing variable, but one that is accessed through the output of the function on the left rather than the global environment.

In a multi-dimensional array, you separate the dimension along which a part is requested with a comma.

```
df[3, "ed"]
```

```
[1] college
Levels: middle < highschool < college
```

It’s fine to mix names and indices when selecting parts of an object.

## Subsetting ranges

There are multiple ways to simultaneously extract multiple parts of an object.

Use in brackets | Subset instructions |
---|---|

positive integers | elements at the specified positions |

negative integers | omit elements at the specified positions |

logical vectors | select elements where the corresponding value is TRUE |

nothing | return the original vector (all) |

```
days <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
weekdays <- days[2:6]
weekend <- days[c(1, 7)]
```

```
weekdays
```

```
[1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
```

```
weekend
```

```
[1] "Sunday" "Saturday"
```

## Exercise 4

- Get weekdays using negative integers.
- Get M-W-F using a call to
`seq()`

to specify the positions (don’t forget to`?seq`

).

The `$`

sign is an operator that makes for quick access to a single, named part of an object.
It’s most useful when used interactively with “tab completion” on the columns of a data frame.

```
df$ed
```

```
[1] college highschool college middle
Levels: middle < highschool < college
```

## Base plotting

R has excellent plotting capabilities for many types of graphics. The `plot()`

function is the most basic plotting function. It is polymorphic, ie. it uses the information you give it to determine what kind of plot to make.

For more advanced plotting such as multi-faceted plots, the libraries lattice and ggplot2 are excellent options.

## Scatterplots

The basic syntax is `plot(x, y)`

or use the formula notation `plot(y ~ x)`

```
plot(surveys$month, surveys$weight)
```

## Histograms

```
hist(log(surveys$weight))
```

## Boxplots

Use a boxplot to compare the number of species seen each year.

```
boxplot(log(surveys$weight) ~ surveys$year)
```

## Creating functions

Writing functions to use multiple times within a project can prevent you from duplicating code. If you see blocks of similar lines of code through your project, those are usually candidates for being moved into functions.

## Anatomy of a function

Writing functions is also a great way to understand the terminology and workings of R. Like all programming languages, R has keywords that are reserved for import activities, like creating functions. Keywords are usually very intuitive, the one we need is `function`

.

```
function(...) {
...
return(...)
}
```

Three components:

**arguments**: control how you can call the function**body**: the code inside the function**return value**: controls what output the function gives

We’ll make a function to extract the first row and column of its argument, for which we can choose an arbitrary name:

```
function(x) {
result <- x[1, 1]
return(result)
}
```

Note that `x`

doesn’t exist until we call the function, which gives the recipe for how `x`

will be handled.

Finally, we need to give the function a name so we can use it like we used `c()`

and `seq()`

above.

```
first <- function(x) {
result <- x[1, 1]
return(result)
}
```

```
first(df)
```

```
[1] college
Levels: middle < highschool < college
```

- Question
- Can you explain the result of entering
`first(counts)`

into the console? - Answer
- The function caused an error, which prompted the interpreter to print a helpful error message. Never ignore an error message.

## Exercise 5

Subset the data frame by column name and row position to obtain the following output.

```
[1] highschool college
Levels: middle < highschool < college
```

## Distributions and Statistics

Since it is designed for statistics, R can easily draw random numbers from statistical distributions and calculate distribution values.

To generate random numbers from a normal distribution, use the function `rnorm()`

```
ten_random_values <- rnorm(n = 10)
```

Function | Returns | Notes |
---|---|---|

`rnorm()` |
Draw random numbers from normal distribution | Specify `n` , `mean` , `sd` |

`dnorm()` |
Probability density at a given number | |

`pnorm()` |
Cumulative probability up to a given number | left-tailed by default |

`qnorm()` |
The quantile given a cumulative probability | opposite of pnorm |

Statistical distributions and their functions.
See *Table 14.1* in **R for Everyone** by Jared Lander for a full table.

Distribution | Random Number |
---|---|

Normal | rnorm |

Binomial | rbinom |

Poisson | rpois |

Gamma | rgamma |

Exponential | rexp |

Uniform | runif |

Logistic | rlogis |

R has built in functions for handling many statistical tests.

```
x <- rnorm(n = 100, mean = 25, sd = 7)
y <- rbinom(n = 100, size = 50, prob = .85)
```

```
t.test(x, y)
```

```
Welch Two Sample t-test
data: x and y
t = -21.97, df = 121.54, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-19.19564 -16.02228
sample estimates:
mean of x mean of y
24.78104 42.39000
```

Linear regression with the `lm()`

function uses a formula notation to specify relationships between variables (e.g. `y ~ x`

).

```
fit <- lm(y ~ x)
```

```
summary(fit)
```

```
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-6.9807 -1.6051 0.2341 1.4698 5.0862
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 43.59879 0.87963 49.565 <2e-16 ***
x -0.04878 0.03395 -1.437 0.154
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.564 on 98 degrees of freedom
Multiple R-squared: 0.02062, Adjusted R-squared: 0.01063
F-statistic: 2.064 on 1 and 98 DF, p-value: 0.154
```

## Exercise 6

Create a data frame from scratch that has three columns and 5 rows. In column “size” place a sequence from 1 to 5. For column “year”, create a factor with three levels representing the past three years. In column “prop”, place 5 random samples from a uniform distribution. Show the summary of a linear model following the formula “prop ~ size + year”.

## Flow control

As a general purpose programming language, you can write R scripts to take care of non-computational tasks.

“Flow control” is the generic term for letting variables whose value is determined at run time to dictate how the code evaluates. It’s things like “for loops” and “if/else” statements.

## Install missing packages

The last thing we’ll do before taking a break, is let R check for any packages you’ll need today that aren’t installed. But we’ll learn how to use flow control along the way.

First, aquire the list of any missing packages.

```
required <- c(
'sp',
'rgdal',
'rgeos',
'raster',
'shiny',
'leaflet',
'tm')
installed <- rownames(installed.packages())
missing <- setdiff(required, installed)
```

Check, from the console, your number of missing packages:

```
length(missing) == 0
```

```
[1] FALSE
```

Your result will be `TRUE`

or `FALSE`

, depending on whether you installed all the packages already. We can let the script decide what to do with this information.

The keyword `if`

is part of the R language’s syntax for flow control. The statement in the body (between `{`

and `}`

) only evaluates if the argument (between `(`

and `)`

) evaluates to TRUE.

```
if (length(missing) != 0) {
install.packages(missing, dep=TRUE)
}
```

## Reminder on important symbols

Symbol | Meaning |
---|---|

`?` |
get help |

`c()` |
combine |

`#` |
comment |

`:` |
sequence |

`<-` |
assignment |

`[ ]` |
selection |

## Exercise solutions

## Solution 1

```
species <- c()
count <- c()
data <- data.frame(species, count)
```

```
str(data)
```

```
'data.frame': 0 obs. of 0 variables
```

## Solution 2

```
sol2a <- days[c(-1, -7)]
sol2b <- days[seq(2, 7, 2)]
```

```
sol2a
```

```
[1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
```

```
sol2b
```

```
[1] "Monday" "Wednesday" "Friday"
```

## Solution 3

```
sol3 <- df[2:3, 'ed']
```

```
sol3
```

```
[1] highschool college
Levels: middle < highschool < college
```

## Solution 4

```
df <- data.frame(
size = 1:5,
year = factor(
c(2014, 2014, 2013, 2015, 2015),
levels = c(2013, 2014, 2015),
ordered = TRUE),
prop = runif(n = 5))
fit <- lm(prop ~ size + year, data = df)
```

```
summary(fit)
```

```
Call:
lm(formula = prop ~ size + year, data = df)
Residuals:
1 2 3 4 5
-0.02184 0.02184 0.00000 0.02184 -0.02184
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.08759 0.13266 -0.660 0.6285
size 0.17991 0.04368 4.119 0.1516
year.L -0.73225 0.05982 -12.242 0.0519 .
year.Q -0.03270 0.08691 -0.376 0.7709
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.04368 on 1 degrees of freedom
Multiple R-squared: 0.9961, Adjusted R-squared: 0.9845
F-statistic: 85.72 on 3 and 1 DF, p-value: 0.07919
```