Basic R
Lesson 1 with Kelly Hondula
Contents
 Why learn R?
 The Console
 The Environment
 The Editor
 Data types
 Multidimensional data structures
 Parts of an Object
 Linear models
 Review
 Exercise solutions
Why learn R?
 Original design for interactive statistical analysis
 General purpose scripting for your whole pipelines
 Bleedingedge packages expand on “base R”
 Vast community within statistics and ecology
 Open source
The Console
The interpreter accepts R commands interactively through the console.
Basic math, as you would type it on a calculator, is usually a valid command in the R language:
1 + 2
[1] 3
4^2
[1] 16
 Question
 Why is the output prefixed by
[1]
?  Answer
 That’s the index, or position in a vector, of the first result.
A command giving a vector of results shows this clearly:
seq(1, 50)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50
The interpreter understands more than arithmatic operations!
That last command told it to use (or “call”) the function seq()
.
Most of “learning R” involves getting to know a whole lot of functions, the effect of each function’s arguments (e.g. the input values 1 and 10), and what each function returns (e.g. the output vector).
Basic math
A good place to begin learning R functions is with its builtin mathematical functionality:
Arithmatic operators
Try +
, 
, *
, /
, and ^
(for raising to a power).
5/3
[1] 1.666667
Logical tests
Test equality with “==” and inequality with “=<”, “<”, “!=”, “>”, or “=>”.
1/2 == 0.5
[1] TRUE
More Maths
Common mathematical functions like sin
, log
, and sqrt
, and constants.
sin(2 * pi)
[1] 2.449294e16
Programming idoms
Common computer programming functions like ‘rep’, ‘sort’, and ‘range’
rep(2, 5)
[1] 2 2 2 2 2
Parentheses
Sandwiching something with “(“ and “)” has two possible meanings.
Group subexpressions by parentheses on an asneeded basis.
(1 + 2) / 3
[1] 1
Call functions by typing their name and commaseparated arguments between parentheses.
logb(2, 2)
[1] 1
Exercise 1
The quadratic formula for a value of that satisfies the equation $a x^2 + b x + c = 0$ is
Use this formula to write an expression that computes $x$ when $a$ is 1.5, $b$ is 0.3, and $c$ is 2.9.
Assignment
When you start a new session, the R interpreter already recognizes many things, including
 any number
 any string of characters
 nearly universal operators (e.g.
+
and/
)  operators specific to R (e.g.
$
and%*%
)  functions in
base
R
To reference a number or function you just type it in as above. To referece a string of characters you must surround them in quotation marks.
'ab.cd'
[1] "ab.cd"
Without quotation marks, the interpreter checks for things named ab.cd
and doesn’t find anything:
ab.cd
Error in eval(expr, envir, enclos): object 'ab.cd' not found
 Question
 Is it better to use
'
or"
?  Answer
 Neither one is better. You will often encounter stylistic choices like this, so if you don’t have a personal preference try to mimic existing styles.
We can expand the vocabulary known to the R interpreter by creating a new variable.
Using the symbol <
is referred to as assignment: we assign the output of any command to the right of <
to any variable written to its left.
x < seq(1, 50)
You’ll notice that nothing prints to the console, because we assigned the output to a variable.
We can print the value of x
by evaluating it without assignment.
x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50
Assigning values to new variables (to the left of a <
) is the only time you can reference something previously unknown to the interpreter.
All other commands must reference things already in the interpreter’s vocabulary.
Once assigned to a variable, a value becomes known to R and you can refer to it in other commands.
y < 'ab.cd'
typeof(y)
[1] "character"
The Environment
In the RStudio IDE, the environment tab displays the variables added to R’s vocabulary in the current session.
 Variables do not persist between sessions (although RStudio defaults to automatic saving and loading of variables from a .RData file).
 Variables only change their value on reassignment.
The Editor
The console is for evaluating commands you don’t intend to keep or reuse. It’s useful for testing commands and poking around. The editor is where you compose scripts that will process data, perform analyses, code up visualizations, and even write reports.
These work together in RStudio, which has multiple ways to send parts of the script you are editing to the console for immediate evaluation. Alternatively you can “source” the entire script.
Open up “worksheet1.R” in the editor, and follow along by replacing the ...
placeholders with the code here. Then evalute just this line (Ctrl+Enter on Windows, ⌘+Enter on Mac OS).
vals < seq(1, 100)
Let’s review the elements of this statement, from left to right:
vals
is the name of a (new) variable<
assigns tovals
the result of what comes afterseq
is the name of a function(
is the opening paren of the function call1
and100
are separate arguments to the function)
is the closing paren of the function call
 Question
 Why call
vals
a “variable” andseq
a “function”?  Answer
 It is true they are both names of things known to R, and could be called variables. But
seq
has the distinguishing property of being callable (i.e. we can use “(“ and “)” to provide arguments), which makes it a variable that behaves something like a mathematical function by taking input and returning output.
Our call to the function seq
could have been much more explicit. We could give the arguments by the names that seq
is expecting.
vals < seq(from = 1,
to = 100)
Run that code by moving your cursor anywhere within those two lines and clicking “Run”, or by using the keyboard shortcut CtrlReturn or ⌘ Return.
 Question
 What’s an advantage of naming arguments?
 Answer
 One advantage is that you can put them in any order. A related advantage is that you can then skip some arguments, which is fine to do if each skipped argument has a default value. A third advantage is code readability, which you should always be concious of while writing in the editor.
Readability
Code readability in the editor cuts both ways: sometimes verbosity is useful, sometimes it is cumbersome.
The seq()
function has an alternative form available when only the from
and to
arguments are needed.
1:100
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
[18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
[35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
[52] 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
[69] 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
[86] 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
The :
operator should be used whenever possible because it replaces a common, cumbersome function call with an brief, intuitive syntax.
Likewise, the assign
function duplicates the functionallity of the <
symbol, but is never used when the simpler operator will suffice.
Function documentation
How would you get to know these properties and the names of a function’s arguments?
?seq
How would you even know what function to call?
??sequence
Data types
Type  Example 

double  3.1, 4, Inf, NaN 
integer  4L, 0L, 999L 
character  ‘a’, ‘4’, ‘👏’ 
logical  TRUE, FALSE 
missing  NA 
Data structures
Compound objects, built from one or more of these data types, or even other objects.
Common onedimensional, array data structures:
 Vectors
 Lists
 Factors
Vectors
Vectors are the basic data structure in R. They are a collection of data that are all of the same type. Create a vector by combining elements together using the function c()
.
counts < c(4, 3, 7, 5, 2)
All elements of an vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible type.
c(1, 2, "c")
[1] "1" "2" "c"
Lists
Lists are like vectors but their elements can be of any data type or structure.
Construct lists with list()
instead of c()
:
list(1, 2, "c")
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] "c"
Lists can even include another list!
list(1, list(2, 3))
[[1]]
[1] 1
[[2]]
[[2]][[1]]
[1] 2
[[2]][[2]]
[1] 3
Exercise 2
Use the typeof
function to inspect the data type of counts
, and do the same for another variable to which you assign a list of numbers. Why are they different? What is the data type the results when you use c
to combine counts
with the new variable you just created, in terms of the underlying data type? Does c
always create vectors?
Factors
A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are like integer vectors, but posess a levels
attribute that assigns names to however many discrete categories are specified.
Use factor()
to create a vector with predefined values, which are often characters or “strings”.
education < factor(
c("college", "highschool", "college", "middle", "middle"),
levels = c("middle", "highschool", "college"))
The str
function notes the labels, but printsthe integers assigned in their stead.
str(education)
Factor w/ 3 levels "middle","highschool",..: 3 2 3 1 1
Multidimensional data structures
Data can be stored in several additional data structures depending on its complexity.
Dimensions  Homogeneous  Heterogeneous 

1d  c()  list() 
2d  matrix()  data.frame() 
nd  array() 
Of these, the data frame is far and away the most used.
Data frames
Data frames are 2dimensional and can contain heterogenous data like numbers in one column and a factor in another.
It is the data structure most similar to a spreadsheet, with two key differences:
 The columns are equallength vectors.
 As vectors, the columns are homogeneous and cannot hold values of the “wrong” type.
Creating a data frame from scratch can be done by combining vectors with the data.frame()
function.
df < data.frame(education, counts)
There are several functions to get to know a data frame:
dim() 
dimensions 
nrow() , ncol() 
number of rows, columns 
names() 
(column) names 
str() 
structure 
summary() 
summary info 
head() 
shows beginning rows 
names(df)
[1] "education" "counts"
Exercise 3
Create a data frame with two columns, one called “species” with four strings and another called “abund” with four numbers. Store your data frame as a variable called data
.
Parts of an Object
Parts of a data structure are always accessible, either by their name or by their position, using square brackets: [
and ]
.
Position
counts[1]
[1] 4
counts[3]
[1] 7
Names
Parts of an object may also have a name. The names can be given when you are creating a vector or afterwards using the names()
function.
df['education']
education
1 college
2 highschool
3 college
4 middle
5 middle
names(df) < c('ed', 'ct')
df['ed']
ed
1 college
2 highschool
3 college
4 middle
5 middle
 Question
 This use of
<
withnames(x)
on the left is a little odd. What’s going on?  Answer
 We are overwriting an existing variable, but one that is accessed through the output of the function on the left rather than the global environment.
For a multidimensional array, separate the dimension along which a part is requested with a comma.
df[3, 'ed']
[1] college
Levels: middle highschool college
It’s fine to mix names and indices when selecting parts of an object.
There are multiple ways to access several parts of an object together.
Part  Result 

positives  elements at given positions 
negatives  given positions omitted 
logicals  elements where the corresponding position is TRUE 
nothing  all the elements 
days < c(
"Sunday", "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday")
weekdays < days[2:6]
weekend < days[c(1, 7)]
weekdays
[1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
weekend
[1] "Sunday" "Saturday"
Exercise 4
 Get weekdays using negative integers.
 Get MWF using a vector of postitions generated by
seq()
that uses theby
argument (don’t forget to?seq
for help).
Subsetting data frames
The $
sign is an operator that makes for quick access to a single, named part of an object.
It’s most useful when used interactively with “tab completion” on the columns of a data frame.
df$ed
[1] college highschool college middle middle
Levels: middle highschool college
A logical test applied to a single column produces a vector of TRUE
and FALSE
values that’s the right length for subsetting the data.
df[df$ed == 'college', ]
ed ct
1 college 4
3 college 7
Linear models
Regression of a “response” variable against discrete and continuous “predictors” is fundamental to statistical data analysis. The lm
function, which is an abbreviation for “linear model”, provides the simplest kind of regression in R.
Fitting a regression requires two inputs:
 data
 a
data.frame
with independent observations  model
 a type of R expression called a
formula
Specify a formula by naming a response variable left of a “~” and any number of predictors to its right.
y ~ a
y ~ a
Formula minilanguage
Writing formulas requires understanding a very simple syntax for including predictors and specifying which ones interact.
Constant and one predictor:
y ~ a
y ~ 1 + a
No constant with one predictor:
y ~ 1 + a
y ~ 0 + a
Constant and two predictors:
y ~ a + b
y ~ 1 + a + b
Constant and one predictor as the “interaction” of two variables:
y ~ a:b
Constant and the full complement of two variables with their interaction:
y ~ a*b
y ~ 1 + a + b + a:b
Constant and the full complement of k
variables with up to $n^{th}$ order interactions:
y ~ (a_1 + a_2 + ... + a_k)^n
Fitting models
Match your formula variables to the column names of a data frame, and pass the formula
and data.frame
as the first two arguments to lm
.
animals < read.csv('data/animals.csv')
fit < lm(weight ~ hindfoot_length, animals)
summary(fit)
Factors in linear models
Data structures matter in statistical modelling. For the predictors in a linear model, the most important distinction is whether a variable is a factor.
fit < lm(weight ~ species_id, animals)
summary(fit)
The difference between 1 and 24 degrees of freedom in the last two models—with one predictor each—is due to species_id
being a factor.
Exercise 5
Regress hindfoot_length against weight and species_id. Does it appear that the Chihuahuan Desert’s common kangaroo rats (DM) have inordinately large feet for their weight?
Review
In this introduction to R, we touched on several key parts of scripting for data analysis.
 RStudio panes
 Variable assignment
 Data structures
 Subsetting data
 Linear models
Special characters in R
Perhaps more than most languages, an R script can appear like a jumble of archaic symbols. Here is a little table of characters to recognize as having special meaning
Symbol  Meaning 

? 
get help 
# 
comment 
: 
sequence 
:: , ::: 
access namespaces (advanced) 
< 
assignment 
$ , [ ] , [[ ]] 
subsetting 
% % 
infix operators, e.g. %*% 
{ } 
statements 
. 

@ 
slot (advanced) 
Yes, the .
in R has no fixed meaning and is often used as _
might be used to separate words in a variable name.
Exercise solutions
Solution 1
(0.3 + sqrt(0.3 ^ 2  4 * 1.5 * 2.9)) / (2 * 1.5)
[1] 1.294035
Solution 2
x < list(3, 4, 5, 7)
typeof(counts)
[1] "double"
typeof(x)
[1] "list"
typeof(c(counts, x))
[1] "list"
The variable x
has a data type of list
, so R does not restrict its elements to a particular type as it does for vectors. The result of combining a list and vector is a list, because the list is the more flexible data structure.
Solution 3
species < c('ape', 'bat', 'cat', 'dog')
abund < 1:4
data < data.frame(species, abund)
str(data)
'data.frame': 4 obs. of 2 variables:
$ species: Factor w/ 4 levels "ape","bat","cat",..: 1 2 3 4
$ abund : int 1 2 3 4
Solution 4
sol1 < days[c(1, 7)]
sol2 < days[seq(2, 7, 2)]
sol1
[1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
sol2
[1] "Monday" "Wednesday" "Friday"
Solution 5
fit < lm(hindfoot_length ~ weight * species_id, animals)
summary(fit)
Call:
lm(formula = hindfoot_length ~ weight * species_id, data = animals)
Residuals:
Min 1Q Median 3Q Max
24.0134 0.6449 0.0519 0.7144 29.7848
Coefficients:
Estimate Std. Error t value Pr(>t)
(Intercept) 12.819706 0.881185 14.548 < 2e16 ***
weight 0.020964 0.099647 0.210 0.83337
species_idDM 19.878067 0.885692 22.444 < 2e16 ***
species_idDO 20.320892 0.896080 22.678 < 2e16 ***
species_idDS 32.082231 0.895996 35.806 < 2e16 ***
species_idNL 16.814112 0.895635 18.773 < 2e16 ***
species_idOL 6.469932 0.906738 7.135 9.86e13 ***
species_idOT 6.436566 0.896637 7.179 7.20e13 ***
species_idOX 5.949524 6.487854 0.917 0.35914
species_idPB 11.555911 0.889542 12.991 < 2e16 ***
species_idPE 5.930293 0.905705 6.548 5.93e11 ***
species_idPF 1.476920 0.899021 1.643 0.10043
species_idPH 9.421561 1.575405 5.980 2.25e09 ***
species_idPI 8.373842 6.804910 1.231 0.21850
species_idPL 6.383570 1.408950 4.531 5.90e06 ***
species_idPM 5.937775 0.907084 6.546 6.00e11 ***
species_idPP 7.368245 0.889413 8.284 < 2e16 ***
species_idPX 16.180294 18.538880 0.873 0.38279
species_idRF 4.028077 1.351396 2.981 0.00288 **
species_idRM 2.828061 0.891669 3.172 0.00152 **
species_idRO 6.371783 3.079506 2.069 0.03855 *
species_idRX 0.513627 3.600336 0.143 0.88656
species_idSF 12.199667 1.019254 11.969 < 2e16 ***
species_idSH 12.083991 0.963573 12.541 < 2e16 ***
species_idSO 10.806853 1.060554 10.190 < 2e16 ***
weight:species_idDM 0.055372 0.099668 0.556 0.57852
weight:species_idDO 0.029121 0.099701 0.292 0.77022
weight:species_idDS 0.021381 0.099656 0.215 0.83012
weight:species_idNL 0.004479 0.099652 0.045 0.96415
weight:species_idOL 0.018680 0.099870 0.187 0.85163
weight:species_idOT 0.020726 0.099874 0.208 0.83560
weight:species_idOX 0.055959 0.317826 0.176 0.86024
weight:species_idPB 0.033628 0.099717 0.337 0.73594
weight:species_idPE 0.046181 0.100101 0.461 0.64455
weight:species_idPF 0.140474 0.102035 1.377 0.16861
weight:species_idPH 0.092764 0.107858 0.860 0.38976
weight:species_idPI 0.027423 0.363536 0.075 0.93987
weight:species_idPL 0.022116 0.114392 0.193 0.84670
weight:species_idPM 0.057366 0.100132 0.573 0.56671
weight:species_idPP 0.070007 0.099883 0.701 0.48338
weight:species_idPX 0.520964 0.978368 0.532 0.59440
weight:species_idRF 0.028946 0.124757 0.232 0.81653
weight:species_idRM 0.054117 0.100441 0.539 0.59003
weight:species_idRO 0.393305 0.300913 1.307 0.19121
weight:species_idRX 0.312369 0.238136 1.312 0.18962
weight:species_idSF 0.007461 0.099955 0.075 0.94050
weight:species_idSH 0.029670 0.099776 0.297 0.76619
weight:species_idSO 0.014673 0.100138 0.147 0.88350

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.376 on 30690 degrees of freedom
(4811 observations deleted due to missingness)
Multiple Rsquared: 0.9792, Adjusted Rsquared: 0.9792
Fstatistic: 3.078e+04 on 47 and 30690 DF, pvalue: < 2.2e16