Basic R

Lesson 1 with Mary Shelley

Contents


Why learn R?

What is R?

Top of Section


The Console

The interpreter accepts commands interactively through the console.

Basic math, as you would type it on a calculator, is usually a valid command in the R language:

1 + 2
[1] 3
4^2
[1] 16
Question
Why is the output prefixed by [1]?
Answer
That’s the index, or position in a vector, of the first result.

A command giving a vector of results shows this clearly:

seq(1, 50)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50

The interpreter understands more than arithmatic operations. That last command told it to use (or “call”) the function seq().

Most of “learning R” involves getting to know a whole lot of functions, the effect of each function’s arguments (e.g. the input values 1 and 10), and what each function returns (e.g. the output vector).

Basic math

A good place to begin learning R functions is with its built-in mathematical functionality.

Arithmatic operators

Try +, -, *, /, and ^ (for raising to a power).

5/3
[1] 1.666667

Logical tests

Test equality with == and inequality with =<, <, !=, >, or =>.

1/2 == 0.5
[1] TRUE

More Math

Common mathematical functions like sin, log, and sqrt, and constants.

sin(2 * pi)
[1] -2.449294e-16

Programming idoms

Common computer programming functions like ‘rep’, ‘sort’, and ‘range’

rep(2, 5)
[1] 2 2 2 2 2

Parentheses

Sandwiching something with ( and ) has two possible meanings.

Group sub-expressions by parentheses on an as-needed basis.

(1 + 2) / 3
[1] 1

Call functions by typing their name and comma-separated arguments between parentheses.

logb(2, 2)
[1] 1

Exercise 1

Use the quadratic formula to find that satisfies the equation .

View solution

Assignment

When you start a new session, the R interpreter already recognizes many things, including

To reference a number or function just type it in as above. To referece a string of characters, surround them in quotation marks.

'ab.cd'
[1] "ab.cd"

Without quotation marks, the interpreter checks for things named ab.cd and doesn’t find anything:

ab.cd
Error in eval(expr, envir, enclos): object 'ab.cd' not found
Question
Is it better to use ' or "?
Answer
Neither one is better. You will often encounter stylistic choices like this, so if you don’t have a personal preference try to mimic existing styles.

You can expand the vocabulary known to the R interpreter by creating a new variable. Using the symbol <- is referred to as assignment: the output of any command to the right of <- gets the name given on its left.

x <- seq(1, 50)

You’ll notice that nothing prints to the console, because we assigned the output to a variable. We can print the value of x by evaluating it without assignment.

x
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50

Assigning values to new variables (to the left of a <-) is the only time you can reference something previously unknown to the interpreter. All other commands must reference things already in the interpreter’s vocabulary.

Once assigned to a variable, a value becomes known to R and you can refer to it in other commands.

y <- 'ab.cd'
typeof(y)
[1] "character"

Top of Section


The Environment

In the RStudio IDE, the environment tab displays the variables added to R’s vocabulary in the current session.

Top of Section


The Editor

The console is for evaluating commands you don’t intend to keep or reuse. It’s useful for testing commands and poking around. The editor is where you compose scripts that will process data, perform analyses, code up visualizations, and even write reports.

These work together in RStudio, which has multiple ways to send parts of the script you are editing to the console for immediate evaluation. Alternatively you can “source” the entire script.

Open up “worksheet-1.R” in the editor, and follow along by replacing the ... placeholders with the code here. Then evalute just this line (Ctrl+Enter on Windows, ⌘+Enter on Mac OS).

vals <- seq(1, 100)

Our call to the function seq could have been much more explicit. We could give the arguments by the names that seq is expecting.

vals <- seq(from = 1,
            to = 100)

Run that code by moving your cursor anywhere within those two lines and clicking “Run”, or by using the keyboard shortcut Ctrl-Return or ⌘ Return.

Question
What’s an advantage of naming arguments?
Answer
One advantage is that you can put them in any order. A related advantage is that you can then skip some arguments, which is fine to do if each skipped argument has a default value. A third advantage is code readability, which you should always be concious of while writing in the editor.

Readability

Code readability in the editor cuts both ways: sometimes verbosity is useful, sometimes it is cumbersome.

The seq() function has an alternative form available when only the from and to arguments are needed.

1:100
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
 [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
 [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
 [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
 [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
 [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100

The : operator should be used whenever possible because it replaces a common, cumbersome function call with an brief, intuitive syntax. Likewise, the assign function duplicates the functionallity of the <- symbol, but is never used when the simpler operator will suffice.

Function documentation

How would you get to know these properties and the names of a function’s arguments?

?seq

How would you even know what function to call?

??sequence

Top of Section


Data types

Type Example
double 3.1, -4, Inf, NaN
integer -4L, 0L, 999L
character ‘a’, ‘4’, ‘👏’
logical TRUE, FALSE
missing NA

Data structures

Compound objects, built from one or more of these data types, or even other objects.

Common one-dimensional, array data structures:

Vectors

Vectors are the basic data structure in R. They are a collection of data that are all of the same type. Create a vector by combining elements together using the function c().

counts <- c(4, 3, 7, 5, 2)

All elements of an vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible type.

c(1, 2, "c")
[1] "1" "2" "c"

Lists

Lists are like vectors but their elements can be of any data type or structure.

Construct lists with list() instead of c():

list(1, 2, "c")
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] "c"

Lists can even include another list!

list(1, list(2, 3))
[[1]]
[1] 1

[[2]]
[[2]][[1]]
[1] 2

[[2]][[2]]
[1] 3

Exercise 2

Use the typeof function to inspect the data type of counts, and do the same for another variable to which you assign a list of numbers. Why are they different? Use c to combine counts with the new variable you just created and inspect the result with typeof. Does c always create vectors?

View solution

Factors

A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are like integer vectors, but posess a levels attribute that assigns names to however many discrete categories are specified.

Use factor() to create a vector with predefined values, which are often characters or “strings”.

education <- factor(
    c("college", "highschool", "college", "middle", "middle"),
    levels = c("middle", "highschool", "college"))

The str function notes the labels, but prints the integers assigned in their stead.

str(education)
 Factor w/ 3 levels "middle","highschool",..: 3 2 3 1 1

Top of Section


Multi-dimensional data structures

Data can be stored in several additional data structures depending on its complexity.

Dimensions Homogeneous Heterogeneous
1d c() list()
2d matrix() data.frame()
nd array()  

Of these, the data frame is far and away the most used.

Data frames

Data frames are 2-dimensional and can contain heterogenous data like numbers in one column and a factor in another.

It is the data structure most similar to a spreadsheet, with two key differences:

Creating a data frame from scratch can be done by combining vectors with the data.frame() function.

df <- data.frame(education, counts)

There are several functions to get to know a data frame:

dim() dimensions
nrow(), ncol() number of rows, columns
names() (column) names
str() structure
summary() summary info
head() shows beginning rows
names(df)
[1] "education" "counts"   

Exercise 3

Create a data frame with two columns, one called “species” with four strings and another called “abund” with four numbers. Store your data frame as a variable called data.

View solution

Top of Section


Parts of an Object

Parts of a data structure are always accessible, either by their name or by their position, using square brackets: [ and ].

Position

counts[1]
[1] 4
counts[3]
[1] 7

Names

Parts of an object may also have a name. The names can be given when you are creating a vector or afterwards using the names() function.

df['education']
   education
1    college
2 highschool
3    college
4     middle
5     middle
names(df) <- c('ed', 'ct')
df['ed']
          ed
1    college
2 highschool
3    college
4     middle
5     middle
Question
This use of <- with names(x) on the left is a little odd. What’s going on?
Answer
We are overwriting an existing variable, but one that is accessed through the output of the function on the left rather than the global environment.

For a multi-dimensional array, separate the dimension along which a part is requested with a comma.

df[3, 'ed']
[1] college
Levels: middle highschool college

It’s fine to mix names and indices when selecting parts of an object.

There are multiple ways to access several parts of an object together.

Part Result
positives elements at given positions
negatives given positions omitted
logicals elements where the corresponding position is TRUE
nothing all the elements
days <- c(
  "Sunday", "Monday", "Tuesday", "Wednesday",
  "Thursday", "Friday", "Saturday")
weekdays <- days[2:6]
weekend <- days[c(1, 7)]
weekdays
[1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"   
weekend
[1] "Sunday"   "Saturday"

Exercise 4

  1. Get weekdays using negative integers.
  2. Get M-W-F using a vector of postitions generated by seq() that uses the by argument (don’t forget to ?seq for help).

View solution

Subsetting data frames

The $ sign is an operator that makes for quick access to a single, named part of an object. It’s most useful when used interactively with “tab completion” on the columns of a data frame.

df$ed
[1] college    highschool college    middle     middle    
Levels: middle highschool college

A logical test applied to a single column produces a vector of TRUE and FALSE values that’s the right length for subsetting the data.

df[df$ed == 'college', ]
       ed ct
1 college  4
3 college  7

Top of Section


Functions

Functions package up a batch of commands. There are several reasons to develop functions in R for data analysis:

Writing functions to use multiple times within a project prevents you from duplicating code, a real time-saver when you want to update what the function does. If you see blocks of similar lines of code through your project, those are usually candidates for being moved into functions.

Anatomy of a function

Like all programming languages, R has keywords that are reserved for import activities, like creating functions. Keywords are usually very intuitive, the one we need is function.

function(...) {
    ...
    return(...)
}

Three components:

We’ll make a function to extract the first row of its argument, which we give a name to use inside the function:

function(z) {
    result <- z[1, ]
    return(result)
}

Note that z doesn’t exist until we call the function, which merely contains the instructions for how any z will be handled.

Finally, we need to give the function a name so we can use it like we used c() and seq() above.

first <- function(z) {
    result <- z[1, ]
    return(result)
}
first(df)
       ed ct
1 college  4
Question
Can you explain the result of entering first(counts) into the console?
Answer
The function caused an error, which prompted the interpreter to print a helpful error message. Never ignore an error message. (It’s okay to ignore a “warning”.)

Top of Section


Review

In this introduction to R, we touched on several key parts of scripting for data analysis.

Special characters in R

Perhaps more than most languages, an R script can appear like a jumble of archaic symbols. Here is a little table of characters to recognize as having special meaning.

Symbol Meaning
? get help
# comment
: sequence
::, ::: access namespaces (advanced)
<- assignment
$, [ ], [[ ]] subsetting
% % infix operators, e.g. %*%
{ } statements
.  
@ slot (advanced)

The . in R has no fixed meaning and is often used as _ might be used to separate words in a variable name.

Top of Section


Exercise solutions

Solution 1

(-0.3 + sqrt(0.3 ^ 2 - 4 * 1.5 * -2.9)) / (2 * 1.5)
[1] 1.294035

Return

Solution 2

x <- list(3, 4, 5, 7)
typeof(counts)
[1] "double"
typeof(x)
[1] "list"
typeof(c(counts, x))
[1] "list"

The variable x has a data type of list, so R does not restrict its elements to a particular type as it does for vectors. The result of combining a list and vector is a list, because the list is the more flexible data structure.

Return

Solution 3

species <- c('ape', 'bat', 'cat', 'dog')
abund <- 1:4
data <- data.frame(species, abund)
str(data)
'data.frame':	4 obs. of  2 variables:
 $ species: Factor w/ 4 levels "ape","bat","cat",..: 1 2 3 4
 $ abund  : int  1 2 3 4

Return

Solution 4

sol1 <- days[c(-1, -7)]
sol2 <- days[seq(2, 7, 2)]
sol1
[1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"   
sol2
[1] "Monday"    "Wednesday" "Friday"   

Return

Top of Section