Database Principles and Use
Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.
What good is a database?
For a team of researchers implementing a collaborative workflow, the top three reasons to use a database are:
- concurrency - Multiple users can safely read and edit database entries simultaneously.
- reliability - Relational databases formalize and can enforce concepts of “tidy” data.
- scalability - Very large tables can be read, searched, and merged quickly.
In this lesson, the term “database” more precisely means a relational database that allows data manipulation with Structured Query Language (SQL). Strictly speaking, the word “database” describes any collection of digitized data–a distinction that has mostly outlived its usefulness.
This lesson also assumes you have access to a database server populated with data by the lesson instructor. Check back here for forthcoming instructions to get setup with Docker.
Objectives
- Understand how a database differs from a data file
- Introduce the PostgreSQL database system
- Meet Structured Query Language (SQL)
- Recognize the value of typed data
Specific Achievements
- Access a database from RStudio
- Create a table and view table definitions
- Insert records one at a time into a table
- Check primary and foreign key constraints
The Portal Project
The Portal Project is a long-term ecological study being conducted near Portal, AZ. Since 1977, the site has been a primary focus of research on interactions among rodents, ants and plants and their respective responses to climate.
The research site consists of many plots – patches of the Arizona desert that are intensively manipulated and repeatedly surveyed. The plots have some fixed characteristics, such as the type of manipulation, geographic location, aspect, etc.
The plots have a lot of dynamic characteristics, and those changes are recorded in repeated surveys. In particular, the animals captured during each survey are identified to species, weighed, and measured.
Data from the Portal project is recorded in a relational database designed for reliable storage & rapid access to the bounty of information produced by this long-term ecological experiment.
This lesson uses real data, which has been analyzed in over 100 publications. The data been simplified just a little bit for the workshop, but you can download the full dataset and work with it using exactly the same tools we learn today.
The three key tables in the relational database are:
- plots
- animals
- species
Database Characteristics: I
Database terminology builds on common ways of characterizing data files. The breakdown of a table into records (by row) or fields (by column) is familiar to anyone who’s worked in spreadsheets. The descriptions below formalize these terms, and provide an example referencing the Portal mammals database.
Field | The smallest unit of information, each having a label and holding a value of the same type. e.g. The day on which a plot was surveyed. |
Record | A collection of related values, from different fields, that all describe the same entity. e.g. The species, sex, weight and size of a small mammal captured during a survey. |
Table | A collection of records, each one uniquely identified by the value of a key field. e.g. All records for small mammals observed during any survey. |
Software for using a database provides different tools for working with tables than a spreadsheet program. A database is generally characterized as being tooled for production environments, in contrast to data files tooled for ease of use.
Collaboration
- Data files are stored in the cloud (sync issues), shared on a network (user collision), or copies are emailed among collaborators.
- A database accepts simultaneous users from different clients on a network. There are never multiple copies of the data (aside from backups!).
Size
- Reading an entire data file into memory isn’t scaleable. Some file types (e.g. MS Excel files) have size limits.
- Database software only reads requested parts of the data into memory. There are no size limits.
Quality
- Data file formats do not typically provide any quality controls.
- Databases stricly enforce data types on each field. Letters, for example “N.A.”, cannot be entered into a field for integers.
Extension
- Specialized files are needed for complicated data types (e.g. ESRI Shapefiles).
- Databases provide many non-standard data types, and very specialized ones (e.g. geometries) are available through extension packages.
Programming
- There is no standard way to read, edit or create records in data files of different formats or from different languages.
- Packages native to all popular programming languages provide access to databases using SQL.
The Database Server
A relational database management system (RDBMS) that resides on a server is ideal for collaborative, data-driven projects. Working with such a database requires communication over a network using the client/server model, which necessitates some way of finding the database server (it’s address) and some form of authentication (a username and password).
The client must know the following information to initate communication with the database server:
- Hostname - name or ip address of the database server
- Port - typically the default (thus safe to ignore)
- Database - name of one database hosted by that server
- Username - a user authorized by the server to access the given database
- Password - password for the given username
We are going to look at the Portal mammals data on a server running PostgreSQL, which is an open-source database management system. The client we will use to communicate with the server is the RStudio IDE for R scripting, which is just one of very many applications that are clients for a PostgreSQL server.
For convenience and security when accessing a PostgreSQL server, some of the information should be stored in a password file.
hostname:port:database:username:password
Connections
The first step from the RStudio client is creating a connection object that verifies the given information by opening up a channel to the database server.
library(DBI)
con <- dbConnect(RPostgres::Postgres(),
host = ...,
dbname = ...,
user = ...)
With the connection object availble, you can begin exploring the database.
> dbListTables(con)
[1] "plots" "animals" "species"
> dbListFields(con, 'species')
[1] "id" "genus" "species" "taxa"
Read an entire database table into an R data frame with dbReadTable
, or if you
prefer “tidyverse” functions, use the dplyr tbl
function.
library(dplyr)
species <- tbl(con, 'species')
> species
# Source: table<species> [?? x 4]
# Database: postgres [student@docker01.research.sesync.org:5432/portal]
id genus species taxa
<chr> <chr> <chr> <chr>
1 AB Amphispiza bilineata Bird
2 AH Ammospermophilus harrisi Rodent
3 AS Ammodramus savannarum Bird
4 BA Baiomys taylori Rodent
5 CB Campylorhynchus brunneicapillus Bird
6 CM Calamospiza melanocorys Bird
7 CQ Callipepla squamata Bird
8 CS Crotalus scutalatus Reptile
9 CT Cnemidophorus tigris Reptile
10 CU Cnemidophorus uniparens Reptile
# ... with more rows
The dbWriteTable
function provides a mechanism for uploading data, as long as
the user specified in the connection object has permission to create tables.
df <- data.frame(
id = c(1, 2),
name = c('Alice', 'Bob')
)
dbWriteTable(con, 'observers', df,
append = TRUE)
> tbl(con, 'observers')
# Source: table<observers> [?? x 2]
# Database: postgres [student@docker01.research.sesync.org:5432/portal]
id name
<dbl> <chr>
1 1 Alice
2 2 Bob
Database Characteristics: II
Returning to the bigger picture and our comparison of storing data in files as opposed to a database, there are some concepts that only apply to databases. We have seen that databases include multiple tables–so far, that’s not so different from keeping multiple spreadsheets in one MS Excel workbook or in multiple CSV files. The collection of tables in a relational database, however, can be structured by built-in relationships between records from different tables. Data is assembled in the correct arrangement for analysis “just in time” by scripting database queries that join tables on these relationships.
Primary key | One or more fields (but usually one) that uniquely identify a record in a table. |
Foreign key | A primary key from one table used in different table to establish a relationship. |
Query | Collect values from tables based on their relationships and other constraints. |
Primary Keys
In the plots
table, id
is the primary key. Any new record cannot duplicate an existing id.
id | treatment |
---|---|
1 | Control |
2 | Rodent Exclosure |
3 | Control |
Creating the observers
table with id
as a primary key will prevent the
duplication observed from multiple identical dbWriteTable
calls.
dbCreateTable(con, 'observers', list(
id = 'serial primary key',
name = 'text'
))
When appending a data frame to the table created with “serial primary key”,
the id
is automatically generated and unique.
df <- data.frame(
name = c('Alice', 'Bob')
)
dbWriteTable(con, 'observers', df,
append = TRUE)
> tbl(con, 'observers')
# Source: table<observers> [?? x 2]
# Database: postgres [student@docker01.research.sesync.org:5432/portal]
id name
<int> <chr>
1 1 Alice
2 2 Bob
Primary keys are checked before duplicates end up in the data, throwing an error if necessary.
df <- data.frame(
id = c(1),
name = c('J. Doe')
)
dbWriteTable(con, 'observers', df,
append = TRUE)
Error in connection_copy_data(conn@ptr, sql, value): COPY returned error: ERROR: duplicate key value violates unique constraint "observers_pkey"
DETAIL: Key (id)=(1) already exists.
CONTEXT: COPY observers, line 1
Foreign Keys
A field may also be designated as a foreign key, which establishes a relationship between tables. A foreign key points to some primary key from a different table.
In the animals
table, id
is the primary key and both plot_id
and
species_id
are foreign keys.
id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight |
---|---|---|---|---|---|---|---|---|
1 | 7 | 16 | 1977 | 2 | ST | M | 32 | 0.45 |
2 | 7 | 16 | 1977 | 2 | PX | M | 33 | 0.23 |
3 | 7 | 16 | 1978 | 1 | RO | F | 14 | 1.23 |
Foreign keys are checked before nonsensical references end up in the data:
df <- data.frame(
month = 7,
day = 16,
year = 1977,
plot_id = -1
)
dbWriteTable(con, 'animals', df,
append = TRUE)
Error in connection_copy_data(conn@ptr, sql, value): COPY returned error: ERROR: insert or update on table "animals" violates foreign key constraint "animals_plot_id_fkey"
DETAIL: Key (plot_id)=(-1) is not present in table "plots".
Query
Structured Query Language (SQL) is a high-level language for interacting with relational databases. Commands use intuitive English words but can be strung together and nested in powerful ways. SQL is not the only way to query a database from R (cf. dbplyr), but sometimes it is the only way to perform a complicated query.
To write SQL statements in RStudio, use the sql
engine for code chunks in a
RMarkdown file:
```{sql, connection = con}
...
```
Basic queries
Let’s write a SQL query that selects only the year column from the animals table.
SELECT year FROM animals;
year |
---|
1977 |
1977 |
1977 |
1977 |
1977 |
1977 |
1977 |
1977 |
1977 |
1977 |
A note on style: we have capitalized the words SELECT and FROM because they are SQL keywords. Unlike R, SQL is case insensitive, so capitalization only helps for readability and is a good style to adopt.
To select data from multiple fields, include multiple fields as a comma-separated list right after SELECT:
SELECT year, month, day
FROM animals;
year | month | day |
---|---|---|
1977 | 7 | 16 |
1977 | 7 | 16 |
1977 | 7 | 16 |
1977 | 7 | 16 |
1977 | 7 | 16 |
1977 | 7 | 16 |
1977 | 7 | 16 |
1977 | 7 | 16 |
1977 | 7 | 16 |
1977 | 7 | 16 |
The line break before FROM
is also good form, particularly as the length of the query grows.
Or select all of the columns in a table using a wildcard:
SELECT *
FROM animals;
id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight |
---|---|---|---|---|---|---|---|---|
2 | 7 | 16 | 1977 | 3 | NL | M | 33 | NA |
3 | 7 | 16 | 1977 | 2 | DM | F | 37 | NA |
4 | 7 | 16 | 1977 | 7 | DM | M | 36 | NA |
5 | 7 | 16 | 1977 | 3 | DM | M | 35 | NA |
6 | 7 | 16 | 1977 | 1 | PF | M | 14 | NA |
7 | 7 | 16 | 1977 | 2 | PE | F | NA | NA |
8 | 7 | 16 | 1977 | 1 | DM | M | 37 | NA |
9 | 7 | 16 | 1977 | 1 | DM | F | 34 | NA |
10 | 7 | 16 | 1977 | 6 | PF | F | 20 | NA |
11 | 7 | 16 | 1977 | 5 | DS | F | 53 | NA |
Limit
We can use the LIMIT statement to select only the first few rows. This is particularly helpful when getting a feel for very large tables.
SELECT year, species_id
FROM animals
LIMIT 4;
year | species_id |
---|---|
1977 | NL |
1977 | DM |
1977 | DM |
1977 | DM |
Unique values
If we want only the unique values so that we can quickly see what species have
been sampled we use DISTINCT
SELECT DISTINCT species_id
FROM animals;
species_id |
---|
NA |
CB |
RM |
PB |
PE |
AB |
AH |
SS |
US |
RX |
If we select more than one column, then the distinct pairs of values are returned
SELECT DISTINCT year, species_id
FROM animals;
year | species_id |
---|---|
1997 | RM |
2002 | NL |
1995 | RX |
1999 | SS |
1985 | DM |
1982 | PP |
1989 | CB |
1994 | OT |
1978 | NL |
1997 | NL |
Calculations
We can also do calculations with the values in a query. For example, if we wanted to look at the mass of each individual, by plot, species, and sex, but we needed it in kg instead of g we would use
SELECT plot_id, species_id,
sex, weight / 1000.0
FROM animals;
plot_id | species_id | sex | ?column? |
---|---|---|---|
3 | NL | M | NA |
2 | DM | F | NA |
7 | DM | M | NA |
3 | DM | M | NA |
1 | PF | M | NA |
2 | PE | F | NA |
1 | DM | M | NA |
1 | DM | F | NA |
6 | PF | F | NA |
5 | DS | F | NA |
The expression weight / 1000.0
is evaluated for each row
and appended to that row, in a new column.
You can assign the new column a name by typing “AS weight_kg” after the expression.
SELECT plot_id, species_id, sex,
weight / 1000 AS weight_kg
FROM animals;
plot_id | species_id | sex | weight_kg |
---|---|---|---|
3 | NL | M | NA |
2 | DM | F | NA |
7 | DM | M | NA |
3 | DM | M | NA |
1 | PF | M | NA |
2 | PE | F | NA |
1 | DM | M | NA |
1 | DM | F | NA |
6 | PF | F | NA |
5 | DS | F | NA |
Expressions can use any fields, any arithmetic operators (+ - * /) and a variety of built-in functions. For example, we could round the values to make them easier to read.
SELECT plot_id, species_id, sex,
ROUND(weight / 1000.0, 2) AS weight_kg
FROM animals;
plot_id | species_id | sex | weight_kg |
---|---|---|---|
3 | NL | M | NA |
2 | DM | F | NA |
7 | DM | M | NA |
3 | DM | M | NA |
1 | PF | M | NA |
2 | PE | F | NA |
1 | DM | M | NA |
1 | DM | F | NA |
6 | PF | F | NA |
5 | DS | F | NA |
The underlying data in the wgt column of the table does not change. The query, which exists separately from the data, simply displays the calculation we requested in the query result window pane.
Filtering
Databases can also filter data – selecting only those records meeting certain
criteria. For example, let’s say we only want data for the species “Dipodomys
merriami”, which has a species code of “DM”. We need to add a WHERE
clause to
our query.
SELECT *
FROM animals
WHERE species_id = 'DM';
id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight |
---|---|---|---|---|---|---|---|---|
3 | 7 | 16 | 1977 | 2 | DM | F | 37 | NA |
4 | 7 | 16 | 1977 | 7 | DM | M | 36 | NA |
5 | 7 | 16 | 1977 | 3 | DM | M | 35 | NA |
8 | 7 | 16 | 1977 | 1 | DM | M | 37 | NA |
9 | 7 | 16 | 1977 | 1 | DM | F | 34 | NA |
12 | 7 | 16 | 1977 | 7 | DM | M | 38 | NA |
13 | 7 | 16 | 1977 | 3 | DM | M | 35 | NA |
14 | 7 | 16 | 1977 | 8 | DM | NA | NA | NA |
15 | 7 | 16 | 1977 | 6 | DM | F | 36 | NA |
16 | 7 | 16 | 1977 | 4 | DM | F | 36 | NA |
Of course, we can do the same thing with numbers.
SELECT *
FROM animals
WHERE year >= 2000;
id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight |
---|---|---|---|---|---|---|---|---|
30159 | 1 | 8 | 2000 | 1 | PP | F | 22 | 17 |
30160 | 1 | 8 | 2000 | 1 | DO | M | 35 | 53 |
30161 | 1 | 8 | 2000 | 1 | PP | F | 21 | 17 |
30162 | 1 | 8 | 2000 | 1 | DM | M | 36 | 50 |
30163 | 1 | 8 | 2000 | 1 | PP | M | 20 | 16 |
30164 | 1 | 8 | 2000 | 1 | PB | M | 26 | 27 |
30165 | 1 | 8 | 2000 | 1 | PP | F | 22 | 15 |
30166 | 1 | 8 | 2000 | 1 | PP | M | 23 | 19 |
30167 | 1 | 8 | 2000 | 1 | DO | M | 35 | 41 |
30168 | 1 | 8 | 2000 | 1 | PB | M | 25 | 24 |
More sophisticated conditions arise from combining tests with AND and OR. For example, suppose we want the data on Dipodomys merriami starting in the year 2000.
SELECT *
FROM animals
WHERE year >= 2000 AND species_id = 'DM';
id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight |
---|---|---|---|---|---|---|---|---|
30162 | 1 | 8 | 2000 | 1 | DM | M | 36 | 50 |
30179 | 1 | 8 | 2000 | 12 | DM | M | 36 | 60 |
30196 | 1 | 8 | 2000 | 17 | DM | M | 37 | 52 |
30197 | 1 | 8 | 2000 | 17 | DM | F | 34 | 43 |
30210 | 1 | 8 | 2000 | 22 | DM | M | 38 | 56 |
30215 | 1 | 8 | 2000 | 22 | DM | F | 34 | 28 |
30227 | 1 | 10 | 2000 | 4 | DM | M | 34 | 45 |
30241 | 1 | 10 | 2000 | 11 | DM | M | 35 | 43 |
30242 | 1 | 10 | 2000 | 11 | DM | M | 35 | 44 |
30244 | 1 | 10 | 2000 | 11 | DM | M | 35 | 44 |
Parentheses can be used to help with readability and to ensure that AND and OR are combined in the way that we intend. If we wanted to get all the animals for “DM” since 2000 or up to 1990 we could combine the tests using OR:
SELECT *
FROM animals
WHERE (year >= 2000 OR year <= 1990)
AND species_id = 'DM';
id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight |
---|---|---|---|---|---|---|---|---|
3 | 7 | 16 | 1977 | 2 | DM | F | 37 | NA |
4 | 7 | 16 | 1977 | 7 | DM | M | 36 | NA |
5 | 7 | 16 | 1977 | 3 | DM | M | 35 | NA |
8 | 7 | 16 | 1977 | 1 | DM | M | 37 | NA |
9 | 7 | 16 | 1977 | 1 | DM | F | 34 | NA |
12 | 7 | 16 | 1977 | 7 | DM | M | 38 | NA |
13 | 7 | 16 | 1977 | 3 | DM | M | 35 | NA |
14 | 7 | 16 | 1977 | 8 | DM | NA | NA | NA |
15 | 7 | 16 | 1977 | 6 | DM | F | 36 | NA |
16 | 7 | 16 | 1977 | 4 | DM | F | 36 | NA |
Normalized Data is Tidy
Proper use of table relationships is a challenging part of database design. The objective is normalization, or taking steps to define logical “observational units” and minimize data redundency.
For example, the genus and species names are not attributes of an animal: they are attributes of the species attributed to an animal. Data about a species belongs in a different observational unit from data about the animal captured in a survey. With an ideal database design, any value discovered to be erroneous should only have to be corrected in one record in one table.
- Question
- Currently,
plots
is pretty sparse. What other kind of data might go into plots? - Answer
- Additional properties, such as location, that do not change between surveys.
Un-tidy data with JOINs
A good data management principle is to record and store data in the most normalized form possible, and un-tidy your tables as needed for particular analyses. The SQL “JOIN” clause lets you create records with fields from multiple tables.
Consider for example what you must do to carry out a regression of animal weight against plot treatment using the R command:
lm(weight ~ treatment,
data = portal)
You need a “data.frame” called portal
with rows for each animal that also
includes a “treatment” inferred from “plot_id”.
Additionally suppose you want to account for genus in this regression, expanding the previous R command to:
lm(weight ~ genus + treatment,
data = portal)
You need another column for genus in the portal
data.frame, inferred from
“species_id” for each animal and the species table.
Relations
There are two kinds of relations–schemas that use primary and foreign key references–that permit table joins:
- One-To-Many
- Many-To-Many
One-To-Many Relationship
The primary key in the first table is referred to multiple times in the foreign key of the second table.
The SQL keyword “JOIN” matches up two tables in the way dictated by the constraint following “ON”, duplicating records as necessary.
SELECT weight, treatment
FROM animals
JOIN plots
ON animals.plot_id = plots.id;
The resulting table could be the basis for the portal
data.frame needed in the
R command lm(weight ~ treatment, data = portal)
.
Many-To-Many Relationship
Each primary key from the first table may relate to any number of primary keys from the second table and vice versa. A many-to-many relationship is induced by the existance of an “association table” involved in two one-to-many relations.
Animals is an “association table” because it includes two foreign keys.
SELECT weight, genus, treatment
FROM animals
JOIN plots
ON animals.plot_id = plots.id
JOIN species
ON animals.species_id = species.id;
The resulting table could be the basis for the portal
data.frame needed in the
R command lm(weight ~ genus + treatment, data = portal)
.
Summary
Databases are a core element of a centralized workflow, accomodating simultaneous use by all members of a collaborative team. We have just skimmed the topic of concurrency in database interactions: there is a lot going on under the hood to prevent data corruption.
The ability to precisely define keys and data types is the primary database feature that guaranties reliability. As you develop scripts for analysis and vizualization, certainty that you’ll never encounter a “NaN” when you expect an Integer will prevent, or help you catch, bugs in your code.
The third major feature to motivate databae use, scaleability, remains for you to discover. Very large tables can be queried, sorted and combined quickly when the work is done by a powerful relational database management system (RDBMS), such as PostgreSQL.
If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.
# Nothing here yet!