Database Principles and Use
Lesson 3 with Ian Carroll
For a team of researchers implementing a collaborative workflow, the top reasons to use a database are:
- concurrency
- reliability
- scalability
Objectives for this Lesson
- Understand how a database differs from a data file
- Discover relational databases
- Meet Structured Query Language (SQL)
- Recognize the value of typed data
Specific Achievements
- Access a SQLite database from R
- Select data to read into data frame
- Test primary and foreign key constraints
The Portal Project
Acknowledgement: Photos from portalproject.wordpress.com.
The Portal Project is a long-term ecological study being conducted near Portal, AZ. Since 1977, the site has been a primary focus of research on interactions among rodents, ants and plants and their respective responses to climate.
The research site consists of many “plots” – patches of the Arizona desert that become intensively manipulated and observed. The plots have some fixed characteristics, such as the type of manipulation, geographic location, aspect, etc.
The plots have a lot of dynamic characteristics too, and are regularly surveyed with a particular focus on their changing species compostion. Every survey of each plot produces a distinct set of observations that get recorded in a well-structure database, designed for reliable storage & rapid access to the bounty of information produced by this long-term ecological experiment.
Characteristics of a Database: Part I
Let’s begin with defining some terminology that is familiar from any system for data storage. For concreteness, we will use the case study just introduced to provide examples.
- Field
- The smallest unit of information, each having a label and holding a value of the same type.
- e.g. The time of the survey in which a small mammal was captured.
- Record
- A collection of related values, from different fields, that all describe the same entity.
- e.g. The species, sex, size and location of a small mammal observed during a given survey.
- Table
- A collection of records, each one uniquely identified by the value of one or more fields.
- e.g. All records of small mammals that were observed in any survey.
Limitations of Data Files
- Collaboration
- Email copies among collaborators, store in the cloud (sync issues) or save to a network (user collision).
- Size
- Reading an entire data file into memory isn’t scaleable. Some file types (e.g. MS Excel files) have size limits.
- Quality
- Enforcing restrictions on values is not the default behaviour of software that writes data files.
- Extensions
- Specialized files are needed for non-traditional data types (e.g. ESRI Shapefiles).
- API
- No standard approach for different application to read, edit or create records.
Database Solutions
- Collaboration
- A database accepts simultaneous users, which is most beneficial when the database is hosted on a network. There are never multiple copies of the data (aside from your backups!)
- Size
- The database management system only reads requested parts of the data into memory. There is no size limit!
- Quality
- Data types are enforced by default. No database lets you enter “9I1” in an integer field.
- Extension
- Special packages are needed for non-traditional data types, but they can live in the same table as traditional ones.
- API
- There are packages native to every programming language that ease reading and writing to databases.
Database connections from R
library(RPostgreSQL)
con <- dbConnect(PostgreSQL(), host="pg.sesync.org", user="icarroll")
dbListTables(con)
Database connections from R
library(RSQLite)
con <- dbConnect(SQLite(), "data/portal.sqlite")
dbListTables(con)
[1] "plots" "species" "surveys"
Two ways to access data
Read the data into an R data frame, and process it using R functions.
plots <- dbReadTable(con, "plots")
surveys <- dbReadTable(con, "surveys")
species <- dbReadTable(con, "species")
str(surveys)
'data.frame': 34786 obs. of 9 variables:
$ record_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ month : int 7 7 7 7 7 7 7 7 7 7 ...
$ day : int 16 16 16 16 16 16 16 16 16 16 ...
$ year : int 1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
$ plot_id : int 2 3 2 7 3 1 2 1 1 6 ...
$ species_id : chr "NL" "NL" "DM" "DM" ...
$ sex : chr "M" "M" "F" "M" ...
$ hindfoot_length: int 32 33 37 36 35 14 NA 37 34 20 ...
$ weight : int NA NA NA NA NA NA NA NA NA NA ...
Let the database system process data.
dbGetQuery(con, "select species_id, weight
from surveys
where plot_id = 1 limit 5")
species_id weight
1 PF NA
2 DM NA
3 DM NA
4 PF 9
5 DS NA
The string inside the brackets is an example of Structured Query Language (SQL). The SQL instructions tell the database system (SQLite, in this case) to sort out and return only the records requested.
Exercise 1
Use dbGetQuery()
to select the “species_id”, and two other fields from the “species” table. Hint: use dbListFields()
to check field names.
Characteristics of a Database: Part II
Returning to terminology, some database tools don’t apply to data files at all. Databases include a collection of tables, just as MS Excel includes a collection of spreadsheets in a workbook. The collection of tables in a relational database, however, is structured by relationships between records from different tables.
Relationships are specified through primary and foreign keys, but which is which depends on context.
Primary key
One or more fields (but usually one) that uniquely identify a record in a table.
- Question
- What would be the primary key in the surveys table. How do you know?
Primary key
In the plots
table, plot_id
is the primary key.
plot_id | plot_type |
---|---|
1 | Control |
2 | Rodent Exclosure |
3 | Control |
As a result, a new record cannot duplicate any plot_id.
dbGetQuery(con, "insert into plots
(plot_id, plot_type)
values (1, 'Control')")
Error in sqliteSendQuery(con, statement, bind.data) :
rsqlite_query_send: could not execute1: UNIQUE constraint failed: plots.plot_id
Foreign key
A primary key from table A used in table B to express their relationship.
- Question
- Identify a primary / foreign key pair in the portal workbook.
Foreign key
In the surveys
table, record_id
is the primary key and plot_id
is a foreign key.
record_id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight |
---|---|---|---|---|---|---|---|---|
1 | 7 | 16 | 1977 | 2 | ST | M | 32 | 0.45 |
2 | 7 | 16 | 1977 | 2 | PX | M | 33 | 0.23 |
3 | 7 | 16 | 1978 | 1 | RO | F | 14 | 1.23 |
With a properly designed database, references to invalid foreign keys cannot be entered.
dbGetQuery(con, "insert into surveys
(record_id, plot_id, species_id, sex)
values (35549, 1, '00', 'M')")
Error in sqliteSendQuery(con, statement, bind.data) :
rsqlite_query_send: could not execute1: FOREIGN KEY constraint failed
Normalized data is Tidy data
Proper use of table relationships is a challenging part of database design. The objective is normalization, or taking steps to minimize data redundency.
For example, the genus and species names are not associated with every survey record – only with the species_id. Data about the species is a different “observational unit” than data about the individual caught in a survey.
With an ideal database design, any value discovered to be erroneous should only have to be corrected in one record in one table.
- Question
- Currently,
plots
is pretty sparse. What other kind of data might go into plots? - Answer
- Additional properties, such as location, that do not change over time.
When untidy data okay
Record, store, and read data in the most normalized form possible. Only untidy your tables as needed for particular analyses. The most common operation for untidying is to “join” tables. For example, to carry out a regression with formula:
“weight ~ month + plot_type”
you’ll need a table with rows for each “record_id” that also includes a “plot_type”, which can be inferred from the “plot_id”. There are two kinds of table relationships – uses of primary and foreign key references – that permit database joins.
One-To-Many Relationship
The SQL keyword “join” lines up two tables, repeating records as necessary, to satisfy the constrain given after “on”.
df <- dbGetQuery(con, "select weight, month, plot_type
from surveys
join plots on surveys.plot_id = plots.plot_id
where weight is not null")
str(df)
'data.frame': 32283 obs. of 3 variables:
$ weight : int 40 48 29 46 36 52 8 22 35 7 ...
$ month : int 8 8 8 8 8 8 8 8 8 8 ...
$ plot_type: chr "Long-term Krat Exclosure" "Rodent Exclosure" "Control" "Control" ...
Many-To-Many Relationship
Exercise 2
Construct a data frame that you could use to fit the regression model “weight ~ month + plot_type + taxa”.
Data Types
A key tactic in data quality control is precise (and enforceable!) definitions of the values allowed for each field.
We just saw integer
, text
and the slightly more complicated serial
.
Some of the essential data types, specifically for a PostgreSQL database.
Type | Description |
---|---|
integer | whole numbers between -2,147,483,648 and 2,147,483,647 |
double | Inexact, variable-precision numeric value that can represent almost any number, and several non-numbers |
decimal | Exact numeric value with user-specified precision (decimal places) |
char | Fixed-length character string |
varchar | Variable-length (up to a limit) character string |
boolean | True or False |
date | day with no time specification |
Data Granularity
The degree to which information is broken up into diferent fields is called granularity, and it’s an important consideration in database design. On one hand, because it is easier to aggregate fields than disaggregate them down the road, initially erring on the side of too granular is best. On the other hand, some consideration of how each field will get used can lead to less granularity.
- Question
- In the surveys table, the date is broken into three integer fields. What are some advantages and disadvantages of such fine granularity over using a single field of type date?
- Answer
- Advantages include easy selection by month, day or year. Disadvantages include difficulty of selecting by date range.
Summary
Databases are a core element of a centralized workflow, accomodating simultaneous use by all members of a collaborative team. We have just skimmed the topic of concurrency in database interactions: there is a lot going on under the hood to prevent data corruption.
The ability to precisely define keys and data types is the primary database feature that guaranties reliability. As you develop scripts for analysis and vizualization, certainty that you’ll never encounter a “NaN” when you expect an Integer will prevent, or help you catch, bugs in your code.
The third major feature to motivate databae use, scaleability, remains for you to discover. Very large tables can be queried, sorted and combined quickly when the work is done by a powerful database system outside of R.
Exercise Solutions
Solution 1
df <- dbGetQuery(con, "select species_id, taxa, genus
from species")
Solution 2
df <- dbGetQuery(con, "select weight, month, plot_type, taxa
from surveys
join plots on surveys.plot_id = plots.plot_id
join species on surveys.species_id = species.species_id
where weight is not null")