Database Principles and Use

Lesson 3 with Ian Carroll

For a team of researchers implementing a collaborative workflow, the top reasons to use a database are:

  1. concurrency
  2. reliability
  3. scalability

Top of Section


Objectives for this Lesson

Specific Achievements

Top of Section


The Portal Project

Acknowledgement: Photos from portalproject.wordpress.com.

The Portal Project is a long-term ecological study being conducted near Portal, AZ. Since 1977, the site has been a primary focus of research on interactions among rodents, ants and plants and their respective responses to climate.

The research site consists of many “plots” – patches of the Arizona desert that become intensively manipulated and observed. The plots have some fixed characteristics, such as the type of manipulation, geographic location, aspect, etc.

The plots have a lot of dynamic characteristics too, and are regularly surveyed with a particular focus on their changing species compostion. Every survey of each plot produces a distinct set of observations that get recorded in a well-structure database, designed for reliable storage & rapid access to the bounty of information produced by this long-term ecological experiment.

Top of Section


Characteristics of a Database: Part I

Let’s begin with defining some terminology that is familiar from any system for data storage. For concreteness, we will use the case study just introduced to provide examples.

Field
The smallest unit of information, each having a label and holding a value of the same type.
e.g. The time of the survey in which a small mammal was captured.
Record
A collection of related values, from different fields, that all describe the same entity.
e.g. The species, sex, size and location of a small mammal observed during a given survey.
Table
A collection of records, each one uniquely identified by the value of one or more fields.
e.g. All records of small mammals that were observed in any survey.

Limitations of Data Files

Collaboration
Email copies among collaborators, store in the cloud (sync issues) or save to a network (user collision).
Size
Reading an entire data file into memory isn’t scaleable. Some file types (e.g. MS Excel files) have size limits.
Quality
Enforcing restrictions on values is not the default behaviour of software that writes data files.
Extensions
Specialized files are needed for non-traditional data types (e.g. ESRI Shapefiles).
API
No standard approach for different application to read, edit or create records.

Database Solutions

Collaboration
A database accepts simultaneous users, which is most beneficial when the database is hosted on a network. There are never multiple copies of the data (aside from your backups!)
Size
The database management system only reads requested parts of the data into memory. There is no size limit!
Quality
Data types are enforced by default. No database lets you enter “9I1” in an integer field.
Extension
Special packages are needed for non-traditional data types, but they can live in the same table as traditional ones.
API
There are packages native to every programming language that ease reading and writing to databases.

Database connections from R

library(RPostgreSQL)
con <- dbConnect(PostgreSQL(), host="pg.sesync.org", user="icarroll")
dbListTables(con)

Database connections from R

library(RSQLite)
con <- dbConnect(SQLite(), "data/portal.sqlite")
dbListTables(con)
[1] "plots"   "species" "surveys"

Two ways to access data

Read the data into an R data frame, and process it using R functions.

plots <- dbReadTable(con, "plots")
surveys <- dbReadTable(con, "surveys")
species <- dbReadTable(con, "species")
str(surveys)
'data.frame':	34786 obs. of  9 variables:
 $ record_id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ month          : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day            : int  16 16 16 16 16 16 16 16 16 16 ...
 $ year           : int  1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
 $ plot_id        : int  2 3 2 7 3 1 2 1 1 6 ...
 $ species_id     : chr  "NL" "NL" "DM" "DM" ...
 $ sex            : chr  "M" "M" "F" "M" ...
 $ hindfoot_length: int  32 33 37 36 35 14 NA 37 34 20 ...
 $ weight         : int  NA NA NA NA NA NA NA NA NA NA ...

Let the database system process data.

dbGetQuery(con, "select species_id, weight
                 from surveys
                 where plot_id = 1 limit 5")
  species_id weight
1         PF     NA
2         DM     NA
3         DM     NA
4         PF      9
5         DS     NA

The string inside the brackets is an example of Structured Query Language (SQL). The SQL instructions tell the database system (SQLite, in this case) to sort out and return only the records requested.

Exercise 1

Use dbGetQuery() to select the “species_id”, and two other fields from the “species” table. Hint: use dbListFields() to check field names.

Top of Section


Characteristics of a Database: Part II

Returning to terminology, some database tools don’t apply to data files at all. Databases include a collection of tables, just as MS Excel includes a collection of spreadsheets in a workbook. The collection of tables in a relational database, however, is structured by relationships between records from different tables.

Relationships are specified through primary and foreign keys, but which is which depends on context.

Primary key

One or more fields (but usually one) that uniquely identify a record in a table.

Question
What would be the primary key in the surveys table. How do you know?

Primary key

In the plots table, plot_id is the primary key.

plot_id plot_type
1 Control
2 Rodent Exclosure
3 Control

As a result, a new record cannot duplicate any plot_id.

dbGetQuery(con, "insert into plots
                 (plot_id, plot_type)
                 values (1, 'Control')")
Error in sqliteSendQuery(con, statement, bind.data) : 
  rsqlite_query_send: could not execute1: UNIQUE constraint failed: plots.plot_id

Foreign key

A primary key from table A used in table B to express their relationship.

Question
Identify a primary / foreign key pair in the portal workbook.

Foreign key

In the surveys table, record_id is the primary key and plot_id is a foreign key.

record_id month day year plot_id species_id sex hindfoot_length weight
1 7 16 1977 2 ST M 32 0.45
2 7 16 1977 2 PX M 33 0.23
3 7 16 1978 1 RO F 14 1.23

With a properly designed database, references to invalid foreign keys cannot be entered.

dbGetQuery(con, "insert into surveys
                 (record_id, plot_id, species_id, sex)
                 values (35549, 1, '00', 'M')")
Error in sqliteSendQuery(con, statement, bind.data) : 
  rsqlite_query_send: could not execute1: FOREIGN KEY constraint failed

Normalized data is Tidy data

Proper use of table relationships is a challenging part of database design. The objective is normalization, or taking steps to minimize data redundency.

For example, the genus and species names are not associated with every survey record – only with the species_id. Data about the species is a different “observational unit” than data about the individual caught in a survey.

With an ideal database design, any value discovered to be erroneous should only have to be corrected in one record in one table.

Question
Currently, plots is pretty sparse. What other kind of data might go into plots?
Answer
Additional properties, such as location, that do not change over time.

Top of Section


When untidy data okay

Record, store, and read data in the most normalized form possible. Only untidy your tables as needed for particular analyses. The most common operation for untidying is to “join” tables. For example, to carry out a regression with formula:

“weight ~ month + plot_type”

you’ll need a table with rows for each “record_id” that also includes a “plot_type”, which can be inferred from the “plot_id”. There are two kinds of table relationships – uses of primary and foreign key references – that permit database joins.

One-To-Many Relationship

The SQL keyword “join” lines up two tables, repeating records as necessary, to satisfy the constrain given after “on”.

df <- dbGetQuery(con, "select weight, month, plot_type
                       from surveys
                       join plots on surveys.plot_id = plots.plot_id
                       where weight is not null")
str(df)
'data.frame':	32283 obs. of  3 variables:
 $ weight   : int  40 48 29 46 36 52 8 22 35 7 ...
 $ month    : int  8 8 8 8 8 8 8 8 8 8 ...
 $ plot_type: chr  "Long-term Krat Exclosure" "Rodent Exclosure" "Control" "Control" ...

Many-To-Many Relationship

Exercise 2

Construct a data frame that you could use to fit the regression model “weight ~ month + plot_type + taxa”.

Top of Section


Data Types

A key tactic in data quality control is precise (and enforceable!) definitions of the values allowed for each field. We just saw integer, text and the slightly more complicated serial.

Some of the essential data types, specifically for a PostgreSQL database.

Type Description
integer whole numbers between -2,147,483,648 and 2,147,483,647
double Inexact, variable-precision numeric value that can represent almost any number, and several non-numbers
decimal Exact numeric value with user-specified precision (decimal places)
char Fixed-length character string
varchar Variable-length (up to a limit) character string
boolean True or False
date day with no time specification

Data Granularity

The degree to which information is broken up into diferent fields is called granularity, and it’s an important consideration in database design. On one hand, because it is easier to aggregate fields than disaggregate them down the road, initially erring on the side of too granular is best. On the other hand, some consideration of how each field will get used can lead to less granularity.

Question
In the surveys table, the date is broken into three integer fields. What are some advantages and disadvantages of such fine granularity over using a single field of type date?
Answer
Advantages include easy selection by month, day or year. Disadvantages include difficulty of selecting by date range.

Top of Section


Summary

Databases are a core element of a centralized workflow, accomodating simultaneous use by all members of a collaborative team. We have just skimmed the topic of concurrency in database interactions: there is a lot going on under the hood to prevent data corruption.

The ability to precisely define keys and data types is the primary database feature that guaranties reliability. As you develop scripts for analysis and vizualization, certainty that you’ll never encounter a “NaN” when you expect an Integer will prevent, or help you catch, bugs in your code.

The third major feature to motivate databae use, scaleability, remains for you to discover. Very large tables can be queried, sorted and combined quickly when the work is done by a powerful database system outside of R.

Top of Section


Exercise Solutions

Solution 1

df <- dbGetQuery(con, "select species_id, taxa, genus
                       from species")

Solution 2

df <- dbGetQuery(con, "select weight, month, plot_type, taxa
                       from surveys
                       join plots on surveys.plot_id = plots.plot_id
                       join species on surveys.species_id = species.species_id
                       where weight is not null")

Top of Section