Database Principles and Use

Lesson 3 with Ian Carroll

For a team of researchers implementing a collaborative workflow, the top reasons to use a database are:

concurrency
reliability
scalability

Top of Section

Objectives for this Lesson

Understand how a database differs from a data file
Discover relational databases
Meet Structured Query Language (SQL)
Recognize the value of typed data

Specific Achievements

Access a SQLite database from R
Select data to read into data frame
Test primary and foreign key constraints

Top of Section

The Portal Project

Acknowledgement: Photos from portalproject.wordpress.com.

The Portal Project is a long-term ecological study being conducted near Portal, AZ. Since 1977, the site has been a primary focus of research on interactions among rodents, ants and plants and their respective responses to climate.

The research site consists of many “plots” – patches of the Arizona desert that become intensively manipulated and observed. The plots have some fixed characteristics, such as the type of manipulation, geographic location, aspect, etc.

The plots have a lot of dynamic characteristics too, and are regularly surveyed with a particular focus on their changing species compostion. Every survey of each plot produces a distinct set of observations that get recorded in a well-structure database, designed for reliable storage & rapid access to the bounty of information produced by this long-term ecological experiment.

Top of Section

Characteristics of a Database: Part I

Let’s begin with defining some terminology that is familiar from any system for data storage. For concreteness, we will use the case study just introduced to provide examples.

Field: The smallest unit of information, each having a label and holding a value of the same type.; e.g. The time of the survey in which a small mammal was captured.
Record: A collection of related values, from different fields, that all describe the same entity.; e.g. The species, sex, size and location of a small mammal observed during a given survey.
Table: A collection of records, each one uniquely identified by the value of one or more fields.; e.g. All records of small mammals that were observed in any survey.

Limitations of Data Files

Collaboration: Email copies among collaborators, store in the cloud (sync issues) or save to a network (user collision).
Size: Reading an entire data file into memory isn’t scaleable. Some file types (e.g. MS Excel files) have size limits.
Quality: Enforcing restrictions on values is not the default behaviour of software that writes data files.
Extensions: Specialized files are needed for non-traditional data types (e.g. ESRI Shapefiles).
API: No standard approach for different application to read, edit or create records.

Database Solutions

Collaboration: A database accepts simultaneous users, which is most beneficial when the database is hosted on a network. There are never multiple copies of the data (aside from your backups!)
Size: The database management system only reads requested parts of the data into memory. There is no size limit!
Quality: Data types are enforced by default. No database lets you enter “9I1” in an integer field.
Extension: Special packages are needed for non-traditional data types, but they can live in the same table as traditional ones.
API: There are packages native to every programming language that ease reading and writing to databases.

Database connections from R

library(RPostgreSQL)
con <- dbConnect(PostgreSQL(), host="pg.sesync.org", user="icarroll")
dbListTables(con)

Database connections from R

library(RSQLite)
con <- dbConnect(SQLite(), "data/portal.sqlite")
dbListTables(con)

[1] "plots"   "species" "surveys"

Two ways to access data

Read the data into an R data frame, and process it using R functions.

plots <- dbReadTable(con, "plots")
surveys <- dbReadTable(con, "surveys")
species <- dbReadTable(con, "species")

str(surveys)

'data.frame':	34786 obs. of  9 variables:
 $ record_id      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ month          : int  7 7 7 7 7 7 7 7 7 7 ...
 $ day            : int  16 16 16 16 16 16 16 16 16 16 ...
 $ year           : int  1977 1977 1977 1977 1977 1977 1977 1977 1977 1977 ...
 $ plot_id        : int  2 3 2 7 3 1 2 1 1 6 ...
 $ species_id     : chr  "NL" "NL" "DM" "DM" ...
 $ sex            : chr  "M" "M" "F" "M" ...
 $ hindfoot_length: int  32 33 37 36 35 14 NA 37 34 20 ...
 $ weight         : int  NA NA NA NA NA NA NA NA NA NA ...

Let the database system process data.

dbGetQuery(con, "select species_id, weight
                 from surveys
                 where plot_id = 1 limit 5")

  species_id weight
1         PF     NA
2         DM     NA
3         DM     NA
4         PF      9
5         DS     NA

The string inside the brackets is an example of Structured Query Language (SQL). The SQL instructions tell the database system (SQLite, in this case) to sort out and return only the records requested.

Exercise 1

Use dbGetQuery() to select the “species_id”, and two other fields from the “species” table. Hint: use dbListFields() to check field names.

Top of Section

Characteristics of a Database: Part II

Returning to terminology, some database tools don’t apply to data files at all. Databases include a collection of tables, just as MS Excel includes a collection of spreadsheets in a workbook. The collection of tables in a relational database, however, is structured by relationships between records from different tables.

Relationships are specified through primary and foreign keys, but which is which depends on context.

Primary key

One or more fields (but usually one) that uniquely identify a record in a table.

Question: What would be the primary key in the surveys table. How do you know?

Primary key

In the plots table, plot_id is the primary key.

plot_id	plot_type
1	Control
2	Rodent Exclosure
3	Control

As a result, a new record cannot duplicate any plot_id.

dbGetQuery(con, "insert into plots
                 (plot_id, plot_type)
                 values (1, 'Control')")

Error in sqliteSendQuery(con, statement, bind.data) : 
  rsqlite_query_send: could not execute1: UNIQUE constraint failed: plots.plot_id

Foreign key

A primary key from table A used in table B to express their relationship.

Question: Identify a primary / foreign key pair in the portal workbook.

Foreign key

In the surveys table, record_id is the primary key and plot_id is a foreign key.

record_id	month	day	year	plot_id	species_id	sex	hindfoot_length	weight
1	7	16	1977	2	ST	M	32	0.45
2	7	16	1977	2	PX	M	33	0.23
3	7	16	1978	1	RO	F	14	1.23

With a properly designed database, references to invalid foreign keys cannot be entered.

dbGetQuery(con, "insert into surveys
                 (record_id, plot_id, species_id, sex)
                 values (35549, 1, '00', 'M')")

Error in sqliteSendQuery(con, statement, bind.data) : 
  rsqlite_query_send: could not execute1: FOREIGN KEY constraint failed

Normalized data is Tidy data

Proper use of table relationships is a challenging part of database design. The objective is normalization, or taking steps to minimize data redundency.

For example, the genus and species names are not associated with every survey record – only with the species_id. Data about the species is a different “observational unit” than data about the individual caught in a survey.

With an ideal database design, any value discovered to be erroneous should only have to be corrected in one record in one table.

Question: Currently, plots is pretty sparse. What other kind of data might go into plots?
Answer: Additional properties, such as location, that do not change over time.

Top of Section

When untidy data okay

Record, store, and read data in the most normalized form possible. Only untidy your tables as needed for particular analyses. The most common operation for untidying is to “join” tables. For example, to carry out a regression with formula:

“weight ~ month + plot_type”

you’ll need a table with rows for each “record_id” that also includes a “plot_type”, which can be inferred from the “plot_id”. There are two kinds of table relationships – uses of primary and foreign key references – that permit database joins.

One-To-Many Relationship

The SQL keyword “join” lines up two tables, repeating records as necessary, to satisfy the constrain given after “on”.

df <- dbGetQuery(con, "select weight, month, plot_type
                       from surveys
                       join plots on surveys.plot_id = plots.plot_id
                       where weight is not null")
str(df)

'data.frame':	32283 obs. of  3 variables:
 $ weight   : int  40 48 29 46 36 52 8 22 35 7 ...
 $ month    : int  8 8 8 8 8 8 8 8 8 8 ...
 $ plot_type: chr  "Long-term Krat Exclosure" "Rodent Exclosure" "Control" "Control" ...

Many-To-Many Relationship

Exercise 2

Construct a data frame that you could use to fit the regression model “weight ~ month + plot_type + taxa”.

Top of Section

Data Types

A key tactic in data quality control is precise (and enforceable!) definitions of the values allowed for each field. We just saw integer, text and the slightly more complicated serial.

Some of the essential data types, specifically for a PostgreSQL database.

Type	Description
integer	whole numbers between -2,147,483,648 and 2,147,483,647
double	Inexact, variable-precision numeric value that can represent almost any number, and several non-numbers
decimal	Exact numeric value with user-specified precision (decimal places)
char	Fixed-length character string
varchar	Variable-length (up to a limit) character string
boolean	True or False
date	day with no time specification

Data Granularity

The degree to which information is broken up into diferent fields is called granularity, and it’s an important consideration in database design. On one hand, because it is easier to aggregate fields than disaggregate them down the road, initially erring on the side of too granular is best. On the other hand, some consideration of how each field will get used can lead to less granularity.

Question: In the surveys table, the date is broken into three integer fields. What are some advantages and disadvantages of such fine granularity over using a single field of type date?
Answer: Advantages include easy selection by month, day or year. Disadvantages include difficulty of selecting by date range.

Top of Section

Summary

Databases are a core element of a centralized workflow, accomodating simultaneous use by all members of a collaborative team. We have just skimmed the topic of concurrency in database interactions: there is a lot going on under the hood to prevent data corruption.

The ability to precisely define keys and data types is the primary database feature that guaranties reliability. As you develop scripts for analysis and vizualization, certainty that you’ll never encounter a “NaN” when you expect an Integer will prevent, or help you catch, bugs in your code.

The third major feature to motivate databae use, scaleability, remains for you to discover. Very large tables can be queried, sorted and combined quickly when the work is done by a powerful database system outside of R.

Top of Section

Exercise Solutions

Solution 1

df <- dbGetQuery(con, "select species_id, taxa, genus
                       from species")

Solution 2

df <- dbGetQuery(con, "select weight, month, plot_type, taxa
                       from surveys
                       join plots on surveys.plot_id = plots.plot_id
                       join species on surveys.species_id = species.species_id
                       where weight is not null")

Top of Section