Database Principles and Use

Lesson 3 with Benoît Parmentier

Contents


What good is a database?

For a team of researchers implementing a collaborative workflow, the top reasons to use a database are:

  1. concurrency - Multiple users can safely read and edit database entries simultaneously.
  2. reliability - Relational databases formalize and can enforce concepts of “tidy” data.
  3. scalability - Very large tables can be read, searched, and merged quickly.

Objectives

Specific Achievements

Top of Section


The Portal Project


Credit: The Portal Project

The Portal Project is a long-term ecological study being conducted near Portal, AZ. Since 1977, the site has been a primary focus of research on interactions among rodents, ants and plants and their respective responses to climate.


Credit: The Portal Project

The research site consists of many plots – patches of the Arizona desert that are intensively manipulated and repeatedly surveyed. The plots have some fixed characteristics, such as the type of manipulation, geographic location, aspect, etc.


Credit: The Portal Project

The plots have a lot of dynamic characteristics, and those changes are recorded in repeated surveys. In particular, the animals captured during each survey are identified to species, weighed, and measured.


Credit: The Portal Project

Data from the Portal project is recorded in a relational database designed for reliable storage & rapid access to the bounty of information produced by this long-term ecological experiment. This lesson uses the real data, that has been analyzed in over 100 publications.

The data has been simplified just a little bit for the workshop, but you can download the full dataset and work with it using exactly the same tools we learn today.

The three key tables in the relational database are:

Top of Section


Characteristics of a Database: Part I

Most database terminology is the same as that used for characterizing data files. The examples below refer to the Portal mammals database.

Field
The smallest unit of information, each having a label and holding a value of the same type.
e.g. The day on which a plot was surveyed.
Record
A collection of related values, from different fields, that all describe the same entity.
e.g. The species, sex, weight and size of a small mammal captured during a survey.
Table
A collection of records, each one uniquely identified by the value of a key field.
e.g. All records for small mammals observed during any survey.

These characteristics could equally be applied to a spreadsheet, one that can be curated in a text file (e.g. a CSV file) as well as specialized formats (e.g. a MS Excel file).

Limitations of Data Files

Collaboration
Email copies among collaborators, store in the cloud (sync issues) or share on a network (user collision).
Size
Reading an entire data file into memory isn’t scaleable. Some file types (e.g. MS Excel files) have size limits.
Quality
Enforcing restrictions on values is not the default behaviour of software that writes data files.
Extensions
Specialized files are needed for non-traditional data types (e.g. ESRI Shapefiles).
API
No standard approach for different application to read, edit or create records.

Database Solutions

Collaboration
A database accepts simultaneous users, which is most beneficial when the database is hosted on a server. There are never multiple copies of the data (aside from backups!)
Size
The database management system only reads requested parts of the data into memory. There are no size limits.
Quality
Data types are enforced by default. No database lets you “accidentally” enter letters (e.g. “N.A.”) in a field for integers.
Extension
Special packages are needed for non-traditional data types, but they can live in the same table as traditional ones.
API
There are packages native to every programming language that ease reading and writing to databases.

The Database Server

A relational database management system (RDBMS) that resides on a server is ideal for collaborative, data-driven projects. Communication is therefore accross a network: it involves a server application and a client application.

We are going to look at the Portal mammals data on a server running PostgreSQL, which is an open-source database management system. The client we will use to communicate with the server is a browser-based application called PostgreSQL Studio, which is just one of very many clients that connects to a PostgreSQL database.

Open PostgreSQL Studio and login with the connection parameters:

A more typical database host would be a URL or IP address. Using “localhost” is unusual in that the server is not typically on the same system as the client.

The “Data” tab provides a quick preview of some records. The “Columns” and “Scripts” tabs provide detailed information on how data types are set on the different fields.

Columns
Most are integer or text data, and “nullable” fields allow missing data (regardless of data type).
Scripts
The Structured Query Language (SQL) statement that defines each table.

Exercise 1

Study the “Columns” and “Script” tabs for the species table. Pick a field that is currently allowed to be null, and guess how you would write a script beginning with CREATE TABLE species to prohibit null values in the database.

Top of Section


Table definition

Table definition uses PostgreSQL “CREATE TABLE” statements, as shown in the “Scripts” tab of PGStudio.

Simplified a bit, the documentation for “CREATE TABLE” follows the pattern:

CREATE TABLE [%schema%.]%table% (
  %column_name% %data_type% [%option% ...],
    ...
);

Exercise 2

Write and execute a CREATE TABLE statement for a new “observers” table with fields for “id” and “name”. Check that the new table exists (you may have to refresh the list).

Insert

The “INSERT” statement provides one way to populate a table with records:

INSERT INTO observers
VALUES (1, 'Alice'), (2, 'Bob');

From the database server’s perspective, tables are populated with some form of “INSERT” statement. There are many mechanisms that database clients devise to facilitate importing large data files. PGStudio, for example, provides a CSV file upload link on the “Data” tab. A reproducible pipeline that moves data from flat files into a database, however, should include imporation in a script – and this depends on the data files.

Top of Section


Characteristics of a Database: Part II

Returning to the big picture, some database terminology doesn’t apply to data files at all. Databases include multiple tables, just as MS Excel includes a multiple spreadsheets in a workbook. The collection of tables in a relational database, however, can be structured by built-in relationships between records from different tables. Data is assembled in the correct arrangement and “just in time” for analysis using scripted queries.

Primary key
One or more fields (but usually one) that uniquely identify a record in a table.
Foreign key
A primary key from one table used in different table to establish a relationship.
Query
A temporary table that subsets and/or merges tables according to query parameters.

Primary key

In the plots table, id is the primary key. Any new record cannot duplicate an existing id.

id treatment
1 Control
2 Rodent Exclosure
3 Control
Question
What would be the primary key in the animals table. How do you know?
Answer
The primary key is id, as shown in the table definition script.

Primary keys are checked before duplicates end up in the data:

INSERT INTO plots (id, treatment)
VALUES (1, 'Control');

Foreign key

A field may also be designated as a foreign key, which establishes a relationship between tables. A foreign key points to some primary key from a different table.

The “Constraints” tab shows two foreign keys associated with the “animals” table.

In the animals table, id is the primary key and both plot_id and species_id are foreign keys.

id month day year plot_id species_id sex hindfoot_length weight
1 7 16 1977 2 ST M 32 0.45
2 7 16 1977 2 PX M 33 0.23
3 7 16 1978 1 RO F 14 1.23

Foreign keys are checked before nonsensical references end up in the data:

INSERT INTO animals (id, plot_id)
VALUES (35550, 789);

Query

The “Data” tab is an example of a query. It is a temporary table generated on-demand by the RDBMS that takes a subset of the first 25 records from a particular table and displays all columns.

Top of Section


Structured Query Language (SQL)

SQL is a high-level language for interacting with relational databases. Commands use intuitive English words but can be strung together and nested in powerful ways.

We have used the “CREATE TABLE” and “INSERT” statements, but most SQL is written to perform “SELECT” statements.

Basic queries

Let’s write a SQL query that selects only the year column from the animals table. Select the button, paste in and run the following script.

SELECT year FROM animals;

A note on style: we have capitalized the words SELECT and FROM because they are SQL keywords. Unlike R, SQL is case insensitive, but it helps for readability – good style.

To select data from multiple fields, include multiple fields as a comma-separated list right after SELECT:

SELECT year, month, day
FROM animals;

The line break before FROM is also good form, particularly as the length of the query grows.

Or select all of the columns in a table using a wildcard:

SELECT *
FROM animals;

Limit

We can use the LIMIT statement to select only the first few rows. This is particularly helpful when getting a feel for very large tables.

SELECT year, species_id
FROM animals
LIMIT 10;

Unique values

If we want only the unique values so that we can quickly see what species have been sampled we use DISTINCT

SELECT DISTINCT species_id
FROM animals;

If we select more than one column, then the distinct pairs of values are returned

SELECT DISTINCT year, species_id
FROM animals;

Calculated values

We can also do calculations with the values in a query. For example, if we wanted to look at the mass of each individual on different dates, but we needed it in kg instead of g we would use

SELECT year, month, day, weight / 1000.0
FROM animals;

The expression weight / 1000.0 is evaluated for each row and appended to that row, in a new column.

Expressions can use any fields, any arithmetic operators (+ - * /) and a variety of built-in functions. For example, we could round the values to make them easier to read.

SELECT plot_id, species_id, sex, weight, ROUND(weight / 1000.0, 2)
FROM animals;

The underlying data in the wgt column of the table does not change. The query, which exists separately from the data, simply displays the calculation we requested in the query result window pane.

You can assign the new column a name by typing “AS weight_kg” after the expression

SELECT plot_id, species_id, sex, weight, ROUND(weight / 1000.0, 2) AS weight_kg
FROM animals;

Exercise 3

Write a query that returns the year, month, day, species id, and weight in mg of each animal.

Filtering

Databases can also filter data – selecting only those records meeting certain criteria. For example, let’s say we only want data for the species “Dipodomys merriami”, which has a species code of “DM”. We need to add a WHERE clause to our query:

SELECT *
FROM animals
WHERE species_id = 'DM';

We can do the same thing with numbers. Here, we only want the data since 2000:

SELECT *
FROM animals
WHERE year >= 2000;

More sophisticated conditions arise from combining tests with AND and OR. For example, suppose we want the data on Dipodomys merriami starting in the year 2000:

SELECT *
FROM animals
WHERE year >= 2000 AND species_id = 'DM';

Parentheses can be used to help with readability and to ensure that AND and OR are combined in the way that we intend. If we wanted to get all the animals for “DM” since 2000 or up to 1990 we could combine the tests using OR:

SELECT *
FROM animals
WHERE (year >= 2000 OR year <= 1990) AND species_id = 'DM';

Top of Section


Normalized data is Tidy

Proper use of table relationships is a challenging part of database design. The objective is normalization, or taking steps to define logical “observational units” and minimize data redundency.

For example, the genus and species names are not attributes of an animal: they are attributes of the species attributed to an animal. Data about a species belongs in a different observational unit from data about the animal captured in a survey.

With an ideal database design, any value discovered to be erroneous should only have to be corrected in one record in one table.

Question
Currently, plots is pretty sparse. What other kind of data might go into plots?
Answer
Additional properties, such as location, that do not change between surveys.

Un-tidy data with JOINs

Record, store, and read data in the most normalized form possible. Only un-tidy your tables as needed for particular analyses. The SQL “JOIN” clause lets you create records with fields from multiple tables.

For example, to carry out a regression of animal weight against plot treatment using the R command:

lm(weight ~ treatment, data = portal)

you’ll need a data.frame called portal with rows for each animal that also includes a “treatment” inferred from “plot_id”.

Now suppose you want to account for genus in this regression, using the R command:

lm(weight ~ genus + treatment, data = portal)

You’ll need an additional column for genus in the portal data.frame inferred from the “species_id” and species table.

Relations

There are two kinds of relations – schemas that use primary and foreign key references – that permit table joins:

One-To-Many Relationship

The primary key in the first table is referred to multiple times in the foreign key of the second table.

Exercise 4

Examine the primary and foreign keys on the three tables by checking the “Columns” and “Constraints” tabs. Identify all the one-to-many relationships.

The SQL keyword “JOIN” matches up two tables in the way dictated by the constraint following “ON”, duplicating records as necessary.

SELECT weight, treatment
FROM animals
JOIN plots ON animals.plot_id = plots.id;

The resulting table could be the basis for the portal data.frame needed in the R command lm(weight ~ treatment, data = portal).

Many-To-Many Relationship

Each primary key from the first table may relate to any number of primary keys from the second table, and vice versa. eachother any number of times. A many-to-many relationship is induced by the existance of an “association table” involved in two one-to-many relations.

Animals is an “association table” because it includes two foreign keys.

SELECT weight, genus, treatment
FROM animals
JOIN plots ON animals.plot_id = plots.id
JOIN species ON animals.species_id = species.id;

The resulting table could be the basis for the portal data.frame needed in the R command lm(weight ~ genus + treatment, data = portal).

Exercise 5

In a previous exercise, you created an observers table. Use a CREATE TABLE statement to make a new association table establishing a many-to-many relation between animals and observers. The resulting table should allow you to record information on which observer(s) caught and measured each animal. (The result will then also record which animals were handled by each observer.)

Hint: Examine the “Script” tab for the animals table to see the keyword “REFERENCES” in action.

Top of Section


Summary

Databases are a core element of a centralized workflow, accomodating simultaneous use by all members of a collaborative team. We have just skimmed the topic of concurrency in database interactions: there is a lot going on under the hood to prevent data corruption.

The ability to precisely define keys and data types is the primary database feature that guaranties reliability. As you develop scripts for analysis and vizualization, certainty that you’ll never encounter a “NaN” when you expect an Integer will prevent, or help you catch, bugs in your code.

The third major feature to motivate databae use, scaleability, remains for you to discover. Very large tables can be queried, sorted and combined quickly when the work is done by a powerful relational database management system (RDBMS), such as PostgreSQL.

Top of Section


Solutions

Solution 1

CREATE TABLE public.species (
   id   text   NOT NULL   PRIMARY KEY,
   genus   text NOT NULL,
   species   text,
   taxa   text
);

Return

Solution 2

CREATE TABLE observers (
    id integer PRIMARY KEY,
    name text
);

Return

Solution 3

SELECT year, month, day, species_id, ROUND(weight * 1000) AS weight_mg
FROM species;

Return

Solution 4

First Table Second Table
plots(id) animals(plot_id)
species(id) animals(species_id)

Return

Solution 5

CREATE TABLE captures (
  observer_id integer REFERENCES observers(id),
  animal_id integer REFERENCES animals(id)
);

Return

Top of Section