Database Principles and Use
Lesson 3 with Benoît Parmentier
Contents
- What good is a database?
- The Portal Project
- Characteristics of a Database: Part I
- Table definition
- Characteristics of a Database: Part II
- Structured Query Language (SQL)
- Normalized data is Tidy
- Summary
- Solutions
What good is a database?
For a team of researchers implementing a collaborative workflow, the top reasons to use a database are:
- concurrency - Multiple users can safely read and edit database entries simultaneously.
- reliability - Relational databases formalize and can enforce concepts of “tidy” data.
- scalability - Very large tables can be read, searched, and merged quickly.
Objectives
- Understand how a database differs from a data file
- Discover relational databases
- Meet Structured Query Language (SQL)
- Recognize the value of typed data
Specific Achievements
- Access a database on a PostgreSQL server
- Create a table and view table definitions
- Insert records one at a time into a table
- Check primary and foreign key constraints
The Portal Project
The Portal Project is a long-term ecological study being conducted near Portal, AZ. Since 1977, the site has been a primary focus of research on interactions among rodents, ants and plants and their respective responses to climate.
The research site consists of many plots – patches of the Arizona desert that are intensively manipulated and repeatedly surveyed. The plots have some fixed characteristics, such as the type of manipulation, geographic location, aspect, etc.
The plots have a lot of dynamic characteristics, and those changes are recorded in repeated surveys. In particular, the animals captured during each survey are identified to species, weighed, and measured.
Data from the Portal project is recorded in a relational database designed for reliable storage & rapid access to the bounty of information produced by this long-term ecological experiment. This lesson uses the real data, that has been analyzed in over 100 publications.
The data has been simplified just a little bit for the workshop, but you can download the full dataset and work with it using exactly the same tools we learn today.
The three key tables in the relational database are:
- plots
- animals
- species
Characteristics of a Database: Part I
Most database terminology is the same as that used for characterizing data files. The examples below refer to the Portal mammals database.
- Field
- The smallest unit of information, each having a label and holding a value of the same type.
- e.g. The day on which a plot was surveyed.
- Record
- A collection of related values, from different fields, that all describe the same entity.
- e.g. The species, sex, weight and size of a small mammal captured during a survey.
- Table
- A collection of records, each one uniquely identified by the value of a key field.
- e.g. All records for small mammals observed during any survey.
These characteristics could equally be applied to a spreadsheet, one that can be curated in a text file (e.g. a CSV file) as well as specialized formats (e.g. a MS Excel file).
Limitations of Data Files
- Collaboration
- Email copies among collaborators, store in the cloud (sync issues) or share on a network (user collision).
- Size
- Reading an entire data file into memory isn’t scaleable. Some file types (e.g. MS Excel files) have size limits.
- Quality
- Enforcing restrictions on values is not the default behaviour of software that writes data files.
- Extensions
- Specialized files are needed for non-traditional data types (e.g. ESRI Shapefiles).
- API
- No standard approach for different application to read, edit or create records.
Database Solutions
- Collaboration
- A database accepts simultaneous users, which is most beneficial when the database is hosted on a server. There are never multiple copies of the data (aside from backups!)
- Size
- The database management system only reads requested parts of the data into memory. There are no size limits.
- Quality
- Data types are enforced by default. No database lets you “accidentally” enter letters (e.g. “N.A.”) in a field for integers.
- Extension
- Special packages are needed for non-traditional data types, but they can live in the same table as traditional ones.
- API
- There are packages native to every programming language that ease reading and writing to databases.
The Database Server
A relational database management system (RDBMS) that resides on a server is ideal for collaborative, data-driven projects. Communication is therefore accross a network: it involves a server application and a client application.
We are going to look at the Portal mammals data on a server running PostgreSQL, which is an open-source database management system. The client we will use to communicate with the server is a browser-based application called PostgreSQL Studio, which is just one of very many clients that connects to a PostgreSQL database.
Open PostgreSQL Studio and login with the connection parameters:
A more typical database host would be a URL or IP address. Using “localhost” is unusual in that the server is not typically on the same system as the client.
The “Data” tab provides a quick preview of some records. The “Columns” and “Scripts” tabs provide detailed information on how data types are set on the different fields.
- Columns
- Most are integer or text data, and “nullable” fields allow missing data (regardless of data type).
- Scripts
- The Structured Query Language (SQL) statement that defines each table.
Exercise 1
Study the “Columns” and “Script” tabs for the species table. Pick a field that is currently allowed to be null, and guess how you would write a script beginning with CREATE TABLE species
to prohibit null values in the database.
Table definition
Table definition uses PostgreSQL “CREATE TABLE” statements, as shown in the “Scripts” tab of PGStudio.
Simplified a bit, the documentation for “CREATE TABLE” follows the pattern:
CREATE TABLE [%schema%.]%table% (
%column_name% %data_type% [%option% ...],
...
);
Exercise 2
Write and execute a CREATE TABLE
statement for a new “observers” table with fields for “id” and “name”. Check that the new table exists (you may have to refresh the list).
Insert
The “INSERT” statement provides one way to populate a table with records:
INSERT INTO observers
VALUES (1, 'Alice'), (2, 'Bob');
From the database server’s perspective, tables are populated with some form of “INSERT” statement. There are many mechanisms that database clients devise to facilitate importing large data files. PGStudio, for example, provides a CSV file upload link on the “Data” tab. A reproducible pipeline that moves data from flat files into a database, however, should include imporation in a script – and this depends on the data files.
Characteristics of a Database: Part II
Returning to the big picture, some database terminology doesn’t apply to data files at all. Databases include multiple tables, just as MS Excel includes a multiple spreadsheets in a workbook. The collection of tables in a relational database, however, can be structured by built-in relationships between records from different tables. Data is assembled in the correct arrangement and “just in time” for analysis using scripted queries.
- Primary key
- One or more fields (but usually one) that uniquely identify a record in a table.
- Foreign key
- A primary key from one table used in different table to establish a relationship.
- Query
- A temporary table that subsets and/or merges tables according to query parameters.
Primary key
In the plots
table, id
is the primary key. Any new record cannot duplicate an existing id.
id | treatment |
---|---|
1 | Control |
2 | Rodent Exclosure |
3 | Control |
- Question
- What would be the primary key in the animals table. How do you know?
- Answer
- The primary key is
id
, as shown in the table definition script.
Primary keys are checked before duplicates end up in the data:
INSERT INTO plots (id, treatment)
VALUES (1, 'Control');
Foreign key
A field may also be designated as a foreign key, which establishes a relationship between tables. A foreign key points to some primary key from a different table.
The “Constraints” tab shows two foreign keys associated with the “animals” table.
- The field
animals(plot_id)
points to the fieldplots(id)
- The field
animals(species_id)
points to the fieldspecies(id)
In the animals
table, id
is the primary key and both plot_id
and species_id
are foreign keys.
id | month | day | year | plot_id | species_id | sex | hindfoot_length | weight |
---|---|---|---|---|---|---|---|---|
1 | 7 | 16 | 1977 | 2 | ST | M | 32 | 0.45 |
2 | 7 | 16 | 1977 | 2 | PX | M | 33 | 0.23 |
3 | 7 | 16 | 1978 | 1 | RO | F | 14 | 1.23 |
Foreign keys are checked before nonsensical references end up in the data:
INSERT INTO animals (id, plot_id)
VALUES (35550, 789);
Query
The “Data” tab is an example of a query. It is a temporary table generated on-demand by the RDBMS that takes a subset of the first 25 records from a particular table and displays all columns.
Structured Query Language (SQL)
SQL is a high-level language for interacting with relational databases. Commands use intuitive English words but can be strung together and nested in powerful ways.
We have used the “CREATE TABLE” and “INSERT” statements, but most SQL is written to perform “SELECT” statements.
Basic queries
Let’s write a SQL query that selects only the year column from the animals table. Select the button, paste in and run the following script.
SELECT year FROM animals;
A note on style: we have capitalized the words SELECT and FROM because they are SQL keywords. Unlike R, SQL is case insensitive, but it helps for readability – good style.
To select data from multiple fields, include multiple fields as a comma-separated list right after SELECT:
SELECT year, month, day
FROM animals;
The line break before FROM
is also good form, particularly as the length of the query grows.
Or select all of the columns in a table using a wildcard:
SELECT *
FROM animals;
Limit
We can use the LIMIT statement to select only the first few rows. This is particularly helpful when getting a feel for very large tables.
SELECT year, species_id
FROM animals
LIMIT 10;
Unique values
If we want only the unique values so that we can quickly see what species have
been sampled we use DISTINCT
SELECT DISTINCT species_id
FROM animals;
If we select more than one column, then the distinct pairs of values are returned
SELECT DISTINCT year, species_id
FROM animals;
Calculated values
We can also do calculations with the values in a query. For example, if we wanted to look at the mass of each individual on different dates, but we needed it in kg instead of g we would use
SELECT year, month, day, weight / 1000.0
FROM animals;
The expression weight / 1000.0
is evaluated for each row
and appended to that row, in a new column.
Expressions can use any fields, any arithmetic operators (+ - * /) and a variety of built-in functions. For example, we could round the values to make them easier to read.
SELECT plot_id, species_id, sex, weight, ROUND(weight / 1000.0, 2)
FROM animals;
The underlying data in the wgt column of the table does not change. The query, which exists separately from the data, simply displays the calculation we requested in the query result window pane.
You can assign the new column a name by typing “AS weight_kg” after the expression
SELECT plot_id, species_id, sex, weight, ROUND(weight / 1000.0, 2) AS weight_kg
FROM animals;
Exercise 3
Write a query that returns the year, month, day, species id, and weight in mg of each animal.
Filtering
Databases can also filter data – selecting only those records meeting certain criteria. For example, let’s say we only want data for the species “Dipodomys merriami”, which has a species code of “DM”. We need to add a WHERE clause to our query:
SELECT *
FROM animals
WHERE species_id = 'DM';
We can do the same thing with numbers. Here, we only want the data since 2000:
SELECT *
FROM animals
WHERE year >= 2000;
More sophisticated conditions arise from combining tests with AND and OR. For example, suppose we want the data on Dipodomys merriami starting in the year 2000:
SELECT *
FROM animals
WHERE year >= 2000 AND species_id = 'DM';
Parentheses can be used to help with readability and to ensure that AND and OR are combined in the way that we intend. If we wanted to get all the animals for “DM” since 2000 or up to 1990 we could combine the tests using OR:
SELECT *
FROM animals
WHERE (year >= 2000 OR year <= 1990) AND species_id = 'DM';
Normalized data is Tidy
Proper use of table relationships is a challenging part of database design. The objective is normalization, or taking steps to define logical “observational units” and minimize data redundency.
For example, the genus and species names are not attributes of an animal: they are attributes of the species attributed to an animal. Data about a species belongs in a different observational unit from data about the animal captured in a survey.
With an ideal database design, any value discovered to be erroneous should only have to be corrected in one record in one table.
- Question
- Currently,
plots
is pretty sparse. What other kind of data might go into plots? - Answer
- Additional properties, such as location, that do not change between surveys.
Un-tidy data with JOINs
Record, store, and read data in the most normalized form possible. Only un-tidy your tables as needed for particular analyses. The SQL “JOIN” clause lets you create records with fields from multiple tables.
For example, to carry out a regression of animal weight against plot treatment using the R command:
lm(weight ~ treatment, data = portal)
you’ll need a data.frame called portal
with rows for each animal that also includes a “treatment” inferred from “plot_id”.
Now suppose you want to account for genus in this regression, using the R command:
lm(weight ~ genus + treatment, data = portal)
You’ll need an additional column for genus in the portal
data.frame inferred from the “species_id” and species table.
Relations
There are two kinds of relations – schemas that use primary and foreign key references – that permit table joins:
- One-To-Many
- Many-To-Many
One-To-Many Relationship
The primary key in the first table is referred to multiple times in the foreign key of the second table.
Exercise 4
Examine the primary and foreign keys on the three tables by checking the “Columns” and “Constraints” tabs. Identify all the one-to-many relationships.
The SQL keyword “JOIN” matches up two tables in the way dictated by the constraint following “ON”, duplicating records as necessary.
SELECT weight, treatment
FROM animals
JOIN plots ON animals.plot_id = plots.id;
The resulting table could be the basis for the portal
data.frame needed in the R command lm(weight ~ treatment, data = portal)
.
Many-To-Many Relationship
Each primary key from the first table may relate to any number of primary keys from the second table, and vice versa. eachother any number of times. A many-to-many relationship is induced by the existance of an “association table” involved in two one-to-many relations.
Animals is an “association table” because it includes two foreign keys.
SELECT weight, genus, treatment
FROM animals
JOIN plots ON animals.plot_id = plots.id
JOIN species ON animals.species_id = species.id;
The resulting table could be the basis for the portal
data.frame needed in the R command lm(weight ~ genus + treatment, data = portal)
.
Exercise 5
In a previous exercise, you created an observers table. Use a CREATE TABLE
statement to make a new association table establishing a many-to-many relation between animals and observers. The resulting table should allow you to record information on which observer(s) caught and measured each animal. (The result will then also record which animals were handled by each observer.)
Hint: Examine the “Script” tab for the animals table to see the keyword “REFERENCES” in action.
Summary
Databases are a core element of a centralized workflow, accomodating simultaneous use by all members of a collaborative team. We have just skimmed the topic of concurrency in database interactions: there is a lot going on under the hood to prevent data corruption.
The ability to precisely define keys and data types is the primary database feature that guaranties reliability. As you develop scripts for analysis and vizualization, certainty that you’ll never encounter a “NaN” when you expect an Integer will prevent, or help you catch, bugs in your code.
The third major feature to motivate databae use, scaleability, remains for you to discover. Very large tables can be queried, sorted and combined quickly when the work is done by a powerful relational database management system (RDBMS), such as PostgreSQL.
Solutions
Solution 1
CREATE TABLE public.species (
id text NOT NULL PRIMARY KEY,
genus text NOT NULL,
species text,
taxa text
);
Solution 2
CREATE TABLE observers (
id integer PRIMARY KEY,
name text
);
Solution 3
SELECT year, month, day, species_id, ROUND(weight * 1000) AS weight_mg
FROM species;
Solution 4
First Table | Second Table |
plots(id) | animals(plot_id) |
species(id) | animals(species_id) |
Solution 5
CREATE TABLE captures (
observer_id integer REFERENCES observers(id),
animal_id integer REFERENCES animals(id)
);