Data Storage & Access for All

Instructor: Mary Shelley

Why use a database? The laundry list of reasons a computer scientist might launch into boils down to three features: concurrency, reliability, and scalability. For a team of researchers implementing a collaborative workflow, it’s concurrency – allowing multiple users, or programs, to interact with the data simultaneously – that answers the question once and for all.

Top of Section


The Portal Project

Acknowledgement: Photos from portalproject.wordpress.com.

The Portal Project is a long-term ecological study being conducted near Portal, AZ. Since 1977, the site has been a primary focus of research on interactions among rodents, ants and plants and their respective responses to climate. A portion of the data has become widely distributed among instructors seeking a real-world dataset on which trainees can hone their data skills.

The research site consists of many “plots” – patches of the Arizona desert that become intensively manipulated and observed. The plots have some fixed characteristics, such as the type of manipulation, geographic location, aspect, etc.

The plots have a lot of dynamic characteristics too, and are regularly surveyed with a particular focus on their changing species compostion. Every survey of each plot produces a distinct set of observations to be recorded in a database. A well-structured database has been designed to store & access the bounty of information produced by this long-term ecological experiment.

Top of Section


Objectives for this Lesson

Specific Achievements

Top of Section


Characteristics of a Database I

Let’s begin with defining some terminology that is familiar from any system for data storage. For concreteness, we will use the case study just introduced to provide examples.

Field
The smallest unit of information, each having a label and holding a value of the same type.
e.g. The time of the survey in which a small mammal was captured.
Record
A collection of related values, from different fields, that all describe the same entity.
e.g. The species, sex, size and location of a small mammal observed during a given survey.
Table
A collection of records, each one uniquely identified by the value of one or more fields.
e.g. All records of small mammals that were observed in any survey.

Stopping here, we may as well be describing a spreadsheet that can be curated in a data file (e.g. a CSV file) and read by almost every program for data applications.

Limitations of Data Files

Collaboration
Emailing copies among collaborators. Anybody see a problem here?
Size
Reading an entire data file into memory isn’t scaleable. Some file types (e.g. MS Excel files) have size limits.
Quality
Enforcing restrictions on values is not the default behaviour of software that writes data files.
Extensions
Specialized files are needed for non-traditional data types (e.g. ESRI Shapefiles).
API
Binary or ASCII string manipulation? Yikes.

Database Solutions

Collaboration
A database accepts simultaneous users, which is most beneficial when the database is hosted on a network. There are never multiple copies of the data (aside from your backups!)
Size
The database management system only reads requested parts of the data into memory. There is no size limit!
Quality
Data types are enforced by default. No database lets you enter “9I1” in an integer field.
Extension
Special packages are needed for non-traditional data types, but they can live in the same table as traditional ones.
API
There are packages native to every programming language that ease reading and writing to databases.

Top of Section


Characteristics of a Database II

Returning to terminology, some database concepts don’t apply to data files at all. Databases include a collection of tables. You might call a mere collection of tables a data wharehouse – or an IKEA. The collection of tables in a relational database is structured by relationships between records from different tables. These relations are specified through primary and foreign keys, but which is which depends on context.

Primary Key
One or more fields (but usually one) that uniquely identify a record in a table.
e.g. plot_id in the plots table, or more succinctly plots(plot_id)
Foreign Key
A primary key from table A used in table B to express their relationship.
e.g. plot_id in the surveys table, or surveys(plot_id)

In the plots table, plot_id is the primary key.

plot_id plot_type
1 Control
2 Rodent Exclosure
3 Control

In the surveys table, record_id is the primary key and plot_id is a foreign key.

record_id year month day plot_id species_id sex weight
1 1977 11 5 2 ST M 0.45  
2 1977 11 5 2 PX M 0.23  
3 1978 1 23 1 RO F 1.23  

One-To-Many Relationship

Many-To-Many Relationship

Proper use of table relationships is a great challenge in database design. The guiding principle is normalization, or taking steps to minimize data redundency.

With an ideal database design, any value discovered to be erroneous should only have to be corrected in one record in one table.

Question
Based on your first look at the plots and surveys tables, does it look possible to further normalize this database?

Top of Section


Communicating with a Database

A database management system that resides on a network is ideal for collaborative, data-driven projects. Communication is therefore designed for networks: it involves a server application and a client application.

We are going to look at the Portal Mammals data on a server running PostgreSQL, which is an open-source database management system. The client we will use to communicate with the server is a browser-based application called PostgreSQL Studio, which is just one of very many clients that connects to a PostgreSQL database.

Point a browser to pgstudio.research.sesync.org and login.

Exercise
Examine the primary and foreign keys on the three tables. What columns make up the primary keys for each table? Are there many-to-one relationships? Are there many-to-many relationships?
Hint #1
The “Columns” tab has a field labeled “PK”.
Hint #2
The “Constraints” tab shows different types of constraints that govern a table.

Reviews came back from the manuscript you submitted to Nature on the portal mammals project. The always humble Reviewer #2 suspects bias introduced by the surveyor and recommends controling for the identity of the observer that conducted each survey.

The “Script” tab tells us about an existing table by showing a command that would create an empty table with the same fields.

CREATE TABLE public.plots (
    plot_id integer NOT NULL PRIMARY KEY,
    plot_type text
)

The command is written in Structured Query Language (SQL), which is a language all database management systems support for database communication. In the next steps, you will use SQL to alter the portal database in response to the Reviewer.

Select the button and run the following two scripts. The first script creates a new table, the second script adds a relationship to the surveys table.

CREATE TABLE surveyor (
    person_id serial PRIMARY KEY,
    first_name text,
    last_name text
);
ALTER TABLE surveys
    ADD COLUMN person_id integer REFERENCES surveyor(person_id);

All of you? Yes, go ahead and try it.

In the plots table we see integer as the data type of the primary key. We used serial, a simple extension to the integer data type. A serial value is an integer sequence that populates a record automatically.

INSERT INTO surveyor (first_name, last_name)
    VALUES ('%your_first_name%', '%your_last_name%');
Question
What’s your person_id? {.fragment}

The foreign key on surveys.person_id exists, now we need to populate the column with data. Choose a record_id you will use as %record_id% in the next command. This is a survey on which you claim to be the surveyor – let’s update the data to reflect your work!

UPDATE surveys SET person_id = "%your_person_id%"
WHERE record_id = "%record_id%";

To view the result of your update, and everyone elses, run the following query on the database. Two new SQL phrases come into play here: SELECT ... FROM and JOIN.

SELECT record_id, surveyor.* FROM surveys
JOIN surveyor USING(person_id);
Exercise
How would the solution be different if there were multiple surveyors for each survey? Pseudocode a couple database commands that would create tables to address the situation with multiple surveyors for each survey.

Top of Section


Data Types

A key tactic in data quality control is precise (and enforceable!) definitions of the values allowed for each field. We just saw integer, text and the slightly more complicated serial.

Some of the essential data types, specifically for a PostgreSQL database.

Type Description
integer whole numbers between -2,147,483,648 and 2,147,483,647
double Inexact, variable-precision numeric value that can represent almost any number, and several non-numbers
decimal Exact numeric value with user-specified precision (decimal places)
char Fixed-length character string
varchar Variable-length (up to a limit) character string
boolean True or False
date day with no time specification

Data Granularity

The degree to which information is broken up into diferent fields is called granularity, and it’s an important consideration in database design. On one hand, because it is easier to aggregate fields than disaggregate them down the road, initially erring on the side of too granular is best. On the other hand, some consideration of how each field will get used can lead to less granularity.

Question
In the surveys table, the date is broken into three integer fields. What are some advantages and disadvantages of such fine granularity over using a single field of type date?

Extensions

The most complicated data types encountered so far are serial and date, but these come built-in to the standard PostgreSQL database management system (and many others). A database feature noted earlier is the ability to extend the available data types, wither by installing plugins or writing your own.

A popular extension for PostgreSQL is PostGIS, which adds spatially aware data types and functions.

Data type Description
Geometry Planar/Cartesian coordinates and calculations
Geography Spheroidal geodetic objects, drawn on the earth’s curved surface.
Raster Space as a grid of rectangular cells, associated with a numeric array/bands.
Topology Models spatial objects as a network of connected nodes, edges, and faces.

PostGIS functions allow you to query your database on spatial relations, which act sort of like foreign key relationships. Jumping way ahead of ourselves, let’s see an example of what PostGIS can do.

Log out of the portal database, and log in to the database postgis_in_action then select ch01 from the Schema drop-down menu.

Question
Look at the different tables and find the fields with spatial data types. What type does restaurants have, and what type does highways have?

The following query will list the restaurant franchises in Maryland that are within 500 meters of a highway.

SELECT franchise, name
FROM ch01.restaurants
JOIN ch01.highways on ST_DWithin(restaurants.geom, highways.geom, 500)
WHERE state = 'MD';
Exercise
PostGIS includes spatially aware function, such as ST_DWithin which tests distance relationships. The function ST_AsText() will convert a geometry into human readable text, and the command ST_Transform(geom, 4326) will transform geom to geographical coordinates (i.e. latitude and longitude). Modify the SQL so that it prints an additional field that gives you coordinates (rather than the name of the highway) of the franchises in Maryland.

Learn more about PostGIS and this database from “PostGIS in Action” by Regina O. Obe and Leo S. Hsu.

Top of Section


Summary

As we have seen, a database will be a core element of a centralized workflow, accomodating simultaneous use by all members of a collaborative team. We have just skimmed the topic of concurrency in database interactions: there is a lot going on under the hood to prevent data corruption.

The ability to precisely define data types if the primary database feature that guaranties reliability in your dataset. As you develop scripts for analysis and vizualization, certainty that you’ll never encounter a “NaN” when you expect an Integer will prevent, or help you catch, bugs and errors down the road.

The third major feature to motivate databae use, scaleability, remains for you to discover. Did you bring a large dataset? Our coaches will help you load it into the PostgreSQL server, where you will be able to witness the speed with which very large tables can be queried, sorted and combined as you work on your “data2doc” project this week.

Top of Section