Data Storage & Access for All
Instructor: Mary Shelley
Why use a database? The laundry list of reasons a computer scientist might launch into boils down to three features: concurrency, reliability, and scalability. For a team of researchers implementing a collaborative workflow, it’s concurrency – allowing multiple users, or programs, to interact with the data simultaneously – that answers the question once and for all.
The Portal Project
Acknowledgement: Photos from portalproject.wordpress.com.
The Portal Project is a long-term ecological study being conducted near Portal, AZ. Since 1977, the site has been a primary focus of research on interactions among rodents, ants and plants and their respective responses to climate. A portion of the data has become widely distributed among instructors seeking a real-world dataset on which trainees can hone their data skills.
The research site consists of many “plots” – patches of the Arizona desert that become intensively manipulated and observed. The plots have some fixed characteristics, such as the type of manipulation, geographic location, aspect, etc.
The plots have a lot of dynamic characteristics too, and are regularly surveyed with a particular focus on their changing species compostion. Every survey of each plot produces a distinct set of observations to be recorded in a database. A well-structured database has been designed to store & access the bounty of information produced by this long-term ecological experiment.
Objectives for this Lesson
- Understand how a database differs from a data file
- Discover relational databases
- Meet Structured Query Language (SQL)
- Recognize the value of typed data
- Glimpse spatial extensions in action
Specific Achievements
- Access a database on a PostgreSQL server
- Create a table and view table definitions
- Insert records one at a time into a table
- Alter the definition of an existing table
Characteristics of a Database I
Let’s begin with defining some terminology that is familiar from any system for data storage. For concreteness, we will use the case study just introduced to provide examples.
- Field
- The smallest unit of information, each having a label and holding a value of the same type.
- e.g. The time of the survey in which a small mammal was captured.
- Record
- A collection of related values, from different fields, that all describe the same entity.
- e.g. The species, sex, size and location of a small mammal observed during a given survey.
- Table
- A collection of records, each one uniquely identified by the value of one or more fields.
- e.g. All records of small mammals that were observed in any survey.
Stopping here, we may as well be describing a spreadsheet that can be curated in a data file (e.g. a CSV file) and read by almost every program for data applications.
Limitations of Data Files
- Collaboration
- Emailing copies among collaborators. Anybody see a problem here?
- Size
- Reading an entire data file into memory isn’t scaleable. Some file types (e.g. MS Excel files) have size limits.
- Quality
- Enforcing restrictions on values is not the default behaviour of software that writes data files.
- Extensions
- Specialized files are needed for non-traditional data types (e.g. ESRI Shapefiles).
- API
- Binary or ASCII string manipulation? Yikes.
Database Solutions
- Collaboration
- A database accepts simultaneous users, which is most beneficial when the database is hosted on a network. There are never multiple copies of the data (aside from your backups!)
- Size
- The database management system only reads requested parts of the data into memory. There is no size limit!
- Quality
- Data types are enforced by default. No database lets you enter “9I1” in an integer field.
- Extension
- Special packages are needed for non-traditional data types, but they can live in the same table as traditional ones.
- API
- There are packages native to every programming language that ease reading and writing to databases.
Characteristics of a Database II
Returning to terminology, some database concepts don’t apply to data files at all. Databases include a collection of tables. You might call a mere collection of tables a data wharehouse – or an IKEA. The collection of tables in a relational database is structured by relationships between records from different tables. These relations are specified through primary and foreign keys, but which is which depends on context.
- Primary Key
- One or more fields (but usually one) that uniquely identify a record in a table.
- e.g.
plot_id
in theplots
table, or more succinctlyplots(plot_id)
- Foreign Key
- A primary key from table A used in table B to express their relationship.
- e.g.
plot_id
in the surveys table, orsurveys(plot_id)
In the plots
table, plot_id
is the primary key.
plot_id | plot_type |
---|---|
1 | Control |
2 | Rodent Exclosure |
3 | Control |
In the surveys
table, record_id
is the primary key and plot_id
is a foreign key.
record_id | year | month | day | plot_id | species_id | sex | weight | … |
---|---|---|---|---|---|---|---|---|
1 | 1977 | 11 | 5 | 2 | ST | M | 0.45 | |
2 | 1977 | 11 | 5 | 2 | PX | M | 0.23 | |
3 | 1978 | 1 | 23 | 1 | RO | F | 1.23 |
One-To-Many Relationship
Many-To-Many Relationship
Proper use of table relationships is a great challenge in database design. The guiding principle is normalization, or taking steps to minimize data redundency.
With an ideal database design, any value discovered to be erroneous should only have to be corrected in one record in one table.
- Question
- Based on your first look at the
plots
andsurveys
tables, does it look possible to further normalize this database?
Communicating with a Database
A database management system that resides on a network is ideal for collaborative, data-driven projects. Communication is therefore designed for networks: it involves a server application and a client application.
We are going to look at the Portal Mammals data on a server running PostgreSQL, which is an open-source database management system. The client we will use to communicate with the server is a browser-based application called PostgreSQL Studio, which is just one of very many clients that connects to a PostgreSQL database.
Point a browser to pgstudio.research.sesync.org
and login.
- Exercise
- Examine the primary and foreign keys on the three tables. What columns make up the primary keys for each table? Are there many-to-one relationships? Are there many-to-many relationships?
- Hint #1
- The “Columns” tab has a field labeled “PK”.
- Hint #2
- The “Constraints” tab shows different types of constraints that govern a table.
Reviews came back from the manuscript you submitted to Nature on the portal mammals project. The always humble Reviewer #2 suspects bias introduced by the surveyor and recommends controling for the identity of the observer that conducted each survey.
The “Script” tab tells us about an existing table by showing a command that would create an empty table with the same fields.
CREATE TABLE public.plots (
plot_id integer NOT NULL PRIMARY KEY,
plot_type text
)
The command is written in Structured Query Language (SQL), which is a language all database management systems support for database communication. In the next steps, you will use SQL to alter the portal database in response to the Reviewer.
Select the button and run the following two scripts.
The first script creates a new table, the second script adds a relationship to the surveys
table.
CREATE TABLE surveyor (
person_id serial PRIMARY KEY,
first_name text,
last_name text
);
ALTER TABLE surveys
ADD COLUMN person_id integer REFERENCES surveyor(person_id);
All of you? Yes, go ahead and try it.
In the plots
table we see integer
as the data type of the primary key.
We used serial
, a simple extension to the integer data type.
A serial
value is an integer sequence that populates a record automatically.
INSERT INTO surveyor (first_name, last_name)
VALUES ('%your_first_name%', '%your_last_name%');
- Question
- What’s your person_id? {.fragment}
The foreign key on surveys.person_id
exists, now we need to populate the column with data.
Choose a record_id
you will use as %record_id%
in the next command.
This is a survey on which you claim to be the surveyor – let’s update the data to reflect your work!
UPDATE surveys SET person_id = "%your_person_id%"
WHERE record_id = "%record_id%";
To view the result of your update, and everyone elses, run the following query on the database.
Two new SQL phrases come into play here: SELECT ... FROM
and JOIN
.
SELECT record_id, surveyor.* FROM surveys
JOIN surveyor USING(person_id);
- Exercise
- How would the solution be different if there were multiple surveyors for each survey? Pseudocode a couple database commands that would create tables to address the situation with multiple surveyors for each survey.
Data Types
A key tactic in data quality control is precise (and enforceable!) definitions of the values allowed for each field.
We just saw integer
, text
and the slightly more complicated serial
.
Some of the essential data types, specifically for a PostgreSQL database.
Type | Description |
---|---|
integer | whole numbers between -2,147,483,648 and 2,147,483,647 |
double | Inexact, variable-precision numeric value that can represent almost any number, and several non-numbers |
decimal | Exact numeric value with user-specified precision (decimal places) |
char | Fixed-length character string |
varchar | Variable-length (up to a limit) character string |
boolean | True or False |
date | day with no time specification |
Data Granularity
The degree to which information is broken up into diferent fields is called granularity, and it’s an important consideration in database design. On one hand, because it is easier to aggregate fields than disaggregate them down the road, initially erring on the side of too granular is best. On the other hand, some consideration of how each field will get used can lead to less granularity.
- Question
- In the surveys table, the date is broken into three integer fields. What are some advantages and disadvantages of such fine granularity over using a single field of type date?
Extensions
The most complicated data types encountered so far are serial
and date
, but these come built-in to the standard PostgreSQL database management system (and many others).
A database feature noted earlier is the ability to extend the available data types, wither by installing plugins or writing your own.
A popular extension for PostgreSQL is PostGIS, which adds spatially aware data types and functions.
Data type | Description |
---|---|
Geometry | Planar/Cartesian coordinates and calculations |
Geography | Spheroidal geodetic objects, drawn on the earth’s curved surface. |
Raster | Space as a grid of rectangular cells, associated with a numeric array/bands. |
Topology | Models spatial objects as a network of connected nodes, edges, and faces. |
PostGIS functions allow you to query your database on spatial relations, which act sort of like foreign key relationships. Jumping way ahead of ourselves, let’s see an example of what PostGIS can do.
Log out of the portal
database, and log in to the database postgis_in_action
then select ch01
from the Schema drop-down menu.
- Question
- Look at the different tables and find the fields with spatial data types. What type does
restaurants
have, and what type doeshighways
have?
The following query will list the restaurant franchises in Maryland that are within 500 meters of a highway.
SELECT franchise, name
FROM ch01.restaurants
JOIN ch01.highways on ST_DWithin(restaurants.geom, highways.geom, 500)
WHERE state = 'MD';
- Exercise
- PostGIS includes spatially aware function, such as
ST_DWithin
which tests distance relationships. The functionST_AsText()
will convert a geometry into human readable text, and the commandST_Transform(geom, 4326)
will transformgeom
to geographical coordinates (i.e. latitude and longitude). Modify the SQL so that it prints an additional field that gives you coordinates (rather than the name of the highway) of the franchises in Maryland.
Learn more about PostGIS and this database from “PostGIS in Action” by Regina O. Obe and Leo S. Hsu.
Summary
As we have seen, a database will be a core element of a centralized workflow, accomodating simultaneous use by all members of a collaborative team. We have just skimmed the topic of concurrency in database interactions: there is a lot going on under the hood to prevent data corruption.
The ability to precisely define data types if the primary database feature that guaranties reliability in your dataset. As you develop scripts for analysis and vizualization, certainty that you’ll never encounter a “NaN” when you expect an Integer will prevent, or help you catch, bugs and errors down the road.
The third major feature to motivate databae use, scaleability, remains for you to discover. Did you bring a large dataset? Our coaches will help you load it into the PostgreSQL server, where you will be able to witness the speed with which very large tables can be queried, sorted and combined as you work on your “data2doc” project this week.