Online Data
Handouts for this lesson need to be saved on your computer. Download and unzip this material into the directory (a.k.a. folder) where you plan to work.
Lesson Objectives
- Distinguish three types of online data
- Break down how web services use HTTP
- Learn Python tools for data acquisition
Specific Achievements
- Programatically acquire data embedded in a web page
- Request data through a REST API
- Use the census package to acquire data
- Use SQLite for caching
Why script data acquistion?
- Too time intensive to acquire manually
- Integrate updated or new data
- Reproducibility
- There’s an API between you and the data
Acquiring Online Data
Data is available on the web in many different forms. How difficult is it to acquire that data to run analyses? It depends which of three approaches the data source requires:
- Web scraping
- Web service or API
- API wrapper
Web Scraping 🙁
If a web browser can read HTML and JavaScript and display a human readable page, why can’t you write a program (a “bot”) to read HTML and JavaScript and store the data?
Web Service or API 😉
An Application Programming Interface (API, as opposed to GUI) that is compatible with passing data around the internet using HTTP (Hyper-text Transfer Protocol). This is not the fastest protocol for moving large datasets, but it is universal (it underpins web browsers, after all).
API Wrapper 😂
Major data providers can justify writing a package, specific to your language of choice (e.g. Python or R), that facilitates accessing the data they provide through a web service. Sadly, not all do so.
Requests
That “http” at the beginning of the URL for a possible data source is a protocol—an understanding between a client and a server about how to communicate. The client does not have to be a web browser, so long as it knows the protocol. After all, servers exist to serve.
The requests package provides a simple interface to issuing HTTP requests and handling the response. Here’s an example using an XKCD comic.
import requests
response = requests.get('https://xkcd.com/869')
response
<Response [200]>
The response is still binary. It takes a browser-like parser to translate the raw content into an HTML document. BeautifulSoup does a fair job, while making no attempt to “render” a human-readable page.
from bs4 import BeautifulSoup
doc = BeautifulSoup(response.text, 'lxml')
'\n'.join(doc.prettify().splitlines()[0:10])
'<!DOCTYPE html>\n<html>\n <head>\n <link href="/s/7d94e0.css" rel="stylesheet" title="Default" type="text/css"/>\n <title>\n xkcd: Server Attention Span\n </title>\n <meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n <link href="/s/919f27.ico" rel="shortcut icon" type="image/x-icon"/>\n <link href="/s/919f27.ico" rel="icon" type="image/x-icon"/>'
Searching the document for desired content is the hard part. This search
uses a CSS query to find the image below a section of the document with
attribute id = comic
.
img = doc.select('#comic > img')
img
[<img alt="Server Attention Span" src="//imgs.xkcd.com/comics/server_attention_span.png" title="They have to keep the adjacent rack units empty. Otherwise, half the entries in their /var/log/syslog are just 'SERVER BELOW TRYING TO START CONVERSATION *AGAIN*.' and 'WISH THEY'D STOP GIVING HIM SO MUCH COFFEE IT SPLATTERS EVERYWHERE.'"/>]
It is possible to query by CSS for a single element and extract attributes such as the image title.
img = doc.select_one('#comic > img')
img['title']
"They have to keep the adjacent rack units empty. Otherwise, half the entries in their /var/log/syslog are just 'SERVER BELOW TRYING TO START CONVERSATION *AGAIN*.' and 'WISH THEY'D STOP GIVING HIM SO MUCH COFFEE IT SPLATTERS EVERYWHERE.'"
Was that so bad?
Pages designed for humans are increasingly harder to parse programmatically.
- Servers provide different responses based on client “metadata”
- JavaScript often needs to be executed by the client
- The HTML
<table>
is drifting into obscurity (mostly for the better)
HTML Tables
Sites with easily accessible html tables nowadays may be specifically intended to be parsed programmatically, rather than browsed by a human reader. The US Census provides some documentation for their data services in a massive table:
> import pandas as pd
+
+ vars = (
+ pd
+ .read_html('https://api.census.gov/data/2017/acs/acs5/variables.html')
+ .pop()
+ )
+ vars.head()
Name Label ... Group Unnamed: 8
0 AIANHH Geography ... NaN NaN
1 AIHHTL Geography ... NaN NaN
2 AIRES Geography ... NaN NaN
3 ANRC Geography ... NaN NaN
4 B00001_001E Estimate!!Total ... B00001 NaN
[5 rows x 9 columns]
We can use our data manipulation tools to search this unwieldy documentation for variables of interest.
idx = (
vars['Label']
.str
.contains('Median household income')
)
vars.loc[idx, ['Name', 'Label']]
Name Label
11214 B19013_001E Estimate!!Median household income in the past ...
11215 B19013A_001E Estimate!!Median household income in the past ...
11216 B19013B_001E Estimate!!Median household income in the past ...
11217 B19013C_001E Estimate!!Median household income in the past ...
11218 B19013D_001E Estimate!!Median household income in the past ...
11219 B19013E_001E Estimate!!Median household income in the past ...
11220 B19013F_001E Estimate!!Median household income in the past ...
11221 B19013G_001E Estimate!!Median household income in the past ...
11222 B19013H_001E Estimate!!Median household income in the past ...
11223 B19013I_001E Estimate!!Median household income in the past ...
11932 B19049_001E Estimate!!Median household income in the past ...
11933 B19049_002E Estimate!!Median household income in the past ...
11934 B19049_003E Estimate!!Median household income in the past ...
11935 B19049_004E Estimate!!Median household income in the past ...
11936 B19049_005E Estimate!!Median household income in the past ...
19332 B25099_001E Estimate!!Median household income!!Total
19333 B25099_002E Estimate!!Median household income!!Total!!Medi...
19334 B25099_003E Estimate!!Median household income!!Total!!Medi...
19643 B25119_001E Estimate!!Median household income in the past ...
19644 B25119_002E Estimate!!Median household income in the past ...
19645 B25119_003E Estimate!!Median household income in the past ...
Web Services
The US Census Bureau provides access to its vast stores of demographic data over the Web via their API at https://api.census.gov.
The I in GUI is for interface—it’s the same in API, where buttons and drop-down menus are replaced by functions and object attributes.
Instead of interfacing with a user, this kind of interface is suitable for use in programming another software application. In the case of the Census, the main component of the application is some relational database management system. There are several GUIs designed for humans to query the Census database; the Census API is meant for communication between your program (i.e. script) and their application.
You’ll often see the acronym “REST API.” In this context, REST stands for Representational state transfer. This refers to a set of standards that help ensure that the Web service works well with any computer system it may interact with.
Inspect this URL in your browser.
In a web service, the already universal system for transferring data over the internet, known as HTTP, is half of the interface. All you really need is documentation for how to construct the URL in a standards-compliant way that the service will recognize.
Section | Description |
---|---|
https:// |
scheme |
api.census.gov |
authority, or simply domain if there’s no user authentication |
/data/2015/acs5 |
path to a resource within a hierarchy |
? |
beginning of the query component of a URL |
get=NAME |
first query parameter |
& |
query parameter separator |
for=county |
second query parameter |
& |
query parameter separator |
in=state:* |
third query parameter |
# |
beginning of the fragment component of a URL |
irrelephant |
a document section, it isn’t even sent to the server |
path = 'https://api.census.gov/data/2017/acs/acs5'
query = {
'get': 'NAME,B19013_001E',
'for': 'tract:*',
'in': 'state:24',
}
response = requests.get(path, params=query)
response
<Response [200]>
Response Header
The response from the API is a bunch of 0s and 1s, but part of the HTTP protocol is to include a “header” with information about how to decode the body of the response.
Most REST APIs return as the “content” either:
- Javascript Object Notation (JSON)
- a UTF-8 encoded string of key-value pairs, where values may be lists
- e.g.
{'a':24, 'b': ['x', 'y', 'z']}
- eXtensible Markup Language (XML)
- a nested
<tag></tag>
hierarchy serving the same purpose
- a nested
The header from Census says the content type is JSON.
response.headers['Content-Type']
'application/json;charset=utf-8'
Response Content
Use a JSON reader to extract a Python object. To read it into
a pandas DataFrame
, use pandas’ read_json
.
data = pd.read_json(response.content)
data.head()
0 1 ... 3 4
0 NAME B19013_001E ... county tract
1 Census Tract 105.01, Wicomico County, Maryland 68652 ... 045 010501
2 Census Tract 5010.02, Carroll County, Maryland 75069 ... 013 501002
3 Census Tract 5077.04, Carroll County, Maryland 88306 ... 013 507704
4 Census Tract 5061.02, Carroll County, Maryland 84810 ... 013 506102
[5 rows x 5 columns]
API Keys & Limits
Most servers request good behavior, others enforce it.
- Size of single query
- Rate of queries (calls per second, or per day)
- User credentials specified by an API key
From the Census FAQ What Are the Query Limits?:
You can include up to 50 variables in a single API query and can make up to 500 queries per IP address per day… Please keep in mind that all queries from a business or organization having multiple employees might employ a proxy service or firewall. This will make all of the users of that business or organization appear to have the same IP address.
Specialized Packages
The third tier of access to online data is much preferred, if it exists: a dedicated package in your programming language’s repository (PyPI or CRAN).
- Additional guidance on query parameters
- Returns data in native formats
- Handles all “encoding” problems
The census package is a user-contributed suite of tools that streamline access to the API.
from census import Census
key = None
c = Census(key, year=2017)
c.acs5
<census.core.ACS5Client object at 0x7efe9cb75b00>
Compared to using the API directly via the requests package:
Pros
- More concise code, quicker development
- Package documentation (if present) is usually more user-friendly than API documentaion.
- May allow seamless update if API changes
Cons
- No guarantee of updates
- Possibly limited in scope
Query the Census ACS5 (American Community Survey)
for the variable B19001_001E
(median annual household income,
in dollars) and each entity’s NAME
.
The American Community Survey (ACS) is a yearly survey that provides detailed population and housing information at fine geographic scale across the United States. Much of the census package is dedicated to accessing the ACS data. ACS5 refers to a five-year average of the annual surveys.
variables = ('NAME', 'B19013_001E')
This code pulls the variables NAME
and B19001_001E
from all census tracts and all
counties in the state with ID 24
(Maryland). The census package converts the JSON string
into a Python dictionary. (No need to check headers.)
response = c.acs5.state_county_tract(
variables,
state_fips='24',
county_fips=Census.ALL,
tract=Census.ALL,
)
response[0]
{'state': '24', 'county': '045', 'tract': '010501', 'NAME': 'Census Tract 105.01, Wicomico County, Maryland', 'B19013_001E': 68652.0}
The Pandas DataFrame()
constructor will accept the list of
dictionaries as the sole argument, taking column names from “keys”.
This code also removes values less than zero.
df = (
pd
.DataFrame(response)
.query("B19013_001E >= 0")
)
The seaborn package provides some nice, quick visualizations. Here we create boxplots showing the income distribution among census tracts within each county in Maryland.
import seaborn as sns
sns.boxplot(
data = df,
x = 'county',
y = 'B19013_001E',
)
Paging & Stashing
A common strategy that web service providers take to balance their load is to limit the number of records a single API request can return. The user ends up having to flip through “pages” with the API, handling the response content at each iteration. Options for stashing data are:
- Store it all in memory, write to file at the end.
- Append each response to a file, writing frequently.
- Offload these decisions to database management software.
The data.gov API provides a case in point. Data.gov is a service provided by the U.S. federal government to make data available from across many government agencies. It hosts a catalog of raw data and of many other APIs from across government. Among the APIs catalogued by data.gov is the FoodData Central API. The U.S. Department of Agriculture maintains a data system of nutrition information for thousands of foods. We might be interested in the relative nutrient content of different fruits.
To repeat the exercise below at home, request an API key at
https://api.data.gov/signup/, and store it in a file named api_key.py
in your working directory. The file should contain the single line
API_KEY = your many digit key
.
Load the API_KEY
variable by importing it from the file you saved it in.
> from api_key import API_KEY
Run an API query for all foods with "fruit"
in their name.
import requests
api = 'https://api.nal.usda.gov/fdc/v1/'
path = 'foods/search'
query = {
'api_key':API_KEY,
'query':'fruit',
}
response = (
requests
.get(api + path, params=query)
)
doc = response.json()
Extract data from the returned JSON object, which gets mapped to a
Python dictionary called doc
. To inspect the return, we can list
the dictionary keys.
> list(doc.keys())
['foods', 'foodSearchCriteria', 'totalHits', 'currentPage', 'totalPages']
We can print the value associated with the key totalHits
to see
how many foods matched our search term, "fruit"
.
> doc['totalHits']
18801
The purported claimed number of results is much larger than the length
of the foods
array contained in this response. The query returned only the
first page, with 50 items.
> len(doc['foods'])
50
The following commands prepare Python to connect to a database-in-a-file, and create empty tables in the database if they do not already exist (meaning that it is safe to re-run after you have populated the database).
Step 1: Boilerplate
The SQLAlchemy package has a lot of features, and requires you to be very precise about how to get started.
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
Base = declarative_base()
Step 2: Table Definition
Define the tables that are going to live in the database using Python classes. For each class, its attributes will map to columns in a table. Then create a session engine.
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, Text, Numeric
Base = declarative_base()
class Food(Base):
__tablename__ = 'food'
id = Column(Integer, primary_key=True)
name = Column(Text)
sugar = Column(Numeric)
engine = create_engine('sqlite:///fruits.db')
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
For each fruit, we’ll store its name and the amount of sugar (grams of sugar per 100 grams of fruit) found in the API response.
> fruit = doc['foods'].pop()
+ fruit['description']
'Fruit peel, candied'
Extract the names and values of the first ten nutrients for the first item returned by the query.
> [ (nutrient['nutrientName'], nutrient['value']) for nutrient in fruit['foodNutrients'][:9] ]
[('Protein', 0.34), ('Total lipid (fat)', 0.07), ('Carbohydrate, by difference', 82.74), ('Energy', 322.0), ('Alcohol, ethyl', 0.0), ('Water', 16.7), ('Caffeine', 0.0), ('Theobromine', 0.0), ('Sugars, total including NLEA', 80.68)]
Step 3: Connect (and Initialize)
from schema import Session, Food
session = Session()
engine = session.bind
You could inspect the fruit database now using any sqlite3 client: you would find one empty “food” table with fields “id”, “name”, and “sugar”.
Add a new pageSize
parameter to request 100
documents per page.
query['pageSize'] = 100
In each request, advance the query parameter pageNumber
by one.
The first record retrieved will be pageNumber * pageSize
.
Insert the fruits (the key:value pairs stored in values
)
in bulk to the database with engine.execute()
.
In each iteration of the loop, we use a list comprehension to extract the value corresponding to the amount of sugar from each of the foods in the page of results returned by the query.
table = Food.metadata.tables['food']
for i in range(0, 10):
# advance page and query
query['pageNumber'] = i
response = requests.get(api + path, params=query)
page = response.json()
fruits = page['foods']
# save page with session engine
values = [{'name': fruit['description'],
'sugar': next(iter([ nutrient['value'] for nutrient in fruit['foodNutrients'] if nutrient['nutrientName'][0:5] == 'Sugar' ]), None) } for fruit in fruits]
insert = table.insert().values(values)
engine.execute(insert)
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9b686da0>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9b492f60>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9db07160>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9dda33c8>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9db07cc0>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9dda7630>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9da90cc0>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9ddb05f8>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9da9ad68>
<sqlalchemy.engine.result.ResultProxy object at 0x7efe9ddb2278>
View the records in the database by reading
everything we have so far back into a DataFrame
.
import pandas as pd
df = pd.read_sql_table('food', engine)
Don’t forget to disconnect from your database!
engine.dispose()
Takeaway
-
Web scraping is hard and unreliable, but sometimes there is no other option.
-
Web services are the most common resource.
-
Search PyPI for an API you plan to use.
Web services do not always have great documentation—what parameters are acceptable or necessary may not be clear. Some may even be poorly documented on purpose if the API wasn’t designed for public use! Even if you plan to acquire data using the “raw” web service, try a search for a relevant package on Python. The package documentation could help.
For more resources, and to discover more APIs, visit:
- DataOne
A network of data repositories making data more accessible and usable.- Data Discovery: Portal to search DataOne
-
SODA Developers
An open data API to access open data resources from many organizations and governments. - Public APIs
A repository listing free APIs for use in your research and projects.
Exercises
Exercise 1
Identify the name of the census variable in the table of ACS variables whose label includes “COUNT OF THE POPULATION”. Next, use the Census API to collect the data for this variable, for every county in the U.S. state with FIPS code ‘24’, into a pandas DataFrame.
Exercise 2
Request an API key for data.gov, which will enable you to access the FoodData
Central API. Use the API to collect 3 “pages” of food results matching a search
term of your choice. Modify schema.py
to save the names and sugar contents of the
foods into a new SQLite file.
If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.
# Nothing here yet!