Web Services and APIs with Python

Objectives for this lesson

  • Address programatic data aquistion
  • Learn principles of web-services
  • Recognize vast opportunities of APIs

Specific achievements

  • Programatically aquire data embedded in a web page
  • Request data through a REST API
  • Use the census package to aquire data

Why script data aquistion?

  • Too time intensive to aquire manually
  • Update or reuse for new data
  • Reproducibility
  • Only available through an Application Programming Interface (API)

Tiers of access to online data

  • Scraping: download static data displayed on a webpage for people
  • REST API: send HTTP requests for data using a URI following the providers documentation
  • Specialized Package import a "wrapper" created by a data provider

Requests

That "http" at the beginning of the URL for a possible data source is a protocol—an understanding between a client and a server about how to communicate. The client does not have to be a web browser, so long as it knows the protocol.

In [ ]:
import requests
from bs4 import BeautifulSoup
import re

response = requests.get('https://xkcd.com/869')
doc = BeautifulSoup(response.text, 'lxml')
In [ ]:
doc
In [ ]:
match = re.search('https://.*\.png', doc.body.text)
In [ ]:
from IPython.display import Image

Image(url=match.group(0))

Range of complexity

Pages designed for humans are increasingly harder to parse programmatically.

  • Servers provide different responses based on client's "metadata"
  • Javascript often needs to be executed by the client
  • The html <table> is drifting into obscurity (mostly for the better)

HTML Tables

Sites with easilly accessible html tables nowadays are often geared toward non-human agents. The US Census provides some documentation for their data services in a massive such table:

http://api.census.gov/data/2015/acs5/variables.html

In [ ]:
import pandas as pd

acs5_variables = pd.read_html('http://api.census.gov/data/2015/acs5/variables.html')
acs5_variables = acs5_variables.pop()
acs5_variables.head()
In [ ]:
rows = acs5_variables['Concept'].str.contains('Household Income', na = False)
acs5_variables.loc[rows,]

REST API

The US Census Burea provides access to its vast stores of demographic data via their API at https://api.census.gov.

The I in API is the entry point into an application: it's the steering wheel and dashboard for whatever more or less complicated vehicle you're driving. In the case of the Census, the main component of the application is a relational database management system. There are probabably several GUIs designed for human readers; the Census API is meant for communication between your software and their application.

In a REST API, the already universal system for transferring data over the internet between applications (a web server and your browser) called http is half of the interface. From there we just need documentation for how to construct the URL in a standards compliant way.

https://api.census.gov/data/2015/acs5?get=NAME,AIANHH&for=county&in=state:24#irrelevant

Section Description
https:// scheme
api.census.gov authority, or simply host if there's no user authentication
/data/2015/acs5 path to a resource within a hierarchy
? beginning of the "query" component of a URL
get=NAME,AIANHH first query parameter
& query parameter separator
for=county second query parameter
& query parameter separator
in=state:* third query parameter
# beginning of the "fragment" component of a URL
irrelevant the fragment is a client side pointer, it isn't even sent to the server
In [ ]:
path = 'https://api.census.gov/data/2015/acs5'
query = {'get':'NAME,AIANHH', 'for':'county', 'in':'state:24'}
response = requests.get(path, params=query)

Interpretting the response

The response from the API is a bunch of 0s and 1s, but part of the HTTP protocol is to include a "header" with information about reading the body.

Most REST APIs return:

  • Javascript Object Notation (JSON)
    • a UTF-8 encoded string of key-value pairs, where values may be lists
    • e.g. {'a':24, 'b':['x', 'y', 'z']}
  • eXtensible Markup Language (XML)
    • hierarchy of <tag></tag>s that do the same thing
In [ ]:
for k, v in response.headers.items():
    print('{}\t{}'.format(k, v))
In [ ]:
data = pd.read_json(response.content)
In [ ]:
data

API Keys & Limits

Most servers request good behavior, others enforce it.

  • Size of single query
  • Rate of queries (calls per second, or per day)
  • User credentials specified by an API key

From the Census Bureau

What Are the Query Limits?

You can include up to 50 variables in a single API query and can make up to 500 queries per IP address per day. More than 500 queries per IP address per day requires that you register for a Census key. That key will be part of your data request URL string.

Please keep in mind that all queries from a business or organization having multiple employees might employ a proxy service or firewall. This will make all of the users of that business or organization appear to have the same IP address. If multiple employees were making queries, the 500-query limit would be for the proxy server/firewall, not the individual user.

Specialized Packages

The third tier of access to online data is the most convenient, if it exists. The data provider may also maintain a package in your programming languages repository (PyPI or CRAN).

  • Additional guidance on query parameters
  • Returns data in native formats
  • Handles all "encoding" problems
In [ ]:
from census import Census
key = None
c = Census(key, year=2015)
In [ ]:
variables = ('NAME', 'B19001_001E')
params = {'for':'tract:*', 'in':'state:24'}
response = c.acs5.get(variables, params)
response = pd.DataFrame(response)
response.dtypes
In [ ]:
import ggplot as gg

response[variables[1]] = pd.to_numeric(response[variables[1]])
gg.ggplot(response, gg.aes(x = 'county', y = variables[1])) + gg.geom_boxplot()