Web Services and APIs with Python¶

Objectives for this lesson¶

Address programatic data aquistion
Learn principles of web-services
Recognize vast opportunities of APIs

Specific achievements¶

Programatically aquire data embedded in a web page
Request data through a REST API
Use the census package to aquire data

Why script data aquistion?¶

Too time intensive to aquire manually
Update or reuse for new data
Reproducibility
Only available through an Application Programming Interface (API)

Tiers of access to online data¶

Scraping: download static data displayed on a webpage for people
REST API: send HTTP requests for data using a URI following the providers documentation
Specialized Package import a "wrapper" created by a data provider

Requests¶

That "http" at the beginning of the URL for a possible data source is a protocol—an understanding between a client and a server about how to communicate. The client does not have to be a web browser, so long as it knows the protocol.

import requests
from bs4 import BeautifulSoup
import re

response = requests.get('https://xkcd.com/869')
doc = BeautifulSoup(response.text, 'lxml')

doc

match = re.search('https://.*\.png', doc.body.text)

from IPython.display import Image

Image(url=match.group(0))

Range of complexity¶

Pages designed for humans are increasingly harder to parse programmatically.

Servers provide different responses based on client's "metadata"
Javascript often needs to be executed by the client
The html <table> is drifting into obscurity (mostly for the better)

HTML Tables¶

Sites with easilly accessible html tables nowadays are often geared toward non-human agents. The US Census provides some documentation for their data services in a massive such table:

http://api.census.gov/data/2015/acs5/variables.html

import pandas as pd

acs5_variables = pd.read_html('http://api.census.gov/data/2015/acs5/variables.html')
acs5_variables = acs5_variables.pop()
acs5_variables.head()

rows = acs5_variables['Concept'].str.contains('Household Income', na = False)
acs5_variables.loc[rows,]

REST API¶

The US Census Burea provides access to its vast stores of demographic data via their API at https://api.census.gov.

The I in API is the entry point into an application: it's the steering wheel and dashboard for whatever more or less complicated vehicle you're driving. In the case of the Census, the main component of the application is a relational database management system. There are probabably several GUIs designed for human readers; the Census API is meant for communication between your software and their application.

In a REST API, the already universal system for transferring data over the internet between applications (a web server and your browser) called http is half of the interface. From there we just need documentation for how to construct the URL in a standards compliant way.

https://api.census.gov/data/2015/acs5?get=NAME,AIANHH&for=county&in=state:24#irrelevant

Section	Description
`https://`	scheme
`api.census.gov`	authority, or simply host if there's no user authentication
`/data/2015/acs5`	path to a resource within a hierarchy
`?`	beginning of the "query" component of a URL
`get=NAME,AIANHH`	first query parameter
`&`	query parameter separator
`for=county`	second query parameter
`&`	query parameter separator
`in=state:*`	third query parameter
`#`	beginning of the "fragment" component of a URL
`irrelevant`	the fragment is a client side pointer, it isn't even sent to the server

path = 'https://api.census.gov/data/2015/acs5'
query = {'get':'NAME,AIANHH', 'for':'county', 'in':'state:24'}
response = requests.get(path, params=query)

Interpretting the response¶

The response from the API is a bunch of 0s and 1s, but part of the HTTP protocol is to include a "header" with information about reading the body.

Most REST APIs return:

Javascript Object Notation (JSON)
- a UTF-8 encoded string of key-value pairs, where values may be lists
- e.g. {'a':24, 'b':['x', 'y', 'z']}
eXtensible Markup Language (XML)
- hierarchy of <tag></tag>s that do the same thing

for k, v in response.headers.items():
    print('{}\t{}'.format(k, v))

data = pd.read_json(response.content)

data

API Keys & Limits¶

Most servers request good behavior, others enforce it.

Size of single query
Rate of queries (calls per second, or per day)
User credentials specified by an API key

From the Census Bureau¶

What Are the Query Limits?

You can include up to 50 variables in a single API query and can make up to 500 queries per IP address per day. More than 500 queries per IP address per day requires that you register for a Census key. That key will be part of your data request URL string.

Please keep in mind that all queries from a business or organization having multiple employees might employ a proxy service or firewall. This will make all of the users of that business or organization appear to have the same IP address. If multiple employees were making queries, the 500-query limit would be for the proxy server/firewall, not the individual user.

Specialized Packages¶

The third tier of access to online data is the most convenient, if it exists. The data provider may also maintain a package in your programming languages repository (PyPI or CRAN).

Additional guidance on query parameters
Returns data in native formats
Handles all "encoding" problems

from census import Census
key = None
c = Census(key, year=2015)

variables = ('NAME', 'B19001_001E')
params = {'for':'tract:*', 'in':'state:24'}
response = c.acs5.get(variables, params)
response = pd.DataFrame(response)
response.dtypes

import ggplot as gg

response[variables[1]] = pd.to_numeric(response[variables[1]])
gg.ggplot(response, gg.aes(x = 'county', y = variables[1])) + gg.geom_boxplot()