That "http" at the beginning of the URL for a possible data source is a protocol—an understanding between a client and a server about how to communicate. The client does not have to be a web browser, so long as it knows the protocol.
import requests
from bs4 import BeautifulSoup
import re
response = requests.get('https://xkcd.com/869')
doc = BeautifulSoup(response.text, 'lxml')
doc
match = re.search('https://.*\.png', doc.body.text)
from IPython.display import Image
Image(url=match.group(0))
Pages designed for humans are increasingly harder to parse programmatically.
<table>
is drifting into obscurity (mostly for the better)Sites with easilly accessible html tables nowadays are often geared toward non-human agents. The US Census provides some documentation for their data services in a massive such table:
import pandas as pd
acs5_variables = pd.read_html('http://api.census.gov/data/2015/acs5/variables.html')
acs5_variables = acs5_variables.pop()
acs5_variables.head()
rows = acs5_variables['Concept'].str.contains('Household Income', na = False)
acs5_variables.loc[rows,]
The US Census Burea provides access to its vast stores of demographic data via their API at https://api.census.gov.
The I in API is the entry point into an application: it's the steering wheel and dashboard for whatever more or less complicated vehicle you're driving. In the case of the Census, the main component of the application is a relational database management system. There are probabably several GUIs designed for human readers; the Census API is meant for communication between your software and their application.
In a REST API, the already universal system for transferring data over the internet between applications (a web server and your browser) called http
is half of the interface. From there we just need documentation for how to construct the URL in a standards compliant way.
https://api.census.gov/data/2015/acs5?get=NAME,AIANHH&for=county&in=state:24#irrelevant
Section | Description |
---|---|
https:// |
scheme |
api.census.gov |
authority, or simply host if there's no user authentication |
/data/2015/acs5 |
path to a resource within a hierarchy |
? |
beginning of the "query" component of a URL |
get=NAME,AIANHH |
first query parameter |
& |
query parameter separator |
for=county |
second query parameter |
& |
query parameter separator |
in=state:* |
third query parameter |
# |
beginning of the "fragment" component of a URL |
irrelevant |
the fragment is a client side pointer, it isn't even sent to the server |
path = 'https://api.census.gov/data/2015/acs5'
query = {'get':'NAME,AIANHH', 'for':'county', 'in':'state:24'}
response = requests.get(path, params=query)
The response from the API is a bunch of 0
s and 1
s, but part of the HTTP protocol is to include a "header" with information about reading the body.
Most REST APIs return:
{'a':24, 'b':['x', 'y', 'z']}
<tag></tag>
s that do the same thingfor k, v in response.headers.items():
print('{}\t{}'.format(k, v))
data = pd.read_json(response.content)
data
Most servers request good behavior, others enforce it.
You can include up to 50 variables in a single API query and can make up to 500 queries per IP address per day. More than 500 queries per IP address per day requires that you register for a Census key. That key will be part of your data request URL string.
Please keep in mind that all queries from a business or organization having multiple employees might employ a proxy service or firewall. This will make all of the users of that business or organization appear to have the same IP address. If multiple employees were making queries, the 500-query limit would be for the proxy server/firewall, not the individual user.
The third tier of access to online data is the most convenient, if it exists. The data provider may also maintain a package in your programming languages repository (PyPI or CRAN).
from census import Census
key = None
c = Census(key, year=2015)
variables = ('NAME', 'B19001_001E')
params = {'for':'tract:*', 'in':'state:24'}
response = c.acs5.get(variables, params)
response = pd.DataFrame(response)
response.dtypes
import ggplot as gg
response[variables[1]] = pd.to_numeric(response[variables[1]])
gg.ggplot(response, gg.aes(x = 'county', y = variables[1])) + gg.geom_boxplot()