Online Data
Lesson 8 with Ian Carroll
Lesson Objectives
- Distinguish three sources for online data
- Understand how HTTP and web-services work
- Learn Python idioms for requesting data
Specific Achievements
- Programatically aquire data embedded in a web page
- Request data through a REST API
- Use the census package to aquire data
=== <!–
- requests, html, “scraping”
- api, parameters, key’s and rules
- Census package as a python api, google earth example JS and Python API? –>
Why script data aquistion?
- Too time intensive to aquire manually
- Integrate updated or new data
- Reproducibility
- There’s an API between you and the data
Aquiring Online Data
Data can be available on the web in many different forms. The difficulty you will have aquiring that data for local analysis depends on which of three approaches you need.
Scraping 🙁
If a web browser can read HTML and JavaScript to display a page, why can’t your “bot” read HTML and JavaScript to store the data?
RESTful Web Services 😉
An Application Programming Interface (API, as opposed to a GUI) that is compatible with passing data around the web using HTTP. The Hyper-text Transfer Protocol may not be the fastest, but it is universal (it underpins web browsers, after all).
API Wrapper 😂
Major data providers can justify writing a package, specific to your language of choice (e.g. Python or R), that facilitates accessing the data they provide through a web service.
Requests
That “http” at the beginning of the URL for a possible data source is a protocol—an understanding between a client and a server about how to communicate. The client does not have to be a web browser, so long as it knows the protocol. After all, servers exist to serve.
The requests package provides a simple interface to issueing HTTP requests and handling the response.
import requests
response = requests.get('https://xkcd.com/869')
response
<Response [200]>
The response is still binary, it takes a browser-like parser to translate the raw content into an HTML document. BeautifulSoup does a fair job, while making no attempt to “render” a human readable page.
from bs4 import BeautifulSoup
doc = BeautifulSoup(response.text, 'lxml')
print('\n'.join(doc.prettify().splitlines()[0:10]))
<!DOCTYPE html>
<html>
<head>
<link href="/s/b0dcca.css" rel="stylesheet" title="Default" type="text/css"/>
<title>
xkcd: Server Attention Span
</title>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="/s/919f27.ico" rel="shortcut icon" type="image/x-icon"/>
<link href="/s/919f27.ico" rel="icon" type="image/x-icon"/>
Searching the document for desired content is the hard part. This search
uses a CSS query, to find the image below a section of the document with
attribute id = comic
.
img = doc.select('#comic > img').pop()
img
<img alt="Server Attention Span" src="//imgs.xkcd.com/comics/server_attention_span.png" title="They have to keep the adjacent rack units empty. Otherwise, half the entries in their /var/log/syslog are just 'SERVER BELOW TRYING TO START CONVERSATION *AGAIN*.' and 'WISH THEY'D STOP GIVING HIM SO MUCH COFFEE IT SPLATTERS EVERYWHERE.'"/>
It makes sense to query by CSS if the content being scraped always appears the same in a browser; stylesheets are separate from delivered content.
from textwrap import fill
print(fill(img['title'], width = 42))
They have to keep the adjacent rack units
empty. Otherwise, half the entries in
their /var/log/syslog are just 'SERVER
BELOW TRYING TO START CONVERSATION
*AGAIN*.' and 'WISH THEY'D STOP GIVING HIM
SO MUCH COFFEE IT SPLATTERS EVERYWHERE.'
Was that so bad?
Pages designed for humans are increasingly harder to parse programmatically.
- Servers provide different responses based on client “metadata”
- Javascript often needs to be executed by the client
- The HTML
<table>
is drifting into obscurity (mostly for the better)
HTML Tables
Sites with easilly accessible html tables nowadays may be specifically geared toward non-human agents. The US Census provides some documentation for their data services in a massive such table:
http://api.census.gov/data/2015/acs5/variables.html
import pandas as pd
acs5_variables = pd.read_html(
'https://api.census.gov/data/2016/acs/acs5/variables.html'
)
vars = acs5_variables[0]
vars.head()
Name Label ... Group Values
0 AIANHH FIPS AIANHH code ... NaN NaN
1 AIHHTLI American Indian Trust Land/Hawaiian Home Land ... ... NaN NaN
2 AITSCE American Indian Tribal Subdivision (FIPS) ... NaN NaN
3 ANRC Alaska Native Regional Corporation (FIPS) ... NaN NaN
4 B00001_001E Estimate!!Total ... B00001 NaN
[5 rows x 9 columns]
We can use our data manipulation tools to search this unwieldy documentation for variables of interest
rows = (
vars['Label']
.str.contains(
'household income',
na = False,
)
)
for idx, row in vars.loc[rows].iterrows():
print('{}:\t{}'.format(row['Name'], row['Label']))
B19013_001E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013A_001E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013B_001E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013C_001E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013D_001E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013E_001E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013F_001E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013G_001E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013H_001E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013I_001E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025_001E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025A_001E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025B_001E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025C_001E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025D_001E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025E_001E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025F_001E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025G_001E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025H_001E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025I_001E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19049_001E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Total
B19049_002E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder under 25 years
B19049_003E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder 25 to 44 years
B19049_004E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder 45 to 64 years
B19049_005E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder 65 years and over
B19050_001E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19050_002E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder under 25 years
B19050_003E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder 25 to 44 years
B19050_004E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder 45 to 64 years
B19050_005E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder 65 years and over
B19202_001E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202A_001E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202B_001E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202C_001E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202D_001E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202E_001E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202F_001E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202G_001E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202H_001E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202I_001E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19214_001E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19215_001E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Total (dollars)
B19215_002E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Total (dollars)
B19215_003E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Living alone!!Total (dollars)
B19215_004E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Living alone!!Householder 15 to 64 years (dollars)
B19215_005E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Living alone!!Householder 65 years and over (dollars)
B19215_006E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Not living alone!!Total (dollars)
B19215_007E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Not living alone!!Householder 15 to 64 years (dollars)
B19215_008E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Not living alone!!Householder 65 years and over (dollars)
B19215_009E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Total (dollars)
B19215_010E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Living alone!!Total (dollars)
B19215_011E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Living alone!!Householder 15 to 64 years (dollars)
B19215_012E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Living alone!!Householder 65 years and over (dollars)
B19215_013E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Not living alone!!Total (dollars)
B19215_014E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Not living alone!!Householder 15 to 64 years (dollars)
B19215_015E: Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Not living alone!!Householder 65 years and over (dollars)
B19216_001E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19216_002E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)
B19216_003E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)!!Living alone (dollars)
B19216_004E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)!!Living alone (dollars)!!Householder 15 to 64 years (dollars)
B19216_005E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)!!Living alone (dollars)!!Householder 65 years and over (dollars)
B19216_006E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)!!Not living alone (dollars)
B19216_007E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)!!Not living alone (dollars)!!Householder 15 to 64 years (dollars)
B19216_008E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)!!Not living alone (dollars)!!Householder 65 years and over (dollars)
B19216_009E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)
B19216_010E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)!!Living alone (dollars)
B19216_011E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)!!Living alone (dollars)!!Householder 15 to 64 years (dollars)
B19216_012E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)!!Living alone (dollars)!!Householder 65 years and over (dollars)
B19216_013E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)!!Not living alone (dollars)
B19216_014E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)!!Not living alone (dollars)!!Householder 15 to 64 years (dollars)
B19216_015E: Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)!!Not living alone (dollars)!!Householder 65 years and over (dollars)
B25071_001E: Estimate!!Median gross rent as a percentage of household income
B25092_001E: Estimate!!Median selected monthly owner costs as a percentage of household income in the past 12 months!!Total
B25092_002E: Estimate!!Median selected monthly owner costs as a percentage of household income in the past 12 months!!Housing units with a mortgage
B25092_003E: Estimate!!Median selected monthly owner costs as a percentage of household income in the past 12 months!!Housing units without a mortgage
B25099_001E: Estimate!!Median household income!!Total
B25099_002E: Estimate!!Median household income!!Total!!Median household income for units with a mortgage
B25099_003E: Estimate!!Median household income!!Total!!Median household income for units without a mortgage
B25119_001E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Total
B25119_002E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Owner occupied (dollars)
B25119_003E: Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Renter occupied (dollars)
B25120_001E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B25120_002E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Owner occupied
B25120_003E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Owner occupied!!Housing units with a mortgage
B25120_004E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Owner occupied!!Housing units without a mortgage
B25120_005E: Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Renter occupied
REST API
The US Census Burea provides access to its vast stores of demographic data via their API at https://api.census.gov.
The I in API is all the buttons and dials on the same kind of black box you need a GUI for (it’s the same I). Instead of interfacing with a user, those buttons and dials are meant for another software application.
In the case of the Census, the main component of the application is some relational database management system. There probabably are several GUIs designed for humans to query the Census database; the Census API is meant for communication between your program (i.e. script) and their application.
Inspect this URL in your browser.
In a RESTful web service, the already universal system for transferring data over the internet, known as HTTP is half of the interface. All you really need is documentation for how to construct the URL in a standards compliant way that the service will accept.
Section | Description |
---|---|
https:// |
scheme |
api.census.gov |
authority, or simply host if there’s no user authentication |
/data/2015/acs5 |
path to a resource within a hierarchy |
? |
beginning of the query component of a URL |
get=NAME |
first query parameter |
& |
query parameter separator |
for=county |
second query parameter |
& |
query parameter separator |
in=state:* |
third query parameter |
# |
beginning of the fragment component of a URL |
irrelevant |
the fragment is a client side pointer, it isn’t even sent to the server |
path = 'https://api.census.gov/data/2016/acs/acs5'
query = {
'get':'NAME,B19013_001E',
'for':'tract:*',
'in':'state:24',
}
response = requests.get(path, params=query)
response
<Response [200]>
Response Header
The response from the API is a bunch of 0s and 1s, but part of the HTTP protocol is to include a “header” with information about how to decode the body of the response.
Most REST APIs return as the “content” either:
- Javascript Object Notation (JSON)
- a UTF-8 encoded string of key-value pairs, where values may be lists
- e.g.
{'a':24, 'b': ['x', 'y', 'z']}
- eXtensible Markup Language (XML)
- a nested
<tag></tag>
hierarchy serving the same purpose
- a nested
The header from Census says the content type is JSON.
for k, v in response.headers.items():
print('{}: {}'.format(k, v))
Server: Apache-Coyote/1.1
Cache-Control: max-age=60, must-revalidate
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET,POST
Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept
Content-Type: application/json;charset=utf-8
Transfer-Encoding: chunked
Date: Thu, 26 Jul 2018 11:09:26 GMT
Strict-Transport-Security: max-age=31536000
Response Content
Use a JSON reader to extract a Python object. To read it into
a Panda’s DataFrame
, use Panda’s read_json
.
data = pd.read_json(response.content)
data.head()
0 1 2 3 4
0 NAME B19013_001E state county tract
1 Census Tract 1, Allegany County, Maryland 42292 24 001 000100
2 Census Tract 2, Allegany County, Maryland 44125 24 001 000200
3 Census Tract 3, Allegany County, Maryland 39571 24 001 000300
4 Census Tract 4, Allegany County, Maryland 39383 24 001 000400
API Keys & Limits
Most servers request good behavior, others enforce it.
- Size of single query
- Rate of queries (calls per second, or per day)
- User credentials specified by an API key
From the Census FAQ What Are the Query Limits?:
You can include up to 50 variables in a single API query and can make up to 500 queries per IP address per day… Please keep in mind that all queries from a business or organization having multiple employees might employ a proxy service or firewall. This will make all of the users of that business or organization appear to have the same IP address.
Specialized Packages
The third tier of access to online data is much preferred, if it exists: a dedicated package in your programming language’s repository (PyPI or CRAN).
- Additional guidance on query parameters
- Returns data in native formats
- Handles all “encoding” problems
The census package is a user contributed suite of tools that streamline access to the API.
from census import Census
key = None
c = Census(key, year=2016)
c.acs5
<census.core.ACS5Client at 0x1148c50f0>
Compared to using the API directly via the requests package:
Pros
- More concise code, quicker development
- Package documentation (if present) is usually more user friendly than API documentaion.
- May allow seemless update if API changes
Cons
- No guarantee of updates
- Possibly limited in scope
Query the Census ACS5 survey for the variable B19001_001E
and each
entity’s NAME
.
variables = ('NAME', 'B19013_001E')
The census package converts the JSON string into a Python dictionary. (No need to check headers.)
response = c.acs5.state_county_tract(
variables,
'24',
Census.ALL,
Census.ALL
)
response[0]
{'NAME': 'Census Tract 1, Allegany County, Maryland',
'B19013_001E': 42292.0,
'state': '24',
'county': '001',
'tract': '000100'}
The Pandas DataFrame()
constructor will accept the list of
dictionaries as the sole argument, taking column names from “keys”.
df = pd.DataFrame(response)
mask = df['B19013_001E'] == -666666666.0
df = df.loc[~mask, :]
The seaborn package provides some nice, quick visualizations.
import seaborn as sns
sns.boxplot(
data = df,
x = 'county',
y = 'B19013_001E',
)
<matplotlib.axes._subplots.AxesSubplot at 0x11357d0f0>
Response Stashing
A common strategy that web service providers take to balance their load, is to limit the number of records a single API request can return. The user ends up having to flip through “pages” with the API, handling the response content at each iteration. Options for stashing data are:
- Store it all in memory, write to file at the end.
- Append each response to a file, writing frequently.
- Offload these decisions to database management software.
To repeat the exercise below at home, request an API key at
https://api.data.gov/signup/, and store it in an adjacent api_key.py
file with the single variable API_KEY = your many digit key
.
The “data.gov” API provides a case in point. Take a look at the request for comments posted by the US Department of Interior about Bears Ear National Monument. The document received over two million comments, all accessible through Regulations.gov.
import requests
from api_key import API_KEY
api = 'https://api.data.gov/regulations/v3/'
path = 'document.json'
query = {
'documentId':'DOI-2017-0002-0001',
'api_key':API_KEY,
}
response = requests.get(
api + path,
params=query)
Extract data from the returned JSON object, which gets mapped to a
Python dictionary called doc
.
doc = response.json()
print('{}: {}'.format(
doc['numItemsRecieved']['label'],
doc['numItemsRecieved']['value'],
))
Number of Comments Received: 2839046
Initiate a new API query for public submission (PS) comments and print the dictionary keys in the response.
query = {
'dktid': doc['docketId']['value'],
'dct': 'PS',
'api_key': API_KEY,
}
path = 'documents.json'
response = requests.get(
api + path, params=query)
dkt = response.json()
To inspect the return, we can list the keys in the
parsed dkt
.
list(dkt.keys())
['documents', 'totalNumRecords']
The purported claimed number of results is much larger than the length of the documents array contained in this response.
print('Number received: {}\nTotal number: {}'
.format(
len(dkt['documents']),
dkt['totalNumRecords'],
))
Number received: 25
Total number: 782468
The following commands prepare Python to connect to a database-in-a-file, and creates empty tables in the database if they do not already exist (i.e. it is safe to re-run after you have populated the database).
Step 1: Boilerplate
The SQLAlchemy package has a lot of features, and requires you to be very precise about how to get started.
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
Base = declarative_base()
Step 2: Table Definition
Define the tables that are going to live in the database using Python classes. For each class, its attributes will map to columns in a table.
from sqlalchemy import Column, Integer, Text
class Comment(Base):
__tablename__ = 'comment'
id = Column(Integer, primary_key=True)
comment = Column(Text)
For each document, we’ll just store the “commentText” found in the API response.
doc = dkt['documents'].pop()
doc['commentText']
'I am appalled that our treasured National monuments are up for review at all. Every single one of our parks, monuments and cultural or historic sites is worthwhile and belongs as a part of the American story. I am adamantly opposed to any effort to eliminate or diminish protections for national monuments and I urge you to support our public lands and waters and recommend that our current national monuments remain protected. The short review you are undertaking makes a mockery of the decades of work that local communities have invested to protect these places for future generations, especially Bears Ears National monument, which is the first on the list for this review. Five Tribal nations, Hopi, Navajo, Uintah and Ouray Ute Indian Tribe, Ute Mountain Ute and Zuni tribes came together, for the first time ever, to protect their shared sacred land by advocating for Bears Ears to be made a national monument. Now the Bears Ears Inter-Tribal Coalition is working to protect the national monument, and maintain its integrity. Hear me, and the overwhelming number of people who agree with me: PUBLIC LANDS BELONG IN PUBLIC HANDS. It is your job as the Secretary of the Dept. of Interior to protect and safeguard our national treasures. Please make sure you side with the people who support national parks, monuments, historical and cultural sites. '
Step 3: Connect (and Initialize)
engine = create_engine('sqlite:///BENM.db')
Session = sessionmaker(bind=engine)
Base.metadata.create_all(engine)
You could inspect the BENM database now using any sqlite3 client: you would find one empty “comment” table with fields “id” and “comment”.
Add a new rpp
parameter to request 100
documents per page.
query['rpp'] = 10
In each request, advance the query parameter po
to the number of the
record you want the response to begin with. Insert the documents (the
key:value pairs stored in values
) in bulk to the database with
engine.execute()
.
for i in range(0, 15):
query['po'] = i * query['rpp']
print(query['po'])
response = requests.get(api + path, params=query)
page = response.json()
docs = page['documents']
values = [{'comment': doc['commentText']} for doc in docs]
insert = Comment.__table__.insert().values(values)
engine.execute(insert)
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
View the records in the database by reading
everyting we have so far back into a DataFrame
.
df = pd.read_sql_table('comment', engine)
Don’t forget to disconnect from your database!
engine.dispose()
Takeaway
-
Web scraping is hard and unreliable, but sometimes there is no other option.
-
RESTful web services are the most common resource.
-
Search PyPI for the specific API you plan to use.
RESTful web services do not always have great documentation—what parameters are acceptable or necessary may not be clear. Some may even be poorly documented on purpose, if the API wasn’t designed for public use. Even if you plan to aquire data using the “raw” web service, try a search for a relevant package on Python. The package documention could help.
If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.
# Nothing here yet!