Online Data

Lesson 8 with Ian Carroll

Lesson Objectives

Distinguish three sources for online data
Understand how HTTP and web-services work
Learn Python idioms for requesting data

Specific Achievements

Programatically aquire data embedded in a web page
Request data through a REST API
Use the census package to aquire data

=== <!–

requests, html, “scraping”
api, parameters, key’s and rules
Census package as a python api, google earth example JS and Python API? –>

Why script data aquistion?

Too time intensive to aquire manually
Integrate updated or new data
Reproducibility
There’s an API between you and the data

Top of Section

Aquiring Online Data

Data can be available on the web in many different forms. The difficulty you will have aquiring that data for local analysis depends on which of three approaches you need.

Scraping 🙁

If a web browser can read HTML and JavaScript to display a page, why can’t your “bot” read HTML and JavaScript to store the data?

RESTful Web Services 😉

An Application Programming Interface (API, as opposed to a GUI) that is compatible with passing data around the web using HTTP. The Hyper-text Transfer Protocol may not be the fastest, but it is universal (it underpins web browsers, after all).

API Wrapper 😂

Major data providers can justify writing a package, specific to your language of choice (e.g. Python or R), that facilitates accessing the data they provide through a web service.

Top of Section

Requests

That “http” at the beginning of the URL for a possible data source is a protocol—an understanding between a client and a server about how to communicate. The client does not have to be a web browser, so long as it knows the protocol. After all, servers exist to serve.

The requests package provides a simple interface to issueing HTTP requests and handling the response.

import requests

response = requests.get('https://xkcd.com/869')
response

<Response [200]>

The response is still binary, it takes a browser-like parser to translate the raw content into an HTML document. BeautifulSoup does a fair job, while making no attempt to “render” a human readable page.

from bs4 import BeautifulSoup

doc = BeautifulSoup(response.text, 'lxml')
print('\n'.join(doc.prettify().splitlines()[0:10]))

<!DOCTYPE html>
<html>
 <head>
  <link href="/s/b0dcca.css" rel="stylesheet" title="Default" type="text/css"/>
  <title>
   xkcd: Server Attention Span
  </title>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <link href="/s/919f27.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="/s/919f27.ico" rel="icon" type="image/x-icon"/>

Searching the document for desired content is the hard part. This search uses a CSS query, to find the image below a section of the document with attribute id = comic.

img = doc.select('#comic > img').pop()
img

<img alt="Server Attention Span" src="//imgs.xkcd.com/comics/server_attention_span.png" title="They have to keep the adjacent rack units empty. Otherwise, half the entries in their /var/log/syslog are just 'SERVER BELOW TRYING TO START CONVERSATION *AGAIN*.' and 'WISH THEY'D STOP GIVING HIM SO MUCH COFFEE IT SPLATTERS EVERYWHERE.'"/>

It makes sense to query by CSS if the content being scraped always appears the same in a browser; stylesheets are separate from delivered content.

from textwrap import fill

print(fill(img['title'], width = 42))

They have to keep the adjacent rack units
empty. Otherwise, half the entries in
their /var/log/syslog are just 'SERVER
BELOW TRYING TO START CONVERSATION
*AGAIN*.' and 'WISH THEY'D STOP GIVING HIM
SO MUCH COFFEE IT SPLATTERS EVERYWHERE.'

Was that so bad?

Pages designed for humans are increasingly harder to parse programmatically.

Servers provide different responses based on client “metadata”
Javascript often needs to be executed by the client
The HTML <table> is drifting into obscurity (mostly for the better)

HTML Tables

Sites with easilly accessible html tables nowadays may be specifically geared toward non-human agents. The US Census provides some documentation for their data services in a massive such table:

http://api.census.gov/data/2015/acs5/variables.html

import pandas as pd

acs5_variables = pd.read_html(
    'https://api.census.gov/data/2016/acs/acs5/variables.html'
    )
vars = acs5_variables[0]
vars.head()

          Name                                              Label  ...     Group Values
     AIANHH                                   FIPS AIANHH code  ...       NaN    NaN
    AIHHTLI  American Indian Trust Land/Hawaiian Home Land ...  ...       NaN    NaN
     AITSCE          American Indian Tribal Subdivision (FIPS)  ...       NaN    NaN
       ANRC          Alaska Native Regional Corporation (FIPS)  ...       NaN    NaN
B00001_001E                                    Estimate!!Total  ...    B00001    NaN

[5 rows x 9 columns]

We can use our data manipulation tools to search this unwieldy documentation for variables of interest

rows = (
    vars['Label']
    .str.contains(
        'household income',
        na = False,
        )
    )
for idx, row in vars.loc[rows].iterrows():
    print('{}:\t{}'.format(row['Name'], row['Label']))

B19013_001E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013A_001E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013B_001E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013C_001E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013D_001E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013E_001E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013F_001E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013G_001E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013H_001E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19013I_001E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025_001E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025A_001E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025B_001E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025C_001E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025D_001E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025E_001E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025F_001E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025G_001E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025H_001E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19025I_001E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19049_001E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Total
B19049_002E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder under 25 years
B19049_003E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder 25 to 44 years
B19049_004E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder 45 to 64 years
B19049_005E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder 65 years and over
B19050_001E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19050_002E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder under 25 years
B19050_003E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder 25 to 44 years
B19050_004E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder 45 to 64 years
B19050_005E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Householder 65 years and over
B19202_001E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202A_001E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202B_001E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202C_001E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202D_001E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202E_001E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202F_001E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202G_001E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202H_001E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19202I_001E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19214_001E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19215_001E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Total (dollars)
B19215_002E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Total (dollars)
B19215_003E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Living alone!!Total (dollars)
B19215_004E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Living alone!!Householder 15 to 64 years (dollars)
B19215_005E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Living alone!!Householder 65 years and over (dollars)
B19215_006E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Not living alone!!Total (dollars)
B19215_007E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Not living alone!!Householder 15 to 64 years (dollars)
B19215_008E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder!!Not living alone!!Householder 65 years and over (dollars)
B19215_009E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Total (dollars)
B19215_010E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Living alone!!Total (dollars)
B19215_011E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Living alone!!Householder 15 to 64 years (dollars)
B19215_012E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Living alone!!Householder 65 years and over (dollars)
B19215_013E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Not living alone!!Total (dollars)
B19215_014E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Not living alone!!Householder 15 to 64 years (dollars)
B19215_015E:	Estimate!!Median nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder!!Not living alone!!Householder 65 years and over (dollars)
B19216_001E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)
B19216_002E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)
B19216_003E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)!!Living alone (dollars)
B19216_004E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)!!Living alone (dollars)!!Householder 15 to 64 years (dollars)
B19216_005E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)!!Living alone (dollars)!!Householder 65 years and over (dollars)
B19216_006E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)!!Not living alone (dollars)
B19216_007E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)!!Not living alone (dollars)!!Householder 15 to 64 years (dollars)
B19216_008E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Male householder (dollars)!!Not living alone (dollars)!!Householder 65 years and over (dollars)
B19216_009E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)
B19216_010E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)!!Living alone (dollars)
B19216_011E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)!!Living alone (dollars)!!Householder 15 to 64 years (dollars)
B19216_012E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)!!Living alone (dollars)!!Householder 65 years and over (dollars)
B19216_013E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)!!Not living alone (dollars)
B19216_014E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)!!Not living alone (dollars)!!Householder 15 to 64 years (dollars)
B19216_015E:	Estimate!!Aggregate nonfamily household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Female householder (dollars)!!Not living alone (dollars)!!Householder 65 years and over (dollars)
B25071_001E:	Estimate!!Median gross rent as a percentage of household income
B25092_001E:	Estimate!!Median selected monthly owner costs as a percentage of household income in the past 12 months!!Total
B25092_002E:	Estimate!!Median selected monthly owner costs as a percentage of household income in the past 12 months!!Housing units with a mortgage
B25092_003E:	Estimate!!Median selected monthly owner costs as a percentage of household income in the past 12 months!!Housing units without a mortgage
B25099_001E:	Estimate!!Median household income!!Total
B25099_002E:	Estimate!!Median household income!!Total!!Median household income for units with a mortgage
B25099_003E:	Estimate!!Median household income!!Total!!Median household income for units without a mortgage
B25119_001E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Total
B25119_002E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Owner occupied (dollars)
B25119_003E:	Estimate!!Median household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Renter occupied (dollars)
B25120_001E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)
B25120_002E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Owner occupied
B25120_003E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Owner occupied!!Housing units with a mortgage
B25120_004E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Owner occupied!!Housing units without a mortgage
B25120_005E:	Estimate!!Aggregate household income in the past 12 months (in 2016 inflation-adjusted dollars)!!Renter occupied

Top of Section

REST API

The US Census Burea provides access to its vast stores of demographic data via their API at https://api.census.gov.

The I in API is all the buttons and dials on the same kind of black box you need a GUI for (it’s the same I). Instead of interfacing with a user, those buttons and dials are meant for another software application.

In the case of the Census, the main component of the application is some relational database management system. There probabably are several GUIs designed for humans to query the Census database; the Census API is meant for communication between your program (i.e. script) and their application.

Inspect this URL in your browser.

In a RESTful web service, the already universal system for transferring data over the internet, known as HTTP is half of the interface. All you really need is documentation for how to construct the URL in a standards compliant way that the service will accept.

Section	Description
`https://`	scheme
`api.census.gov`	authority, or simply host if there’s no user authentication
`/data/2015/acs5`	path to a resource within a hierarchy
`?`	beginning of the query component of a URL
`get=NAME`	first query parameter
`&`	query parameter separator
`for=county`	second query parameter
`&`	query parameter separator
`in=state:*`	third query parameter
`#`	beginning of the fragment component of a URL
`irrelevant`	the fragment is a client side pointer, it isn’t even sent to the server

path = 'https://api.census.gov/data/2016/acs/acs5'
query = {
  'get':'NAME,B19013_001E',
  'for':'tract:*',
  'in':'state:24',
}
response = requests.get(path, params=query)
response

<Response [200]>

Response Header

The response from the API is a bunch of 0s and 1s, but part of the HTTP protocol is to include a “header” with information about how to decode the body of the response.

Most REST APIs return as the “content” either:

Javascript Object Notation (JSON)
- a UTF-8 encoded string of key-value pairs, where values may be lists
- e.g. {'a':24, 'b': ['x', 'y', 'z']}
eXtensible Markup Language (XML)
- a nested <tag></tag> hierarchy serving the same purpose

The header from Census says the content type is JSON.

for k, v in response.headers.items():
    print('{}: {}'.format(k, v))

Server: Apache-Coyote/1.1
Cache-Control: max-age=60, must-revalidate
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET,POST
Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept
Content-Type: application/json;charset=utf-8
Transfer-Encoding: chunked
Date: Thu, 26 Jul 2018 11:09:26 GMT
Strict-Transport-Security: max-age=31536000

Response Content

Use a JSON reader to extract a Python object. To read it into a Panda’s DataFrame, use Panda’s read_json.

data = pd.read_json(response.content)
data.head()

                                           0            1      2       3       4
                                     NAME  B19013_001E  state  county   tract
Census Tract 1, Allegany County, Maryland        42292     24     001  000100
Census Tract 2, Allegany County, Maryland        44125     24     001  000200
Census Tract 3, Allegany County, Maryland        39571     24     001  000300
Census Tract 4, Allegany County, Maryland        39383     24     001  000400

API Keys & Limits

Most servers request good behavior, others enforce it.

Size of single query
Rate of queries (calls per second, or per day)
User credentials specified by an API key

From the Census FAQ What Are the Query Limits?:

You can include up to 50 variables in a single API query and can make up to 500 queries per IP address per day… Please keep in mind that all queries from a business or organization having multiple employees might employ a proxy service or firewall. This will make all of the users of that business or organization appear to have the same IP address.

Top of Section

Specialized Packages

The third tier of access to online data is much preferred, if it exists: a dedicated package in your programming language’s repository (PyPI or CRAN).

Additional guidance on query parameters
Returns data in native formats
Handles all “encoding” problems

The census package is a user contributed suite of tools that streamline access to the API.

from census import Census

key = None
c = Census(key, year=2016)
c.acs5

<census.core.ACS5Client at 0x1148c50f0>

Compared to using the API directly via the requests package:

Pros

More concise code, quicker development
Package documentation (if present) is usually more user friendly than API documentaion.
May allow seemless update if API changes

Cons

No guarantee of updates
Possibly limited in scope

Query the Census ACS5 survey for the variable B19001_001E and each entity’s NAME.

variables = ('NAME', 'B19013_001E')

The census package converts the JSON string into a Python dictionary. (No need to check headers.)

response = c.acs5.state_county_tract(
    variables,
    '24',
    Census.ALL,
    Census.ALL
    )
response[0]

{'NAME': 'Census Tract 1, Allegany County, Maryland',
 'B19013_001E': 42292.0,
 'state': '24',
 'county': '001',
 'tract': '000100'}

The Pandas DataFrame() constructor will accept the list of dictionaries as the sole argument, taking column names from “keys”.

df = pd.DataFrame(response)
mask = df['B19013_001E'] == -666666666.0
df = df.loc[~mask, :]

The seaborn package provides some nice, quick visualizations.

import seaborn as sns

sns.boxplot(
  data = df,
  x = 'county',
  y = 'B19013_001E',
)

<matplotlib.axes._subplots.AxesSubplot at 0x11357d0f0>

Top of Section

Response Stashing

A common strategy that web service providers take to balance their load, is to limit the number of records a single API request can return. The user ends up having to flip through “pages” with the API, handling the response content at each iteration. Options for stashing data are:

Store it all in memory, write to file at the end.
Append each response to a file, writing frequently.
Offload these decisions to database management software.

To repeat the exercise below at home, request an API key at https://api.data.gov/signup/, and store it in an adjacent api_key.py file with the single variable API_KEY = your many digit key.

The “data.gov” API provides a case in point. Take a look at the request for comments posted by the US Department of Interior about Bears Ear National Monument. The document received over two million comments, all accessible through Regulations.gov.

import requests
from api_key import API_KEY

api = 'https://api.data.gov/regulations/v3/'
path = 'document.json'
query = {
    'documentId':'DOI-2017-0002-0001',
    'api_key':API_KEY,
    }
response = requests.get(
    api + path,
    params=query)

Extract data from the returned JSON object, which gets mapped to a Python dictionary called doc.

doc = response.json()
print('{}: {}'.format(
    doc['numItemsRecieved']['label'],
    doc['numItemsRecieved']['value'],
))

Number of Comments Received: 2839046

Initiate a new API query for public submission (PS) comments and print the dictionary keys in the response.

query = {
    'dktid': doc['docketId']['value'],
    'dct': 'PS',
    'api_key': API_KEY,
    }
path = 'documents.json'
response = requests.get(
    api + path, params=query)
dkt = response.json()

To inspect the return, we can list the keys in the parsed dkt.

list(dkt.keys())

['documents', 'totalNumRecords']

The purported claimed number of results is much larger than the length of the documents array contained in this response.

print('Number received: {}\nTotal number: {}'
    .format(
        len(dkt['documents']),
        dkt['totalNumRecords'],
))

Number received: 25
Total number: 782468

The following commands prepare Python to connect to a database-in-a-file, and creates empty tables in the database if they do not already exist (i.e. it is safe to re-run after you have populated the database).

Step 1: Boilerplate

The SQLAlchemy package has a lot of features, and requires you to be very precise about how to get started.

from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine

Base = declarative_base()

Step 2: Table Definition

Define the tables that are going to live in the database using Python classes. For each class, its attributes will map to columns in a table.

from sqlalchemy import Column, Integer, Text

class Comment(Base):
    __tablename__ = 'comment'
    
    id = Column(Integer, primary_key=True)
    comment = Column(Text)

For each document, we’ll just store the “commentText” found in the API response.

doc = dkt['documents'].pop()
doc['commentText']

'I am appalled that our treasured National monuments are up for review at all.  Every single one of our parks, monuments and cultural or historic sites is worthwhile and belongs as a part of the American story. I am adamantly opposed to any effort to eliminate or diminish protections for national monuments and I urge you to support our public lands and waters and recommend that our current national monuments remain protected. The short review you are undertaking makes a mockery of the decades of work that local communities have invested to protect these places for future generations, especially Bears Ears National monument, which is the first on the list for this review. Five Tribal nations, Hopi, Navajo, Uintah and Ouray Ute Indian Tribe, Ute Mountain Ute and Zuni tribes came together, for the first time ever, to protect their shared sacred land by advocating for Bears Ears to be made a national monument. Now the Bears Ears Inter-Tribal Coalition is working to protect the national monument, and maintain its integrity. Hear me, and the overwhelming number of people who agree with me: PUBLIC LANDS BELONG IN PUBLIC HANDS. It is your job as the Secretary of the Dept. of Interior to protect and safeguard our national treasures. Please make sure you side with the people who support national parks, monuments, historical and cultural sites. '

Step 3: Connect (and Initialize)

engine = create_engine('sqlite:///BENM.db')
Session = sessionmaker(bind=engine)

Base.metadata.create_all(engine)

You could inspect the BENM database now using any sqlite3 client: you would find one empty “comment” table with fields “id” and “comment”.

Add a new rpp parameter to request 100 documents per page.

query['rpp'] = 10

In each request, advance the query parameter po to the number of the record you want the response to begin with. Insert the documents (the key:value pairs stored in values) in bulk to the database with engine.execute().

for i in range(0, 15):
    query['po'] = i * query['rpp']
    print(query['po'])
    response = requests.get(api + path, params=query)
    page = response.json()
    docs = page['documents']
    values = [{'comment': doc['commentText']} for doc in docs]
    insert = Comment.__table__.insert().values(values)
    engine.execute(insert)

View the records in the database by reading everyting we have so far back into a DataFrame.

df = pd.read_sql_table('comment', engine)

Don’t forget to disconnect from your database!

engine.dispose()

Top of Section

Takeaway

Web scraping is hard and unreliable, but sometimes there is no other option.
RESTful web services are the most common resource.
Search PyPI for the specific API you plan to use.

RESTful web services do not always have great documentation—what parameters are acceptable or necessary may not be clear. Some may even be poorly documented on purpose, if the API wasn’t designed for public use. Even if you plan to aquire data using the “raw” web service, try a search for a relevant package on Python. The package documention could help.

Top of Section

If you need to catch-up before a section of code will work, just squish it's 🍅 to copy code above it into your clipboard. Then paste into your interpreter's console, run, and you'll be ready to start in on that section. Code copied by both 🍅 and 📋 will also appear below, where you can edit first, and then copy, paste, and run again.

# Nothing here yet!