Introduction to Python, Pandas and SQLAlchemy
Lesson 10 with Kelly Hondula
Contents
- Why learn Python?
- Variables
- Data types
- Data structures
- Iteration
- Flow control
- Function definition
- Python packages
- Pandas
- SQLAlchemy
- Summary
Why learn Python?
- Write scripts clearly and quickly
- Community created extensions support high-performance scientific computing (e.g. NumPy and SciKits)
- Common in- and out-side of academia
- Helpful user community on https://stackoverflow.com
- The language for scripting ArcGIS and QGIS pipelines
- Object-oriented design that facilitates database operations
Objectives for this Lesson
- Earn your Python “learner’s permit”
- Work with one basic and one advanced data science package
- Recognize differences between R and Python
Specific Achievements
- Differentiate between data types and structures
- Learn to use indentation as syntax
- Import and query tabular data
- Bind Python objects to database records
Jupyter
Open up worksheet-10.ipynb
after signing into JupyterHub. This worksheet is an IPython Notebook document: it is divided into “cells” that are run independently but access the same Python interpreter. Use the Notebook to write and annotate code.
For “throwaway” evaluations, it’s handy to have a console. After opening the worksheet, go back to the tab showing your documents, and select New > Terminal. In the shell that opens, enter
jupyter console --existing
Variables
Variable assignment attaches the label left of an =
to the return value of the expression on its right.
a = 'xyz'
Colloquially, you might say the new variable a
equals 'xyz'
, but Python makes it easy to “go deeper”. There can be only one string 'xyz'
, so the Python interpreter makes a
into another label for the same 'xyz'
, which we can verify by id()
.
The “in-memory” location of a
returned by id()
…
>>> id(a)
4362913584
… is equal to that of xyz
itself:
>>> id('xyz')
4362913584
The idiom to test this “sameness” is typical of the Python language: it uses plain English when words will suffice.
>>> a is 'xyz'
True
Equal but not the Same
The id()
function helps demonstrate that “equal” is not the “same”.
b = [1, 2, 3]
>>> id(b)
4392204616
The “in-memory” location of the list labeled b
isn’t the same as a list generated on-the-fly:
>>> id([1, 2, 3])
4392203592
Even though b == [1, 2, 3]
returns True
, these are not the same object:
>>> b is [1, 2, 3]
False
Side-effects
The reason to be aware of what b
is has to do with “side-effects”, an import part of Python programming. A side-effect occurs when an expression generates some ripples other than its return value. And side-effects don’t change the label, they effect what the label is assigned to (i.e. what it is).
>>> b.pop()
3
>>> b
[1, 2]
- Question
- Re-check the “in-memory” location—is it the same
b
? - Answer
- Yes! The list got shorter but it is the same list.
Side-effects trip up Python programmers when an object has multiple labels, which is not so unusual:
c = b
b.pop()
>>> c
[1]
The assignment to c
does not create a new list, so the side-effect of popping off the tail of b
ripples into c
.
A common mistake for those coming to Python from R, is to write b = b.append(4)
, which overwrites b
with the value None
that happens to be returned by the append()
method.
Not every object is mutable; for example, the a
assigned earlier is not.
>>> x = a
>>> a.upper()
'XYZ'
>>> x
'xyz'
The string ‘xyz’ hasn’t changed—it’s immutable. So it is also a safe guess that there has been no side-effect:
>>> a
'xyz'
Data types
The immutable data types are
'int' |
Integer |
'float' |
Real number |
'str' |
Character string |
'bool' |
True /False |
'tuple' |
Immutable sequence |
Any object can be queried with type()
T = ('x', 3, True)
>>> type(T)
<class 'tuple'>
Python supports the usual arithmetic operators for numeric types:
+ |
addition |
- |
subtraction |
* |
multiplication |
/ |
floating-point division |
** |
exponent |
% |
modulus |
// |
floor division |
Some operators have natural extensions to non-numeric types:
>>> a * 2
'xyzxyz'
>>> T + (3.14, 'y')
('x', 3, True, 3.14, 'y')
Comparison operators are symbols or plain english:
== |
equal |
!= |
non-equal |
> , < |
greater, lesser |
>= , <= |
greater or equal, lesser or equal |
and |
logical and |
or |
logical or |
not |
logical negation |
in |
logical membership |
Exercise 1
Explore the use of in
to test membership in a list. Create a list of multiple integers, and use in
to test membership of some other numbers in your list.
Data structures
The built-in structures for holding multiple values are:
- Tuple
- List
- Set
- Dictionary
Tuple
The simplest kind of sequence, a tuple is declared with comma-separated values inside ()
as we have already seen, or inside nothing at all. Note that to declare a one-tuple, a trailing “,” is required.
T = 'cat',
>>> type(T)
<class 'tuple'>
List
The more common kind of sequence in Python is the list, which is declared with comma-separated values inside []
. Unlike a tuple, a list is mutable.
L = [3.14, 'xyz', T]
>>> type(L)
<class 'list'>
Subsetting Tuples and Lists
Subsetting elements from a tuple or list is performed with square brackets in both cases, and selects elements using their integer position starting from zero—their “index”.
>>> L[0]
3.14
Negative indices are allowed, and refer to the reverse ordering: -1 is the last item in the list, -2 the second-to-last item, and so on.
>>> L[-1]
('cat',)
The syntax L[i:j]
selects a sub-list starting with the element at index
i
and ending with the element at index j - 1
.
>>> L[0:2]
[3.14, 'xyz']
A blank space before or after the “:” indicates the start or end of the list,
respectively. For example, the previous example could have been written
L[:2]
.
A potentially useful trick to remember the list subsetting rules in Python is to picture the indices as “dividers” between list elements.
0 1 2 3
| 3.14 | 'xyz' | ('cat',) |
-3 -2 -1
Positive indices are written at the top and negative indices at the bottom.
L[i]
returns the element to the right of i
whereas L[i:j]
returns
elements between i
and j
.
Exercise 2
Create a Python list containing zero as a float, the integer 2, and a tuple of three separate characters. Now, assume you did not know the length of the list and extract the last two elements.
Set
The third and last “sequence” data structure is a set, used mainly for quick access to set operations like “union” and “difference”. Declare a set with comma-separated values inside {}
or by casting another sequence with set()
.
S = set(L)
>>> S.difference({3.14})
{'xyz', ('cat',)}
Python is a principled language: a set is technically unordered, so its elements do not have an index. You cannot subset a set using []
.
Dictionary
Lists are useful when you need to access elements by their position in a sequence. In contrast, a dictionary is needed to find values based on arbitrary identifiers.
Construct a dictionary with comma-separated key:value
pairs in {}
.
toons = {'Snowy':'dog', 'Garfield':'cat', 'Bugs':'bunny'}
>>> type(toons)
<class 'dict'>
Individual values are accessed using square brackets, as for lists, but the key must be used rather than an index.
>>> toons['Bugs']
'bunny'
To add a single new element to the dictionary, define a new key:value
pair by assigning a value to a novel key in the dictionary.
toons['Goofy'] = 'dog'
>>> toons
{'Snowy': 'dog', 'Garfield': 'cat', 'Bugs': 'bunny', 'Goofy': 'dog'}
Dictionary keys are unique. Assigning a value to an existing key overwrites its previous value.
Exercise 3
Based on what we have learned so far about lists and dictionaries, think up a data structure suitable for an address book. Using what you come up with, store the contact information (i.e. the name and email address) of three or four (hypothetical) persons as a variable addr
.
Iteration
The data structures just discussed have multiple values. Subsetting is one way to get at them individually. The common case of getting at each one individually is called iterating.
Python formally declares a thing “iterable” if it can be used in an expression for x in y
. where y
is the iterable thing and x
will label each element in turn.
Declarations with Iterables
Packing the for x in y
expression inside a sequence declaration is one way to build a sequence.
letters = [x for x in 'abcde']
>>> letters
['a', 'b', 'c', 'd', 'e']
This way of declaring with for
and in
is called a “comprehension” in Python.
Dictionary Comprehension
To declare a dictionary in this way, specify a key:value
pair.
>>> CAPS = {x:x.upper() for x in 'abcde'}
>>> CAPS
{'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D', 'e': 'E'}
Flow control
The list and dictionary comprehensions embed a short form of the expression used to initiate a looping control statement.
For loops
A for
loop takes any iterable object and executes a block of code once for each
element in
the iterable..
squares = []
for i in range(1, 5):
j = i ** 2
squares.append(j)
n_squares = len(squares)
>>> n_squares
4
The range(i, j)
function creates a list of integers from i
up through j - 1
; just
like in the case of list slices, the range is not inclusive of the upper bound.
Indentation
Note the pattern of the block above:
- the
for x in y
expression is followed by a colon - the following lines are indented equally
- un-indenting indicates the end of the loop (rather than any special character)
Compared with other programming languages in which code indentation only serves to enhance readability, Python uses indentation (and only indentation) to define “code blocks”, a.k.a. statements.
Nesting indentation
A for
loop can be used to iterate over the elements of any list. In the
following example, we create a contact list (as a list of dictionaries), then
perform a loop over all contacts. Within the loop, we use a conditional
statement (if
) to check if the name is ‘Alice’. If so, print the phone
number; otherwise print the name (else
block).
contacts = [
{'name':'Alice', 'phone':'555-111-2222'},
{'name':'Bob', 'phone':'555-333-4444'},
]
for c in contacts:
if c['name'] == 'Alice':
print(c['phone'])
else:
print(c['name'])
555-111-2222
Bob
Exercise 4
Write a for loop that prints all even numbers between 1 and 9. Use the modulo
operator (%
) to check for evenness: if i
is even, then i % 2
returns 0
, because %
gives the remainder after division of the first number by the second.
Function definition
We already saw examples of a few built-in functions, such as type()
and len()
.
New functions are defined as a block of code starting with a def
keyword and (optionally) finishing with a return
.
def add_two(x):
result = x + 2
return result
>>> add_two(10)
12
The def
keyword is followed by the function name, its arguments enclosed in
parentheses (separated by commas if there are more than one), and a colon.
The return
statement is needed to make the function provide output.
The lack of a return
, or return
followed by nothing, causes the function to return the value None
.
After it is defined, the function is invoked using its name and specifying the arguments in parentheses, in the same order as in its definition.
Default arguments
A default value can be “assigned” during function definition.
def add_any(x, y=0):
result = x + y
return result
Then the function can be called without that argument:
>>> add_any(10)
10
… as well as with an argument that will override the default:
>>> add_any(10, 5)
15
Exercise 5
Create a function that takes a list as an argument and returns its first and last elements as a new list.
Methods
The period is a special character in Python that accesses an object’s attributes and methods. In either the Jupyter Notebook or Console, typing an object’s name followed by .
and then pressing the TAB
key brings up suggestions.
>>> squares.index(4)
1
We call this index()
function a method of lists (recall that squares
is of type 'list'
). A useful feature of having methods attached to objects is that we can dial up help on a method as it applies to a particular type.
>>> help(squares.index)
Help on built-in function index:
index(...) method of builtins.list instance
L.index(value, [start, [stop]]) -> integer -- return first index of value.
Raises ValueError if the value is not present.
A major differnce between Python and R has to do with the process for making functions behave differently for different objects. In Python, a function is attached to an object as a “method”, while in R a “dispatcher” examines the attributes of a function call’s arguments and chooses a the particular function to use.
A dictionary method
The update()
method allows you to extend a dictionary with another dictionary of key:value
pairs, while simultaneously overwriting values for existing keys.
toons.update({
'Tweety':'bird',
'Bob':'sponge',
'Bugs':'rabbit',
})
- Question
- How many
key:value
pairs are there now in toons? - Answer
- Five. The key
'Bugs'
is only inserted once.
Note a couple “Pythonic” style choices of the above:
- Leave no space around the
:
when declaringkey:value
pairs - Trailing null arguments are syntactically correct, even advantageous
- White space can be used to improve readiability; indentation within
()
is meaningless
Python packages
So far we have only covered elements of the core Python language. However, most of Python’s useful tools for scientific programming can be found in contributed packages that extend Python in every direction.
PyPI
The Python Package Index is the online directory of contributed Python packages. Interaction with PyPI typically
take place outside of Python, using the command line utility pip
or a GUI package manager.
The vast majority of Python packages can be installed with the command pip install
. For example, to install the pandas package:
pip install pandas
The packages for this lesson are already installed in SESYNC’s “teaching-lab” Docker container.
To access the tools provided by a package, once it has been installed, use the import
command. Their are several variations that
preciesly control the tools made available.
Syntax | Result |
---|---|
import pandas |
all modules are available following pandas. |
import pandas as pd |
all modules are available following pd. |
from pandas import DataFrame |
the DataFrame module is available |
from pandas import DataFrame as DF |
the label DF points to pandas.DataFrame |
Pandas
If you have used the statistical programming language R, you are familiar with “data frames”, two-dimensional data structures where each column can hold a different type of data, as in a spreadsheet.
The data analysis library pandas provides a data frame object type for Python, along with functions to subset, filter reshape and aggregate data stored in data frames.
After importing pandas, we call its read_csv
function to load the Portal
animals data from the file animals.csv
.
import pandas as pd
animals = pd.read_csv("data/animals.csv")
>>> animals.dtypes
id int64
month int64
day int64
year int64
plot_id int64
species_id object
sex object
hindfoot_length float64
weight float64
dtype: object
There are many ways to slice a DataFrame
. To select a subset of rows and/or columns by name, use the loc
attribute and [
for indexing.
animals.loc[:, ['plot_id', 'species_id']]
As with lists, :
by itself indicates all the rows (or columns). Unlike lists, the loc
attribute returns both endpoints of a slice.
>>> animals.loc[2:4, 'plot_id':'sex']
plot_id species_id sex
2 7 DM M
3 3 DM M
4 1 PF M
Use the iloc
attribute of a DataFrame to get rows and/or columns by position, which behaves identically to list indexing.
>>> animals.iloc[2:4, 4:6]
plot_id species_id
2 7 DM
3 3 DM
The default indexing for a DataFrame, without using the loc
or iloc
attributes, is by column name.
animals[['hindfoot_length', 'weight']].describe()
The loc
attribute also allows logical indexing, i.e. the use of a boolean array of appropriate length for the selected dimension.
The subset of animals
where the species is “DM” is extracted into a new data frame.
animals_dm = animals.loc[animals['species_id'] == 'DM', ]
>>> animals_dm.head()
id month day year plot_id species_id sex hindfoot_length weight
1 3 7 16 1977 2 DM F 37.0 NaN
2 4 7 16 1977 7 DM M 36.0 NaN
3 5 7 16 1977 3 DM M 35.0 NaN
6 8 7 16 1977 1 DM M 37.0 NaN
7 9 7 16 1977 1 DM F 34.0 NaN
Aggregation of records in a DataFrame by value of a given variable is performed with the groupby()
method. The resulting “grouped” DataFrame has additional methods (like mean()
) that summarize each group, producing a DataFrame with one record for each group.
dm_stats = (
animals_dm
.groupby('sex')
['hindfoot_length', 'weight']
.mean()
)
>>> dm_stats
hindfoot_length weight
sex
F 35.712692 41.609685
M 36.188229 44.353134
Exercise 6
The count
method for DataFrames (e.g. animals.count()
) returns the number of rows
in a data frame. Find out which month had the most observations recorded
in animals
using groupby()
and count()
.
SQLAlchemy
Working with databases from a high level programming language (Python or R) is usually a one way street: the language helps with complicated queries for pulling out data, but provides little assistance with pushing data back, especially for complicated modifications. SQLAlchemy lays the groundwork for bi-directional flow: Python objects become synchronized with database records.
After some upfront alignment between the Portal Mammals database in PostgreSQL and a Python environment, we’ll be ready to go.
from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine
# Create the canvas for a map between Python objects and database records
Base = automap_base()
# Identify database, create the map from the schema, and prepare to create sessions
engine = create_engine('postgresql+pygresql://@localhost/portal')
Base.prepare(engine, reflect=True)
Session = sessionmaker(bind=engine)
Object Relational Mapper (ORM)
The ORM sets up a link between Python objects and records in a database. You could think of a table as defining a class or data structure, in which each field corresponds to an attribute. Each record corresponds to an instance of the class.
>>> Base.classes.keys()
['species', 'animals', 'plots']
The Base
class contains a list of tables that automap_base()
found in the PostgreSQL database.
Pull class definitions directly from the Base
after it has been prepared.
Plots = Base.classes['plots']
Animals = Base.classes['animals']
Species = Base.classes['species']
Then create a new instance by calling the class “constructor” with values specified for each field.
plot = Plots(treatment='Control')
>>> plot.id
Note that plot.id
is empty. We did not specify a value for id
because we don’t know what values would duplicate an existing primary key.
Session
Connections to the database are made within a “session”, as a safety protocol that protects against network disruptions and provides a session.rollback()
method of aborting changes. After adding the new plot
object to a session and commiting the change, the database and Python environments are synched.
session = Session()
session.add(plot)
session.commit()
>>> plot.id
26
The RDBMS inserted a new record, incrementing the id
value as designed, and reflected it back into the object plot
.
Relations
The full reach of the ORM extends into relationships between tables.
Let’s pull a species
record using the session.query()
method.
query = session.query(Species).filter_by(id='RO')
species = query.one()
Not only do we have the species’ attributes
>>> species.id
'RO'
>>> species.genus
'Reithrodontomys'
… we have the list of every animal
linked to this species
by the many-to-one relationship dictated by the use of foreign keys.
>>> species.animals_collection
[<sqlalchemy.ext.automap.animals object at 0x7fa66638b3c8>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b438>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b4e0>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b5c0>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b668>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b748>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b828>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b8d0>]
Update
With SQLAlchemy, you can programatically make complex updates to the data in Python, and then commit the state to the database in one transaction.
Suppose a surveyor mistakenly reversed the labels of two plots during a particular survey, and your job is to correct the database. Here are the two problematic plots.
plot_a = session.query(Plots).filter_by(id=2).one()
plot_b = session.query(Plots).filter_by(id=12).one()
Here are the animals incorrectly associated with each plot.
query = session.query(Animals).filter_by(year=2002, month=1, day=12)
animals_a = query.filter(Animals.plots == plot_a).all()
animals_b = query.filter(Animals.plots == plot_b).all()
>>> animals_a[0].plots is plot_a
True
To swap the plots associated with all these animal records, assign the opposite plot to each animal.plot
attribute.
for animal in animals_a:
animal.plots = plot_b
for animal in animals_b:
animal.plots = plot_a
Check your work …
>>> animals_a[0].plots is plot_b
True
and commit the result to the database.
session.commit()
animals_a[0].plots is plot_b
Query
At any time, collect the result of a SQLAlchemy query into a data fram
query = (session
.query(Animals)
.join(Species)
.filter_by(taxa='Rodent')
)
rodents = pd.read_sql(query.statement, engine)
>>> rodents.head()
id month day year plot_id species_id sex hindfoot_length weight
0 2 7 16 1977 3 NL M 33.0 NaN
1 3 7 16 1977 2 DM F 37.0 NaN
2 4 7 16 1977 7 DM M 36.0 NaN
3 5 7 16 1977 3 DM M 35.0 NaN
4 6 7 16 1977 1 PF M 14.0 NaN
Summary
Python is said to have a gentle learning curve, but all new languages take practice.
Key concepts covered in this lesson include:
- Variable assignment
- Data structures
- Functions and methods
- Python packages
Additional critical packages for data science:
- matplotlib, plotly, seaborn for vizualization
- StatsModels, scikit-learn, and pystan for model fitting
- PyQGIS, Shapely, and Cartopy for GIS