Introduction to Python, Pandas and SQLAlchemy

Lesson 10 with Kelly Hondula

Contents


Why learn Python?

Objectives for this Lesson

Specific Achievements

Jupyter

Open up worksheet-10.ipynb after signing into JupyterHub. This worksheet is an IPython Notebook document: it is divided into “cells” that are run independently but access the same Python interpreter. Use the Notebook to write and annotate code.

For “throwaway” evaluations, it’s handy to have a console. After opening the worksheet, go back to the tab showing your documents, and select New > Terminal. In the shell that opens, enter

jupyter console --existing

Top of Section


Variables

Variable assignment attaches the label left of an = to the return value of the expression on its right.

a = 'xyz'

Colloquially, you might say the new variable a equals 'xyz', but Python makes it easy to “go deeper”. There can be only one string 'xyz', so the Python interpreter makes a into another label for the same 'xyz', which we can verify by id().

The “in-memory” location of a returned by id()

>>> id(a)
4362913584

… is equal to that of xyz itself:

>>> id('xyz')
4362913584

The idiom to test this “sameness” is typical of the Python language: it uses plain English when words will suffice.

>>> a is 'xyz'
True

Equal but not the Same

The id() function helps demonstrate that “equal” is not the “same”.

b = [1, 2, 3]
>>> id(b)
4392204616

The “in-memory” location of the list labeled b isn’t the same as a list generated on-the-fly:

>>> id([1, 2, 3])
4392203592

Even though b == [1, 2, 3] returns True, these are not the same object:

>>> b is [1, 2, 3]
False

Side-effects

The reason to be aware of what b is has to do with “side-effects”, an import part of Python programming. A side-effect occurs when an expression generates some ripples other than its return value. And side-effects don’t change the label, they effect what the label is assigned to (i.e. what it is).

>>> b.pop()
3
>>> b
[1, 2]
Question
Re-check the “in-memory” location—is it the same b?
Answer
Yes! The list got shorter but it is the same list.

Side-effects trip up Python programmers when an object has multiple labels, which is not so unusual:

c = b
b.pop()
>>> c
[1]

The assignment to c does not create a new list, so the side-effect of popping off the tail of b ripples into c.

A common mistake for those coming to Python from R, is to write b = b.append(4), which overwrites b with the value None that happens to be returned by the append() method.

Not every object is mutable; for example, the a assigned earlier is not.

>>> x = a
>>> a.upper()
'XYZ'
>>> x
'xyz'

The string ‘xyz’ hasn’t changed—it’s immutable. So it is also a safe guess that there has been no side-effect:

>>> a
'xyz'

Top of Section


Data types

The immutable data types are

'int' Integer
'float' Real number
'str' Character string
'bool' True/False
'tuple' Immutable sequence

Any object can be queried with type()

T = ('x', 3, True)
>>> type(T)
<class 'tuple'>

Python supports the usual arithmetic operators for numeric types:

+ addition
- subtraction
* multiplication
/ floating-point division
** exponent
% modulus
// floor division

Some operators have natural extensions to non-numeric types:

>>> a * 2
'xyzxyz'
>>> T + (3.14, 'y')
('x', 3, True, 3.14, 'y')

Comparison operators are symbols or plain english:

== equal
!= non-equal
>, < greater, lesser
>=, <= greater or equal, lesser or equal
and logical and
or logical or
not logical negation
in logical membership

Exercise 1

Explore the use of in to test membership in a list. Create a list of multiple integers, and use in to test membership of some other numbers in your list.

Top of Section


Data structures

The built-in structures for holding multiple values are:

Tuple

The simplest kind of sequence, a tuple is declared with comma-separated values inside () as we have already seen, or inside nothing at all. Note that to declare a one-tuple, a trailing “,” is required.

T = 'cat',
>>> type(T)
<class 'tuple'>

List

The more common kind of sequence in Python is the list, which is declared with comma-separated values inside []. Unlike a tuple, a list is mutable.

L = [3.14, 'xyz', T]
>>> type(L)
<class 'list'>

Subsetting Tuples and Lists

Subsetting elements from a tuple or list is performed with square brackets in both cases, and selects elements using their integer position starting from zero—their “index”.

>>> L[0]
3.14

Negative indices are allowed, and refer to the reverse ordering: -1 is the last item in the list, -2 the second-to-last item, and so on.

>>> L[-1]
('cat',)

The syntax L[i:j] selects a sub-list starting with the element at index i and ending with the element at index j - 1.

>>> L[0:2]
[3.14, 'xyz']

A blank space before or after the “:” indicates the start or end of the list, respectively. For example, the previous example could have been written L[:2].

A potentially useful trick to remember the list subsetting rules in Python is to picture the indices as “dividers” between list elements.

 0      1       2          3 
 | 3.14 | 'xyz' | ('cat',) |
-3     -2      -1

Positive indices are written at the top and negative indices at the bottom. L[i] returns the element to the right of i whereas L[i:j] returns elements between i and j.

Exercise 2

Create a Python list containing zero as a float, the integer 2, and a tuple of three separate characters. Now, assume you did not know the length of the list and extract the last two elements.

Set

The third and last “sequence” data structure is a set, used mainly for quick access to set operations like “union” and “difference”. Declare a set with comma-separated values inside {} or by casting another sequence with set().

S = set(L)
>>> S.difference({3.14})
{'xyz', ('cat',)}

Python is a principled language: a set is technically unordered, so its elements do not have an index. You cannot subset a set using [].

Dictionary

Lists are useful when you need to access elements by their position in a sequence. In contrast, a dictionary is needed to find values based on arbitrary identifiers.

Construct a dictionary with comma-separated key:value pairs in {}.

toons = {'Snowy':'dog', 'Garfield':'cat', 'Bugs':'bunny'}
>>> type(toons)
<class 'dict'>

Individual values are accessed using square brackets, as for lists, but the key must be used rather than an index.

>>> toons['Bugs']
'bunny'

To add a single new element to the dictionary, define a new key:value pair by assigning a value to a novel key in the dictionary.

toons['Goofy'] = 'dog'
>>> toons
{'Snowy': 'dog', 'Garfield': 'cat', 'Bugs': 'bunny', 'Goofy': 'dog'}

Dictionary keys are unique. Assigning a value to an existing key overwrites its previous value.

Exercise 3

Based on what we have learned so far about lists and dictionaries, think up a data structure suitable for an address book. Using what you come up with, store the contact information (i.e. the name and email address) of three or four (hypothetical) persons as a variable addr.

Top of Section


Iteration

The data structures just discussed have multiple values. Subsetting is one way to get at them individually. The common case of getting at each one individually is called iterating.

Python formally declares a thing “iterable” if it can be used in an expression for x in y. where y is the iterable thing and x will label each element in turn.

Declarations with Iterables

Packing the for x in y expression inside a sequence declaration is one way to build a sequence.

letters = [x for x in 'abcde']
>>> letters
['a', 'b', 'c', 'd', 'e']

This way of declaring with for and in is called a “comprehension” in Python.

Dictionary Comprehension

To declare a dictionary in this way, specify a key:value pair.

>>> CAPS = {x:x.upper() for x in 'abcde'}
>>> CAPS
{'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D', 'e': 'E'}

Top of Section


Flow control

The list and dictionary comprehensions embed a short form of the expression used to initiate a looping control statement.

For loops

A for loop takes any iterable object and executes a block of code once for each element in the iterable..

squares = []
for i in range(1, 5):
    j = i ** 2
    squares.append(j)
n_squares = len(squares)
>>> n_squares
4

The range(i, j) function creates a list of integers from i up through j - 1; just like in the case of list slices, the range is not inclusive of the upper bound.

Indentation

Note the pattern of the block above:

Compared with other programming languages in which code indentation only serves to enhance readability, Python uses indentation (and only indentation) to define “code blocks”, a.k.a. statements.

Nesting indentation

A for loop can be used to iterate over the elements of any list. In the following example, we create a contact list (as a list of dictionaries), then perform a loop over all contacts. Within the loop, we use a conditional statement (if) to check if the name is ‘Alice’. If so, print the phone number; otherwise print the name (else block).

contacts = [
    {'name':'Alice', 'phone':'555-111-2222'},
    {'name':'Bob', 'phone':'555-333-4444'},
    ]
for c in contacts:
    if c['name'] == 'Alice':
        print(c['phone'])
    else:
        print(c['name'])
555-111-2222
Bob

Exercise 4

Write a for loop that prints all even numbers between 1 and 9. Use the modulo operator (%) to check for evenness: if i is even, then i % 2 returns 0, because % gives the remainder after division of the first number by the second.

Top of Section


Function definition

We already saw examples of a few built-in functions, such as type() and len(). New functions are defined as a block of code starting with a def keyword and (optionally) finishing with a return.

def add_two(x):
    result = x + 2
    return result
>>> add_two(10)
12

The def keyword is followed by the function name, its arguments enclosed in parentheses (separated by commas if there are more than one), and a colon.

The return statement is needed to make the function provide output. The lack of a return, or return followed by nothing, causes the function to return the value None.

After it is defined, the function is invoked using its name and specifying the arguments in parentheses, in the same order as in its definition.

Default arguments

A default value can be “assigned” during function definition.

def add_any(x, y=0):
    result = x + y
    return result

Then the function can be called without that argument:

>>> add_any(10)
10

… as well as with an argument that will override the default:

>>> add_any(10, 5)
15

Exercise 5

Create a function that takes a list as an argument and returns its first and last elements as a new list.

Methods

The period is a special character in Python that accesses an object’s attributes and methods. In either the Jupyter Notebook or Console, typing an object’s name followed by . and then pressing the TAB key brings up suggestions.

>>> squares.index(4)
1

We call this index() function a method of lists (recall that squares is of type 'list'). A useful feature of having methods attached to objects is that we can dial up help on a method as it applies to a particular type.

>>> help(squares.index)
Help on built-in function index:

index(...) method of builtins.list instance
    L.index(value, [start, [stop]]) -> integer -- return first index of value.
    Raises ValueError if the value is not present.

A major differnce between Python and R has to do with the process for making functions behave differently for different objects. In Python, a function is attached to an object as a “method”, while in R a “dispatcher” examines the attributes of a function call’s arguments and chooses a the particular function to use.

A dictionary method

The update() method allows you to extend a dictionary with another dictionary of key:value pairs, while simultaneously overwriting values for existing keys.

toons.update({
  'Tweety':'bird',
  'Bob':'sponge',
  'Bugs':'rabbit',
  })
Question
How many key:value pairs are there now in toons?
Answer
Five. The key 'Bugs' is only inserted once.

Note a couple “Pythonic” style choices of the above:

  1. Leave no space around the : when declaring key:value pairs
  2. Trailing null arguments are syntactically correct, even advantageous
  3. White space can be used to improve readiability; indentation within () is meaningless

Top of Section


Python packages

So far we have only covered elements of the core Python language. However, most of Python’s useful tools for scientific programming can be found in contributed packages that extend Python in every direction.

PyPI

The Python Package Index is the online directory of contributed Python packages. Interaction with PyPI typically take place outside of Python, using the command line utility pip or a GUI package manager.

The vast majority of Python packages can be installed with the command pip install. For example, to install the pandas package:

pip install pandas

The packages for this lesson are already installed in SESYNC’s “teaching-lab” Docker container.

To access the tools provided by a package, once it has been installed, use the import command. Their are several variations that preciesly control the tools made available.

Syntax Result
import pandas all modules are available following pandas.
import pandas as pd all modules are available following pd.
from pandas import DataFrame the DataFrame module is available
from pandas import DataFrame as DF the label DF points to pandas.DataFrame

Top of Section


Pandas

If you have used the statistical programming language R, you are familiar with “data frames”, two-dimensional data structures where each column can hold a different type of data, as in a spreadsheet.

The data analysis library pandas provides a data frame object type for Python, along with functions to subset, filter reshape and aggregate data stored in data frames.

After importing pandas, we call its read_csv function to load the Portal animals data from the file animals.csv.

import pandas as pd
animals = pd.read_csv("data/animals.csv")
>>> animals.dtypes
id                   int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

There are many ways to slice a DataFrame. To select a subset of rows and/or columns by name, use the loc attribute and [ for indexing.

animals.loc[:, ['plot_id', 'species_id']]

As with lists, : by itself indicates all the rows (or columns). Unlike lists, the loc attribute returns both endpoints of a slice.

>>> animals.loc[2:4, 'plot_id':'sex']
   plot_id species_id sex
2        7         DM   M
3        3         DM   M
4        1         PF   M

Use the iloc attribute of a DataFrame to get rows and/or columns by position, which behaves identically to list indexing.

>>> animals.iloc[2:4, 4:6]
   plot_id species_id
2        7         DM
3        3         DM

The default indexing for a DataFrame, without using the loc or iloc attributes, is by column name.

animals[['hindfoot_length', 'weight']].describe()

The loc attribute also allows logical indexing, i.e. the use of a boolean array of appropriate length for the selected dimension.

The subset of animals where the species is “DM” is extracted into a new data frame.

animals_dm = animals.loc[animals['species_id'] == 'DM', ]
>>> animals_dm.head()
   id  month  day  year  plot_id species_id sex  hindfoot_length  weight
1   3      7   16  1977        2         DM   F             37.0     NaN
2   4      7   16  1977        7         DM   M             36.0     NaN
3   5      7   16  1977        3         DM   M             35.0     NaN
6   8      7   16  1977        1         DM   M             37.0     NaN
7   9      7   16  1977        1         DM   F             34.0     NaN

Aggregation of records in a DataFrame by value of a given variable is performed with the groupby() method. The resulting “grouped” DataFrame has additional methods (like mean()) that summarize each group, producing a DataFrame with one record for each group.

dm_stats = (
  animals_dm
  .groupby('sex')
  ['hindfoot_length', 'weight']
  .mean()
  )
>>> dm_stats
     hindfoot_length     weight
sex                            
F          35.712692  41.609685
M          36.188229  44.353134

Exercise 6

The count method for DataFrames (e.g. animals.count()) returns the number of rows in a data frame. Find out which month had the most observations recorded in animals using groupby() and count().

Top of Section


SQLAlchemy

Working with databases from a high level programming language (Python or R) is usually a one way street: the language helps with complicated queries for pulling out data, but provides little assistance with pushing data back, especially for complicated modifications. SQLAlchemy lays the groundwork for bi-directional flow: Python objects become synchronized with database records.

After some upfront alignment between the Portal Mammals database in PostgreSQL and a Python environment, we’ll be ready to go.

from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import sessionmaker
from sqlalchemy import create_engine

# Create the canvas for a map between Python objects and database records
Base = automap_base()

# Identify database, create the map from the schema, and prepare to create sessions
engine = create_engine('postgresql+pygresql://@localhost/portal')
Base.prepare(engine, reflect=True)
Session = sessionmaker(bind=engine)

Object Relational Mapper (ORM)

The ORM sets up a link between Python objects and records in a database. You could think of a table as defining a class or data structure, in which each field corresponds to an attribute. Each record corresponds to an instance of the class.

>>> Base.classes.keys()
['species', 'animals', 'plots']

The Base class contains a list of tables that automap_base() found in the PostgreSQL database.

Pull class definitions directly from the Base after it has been prepared.

Plots = Base.classes['plots']
Animals = Base.classes['animals']
Species = Base.classes['species']

Then create a new instance by calling the class “constructor” with values specified for each field.

plot = Plots(treatment='Control')
>>> plot.id

Note that plot.id is empty. We did not specify a value for id because we don’t know what values would duplicate an existing primary key.

Session

Connections to the database are made within a “session”, as a safety protocol that protects against network disruptions and provides a session.rollback() method of aborting changes. After adding the new plot object to a session and commiting the change, the database and Python environments are synched.

session = Session()
session.add(plot)
session.commit()
>>> plot.id
26

The RDBMS inserted a new record, incrementing the id value as designed, and reflected it back into the object plot.

Relations

The full reach of the ORM extends into relationships between tables.

Let’s pull a species record using the session.query() method.

query = session.query(Species).filter_by(id='RO')
species = query.one()

Not only do we have the species’ attributes

>>> species.id
'RO'
>>> species.genus
'Reithrodontomys'

… we have the list of every animal linked to this species by the many-to-one relationship dictated by the use of foreign keys.

>>> species.animals_collection
[<sqlalchemy.ext.automap.animals object at 0x7fa66638b3c8>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b438>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b4e0>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b5c0>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b668>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b748>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b828>, <sqlalchemy.ext.automap.animals object at 0x7fa66638b8d0>]

Update

With SQLAlchemy, you can programatically make complex updates to the data in Python, and then commit the state to the database in one transaction.

Suppose a surveyor mistakenly reversed the labels of two plots during a particular survey, and your job is to correct the database. Here are the two problematic plots.

plot_a = session.query(Plots).filter_by(id=2).one()
plot_b = session.query(Plots).filter_by(id=12).one()

Here are the animals incorrectly associated with each plot.

query = session.query(Animals).filter_by(year=2002, month=1, day=12)
animals_a = query.filter(Animals.plots == plot_a).all()
animals_b = query.filter(Animals.plots == plot_b).all()
>>> animals_a[0].plots is plot_a
True

To swap the plots associated with all these animal records, assign the opposite plot to each animal.plot attribute.

for animal in animals_a:
    animal.plots = plot_b
for animal in animals_b:
    animal.plots = plot_a

Check your work …

>>> animals_a[0].plots is plot_b
True

and commit the result to the database.

session.commit()
animals_a[0].plots is plot_b

Query

At any time, collect the result of a SQLAlchemy query into a data fram

query = (session
    .query(Animals)
    .join(Species)
    .filter_by(taxa='Rodent')
    )
rodents = pd.read_sql(query.statement, engine)
>>> rodents.head()
   id  month  day  year  plot_id species_id sex  hindfoot_length  weight
0   2      7   16  1977        3         NL   M             33.0     NaN
1   3      7   16  1977        2         DM   F             37.0     NaN
2   4      7   16  1977        7         DM   M             36.0     NaN
3   5      7   16  1977        3         DM   M             35.0     NaN
4   6      7   16  1977        1         PF   M             14.0     NaN

Top of Section


Summary

Python is said to have a gentle learning curve, but all new languages take practice.

Key concepts covered in this lesson include:

Additional critical packages for data science:

Top of Section