Introduction to Python with Pandas
Lesson 6 with Ian Carroll
Contents
- Why learn Python?
- Variables
- Data types
- Data structures
- Iteration
- Flow control
- Function definition
- Pandas
- Summary
- Exercise solutions
Why learn Python?
- 
    Write scripts clearly and quickly 
- High-performance scientific computing
    - e.g. NumPy and SciKits
 
- 
    Common in and out of academia 
- 
    Helpful user community on https://stackoverflow.com 
- The scripting language for ArcGIS and QGIS
Objectives for this Lesson
- 
    Earn your Python “learner’s permit” 
- 
    Work with Pandas, the DataFramepackage
- 
    Recognize differences between R and Python 
Specific Achievements
- 
    Differentiate between data types and structures 
- 
    Use “comprehensions” and define functions 
- 
    Learn to use indentation as syntax 
- 
    Import data and try a simple “split-apply-combine” 
Jupyter
Open up worksheet-6.ipynb after signing into JupyterHub. This worksheet
is an Jupyter Notebook document: it is divided into “cells” that are run
independently but access the same Python interpreter. Use the Notebook to write
and annotate code.
After opening worksheet-6.ipynb, right click anywhere in your notebook and choose “Create Console for Notebook”. Drag-and-drop the tabs into whatever arrangement you like.
Variables
Variable assignment attaches the label left of an = to the return
value of the expression on its right.
a = 'xyz'
a
Out[1]: 'xyz'
Colloquially, you might say the new variable a equals 'xyz', but
Python makes it easy to “go deeper”. There can be only one string
'xyz', so the Python interpreter makes a into another label for
the same 'xyz', which we can verify by id().
The “in-memory” location of a returned by id() …
id(a)
Out[1]: 4388719672
… is equal to that of xyz itself:
id('xyz')
Out[1]: 4388719672
The idiom to test this “sameness” is typical of the Python language: it uses plain English when words will suffice.
a is 'xyz'
Out[1]: True
Equal but not the Same
The id() function helps demonstrate that “equal” is not the “same”.
b = [1, 2, 3]
id(b)
Out[1]: 4388891208
id([1, 2, 3])
Out[1]: 4388401672
Even though b == [1, 2, 3] returns True, these are not the same
object:
b is [1, 2, 3]
Out[1]: False
Side-effects
The reason to be aware of what b is has to do with
“side-effects”, an import part of Python programming. A side-effect
occurs when an expression generates some ripples other than its return
value. And side-effects don’t change the label, they effect what the
label is assigned to (i.e. what it is).
b.pop()
b
Out[1]: [1, 2]
- Question
- Re-check the “in-memory” location—is it the same b?
- Answer
- Yes! The list got shorter but it is the same list.
Side-effects trip up Python programmers when an object has multiple labels, which is not so unusual:
c = b
b.pop()
Out[1]: 2
c
Out[1]: [1]
The assignment to c does not create a new list, so the side-effect
of popping off the tail of b ripples into c.
A common mistake for those coming to Python from R, is to write b =
b.append(4), which overwrites b with the value None that happens
to be returned by the append() method.
Not every object is “mutable” like our list b. For example, the a
assigned earlier is not.
x = a
a.upper()
Out[1]: 'XYZ'
x
Out[1]: 'xyz'
The string ‘xyz’ hasn’t changed—it’s immutable. So it is also a safe
guess that there has been no side-effect on the original a.
a
Out[1]: 'xyz'
Data types
The immutable data types are
| 'int' | Integer | 
| 'float' | Real number | 
| 'str' | Character string | 
| 'bool' | True/False | 
| 'tuple' | Immutable sequence | 
Any object can be queried with type()
T = 'x', 3, True
type(T)
type('x')
Out[1]: str
Operators
Python supports the usual arithmetic operators for numeric types:
| + | addition | 
| - | subtraction | 
| * | multiplication | 
| / | floating-point division | 
| ** | exponent | 
| % | modulus | 
| // | floor division | 
One or both of these might be a surprise:
5 ** 2
Out[1]: 25
2 // 3
Out[1]: 0
Some operators have natural extensions to non-numeric types:
a * 2
Out[1]: 'xyzxyz'
T + (3.14, 'y')
Out[1]: ('x', 3, True, 3.14, 'y')
Comparison operators are symbols or plain english:
| == | equal | 
| != | non-equal | 
| >,< | greater, lesser | 
| >=,<= | greater or equal, lesser or equal | 
| and | logical and | 
| or | logical or | 
| not | logical negation | 
| in | logical membership | 
Exercise 1
Explore the use of in to test membership in a list. Create a list of
multiple integers, and use in to test membership of some other
numbers in your list.
Data structures
The built-in structures for holding multiple values are:
- Tuple
- List
- Set
- Dictionary
Tuple
The simplest kind of sequence, a tuple is declared with
comma-separated values, optionally inside ().
T = 'x', 3, True
type(T)
Out[1]: tuple
Note that to declare a one-tuple without “(“, a trailing “,” is required.
T = 'cat',
type(T)
Out[1]: tuple
List
The more common kind of sequence in Python is the list, which is
declared with comma-separated values inside []. Unlike a tuple, a
list is mutable.
L = [3.14, 'xyz', T]
type(L)
Out[1]: list
Subsetting Tuples and Lists
Subsetting elements from a tuple or list is performed with square brackets in both cases, and selects elements using their integer position starting from zero—their “index”.
L[0]
Out[1]: 3.14
Negative indices are allowed, and refer to the reverse ordering: -1 is the last item in the list, -2 the second-to-last item, and so on.
L[-1]
Out[1]: ('cat',)
The syntax L[i:j] selects a sub-list starting with the element at index
i and ending with the element at index j - 1.
L[0:2]
Out[1]: [3.14, 'xyz']
A blank space before or after the “:” indicates the start or end of the list,
respectively. For example, the previous example could have been written 
L[:2].
A potentially useful trick to remember the list subsetting rules in Python is to picture the indices as “dividers” between list elements.
 0      1       2          3 
 | 3.14 | 'xyz' | ('cat',) |
-3     -2      -1
Positive indices are written at the top and negative indices at the bottom. 
L[i] returns the element to the right of i whereas L[i:j] returns
elements between i and j.
Set
The third and last “sequence” data structure is a set, used mainly for quick access to set operations like “union” and “difference”. Declare a set with comma-separated values inside {} or by casting another sequence with set().
S1 = set(L)
S2 = {3.14, 'z'}
S1.difference(S2)
Out[1]: {('cat',), 'xyz'}
Python is a rather principled language: a set is technically unordered, so its elements do not have an index. You cannot subset a set using [].
Dictionary
Lists are useful when you need to access elements by their position in a sequence. In contrast, a dictionary is needed to find values based on arbitrary identifiers.
Construct a dictionary with comma-separated key: value pairs in {}.
toons = {
  'Snowy': 'dog',
  'Garfield': 'cat',
  'Bugs': 'bunny',
}
type(toons)
Out[1]: dict
Individual values are accessed using square brackets, as for lists, but the key must be used rather than an index.
toons['Bugs']
Out[1]: 'bunny'
To add a single new element to the dictionary, define a new
key:value pair by assigning a value to a novel key in the
dictionary.
toons['Goofy'] = 'dog'
toons
Out[1]: {'Bugs': 'bunny', 'Garfield': 'cat', 'Goofy': 'dog', 'Snowy': 'dog'}
Dictionary keys are unique. Assigning a value to an existing key overwrites its previous value.
Exercise 2
Based on what we have learned so far about lists and dictionaries,
think up a data structure suitable for an address book of names and
emails. Now create it! Enter the name and email address for yourself
and your neighbor in a new variable called addr.
Iteration
The data structures just discussed have multiple values. Subsetting is one way to get at them individually. Stepping through all values is called iterating.
Python formally declares a thing “iterable” if it can be used in an
expression for x in y. where y is the iterable thing and x will
label each element in turn.
Declarations with Iterables
Packing the for x in y expression inside a sequence declaration is
one way to build a sequence.
letters = [x for x in 'abcde']
letters
Out[1]: ['a', 'b', 'c', 'd', 'e']
This way of declaring with for and in is called a “comprehension” in Python.
Dictionary Comprehension
To declare a dictionary in this way, specify a key:value pair.
CAPS = {x: x.upper() for x in 'abcde'}
CAPS
Out[1]: {'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D', 'e': 'E'}
Flow control
The list and dictionary comprehensions embed a short form of the expression used to initiate a looping control statement.
For loops
A for loop takes any iterable object and executes a block of code
once for each element in the iterable..
squares = []
for i in range(1, 5):
    j = i ** 2
    squares.append(j)
len(squares)
Out[1]: 4
The range(i, j) function creates a list of integers from i up
through j - 1; just like in the case of list slices, the range is
not inclusive of the upper bound.
Indentation
Note the pattern of the block above:
- the for x in yexpression is followed by a colon
- the following lines are indented equally
- un-indenting indicates the end of the block
Compared with other programming languages in which code indentation only serves to enhance readability, Python uses indentation (and only indentation) to define “code blocks”, a.k.a. statements.
Nesting indentation
Each level of indentation indicates blocks within blocks. Nesting a conditional within a for-loop is a common case.
The following example creates a contact list (as a list of
dictionaries), then performs a loop over all contacts. Within the
loop, a conditional statement (if) checks if the name is ‘Alice’. If
so, the interpreter prints the phone number; otherwise it prints the
name (else block).
contacts = [
    {'name':'Alice', 'phone':'555-111-2222'},
    {'name':'Bob', 'phone':'555-333-4444'},
    ]
for c in contacts:
    if c['name'] == 'Alice':
        print(c['phone'])
    else:
        print(c['name'])
555-111-2222
Bob
Exercise 3
Write a for loop that prints all even numbers between 1 and 9. Use the
modulo operator (%) to check for evenness: if i is even, then i %
2 returns 0, because % gives the remainder after division of the
first number by the second.
Function definition
We already saw examples of a few built-in functions, such as type()
and len().  New functions are defined as a block of code starting
with a def keyword and (optionally) finishing with a return.
def add_two(x):
    result = x + 2
    return result
The def keyword is followed by the function name, its arguments enclosed in
parentheses (separated by commas if there are more than one), and a colon.
The return statement is needed to make the function provide output.
The lack of a return, or return followed by nothing, causes the function to return the value None.
add_two(10)
Out[1]: 12
This function is invoked by name followed by any arguments in parentheses and in the order defined.
Default arguments
A default value can be “assigned” during function definition.
def add_any(x, y=0):
    result = x + y
    return result
Then the function can be called without that argument:
add_any(10)
Out[1]: 10
Adding an argument will override the default:
add_any(10, 5)
Out[1]: 15
Methods
The period is a special character in Python that accesses an object’s
attributes and methods. In either the Jupyter Notebook or Console,
typing an object’s name followed by . and then pressing the TAB
key brings up suggestions.
squares.index(4)
Out[1]: 1
We call this index() function a method of lists (recall that
squares is of type 'list'). A useful feature of having methods
attached to objects is that we can dial up help on a method as it
applies to any instance of a type.
help(squares.index)
Help on built-in function index:
index(...) method of builtins.list instance
    L.index(value, [start, [stop]]) -> integer -- return first index of value.
    Raises ValueError if the value is not present.
A major differnce between Python and R has to do with the process for making functions behave differently for different objects. In Python, a function is attached to an object as a “method”, while in R a “dispatcher” examines the attributes of a function call’s arguments and chooses a the particular function to use.
A dictionary method
The update() method allows you to extend a dictionary with another dictionary of key:value pairs, while simultaneously overwriting values for existing keys.
toons.update({
  'Tweety': 'bird',
  'Bob': 'sponge',
  'Bugs': 'rabbit',
})
- Question
- How many key: valuepairs are there now in toons?
- Answer
- Five. The key 'Bugs'is only inserted once.
Note a couple “Pythonic” style choices of the above:
- Leave a space after the :when declaringkey: valuepairs
- Trailing null arguments are syntactically correct, even advantageous
- White space with ()has no meaning and can improve readiability
Pandas
If you have used the statistical programming language R, you are familiar with “data frames”, two-dimensional data structures where each column can hold a different type of data, as in a spreadsheet.
The data analysis library pandas provides a data frame object type for Python, along with functions to subset, filter reshape and aggregate data stored in data frames.
After importing pandas, we call its read_csv function
to load the Portal animals data from the file animals.csv.
import pandas as pd
animals = pd.read_csv("data/animals.csv")
animals.dtypes
Out[1]: 
id                   int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object
There are many ways to slice a DataFrame. To select a subset of rows
and/or columns by name, use the loc attribute and [ for indexing.
animals.loc[:, ['plot_id', 'species_id']]
Out[1]: 
       plot_id species_id
0            3         NL
1            2         DM
2            7         DM
3            3         DM
4            1         PF
5            2         PE
6            1         DM
7            1         DM
8            6         PF
9            5         DS
10           7         DM
11           3         DM
12           8         DM
13           6         DM
14           4         DM
15           3         DS
16           2         PP
17           4         PF
18          11         DS
19          14         DM
20          15         NL
21          13         DM
22          13         SH
23           9         DM
24          15         DM
25          15         DM
26          11         DM
27          11         PP
28          10         DS
29          15         DM
...        ...        ...
35519        9         DM
35520        9         DM
35521        9         DM
35522        9         PB
35523        9         OL
35524        8         OT
35525       13         DO
35526       13         US
35527       13         PB
35528       13         OT
35529       13         PB
35530       14         DM
35531       14         DM
35532       14         DM
35533       14         DM
35534       14         DM
35535       14         DM
35536       15         PB
35537       15         SF
35538       15         PB
35539       15         PB
35540       15         PB
35541       15         PB
35542       15         US
35543       15         AH
35544       15         AH
35545       10         RM
35546        7         DO
35547        5        NaN
35548        2         NL
[35549 rows x 2 columns]
As with lists, : by itself indicates all the rows (or
columns). Unlike lists, the loc attribute returns both endpoints of
a slice.
animals.loc[2:4, 'plot_id':'sex']
Out[1]: 
   plot_id species_id sex
2        7         DM   M
3        3         DM   M
4        1         PF   M
Use the iloc attribute of a DataFrame to get rows and/or columns by
position, which behaves identically to list indexing.
animals.iloc[2:4, 4:6]
Out[1]: 
   plot_id species_id
2        7         DM
3        3         DM
The default indexing for a DataFrame, without using the loc or
iloc attributes, is by column name.
animals[['hindfoot_length', 'weight']].describe()
Out[1]: 
       hindfoot_length        weight
count     31438.000000  32283.000000
mean         29.287932     42.672428
std           9.564759     36.631259
min           2.000000      4.000000
25%          21.000000     20.000000
50%          32.000000     37.000000
75%          36.000000     48.000000
max          70.000000    280.000000
The loc attribute also allows logical indexing, i.e. the use of a
boolean array of appropriate length for the selected dimension. The
subset of animals where the species is “DM” can be extracted with a
logical test.
idx = animals['species_id'] == 'DM'
animals_dm = animals.loc[idx]
animals_dm.head()
Out[1]: 
   id  month  day  year  plot_id species_id sex  hindfoot_length  weight
1   3      7   16  1977        2         DM   F             37.0     NaN
2   4      7   16  1977        7         DM   M             36.0     NaN
3   5      7   16  1977        3         DM   M             35.0     NaN
6   8      7   16  1977        1         DM   M             37.0     NaN
7   9      7   16  1977        1         DM   F             34.0     NaN
Aggregation of records in a DataFrame by value of a given variable is
performed with the groupby() method. The resulting “grouped”
DataFrame can apply aggregations to each group, and combine the result
into a DataFrame with one record for each group.
dm_stats = (
  animals_dm
  .groupby('sex')
  .agg({'hindfoot_length': ['mean', 'std']})
)
dm_stats
Out[1]: 
    hindfoot_length          
               mean       std
sex                          
F         35.712692  1.433067
M         36.188229  1.455396
Exercise 4
The count() method can be used in a pandas aggregation step to
count non-NA values in a column. Find out which month had the most
observations recorded in animals using groupby() and count(). If
you are feeling adventurous, calculate the average weight in each
month and rename() the columns to “n” and “mean_weight”.
Summary
Python is said to have a gentle learning curve, but all new languages take practice.
Key concepts covered in this lesson include:
- Variable assignment
- Data structures
- Functions and methods
- pandas
Additional critical packages for data science:
- matplotlib, plotly, seaborn for vizualization
- StatsModels, scikit-learn, and pystan for model fitting
- PyQGIS, Shapely, and Cartopy for GIS
Exercise solutions
Solution 1
answers = [2, 15, 42, 19]
42 in answers
Out[1]: True
Solution 2
addr = [
  {'name': 'Alice', 'email': 'alice@gmail.com'},
  {'name': 'Bob', 'email': 'bob59@aol.com'},
]
Solution 3
for i in range(1, 9):
  if i % 2 == 0:
    print(i)
2
4
6
8
Solution 4
(
animals
  .groupby('month')
  .agg({'id': 'count', 'weight': 'mean'})
  .rename({'id': 'n', 'weight': 'mean_weight'})
)
Out[1]: 
         id     weight
month                 
1      2518  41.656815
2      2796  40.569822
3      3390  42.558372
4      3443  45.290231
5      3073  46.651155
6      2697  41.161753
7      3633  39.923124
8      2369  42.575729
9      2751  43.830675
10     3064  43.879402
11     3016  43.046996
12     2799  40.408385