Introduction to Python with Pandas
Lesson 6 with Ian Carroll
Contents
- Why learn Python?
- Variables
- Data types
- Data structures
- Iteration
- Flow control
- Function definition
- Pandas
- Summary
- Exercise solutions
Why learn Python?
-
Write scripts clearly and quickly
- High-performance scientific computing
- e.g. NumPy and SciKits
-
Common in and out of academia
-
Helpful user community on https://stackoverflow.com
- The scripting language for ArcGIS and QGIS
Objectives for this Lesson
-
Earn your Python “learner’s permit”
-
Work with Pandas, the
DataFrame
package -
Recognize differences between R and Python
Specific Achievements
-
Differentiate between data types and structures
-
Use “comprehensions” and define functions
-
Learn to use indentation as syntax
-
Import data and try a simple “split-apply-combine”
Jupyter
Open up worksheet-6.ipynb
after signing into JupyterHub. This worksheet
is an Jupyter Notebook document: it is divided into “cells” that are run
independently but access the same Python interpreter. Use the Notebook to write
and annotate code.
After opening worksheet-6.ipynb
, right click anywhere in your notebook and choose “Create Console for Notebook”. Drag-and-drop the tabs into whatever arrangement you like.
Variables
Variable assignment attaches the label left of an =
to the return
value of the expression on its right.
a = 'xyz'
a
Out[1]: 'xyz'
Colloquially, you might say the new variable a
equals 'xyz'
, but
Python makes it easy to “go deeper”. There can be only one string
'xyz'
, so the Python interpreter makes a
into another label for
the same 'xyz'
, which we can verify by id()
.
The “in-memory” location of a
returned by id()
…
id(a)
Out[1]: 4388719672
… is equal to that of xyz
itself:
id('xyz')
Out[1]: 4388719672
The idiom to test this “sameness” is typical of the Python language: it uses plain English when words will suffice.
a is 'xyz'
Out[1]: True
Equal but not the Same
The id()
function helps demonstrate that “equal” is not the “same”.
b = [1, 2, 3]
id(b)
Out[1]: 4388891208
id([1, 2, 3])
Out[1]: 4388401672
Even though b == [1, 2, 3]
returns True
, these are not the same
object:
b is [1, 2, 3]
Out[1]: False
Side-effects
The reason to be aware of what b
is has to do with
“side-effects”, an import part of Python programming. A side-effect
occurs when an expression generates some ripples other than its return
value. And side-effects don’t change the label, they effect what the
label is assigned to (i.e. what it is).
b.pop()
b
Out[1]: [1, 2]
- Question
- Re-check the “in-memory” location—is it the same
b
? - Answer
- Yes! The list got shorter but it is the same list.
Side-effects trip up Python programmers when an object has multiple labels, which is not so unusual:
c = b
b.pop()
Out[1]: 2
c
Out[1]: [1]
The assignment to c
does not create a new list, so the side-effect
of popping off the tail of b
ripples into c
.
A common mistake for those coming to Python from R, is to write b =
b.append(4)
, which overwrites b
with the value None
that happens
to be returned by the append()
method.
Not every object is “mutable” like our list b
. For example, the a
assigned earlier is not.
x = a
a.upper()
Out[1]: 'XYZ'
x
Out[1]: 'xyz'
The string ‘xyz’ hasn’t changed—it’s immutable. So it is also a safe
guess that there has been no side-effect on the original a
.
a
Out[1]: 'xyz'
Data types
The immutable data types are
'int' |
Integer |
'float' |
Real number |
'str' |
Character string |
'bool' |
True /False |
'tuple' |
Immutable sequence |
Any object can be queried with type()
T = 'x', 3, True
type(T)
type('x')
Out[1]: str
Operators
Python supports the usual arithmetic operators for numeric types:
+ |
addition |
- |
subtraction |
* |
multiplication |
/ |
floating-point division |
** |
exponent |
% |
modulus |
// |
floor division |
One or both of these might be a surprise:
5 ** 2
Out[1]: 25
2 // 3
Out[1]: 0
Some operators have natural extensions to non-numeric types:
a * 2
Out[1]: 'xyzxyz'
T + (3.14, 'y')
Out[1]: ('x', 3, True, 3.14, 'y')
Comparison operators are symbols or plain english:
== |
equal |
!= |
non-equal |
> , < |
greater, lesser |
>= , <= |
greater or equal, lesser or equal |
and |
logical and |
or |
logical or |
not |
logical negation |
in |
logical membership |
Exercise 1
Explore the use of in
to test membership in a list. Create a list of
multiple integers, and use in
to test membership of some other
numbers in your list.
Data structures
The built-in structures for holding multiple values are:
- Tuple
- List
- Set
- Dictionary
Tuple
The simplest kind of sequence, a tuple is declared with
comma-separated values, optionally inside ()
.
T = 'x', 3, True
type(T)
Out[1]: tuple
Note that to declare a one-tuple without “(“, a trailing “,” is required.
T = 'cat',
type(T)
Out[1]: tuple
List
The more common kind of sequence in Python is the list, which is
declared with comma-separated values inside []
. Unlike a tuple, a
list is mutable.
L = [3.14, 'xyz', T]
type(L)
Out[1]: list
Subsetting Tuples and Lists
Subsetting elements from a tuple or list is performed with square brackets in both cases, and selects elements using their integer position starting from zero—their “index”.
L[0]
Out[1]: 3.14
Negative indices are allowed, and refer to the reverse ordering: -1 is the last item in the list, -2 the second-to-last item, and so on.
L[-1]
Out[1]: ('cat',)
The syntax L[i:j]
selects a sub-list starting with the element at index
i
and ending with the element at index j - 1
.
L[0:2]
Out[1]: [3.14, 'xyz']
A blank space before or after the “:” indicates the start or end of the list,
respectively. For example, the previous example could have been written
L[:2]
.
A potentially useful trick to remember the list subsetting rules in Python is to picture the indices as “dividers” between list elements.
0 1 2 3
| 3.14 | 'xyz' | ('cat',) |
-3 -2 -1
Positive indices are written at the top and negative indices at the bottom.
L[i]
returns the element to the right of i
whereas L[i:j]
returns
elements between i
and j
.
Set
The third and last “sequence” data structure is a set, used mainly for quick access to set operations like “union” and “difference”. Declare a set with comma-separated values inside {}
or by casting another sequence with set()
.
S1 = set(L)
S2 = {3.14, 'z'}
S1.difference(S2)
Out[1]: {('cat',), 'xyz'}
Python is a rather principled language: a set is technically unordered, so its elements do not have an index. You cannot subset a set using []
.
Dictionary
Lists are useful when you need to access elements by their position in a sequence. In contrast, a dictionary is needed to find values based on arbitrary identifiers.
Construct a dictionary with comma-separated key: value
pairs in {}
.
toons = {
'Snowy': 'dog',
'Garfield': 'cat',
'Bugs': 'bunny',
}
type(toons)
Out[1]: dict
Individual values are accessed using square brackets, as for lists, but the key must be used rather than an index.
toons['Bugs']
Out[1]: 'bunny'
To add a single new element to the dictionary, define a new
key:value
pair by assigning a value to a novel key in the
dictionary.
toons['Goofy'] = 'dog'
toons
Out[1]: {'Bugs': 'bunny', 'Garfield': 'cat', 'Goofy': 'dog', 'Snowy': 'dog'}
Dictionary keys are unique. Assigning a value to an existing key overwrites its previous value.
Exercise 2
Based on what we have learned so far about lists and dictionaries,
think up a data structure suitable for an address book of names and
emails. Now create it! Enter the name and email address for yourself
and your neighbor in a new variable called addr
.
Iteration
The data structures just discussed have multiple values. Subsetting is one way to get at them individually. Stepping through all values is called iterating.
Python formally declares a thing “iterable” if it can be used in an
expression for x in y
. where y
is the iterable thing and x
will
label each element in turn.
Declarations with Iterables
Packing the for x in y
expression inside a sequence declaration is
one way to build a sequence.
letters = [x for x in 'abcde']
letters
Out[1]: ['a', 'b', 'c', 'd', 'e']
This way of declaring with for
and in
is called a “comprehension” in Python.
Dictionary Comprehension
To declare a dictionary in this way, specify a key:value
pair.
CAPS = {x: x.upper() for x in 'abcde'}
CAPS
Out[1]: {'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D', 'e': 'E'}
Flow control
The list and dictionary comprehensions embed a short form of the expression used to initiate a looping control statement.
For loops
A for
loop takes any iterable object and executes a block of code
once for each element in
the iterable..
squares = []
for i in range(1, 5):
j = i ** 2
squares.append(j)
len(squares)
Out[1]: 4
The range(i, j)
function creates a list of integers from i
up
through j - 1
; just like in the case of list slices, the range is
not inclusive of the upper bound.
Indentation
Note the pattern of the block above:
- the
for x in y
expression is followed by a colon - the following lines are indented equally
- un-indenting indicates the end of the block
Compared with other programming languages in which code indentation only serves to enhance readability, Python uses indentation (and only indentation) to define “code blocks”, a.k.a. statements.
Nesting indentation
Each level of indentation indicates blocks within blocks. Nesting a conditional within a for-loop is a common case.
The following example creates a contact list (as a list of
dictionaries), then performs a loop over all contacts. Within the
loop, a conditional statement (if
) checks if the name is ‘Alice’. If
so, the interpreter prints the phone number; otherwise it prints the
name (else
block).
contacts = [
{'name':'Alice', 'phone':'555-111-2222'},
{'name':'Bob', 'phone':'555-333-4444'},
]
for c in contacts:
if c['name'] == 'Alice':
print(c['phone'])
else:
print(c['name'])
555-111-2222
Bob
Exercise 3
Write a for loop that prints all even numbers between 1 and 9. Use the
modulo operator (%
) to check for evenness: if i
is even, then i %
2
returns 0
, because %
gives the remainder after division of the
first number by the second.
Function definition
We already saw examples of a few built-in functions, such as type()
and len()
. New functions are defined as a block of code starting
with a def
keyword and (optionally) finishing with a return
.
def add_two(x):
result = x + 2
return result
The def
keyword is followed by the function name, its arguments enclosed in
parentheses (separated by commas if there are more than one), and a colon.
The return
statement is needed to make the function provide output.
The lack of a return
, or return
followed by nothing, causes the function to return the value None
.
add_two(10)
Out[1]: 12
This function is invoked by name followed by any arguments in parentheses and in the order defined.
Default arguments
A default value can be “assigned” during function definition.
def add_any(x, y=0):
result = x + y
return result
Then the function can be called without that argument:
add_any(10)
Out[1]: 10
Adding an argument will override the default:
add_any(10, 5)
Out[1]: 15
Methods
The period is a special character in Python that accesses an object’s
attributes and methods. In either the Jupyter Notebook or Console,
typing an object’s name followed by .
and then pressing the TAB
key brings up suggestions.
squares.index(4)
Out[1]: 1
We call this index()
function a method of lists (recall that
squares
is of type 'list'
). A useful feature of having methods
attached to objects is that we can dial up help on a method as it
applies to any instance of a type.
help(squares.index)
Help on built-in function index:
index(...) method of builtins.list instance
L.index(value, [start, [stop]]) -> integer -- return first index of value.
Raises ValueError if the value is not present.
A major differnce between Python and R has to do with the process for making functions behave differently for different objects. In Python, a function is attached to an object as a “method”, while in R a “dispatcher” examines the attributes of a function call’s arguments and chooses a the particular function to use.
A dictionary method
The update()
method allows you to extend a dictionary with another dictionary of key:value
pairs, while simultaneously overwriting values for existing keys.
toons.update({
'Tweety': 'bird',
'Bob': 'sponge',
'Bugs': 'rabbit',
})
- Question
- How many
key: value
pairs are there now in toons? - Answer
- Five. The key
'Bugs'
is only inserted once.
Note a couple “Pythonic” style choices of the above:
- Leave a space after the
:
when declaringkey: value
pairs - Trailing null arguments are syntactically correct, even advantageous
- White space with
()
has no meaning and can improve readiability
Pandas
If you have used the statistical programming language R, you are familiar with “data frames”, two-dimensional data structures where each column can hold a different type of data, as in a spreadsheet.
The data analysis library pandas provides a data frame object type for Python, along with functions to subset, filter reshape and aggregate data stored in data frames.
After importing pandas, we call its read_csv
function
to load the Portal animals data from the file animals.csv
.
import pandas as pd
animals = pd.read_csv("data/animals.csv")
animals.dtypes
Out[1]:
id int64
month int64
day int64
year int64
plot_id int64
species_id object
sex object
hindfoot_length float64
weight float64
dtype: object
There are many ways to slice a DataFrame
. To select a subset of rows
and/or columns by name, use the loc
attribute and [
for indexing.
animals.loc[:, ['plot_id', 'species_id']]
Out[1]:
plot_id species_id
0 3 NL
1 2 DM
2 7 DM
3 3 DM
4 1 PF
5 2 PE
6 1 DM
7 1 DM
8 6 PF
9 5 DS
10 7 DM
11 3 DM
12 8 DM
13 6 DM
14 4 DM
15 3 DS
16 2 PP
17 4 PF
18 11 DS
19 14 DM
20 15 NL
21 13 DM
22 13 SH
23 9 DM
24 15 DM
25 15 DM
26 11 DM
27 11 PP
28 10 DS
29 15 DM
... ... ...
35519 9 DM
35520 9 DM
35521 9 DM
35522 9 PB
35523 9 OL
35524 8 OT
35525 13 DO
35526 13 US
35527 13 PB
35528 13 OT
35529 13 PB
35530 14 DM
35531 14 DM
35532 14 DM
35533 14 DM
35534 14 DM
35535 14 DM
35536 15 PB
35537 15 SF
35538 15 PB
35539 15 PB
35540 15 PB
35541 15 PB
35542 15 US
35543 15 AH
35544 15 AH
35545 10 RM
35546 7 DO
35547 5 NaN
35548 2 NL
[35549 rows x 2 columns]
As with lists, :
by itself indicates all the rows (or
columns). Unlike lists, the loc
attribute returns both endpoints of
a slice.
animals.loc[2:4, 'plot_id':'sex']
Out[1]:
plot_id species_id sex
2 7 DM M
3 3 DM M
4 1 PF M
Use the iloc
attribute of a DataFrame to get rows and/or columns by
position, which behaves identically to list indexing.
animals.iloc[2:4, 4:6]
Out[1]:
plot_id species_id
2 7 DM
3 3 DM
The default indexing for a DataFrame, without using the loc
or
iloc
attributes, is by column name.
animals[['hindfoot_length', 'weight']].describe()
Out[1]:
hindfoot_length weight
count 31438.000000 32283.000000
mean 29.287932 42.672428
std 9.564759 36.631259
min 2.000000 4.000000
25% 21.000000 20.000000
50% 32.000000 37.000000
75% 36.000000 48.000000
max 70.000000 280.000000
The loc
attribute also allows logical indexing, i.e. the use of a
boolean array of appropriate length for the selected dimension. The
subset of animals
where the species is “DM” can be extracted with a
logical test.
idx = animals['species_id'] == 'DM'
animals_dm = animals.loc[idx]
animals_dm.head()
Out[1]:
id month day year plot_id species_id sex hindfoot_length weight
1 3 7 16 1977 2 DM F 37.0 NaN
2 4 7 16 1977 7 DM M 36.0 NaN
3 5 7 16 1977 3 DM M 35.0 NaN
6 8 7 16 1977 1 DM M 37.0 NaN
7 9 7 16 1977 1 DM F 34.0 NaN
Aggregation of records in a DataFrame by value of a given variable is
performed with the groupby()
method. The resulting “grouped”
DataFrame can apply aggregations to each group, and combine the result
into a DataFrame with one record for each group.
dm_stats = (
animals_dm
.groupby('sex')
.agg({'hindfoot_length': ['mean', 'std']})
)
dm_stats
Out[1]:
hindfoot_length
mean std
sex
F 35.712692 1.433067
M 36.188229 1.455396
Exercise 4
The count()
method can be used in a pandas aggregation step to
count non-NA values in a column. Find out which month had the most
observations recorded in animals
using groupby()
and count()
. If
you are feeling adventurous, calculate the average weight in each
month and rename()
the columns to “n” and “mean_weight”.
Summary
Python is said to have a gentle learning curve, but all new languages take practice.
Key concepts covered in this lesson include:
- Variable assignment
- Data structures
- Functions and methods
- pandas
Additional critical packages for data science:
- matplotlib, plotly, seaborn for vizualization
- StatsModels, scikit-learn, and pystan for model fitting
- PyQGIS, Shapely, and Cartopy for GIS
Exercise solutions
Solution 1
answers = [2, 15, 42, 19]
42 in answers
Out[1]: True
Solution 2
addr = [
{'name': 'Alice', 'email': 'alice@gmail.com'},
{'name': 'Bob', 'email': 'bob59@aol.com'},
]
Solution 3
for i in range(1, 9):
if i % 2 == 0:
print(i)
2
4
6
8
Solution 4
(
animals
.groupby('month')
.agg({'id': 'count', 'weight': 'mean'})
.rename({'id': 'n', 'weight': 'mean_weight'})
)
Out[1]:
id weight
month
1 2518 41.656815
2 2796 40.569822
3 3390 42.558372
4 3443 45.290231
5 3073 46.651155
6 2697 41.161753
7 3633 39.923124
8 2369 42.575729
9 2751 43.830675
10 3064 43.879402
11 3016 43.046996
12 2799 40.408385