Introduction to Python with Pandas

Lesson 6 with Ian Carroll

Contents


Why learn Python?

Objectives for this Lesson

Specific Achievements

Jupyter

Open up worksheet-6.ipynb after signing into JupyterHub. This worksheet is an Jupyter Notebook document: it is divided into “cells” that are run independently but access the same Python interpreter. Use the Notebook to write and annotate code.

After opening worksheet-6.ipynb, right click anywhere in your notebook and choose “Create Console for Notebook”. Drag-and-drop the tabs into whatever arrangement you like.

Top of Section


Variables

Variable assignment attaches the label left of an = to the return value of the expression on its right.

a = 'xyz'
a
Out[1]: 'xyz'

Colloquially, you might say the new variable a equals 'xyz', but Python makes it easy to “go deeper”. There can be only one string 'xyz', so the Python interpreter makes a into another label for the same 'xyz', which we can verify by id().

The “in-memory” location of a returned by id()

id(a)
Out[1]: 4388719672

… is equal to that of xyz itself:

id('xyz')
Out[1]: 4388719672

The idiom to test this “sameness” is typical of the Python language: it uses plain English when words will suffice.

a is 'xyz'
Out[1]: True

Equal but not the Same

The id() function helps demonstrate that “equal” is not the “same”.

b = [1, 2, 3]
id(b)
Out[1]: 4388891208
id([1, 2, 3])
Out[1]: 4388401672

Even though b == [1, 2, 3] returns True, these are not the same object:

b is [1, 2, 3]
Out[1]: False

Side-effects

The reason to be aware of what b is has to do with “side-effects”, an import part of Python programming. A side-effect occurs when an expression generates some ripples other than its return value. And side-effects don’t change the label, they effect what the label is assigned to (i.e. what it is).

b.pop()
b
Out[1]: [1, 2]
Question
Re-check the “in-memory” location—is it the same b?
Answer
Yes! The list got shorter but it is the same list.

Side-effects trip up Python programmers when an object has multiple labels, which is not so unusual:

c = b
b.pop()
Out[1]: 2
c
Out[1]: [1]

The assignment to c does not create a new list, so the side-effect of popping off the tail of b ripples into c.

A common mistake for those coming to Python from R, is to write b = b.append(4), which overwrites b with the value None that happens to be returned by the append() method.

Not every object is “mutable” like our list b. For example, the a assigned earlier is not.

x = a
a.upper()
Out[1]: 'XYZ'
x
Out[1]: 'xyz'

The string ‘xyz’ hasn’t changed—it’s immutable. So it is also a safe guess that there has been no side-effect on the original a.

a
Out[1]: 'xyz'

Top of Section


Data types

The immutable data types are

'int' Integer
'float' Real number
'str' Character string
'bool' True/False
'tuple' Immutable sequence

Any object can be queried with type()

T = 'x', 3, True
type(T)
type('x')
Out[1]: str

Operators

Python supports the usual arithmetic operators for numeric types:

+ addition
- subtraction
* multiplication
/ floating-point division
** exponent
% modulus
// floor division

One or both of these might be a surprise:

5 ** 2
Out[1]: 25
2 // 3
Out[1]: 0

Some operators have natural extensions to non-numeric types:

a * 2
Out[1]: 'xyzxyz'
T + (3.14, 'y')
Out[1]: ('x', 3, True, 3.14, 'y')

Comparison operators are symbols or plain english:

== equal
!= non-equal
>, < greater, lesser
>=, <= greater or equal, lesser or equal
and logical and
or logical or
not logical negation
in logical membership

Exercise 1

Explore the use of in to test membership in a list. Create a list of multiple integers, and use in to test membership of some other numbers in your list.

View solution

Top of Section


Data structures

The built-in structures for holding multiple values are:

Tuple

The simplest kind of sequence, a tuple is declared with comma-separated values, optionally inside ().

T = 'x', 3, True
type(T)
Out[1]: tuple

Note that to declare a one-tuple without “(“, a trailing “,” is required.

T = 'cat',
type(T)
Out[1]: tuple

List

The more common kind of sequence in Python is the list, which is declared with comma-separated values inside []. Unlike a tuple, a list is mutable.

L = [3.14, 'xyz', T]
type(L)
Out[1]: list

Subsetting Tuples and Lists

Subsetting elements from a tuple or list is performed with square brackets in both cases, and selects elements using their integer position starting from zero—their “index”.

L[0]
Out[1]: 3.14

Negative indices are allowed, and refer to the reverse ordering: -1 is the last item in the list, -2 the second-to-last item, and so on.

L[-1]
Out[1]: ('cat',)

The syntax L[i:j] selects a sub-list starting with the element at index i and ending with the element at index j - 1.

L[0:2]
Out[1]: [3.14, 'xyz']

A blank space before or after the “:” indicates the start or end of the list, respectively. For example, the previous example could have been written L[:2].

A potentially useful trick to remember the list subsetting rules in Python is to picture the indices as “dividers” between list elements.

 0      1       2          3 
 | 3.14 | 'xyz' | ('cat',) |
-3     -2      -1

Positive indices are written at the top and negative indices at the bottom. L[i] returns the element to the right of i whereas L[i:j] returns elements between i and j.

Set

The third and last “sequence” data structure is a set, used mainly for quick access to set operations like “union” and “difference”. Declare a set with comma-separated values inside {} or by casting another sequence with set().

S1 = set(L)
S2 = {3.14, 'z'}
S1.difference(S2)
Out[1]: {('cat',), 'xyz'}

Python is a rather principled language: a set is technically unordered, so its elements do not have an index. You cannot subset a set using [].

Dictionary

Lists are useful when you need to access elements by their position in a sequence. In contrast, a dictionary is needed to find values based on arbitrary identifiers.

Construct a dictionary with comma-separated key: value pairs in {}.

toons = {
  'Snowy': 'dog',
  'Garfield': 'cat',
  'Bugs': 'bunny',
}
type(toons)
Out[1]: dict

Individual values are accessed using square brackets, as for lists, but the key must be used rather than an index.

toons['Bugs']
Out[1]: 'bunny'

To add a single new element to the dictionary, define a new key:value pair by assigning a value to a novel key in the dictionary.

toons['Goofy'] = 'dog'
toons
Out[1]: {'Bugs': 'bunny', 'Garfield': 'cat', 'Goofy': 'dog', 'Snowy': 'dog'}

Dictionary keys are unique. Assigning a value to an existing key overwrites its previous value.

Exercise 2

Based on what we have learned so far about lists and dictionaries, think up a data structure suitable for an address book of names and emails. Now create it! Enter the name and email address for yourself and your neighbor in a new variable called addr.

View solution

Top of Section


Iteration

The data structures just discussed have multiple values. Subsetting is one way to get at them individually. Stepping through all values is called iterating.

Python formally declares a thing “iterable” if it can be used in an expression for x in y. where y is the iterable thing and x will label each element in turn.

Declarations with Iterables

Packing the for x in y expression inside a sequence declaration is one way to build a sequence.

letters = [x for x in 'abcde']
letters
Out[1]: ['a', 'b', 'c', 'd', 'e']

This way of declaring with for and in is called a “comprehension” in Python.

Dictionary Comprehension

To declare a dictionary in this way, specify a key:value pair.

CAPS = {x: x.upper() for x in 'abcde'}
CAPS
Out[1]: {'a': 'A', 'b': 'B', 'c': 'C', 'd': 'D', 'e': 'E'}

Top of Section


Flow control

The list and dictionary comprehensions embed a short form of the expression used to initiate a looping control statement.

For loops

A for loop takes any iterable object and executes a block of code once for each element in the iterable..

squares = []
for i in range(1, 5):
    j = i ** 2
    squares.append(j)
len(squares)
Out[1]: 4

The range(i, j) function creates a list of integers from i up through j - 1; just like in the case of list slices, the range is not inclusive of the upper bound.

Indentation

Note the pattern of the block above:

Compared with other programming languages in which code indentation only serves to enhance readability, Python uses indentation (and only indentation) to define “code blocks”, a.k.a. statements.

Nesting indentation

Each level of indentation indicates blocks within blocks. Nesting a conditional within a for-loop is a common case.

The following example creates a contact list (as a list of dictionaries), then performs a loop over all contacts. Within the loop, a conditional statement (if) checks if the name is ‘Alice’. If so, the interpreter prints the phone number; otherwise it prints the name (else block).

contacts = [
    {'name':'Alice', 'phone':'555-111-2222'},
    {'name':'Bob', 'phone':'555-333-4444'},
    ]
for c in contacts:
    if c['name'] == 'Alice':
        print(c['phone'])
    else:
        print(c['name'])
555-111-2222
Bob

Exercise 3

Write a for loop that prints all even numbers between 1 and 9. Use the modulo operator (%) to check for evenness: if i is even, then i % 2 returns 0, because % gives the remainder after division of the first number by the second.

View solution

Top of Section


Function definition

We already saw examples of a few built-in functions, such as type() and len(). New functions are defined as a block of code starting with a def keyword and (optionally) finishing with a return.

def add_two(x):
    result = x + 2
    return result

The def keyword is followed by the function name, its arguments enclosed in parentheses (separated by commas if there are more than one), and a colon.

The return statement is needed to make the function provide output. The lack of a return, or return followed by nothing, causes the function to return the value None.

add_two(10)
Out[1]: 12

This function is invoked by name followed by any arguments in parentheses and in the order defined.

Default arguments

A default value can be “assigned” during function definition.

def add_any(x, y=0):
    result = x + y
    return result

Then the function can be called without that argument:

add_any(10)
Out[1]: 10

Adding an argument will override the default:

add_any(10, 5)
Out[1]: 15

Methods

The period is a special character in Python that accesses an object’s attributes and methods. In either the Jupyter Notebook or Console, typing an object’s name followed by . and then pressing the TAB key brings up suggestions.

squares.index(4)
Out[1]: 1

We call this index() function a method of lists (recall that squares is of type 'list'). A useful feature of having methods attached to objects is that we can dial up help on a method as it applies to any instance of a type.

help(squares.index)
Help on built-in function index:

index(...) method of builtins.list instance
    L.index(value, [start, [stop]]) -> integer -- return first index of value.
    Raises ValueError if the value is not present.

A major differnce between Python and R has to do with the process for making functions behave differently for different objects. In Python, a function is attached to an object as a “method”, while in R a “dispatcher” examines the attributes of a function call’s arguments and chooses a the particular function to use.

A dictionary method

The update() method allows you to extend a dictionary with another dictionary of key:value pairs, while simultaneously overwriting values for existing keys.

toons.update({
  'Tweety': 'bird',
  'Bob': 'sponge',
  'Bugs': 'rabbit',
})
Question
How many key: value pairs are there now in toons?
Answer
Five. The key 'Bugs' is only inserted once.

Note a couple “Pythonic” style choices of the above:

  1. Leave a space after the : when declaring key: value pairs
  2. Trailing null arguments are syntactically correct, even advantageous
  3. White space with () has no meaning and can improve readiability

Top of Section


Pandas

If you have used the statistical programming language R, you are familiar with “data frames”, two-dimensional data structures where each column can hold a different type of data, as in a spreadsheet.

The data analysis library pandas provides a data frame object type for Python, along with functions to subset, filter reshape and aggregate data stored in data frames.

After importing pandas, we call its read_csv function to load the Portal animals data from the file animals.csv.

import pandas as pd
animals = pd.read_csv("data/animals.csv")
animals.dtypes
Out[1]: 
id                   int64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

There are many ways to slice a DataFrame. To select a subset of rows and/or columns by name, use the loc attribute and [ for indexing.

animals.loc[:, ['plot_id', 'species_id']]
Out[1]: 
       plot_id species_id
0            3         NL
1            2         DM
2            7         DM
3            3         DM
4            1         PF
5            2         PE
6            1         DM
7            1         DM
8            6         PF
9            5         DS
10           7         DM
11           3         DM
12           8         DM
13           6         DM
14           4         DM
15           3         DS
16           2         PP
17           4         PF
18          11         DS
19          14         DM
20          15         NL
21          13         DM
22          13         SH
23           9         DM
24          15         DM
25          15         DM
26          11         DM
27          11         PP
28          10         DS
29          15         DM
...        ...        ...
35519        9         DM
35520        9         DM
35521        9         DM
35522        9         PB
35523        9         OL
35524        8         OT
35525       13         DO
35526       13         US
35527       13         PB
35528       13         OT
35529       13         PB
35530       14         DM
35531       14         DM
35532       14         DM
35533       14         DM
35534       14         DM
35535       14         DM
35536       15         PB
35537       15         SF
35538       15         PB
35539       15         PB
35540       15         PB
35541       15         PB
35542       15         US
35543       15         AH
35544       15         AH
35545       10         RM
35546        7         DO
35547        5        NaN
35548        2         NL

[35549 rows x 2 columns]

As with lists, : by itself indicates all the rows (or columns). Unlike lists, the loc attribute returns both endpoints of a slice.

animals.loc[2:4, 'plot_id':'sex']
Out[1]: 
   plot_id species_id sex
2        7         DM   M
3        3         DM   M
4        1         PF   M

Use the iloc attribute of a DataFrame to get rows and/or columns by position, which behaves identically to list indexing.

animals.iloc[2:4, 4:6]
Out[1]: 
   plot_id species_id
2        7         DM
3        3         DM

The default indexing for a DataFrame, without using the loc or iloc attributes, is by column name.

animals[['hindfoot_length', 'weight']].describe()
Out[1]: 
       hindfoot_length        weight
count     31438.000000  32283.000000
mean         29.287932     42.672428
std           9.564759     36.631259
min           2.000000      4.000000
25%          21.000000     20.000000
50%          32.000000     37.000000
75%          36.000000     48.000000
max          70.000000    280.000000

The loc attribute also allows logical indexing, i.e. the use of a boolean array of appropriate length for the selected dimension. The subset of animals where the species is “DM” can be extracted with a logical test.

idx = animals['species_id'] == 'DM'
animals_dm = animals.loc[idx]
animals_dm.head()
Out[1]: 
   id  month  day  year  plot_id species_id sex  hindfoot_length  weight
1   3      7   16  1977        2         DM   F             37.0     NaN
2   4      7   16  1977        7         DM   M             36.0     NaN
3   5      7   16  1977        3         DM   M             35.0     NaN
6   8      7   16  1977        1         DM   M             37.0     NaN
7   9      7   16  1977        1         DM   F             34.0     NaN

Aggregation of records in a DataFrame by value of a given variable is performed with the groupby() method. The resulting “grouped” DataFrame can apply aggregations to each group, and combine the result into a DataFrame with one record for each group.

dm_stats = (
  animals_dm
  .groupby('sex')
  .agg({'hindfoot_length': ['mean', 'std']})
)
dm_stats
Out[1]: 
    hindfoot_length          
               mean       std
sex                          
F         35.712692  1.433067
M         36.188229  1.455396

Exercise 4

The count() method can be used in a pandas aggregation step to count non-NA values in a column. Find out which month had the most observations recorded in animals using groupby() and count(). If you are feeling adventurous, calculate the average weight in each month and rename() the columns to “n” and “mean_weight”.

View solution

Top of Section


Summary

Python is said to have a gentle learning curve, but all new languages take practice.

Key concepts covered in this lesson include:

Additional critical packages for data science:

Top of Section


Exercise solutions

Solution 1

answers = [2, 15, 42, 19]
42 in answers
Out[1]: True

Return

Solution 2

addr = [
  {'name': 'Alice', 'email': 'alice@gmail.com'},
  {'name': 'Bob', 'email': 'bob59@aol.com'},
]

Return

Solution 3

for i in range(1, 9):
  if i % 2 == 0:
    print(i)
2
4
6
8

Return

Solution 4

(
animals
  .groupby('month')
  .agg({'id': 'count', 'weight': 'mean'})
  .rename({'id': 'n', 'weight': 'mean_weight'})
)
Out[1]: 
         id     weight
month                 
1      2518  41.656815
2      2796  40.569822
3      3390  42.558372
4      3443  45.290231
5      3073  46.651155
6      2697  41.161753
7      3633  39.923124
8      2369  42.575729
9      2751  43.830675
10     3064  43.879402
11     3016  43.046996
12     2799  40.408385

Return

Top of Section