- 1 Fundamentals (Week 1)
- 2 Data manipulation with Pandas (Week 2)
- 2.1 (Optional) Review collections
- 2.2 A very brief introduction to NumPy
- 2.3 A very brief introduction to Pandas
- 2.4 (Optional) Where are we?
- 2.5 Reading tabular data into data frames
- 2.6 Data frames are objects that can tell you about their contents
- 2.7 Subsetting Data
- 2.8 Filtering (i.e. masking) data on contents
- 2.9 Working with missing data
- 2.10 Sorting and grouping
- 2.11 Write output
- 2.12 Working with multiple tables
- 2.13 (Optional) Text processing in Pandas
- 2.14 (Optional) Adding rows to DataFrames
- 2.15 (Optional) Scientific Computing Libraries
- 2.16 (Optional) Things we didn’t talk about
- 2.17 (Optional) Pandas method chaining in the wild
- 2.18 (Optional) Introspecting on the DataFrame object
- 2.19 (Carpentries version) Group By: split-apply-combine
- 3 Building Programs (Week 3)
- 4 Visualization with Matplotlib and Seaborn (Week 4)
- 5 Special Topics
- 6 Endnotes
Fundamentals (Week 1)
Orientation
What programming language should I use?
- Use the language that your friends use (so you can ask them for help)
- Use a language that has a community of practice for your desired use case (you can find documentation, bug reports, sample code, etc.)
- Use a language that is “best” by some technical definition
Python is pretty good at lots of things
- “Glue” language intended to replace shell and Perl
- Concise, readable, good for rapid prototyping
- Access to linear algebra libraries in FORTRAN/C → user-friendly numeric computing
- General purpose, not just an academic language; we will spend more time on some of the general purpose aspects.
Literate programming and notebooks
- Blend code, documentation, and visualization
- Good for trying things, demos
- Bad for massive or long-running processes
- You can export notebooks as .py files when they outgrow the notebook format
Jupyter commands
How to start Jupyter Lab
-
Method 1
- Open Anaconda Navigator
- Run Jupyter Lab
-
Method 2 Open Terminal (MacOS/Linux) or Anaconda Prompt (Windows)
cd Desktop/data jupyter lab
Navigation
- Navigate to where you want to be before creating new notebook
- Rename your notebook to something informative
- Use drag-and-drop interface to move .ipynb file to new location
Writing code
-
Execute cell with CTRL-Enter
3 + 7
-
Execute cell and move to new cell with Shift-Enter
# This is a comment print("hello")
-
Cells can be formatted as Code or Markdown
-
Many keyboard shortcuts are available; see https://gist.github.com/discdiver/9e00618756d120a8c9fa344ac1c375ac
-
(Optional) Jupyter Lab understands (some) terminal commands
ls
-
(Optional) Jupyter Lab (IPython, actually) has “magic” commands that start with
%
(line) or%%
(cell)# Print current items in memory %dirs # Get environment variables %env # Cell magic: Run bash in a subprocess %%bash # Cell magic: Time cell execution %%time
- The magic command must be in the first line of the cell (no comments)
- Some commands are not available on Windows (e.g.
%%bash
)
Variables and Assignment
Use variables to store values
Variables are names for values.
first_name = 'Derek'
age = 42
Rules for naming things
- Can only contain letters, digits, and underscore
- Cannot start with a digit
- Are case sensitive:
age
,Age
andAGE
Use print()
to display values
print(first_name, 'is', age, 'years old')
- Functions are verbs
- Functions end in
()
- Functions take arguments (i.e. they do stuff with the values that you give them)
print()
useful for tracking progress, debugging
Jupyter Lab will always echo the last value in a cell
-
Python will evaluate and echo the last item
first_name age
-
If you want to see multiple items, you should explicitly print them
print(first_name) print(age)
(Optional) Variables must be created before they are used
# Prints an informative error message; more about this later
print(last_name)
Variables can be used in calculations
print(age)
age = age + 3
print(age)
Challenge: Variables only change value when something is assigned to them
Order of operations matters!
first = 1
second = 5 * first
first = 2
# What will this print?
print('first:', first)
print('second:', second)
Data Types and Type Conversion
Every value has a type
Most data is text and numbers, but there are many other types.
- Integers: whole numbers (counting)
- Floats: real numbers (math)
- Strings: text
- Files
- Various collections (lists, sets, dictionaries, data frames, arrays)
- More abstract stuff (e.g., database connection)
The type determine what operations you can perform with a given value
-
Example 1: Subtraction makes sense for some kinds of data but not others
print(5 - 3) print('hello' - 'h')
-
Example 2: Some things have length and some don’t Note that we can put functions inside other functions!
print(len('hello')) print(len(5))
Use the built-in function type()
to find the type of a value
-
Variables point to values
print(type(53), type(age))
-
There are many types
print(type(3.12), type("hello"), type(True), type([]))
(Optional) Python is strongly-typed.
-
It will (mostly) refuse to convert things automatically. You can explicitly convert data to a different type.
-
Can’t do math with text
1 + '2'
-
If you have string data, you can explicitly convert it to numeric data…
print(1 + float('2')) print(1 + int('2'))
-
…and vice-versa
text = str(3) print(text) print(type(text))
-
The exception is mathematical operations with integers and floats.
int_sum = 3 + 4 mixed_sum = 3 + 4.0 print(type(int_sum)) print(type(mixed_sum))
-
What’s going on under the hood?
int
,float
, andstr
are types. More precisely, they are classes.int()
,float()
, andstr()
are functions that create new instances of their respective classes. The argument to the creation function (e.g.,'2'
) is the raw material for creating the new instance.
-
This can work for more complex data types as well, e.g. Pandas data frames and Numpy arrays.
Challenge: Explain what each operator does
# Floor
print('5 // 3:', 5 // 3)
# Floating point
print('5 / 3:', 5 / 3)
# Modulus (remainder)
print('5 % 3:', 5 % 3)
Built-in Functions and Help
A function may take zero or more arguments
print('before')
print()
print('after')
Functions can have optional arguments
# By default, we round to the nearest integer
round(3.712)
# You can optionally specify the number of significant digits
round(3.712, 1)
Use the built-in function help()
to get help for a function
-
View the documentation for
round()
help(round)
- 1 mandatory argument
- 1 optional argument with a default value:
ndigits=None
-
You can proved arguments implicitly by order, or explicitly in any order
# You can optionally specify the number of significant digits round(4.712823, ndigits=2)
-
Getting more help
- Python.org tutorial
- Standard library reference (we will discuss libraries in the next section)
- References section of this document
- Stack Overflow
-
Use comments to add documentation to your own programs
# This line isn't executed by Python print("This cell has many comments") # The rest of this line isn't executed either
Every function returns something
-
Collect the results of a function in a new variable. This is one of the ways we build complex programs.
# You can optionally specify the number of significant digits rounded_num = round(4.712823, ndigits=2) print(rounded_num)
result = len("hello") print(result)
-
(Optional) Some function only have “side effects”; they return
None
result = print("hello") print(result) # print(type(result))
(Optional) Functions will typically generalize in sensible ways
-
max()
andmin()
do the intuitively correct thing with numerical and text dataprint(max(1, 2, 3)) print(min('a', 'A', '0')) # sort order is 0-9, A-Z, a-z
-
Mixed numbers and text aren’t meaningfully comparable
max(1, 'a')
(Optional) Python produces informative error messages
-
Python reports a syntax error when it can’t understand the source of a program
name = 'Bob age = = 54 print("Hello world"
-
Python reports a runtime error when something goes wrong while a program is executing
Beginner Challenge: What happens when?
Explain in simple terms the order of operations in the following program: when does the addition happen, when does the subtraction happen, when is each function called, etc. Extra credit: What is the final value of radiance?
radiance = 1.0
radiance = max(2.1, 2.0 + min(radiance, 1.1 * radiance - 0.5))
Libraries
Most of the power of a programming language is in its libraries
https://docs.python.org/3/library/index.html
A program must import
a library module before using it
import math
print(math.pi)
print(math.cos(math.pi))
- Refer to things from the module as
module-name.thing-name
- Python uses “.” to mean “part of” or “belongs to”.
Use help()
to learn about the contents of a library module
help(math) # user friendly
dir(math) # brief reminder, not user friendly
(Optional) Import shortcuts
-
Import specific items from a library module. You want to be careful with this. It’s safer to keep the namespace.
from math import cos, pi cos(pi)
-
Create an alias for a library module when importing it
import math as m print(m.cos(m.pi))
Python has opinions about how to write your programs
import this
Lists
Lists are the central data structure in Python; we will explain many things by making analogies to lists.
A list stores many values in a single structure
fruits = ["apple", "banana", "cherry", "date", "elderberry", "fig"]
print(fruits)
print(len(fruits))
Lists are indexed by position, counting from 0
# First item
print(fruits[0])
# Fifth item
print(fruits[4])
You can get a subset of the list by slicing it
-
You slice a list from the start position up to, but not including, the stop position
print(fruits[0:3]) print(fruits[2:5])
-
You can omit the start position if you’re starting at the beginning…
# Two ways to get the first 5 items print(fruits[0:5]) print(fruits[:5])
-
…and you must omit the end position if you’re going to the end (otherwise it’s up to, but not including, the end!). This is useful if you don’t know how long the list is:
# Everything but the first 3 items print(fruits[3:])
-
You can add an optional step interval (every 2nd item, every 3rd item, etc.)
# First 5 items, every other item print(fruits[0:5:2]) # Every third item print(fruits[::3])
(Optional) Why are lists indexed from 0?
cf. https://stackoverflow.com/a/11364711
-
Slice endpoints are compliments In both cases, the number you see represents what you want to do.
# Get the first two items print(fruits[:2]) # Get everything except the first two items print(fruits[2:])
-
For non-negative indices, the length of a slice is the difference of the indices
len(fruits[1:3]) == 2
Challenge: Some other properties of indexes
Try these statements. What are they doing? Can you explain the differences in their behavior?
fruits[-1]
fruits[20]
fruits[-3:]
Solution
- You can count backwards from the end with negative integers
- Indexing beyond the end of the collection is an error
Lists are mutable
-
You can replace a value at a specific index location
fruits[0] = "apricot" print(fruits)
-
Add an item to list with
append()
. This is a method of the list (more on this later!).fruits.append("grape") print(fruits)
-
Add the items from one list to another with
extend()
more_fruits = ["honeydew", "imbe", "jackfruit"] # Add all of the elements of more_fruits to fruits fruits.extend(more_fruits) print(fruits)
Many functions take collections as arguments
# Assessing the overall productivity of our wide receivers
receiving_yards = [450, 370, 870, 150]
mean_yards = sum(receiving_yards)/len(receiving_yards)
print(mean_yards)
(Optional) Removing items from a list
-
Use
del
to remove an item at an index locationprint(more_fruits) del more_fruits[1] print(more_fruits)
-
Use
pop()
to remove the last item and assign it to a variable. This is useful for destructive iteration.f = fruits.pop() print('Last fruit in list:', f) print(fruits)
Lists can contain anything
-
You can put anything in a list
ages = ['Derek', 42, 'Bill', 24, 'Susan', 37]
-
(Optional) You could use this to manage complex data, but you shouldn’t
# Get first pair print(ages[0:2]) # Get all the names print(ages[::2]) # Get all the ages print(ages[1::2])
-
You can put lists inside other lists
ages.append(more_fruits) # List in our list print(ages) # The last item is a list print(ages[-1]) # Get an item from that list print(ages[-1][0])
(Optional) Challenge: Reversing a list
Create a new list that contains all of the items from fruits
in the reverse order.
Solution
rev_fruits = fruits[len(fruits)-1::-1]
print(rev_fruits)
For Loops
Usually you don’t need to find list items by index. What you actually want to do is go through each item in the list and use it for something.
A for
loop executes commands once for each value in a collection
“For each thing in this group, do these operations”
for fruit in fruits:
print(fruit)
- A for loop is made up of a collection, a loop variable, and a body
- The collection, fruits, is what the loop is being run on.
- The loop variable, fruit, is what changes for each iteration of the loop (i.e. the “current thing”)
- The body, print(fruit), specifies what to do for each value in the collection.
Whitespace is syntactically meaningful in Python!
for fruit in fruits:
print(fruit)
Loop variables can be called anything
for bob in fruits:
print(bob)
The body of a loop can contain many statements
primes = [2, 3, 5]
for p in primes:
squared = p ** 2
cubed = p ** 3
print(p, squared, cubed)
Create a new collection from an existing collection
We will learn how to vectorize this when we get to Numpy and Pandas
prime_exponents = []
for p in primes:
prime_exponents.append(p**2)
print(prime_exponents)
Challenge: Accumulation
Get the total length of all the words in the fruits
list.
Solution 1
total = 0
for f in fruits:
total = total + len(f)
print(total)
Solution 2
lengths = []
for f in fruits:
lengths.append(len(f))
print(sum(lengths))
Solution 3
sum(len(f) for f in fruits)
(Optional) Helpful tools for iteration
-
Use
range()
to iterate over a sequence of numbersfor number in range(0, 3): print(number)
- range() produces numbers on demand (a “generator” function)
- useful for tracking progress
-
Use
enumerate()
to iterate over a sequence of items and their positionsfor number, fruit in enumerate(fruits): print(number, ":", fruit)
-
Use functional programming idioms
- Comprehensions: generator, list, dictionary
- itertools library
-
Test to see if an object is iterable
# Lists, dictionaries, and strings are iterable hasattr(location, "__iter__") #Integers are not iterable hasattr(5, "__iter__")
Strings and methods
Strings are (kind of) like lists
-
Strings are indexed like lists
# Use an index to get a single character from a string fruit = "gooseberry" print(fruit[0]) print(fruit[0:3])
-
Strings have length
len(fruit)
But! Strings are immutable
-
Can’t change a string in place
fruit[0] = 'G'
-
Solution: String methods create a new string
fruit_title = fruit.capitalize() print(fruit_title)
Use the built-in string methods to clean up data
bad_str1 = " Hello world! "
bad_str2 = "|...goodbye cruel world|"
good_str1 = bad_str1.strip()
good_str2 = bad_str2.strip("|")
print(good_str1, "\n", good_str2)
(Optional) Methods are functions that belong to objects
-
An object packages data together with functions that operate on that data. This is a very common organizational strategy in Python.
sentence = "Hello world!" # Call the swapcase method on the my_string object print(sentence.swapcase())
-
You can chain methods into processing pipelines
print(sentence.isupper()) # Check whether all letters are uppercase print(sentence.upper()) # Capitalize all the letters
# The output of upper() is as string; you can use more string methods on it sentence.upper().isupper()
-
You can view an object’s attributes (i.e. methods and fields) using
help()
ordir()
. Some attributes are “private”; you’re not supposed to use these directly.# More verbose help help(str)
# The short, short version dir(my_string)
(Optional) Challenge: Putting it all together
You want to iterate through the fruits
list in a random order. For each randomly-selected fruit, capitalize the fruit and print it.
- Which standard library module could help you? https://docs.python.org/3/library/
- Which function would you select from that module? Are there alternatives?
- Try to write a program that uses the function.
Solution 1 (shuffle)
import random
random.shuffle(fruits)
for f in fruits:
print(f.title())
Solution 2 (sample)
random_fruits = random.sample(fruits, len(fruits))
for f in random_fruits:
print(f.title())
(Optional) Beginner Challenge: From Strings to Lists and Back
-
Given this Python code…
print('string to list:', list('tin')) print('list to string:', ''.join(['g', 'o', 'l', 'd']))
-
What does
list('some string')
do? -
What does
'-'.join(['x', 'y', 'z'])
generate?
Dictionaries
Dictionaries are sets of key/value pairs. Instead of being indexed by position, they are indexed by key.
ages = {'Derek': 42,
'Bill': 24,
'Susan': 37}
ages["Derek"]
Update dictionaries by assigning a key/value pair
-
Update a pre-existing key with a new value
ages["Derek"] = 44 print(ages)
-
Add a new key/value pair
ages["Beth"] = 19 print(ages)
Check whether the dictionary contains an item
-
Does a key already exist?
"Derek" in ages
-
(Optional) Does a value already exist (you generally don’t want to do this; keys are unique but values are not)?
24 in ages.values()
(Optional) Delete an item using del
or pop()
print("Original dictionary", ages)
del ages["Derek"]
print("1st deletion", ages)
susan_age = ages.pop("Susan")
print("2nd deletion", ages)
print("Returned value", susan_age)
Dictionaries are the natural way to store tree-structured data
As with lists, you can put anything in a dictionary.
location = {'latitude': [37.28306, 'N'],
'longitude': [-120.50778, 'W']}
print(location['longitude'][0])
Dictionary iteration
-
Iterate over key: value pairs
for key, val in ages.items(): print(key, ":", val)
-
(Optional) You can iterate over keys and values separately
# Iterate over keys; you can also explicitly call .keys() for key in ages: print(key) # Iterate over values for val in ages.values(): print(val)
-
(Optional) Iteration can be useful for unpacking complex dictionaries
for key, val in location.items(): print(key, 'is', val[0], val[1])
Challenge: Generate a dictionary
-
You have the following key/value pairs:
'Derek' 42 'Bill' 24 'Susan' 37
-
Create a dictionary that contains all of them. You may find the following useful:
help(dict) help(zip)
Solution 1 of many
names = ["Derek", "Bill", "Susan"]
ages = [42, 24, 37]
ages_dict = dict(zip(names, ages))
(Optional) Advanced Challenge: Convert a list to a dictionary
How can you convert our list of names and ages into a dictionary? Hint: You will need to populate the dictionary with a list of keys and a list of values.
# Starting data
ages = ['Derek', 42, 'Bill', 24, 'Susan', 37]
# Get dictionary help
help({})
Solution
ages_dict = dict(zip(ages[::2], ages[1::2]))
(Optional) Other containers
- Tuples
- Sets
Data manipulation with Pandas (Week 2)
(Optional) Review collections
Lists and dictionaries
- Reference item by index/key
- Insert item by index/key
- Indices/keys must be unique
Strings
- Similar to lists: Reference item by index, have length
- Immutable, so need to use string methods
'/'.join()
is a very useful method
A very brief introduction to NumPy
Introductory documentation: https://numpy.org/doc/stable/user/quickstart.html
-
NumPy is the linear algebra library for Python
import numpy as np # Create an array of random numbers m_rand = np.random.rand(3, 4) print(m_rand)
-
Arrays are indexed like lists
print(m_rand[0,0])
-
Arrays have attributes
print(m_rand.shape) print(m_rand.size) print(m_rand.ndim)
-
Arrays are fast but inflexible - the entire array must be of a single type.
Linear algebra with NumPy
Don’t use for
loops with DataFrames or Numpy matrices. There is almost always a faster vectorized function that does what you want.
x = np.arange(9)
y = np.arange(9)
print(x)
print(y)
-
Operations are element-wise by default
print(x * y)
-
Matrix-wise operations (e.g. dot product) use NumPy functions
# Use a special operator if it exists print(x @ y) # Otherwise, use a numpy function print(np.dot(x, y))
-
You can rearrange the same array into different configurations
# Use method chaining to link actions together x1 = x.reshape(3,3) x2 = x.reshape(9,1) print(x1) print(x2)
-
(Optional) Matlab gotcha: 1-D arrays have no transpose
print(x) print(x.T) print(x.reshape(-1,1))
Challenge: Matrix operations
- Create a 3x3 matrix containing the numbers 0-8. Hint: Consult the NumPy Quickstart documentation here: https://numpy.org/doc/stable/user/quickstart.html
- Multiply the matrix by itself (element-wise).
- Multiply the matrix by its transpose.
- Divide the matrix by itself. What happens?
Solutions
# Use method chaining to link actions together
x = np.arange(9).reshape(3,3)
print(x * x)
print(x * x.T)
print(x / x)
A very brief introduction to Pandas
- Pandas is a library for working with spreadsheet-like data (“DataFrames”)
- A DataFrame is a collection (dict) of Series columns
- Each Series is a 1-dimensional NumPy array with optional row labels (dict-like, similar to R vectors)
- Therefore, each series inherits many of the abilities (linear algebra) and limitations (single data type) of NumPy
(Optional) Where are we?
Python provides functions for working with the file system.
import os
# print current directory
print("Current working directory:", os.getcwd())
# print all of the files and directories
print("Working directory contents:", os.listdir())
These provide a rich Python alternative to shell functions
# Get 1 level of subdirectories
print("Just print the sub-directories:", sorted(next(os.walk('.'))[1]))
# Move down one directory
os.chdir("data")
print(os.getcwd())
# Move up one directory
os.chdir("..")
print(os.getcwd())
Reading tabular data into data frames
Import tabular data using the Pandas library
import pandas as pd
data = pd.read_csv('data/gapminder_gdp_oceania.csv')
print(data)
# Jupyter Lab will give you nice formatting if you echo
data
- File and directory names are strings
- You can use relative or absolute file paths
Use index_col
to use a column’s values as row indices
Rows are indexed by number by default (0, 1, 2,….). For convenience, we want to index by country:
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data)
- By default, rows are indexed by position, like lists.
- Setting the
index_col
parameter lets us index rows by label, like dictionaries. For this to work, the index column needs to have unique values for every row. - You can verify the contents of the CSV by double-clicking on the file in Jupyter Lab
Pandas help files are dense; you should prefer the online documentation
- Main documentation link: https://pandas.pydata.org/docs/user_guide/index.html
- Pandas can read many different data formats: https://pandas.pydata.org/docs/user_guide/io.html
Data frames are objects that can tell you about their contents
Data frames have methods (i.e. functions) that perform operations using the data frame’s contents as input
-
Use
.info()
to find out more about a data framedata.info()
-
Use
.describe()
to get summary statistics about datadata.describe()
-
(Optional) Look at the first few rows
data.head(1)
Data frames have fields (i.e. variables) that hold additional information
A “field” is a variable that belongs to an object.
-
The
.index
field stores the row Index (list of row labels)print(data.index)
-
The
.columns
field stores the column Index (list of column labels)print(data.columns)
-
The
.shape
variable stores the matrix shapeprint(data.shape)
-
Use
DataFrame.T
to transpose a DataFrame. This doesn’t copy or modify the data, it just changes the caller’s view of it.print(data.T) print(data.T.shape)
(Optional) Pandas introduces some new types
# DataFrame type
type(data)
type(data.T)
# Series type
type(data['gdpPercap_1952'])
# Index type
type(data.columns)
- You can convert data between NumPy arrays, Series, and DataFrames
- You can read data into any of the data structures from files or from standard Python containers
(Optional) Beginner Challenge
- Read the data in
gapminder_gdp_americas.csv
into a variable calledamericas
and display its summary statistics. - After reading the data for the Americas, use
help(americas.head)
andhelp(americas.tail)
to find out whatDataFrame.head
andDataFrame.tail
do.- How can you display the first three rows of this data?
- How can you display the last three columns of this data? (Hint: You may need to change your view of the data).
- As well as the
read_csv
function for reading data from a file, Pandas provides ato_csv
function to write DataFrames to files. Applying what you’ve learned about reading from files, write one of your DataFrames to a file calledprocessed.csv
. You can usehelp
to get information on how to useto_csv
.
Solution
americas = pd.read_csv('data/gapminder_gdp_americas.csv', index_col='country')
americas.describe()
americas.head(3)
americas.T.tail(3)
americas.to_csv('processed.csv')
Subsetting Data
Treat the data frame as a matrix and select values by position
Use DataFrame.iloc[..., ...]
to select values by their (entry) position. The i
in iloc
stands for “index”.
#import pandas as pd
data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data.iloc[0,0]
Treat the data frame as a table and select values by label
This is most common way to get data
-
Use
DataFrame.loc[..., ...]
to select values by their label# This returns a value data.loc["Albania", "gdpPercap_1952"]
Shorten the column names using vectorized string methods
-
Standard Python has string methods
big_hello = "hello".title() print(big_hello) help("hello".title) print(dir("hello"))
-
Pandas data frames are complex objects
print(data.columns) print(dir(data.columns.str))
-
Use built-in methods to transform the entire data frame
# The columns index can update all of its values in a single operation data.columns = data.columns.str.strip("gdpPercap_") print(data.columns)
Use list slicing notation to get subsets of the data frame
-
Select multiple columns or rows using
.loc
and a named slice. This generalizes the concept of a slice to include labeled indexes.# This returns a DataFrame data.loc['Italy':'Poland', '1962':'1972']
-
Use
:
on its own to mean all columns or all rows. This is Python’s usual slicing notation, which allows you to treat data frames as multi-dimensional lists.# This returns a DataFrame data.loc['Italy':'Poland', :]
-
(Optional) If you want specific rows or columns, pass in a list
data.loc[['Italy','Poland'], :]
-
(Optional)
.iloc
follows list index conventions (“up to, but not including)”, but.loc
does the intuitive right thing (“A through B”)index_subset = data.iloc[0:2, 0:2] label_subset = data.loc["Albania":"Belgium", "1952":"1962"] print(index_subset) print(label_subset)
-
Result of slicing can be used in further operations
subset = data.loc['Italy':'Poland', '1962':'1972'] print(subset.describe()) print(subset.max())
-
(Optional) Insert new values using
.at
(for label indexing) or.iat
(for numerical indexing)subset.at["Italy", "1962"] = 2000 print(subset)
Challenge: Collection types
- Calculate
subset.max()
and assign the result to a variable. What kind of thing is it? What are its properties? - What is the maximum value of the new variable? Can you determine this without creating an intermediate variable?
Solution
-
Pandas always drills down to the most parsimonious representation. On one hand, this is convenient; on the other, it violates the Pythonic expectation for strong types.
Shape of data selection Pandas return type 2D DataFrame 1D Series 0D single value -
Use method chaining
print(subset.max().max()) # Alternatively print(subset.max(axis=None))
(Optional) Filter on label properties
-
.filter()
always returns the same type as the original item, whereas.loc
and.iloc
might return a data frame or a series.italy = data.filter(items=["Italy"], axis="index") print(italy) print(type(italy))
-
.filter()
is a general-purpose, flexible methodhelp(data.filter) data.filter(like="200", axis="columns") data.filter(like="200", axis="columns").filter(items=["Italy"], axis="index")
Filtering (i.e. masking) data on contents
Use comparisons to select data based on value
-
Show which data frame elements match a criterion.
# Which GDPs are greater than 10,000? subset > 10000
-
Use the criterion match to filter the data frame’s contents. This uses index notation:
df = subset[subset > 10000] print(df)
subset > 10000
returns a data frame of True/False valuessubset[subset > 10000]
filters its contents based on that True/False data frame. AllTrue
values are returned, element-wise.- This section is more properly called “Masking Data,” because it involves operations for overlaying a data frame’s values without changing the data frame’s shape. We don’t drop anything from the data frame, we just replace it with
NaN
.
-
(Optional) Use
.where()
method to find elements that match the criterion:df = subset.where(subset > 10000) print(df)
You can filter using any method that returns a data frame
For example, get the GDP for all countries greater than the median.
# Get the overall median
subset.median() # Returns Series
subset.median(axis=None) # Returns single valuey
# Which data points are above the median
subset > subset.median(axis=None)
# Return the masked data set
subset[subset > subset.median(axis=None)]
Use method chaining to create final output without creating intermediate variables
# The .rank() method turns numerical scores into ranks
data.rank()
# Get mean rank over time and sort the output
mean_rank = data.rank().mean(axis=1).sort_values()
print(mean_rank)
Working with missing data
By default, most numerical operations ignore missing data
Examples include min, max, mean, std, etc.
-
Missing values ignored by default
print("Column means") print(df.mean()) print("Row means") print(df.mean(axis=1))
-
Force inclusions with the
skipna
argumentprint("Column means") print(df.mean(skipna=False)) print("Row means") print(df.mean(axis=1, skipna=False))
Check for missing values
-
Show which items are missing. “NA” includes
NaN
andNone
. It doesn’t include empty strings ornumpy.inf
.# Show which items are NA df.isna()
-
Count missing values
# Missing by row print(df.isna().sum()) # Missing by column print(df.isna().sum(axis=1)) # Aggregate sum df.isna().sum().sum()
-
Are any values missing?
df.isna().any(axis=None)
-
(Optional) Are all of the values missing?
df.isna().all(axis=None)
Replace missing values
-
Replace with a fixed value
df_fixed = df.fillna(99) print(df_fixed)
-
Replace values that don’t meet a criterion with an alternate value
subset_fixed = subset.where(subset > 10000, 99) print(subset_fixed)
-
(Optional) Impute missing values. Read the docs, this may or may not be sufficient for your needs.
df_imputed = df.interpolate()
Drop missing values
Drop all rows with missing values
df_drop = df.dropna()
Hard Challenge: The perils of missing data
-
Create an array of random numbers matching the
data
data framerandom_filter = np.random.rand(30, 12) * data.max(axis=None)
-
Create a new data frame that filters out all numbers lower than the random numbers
-
Interpolate new values for the missing values in the new data frame. How accurate do you think they are?
Solution
new_data = data[data > random_filter]
# Data is not missing randomly
print(new_data)
new_data.interpolate()
new_data.interpolate().mean(axis=None)
(Optional) Challenge: Filter and trim with a boolean vector
A DataFrame is a dictionary of Series columns. With this in mind, experiment with the following code and try to explain what each line is doing. What operation is it performing, and what is being returned?
Feel free to use print()
, help()
, type()
, etc as you investigate.
df["1962"]
df["1962"].notna()
df[df["1962"].notna()]
Solution
- Line 1 returns the column as a Series vector
- Line 2 returns a boolean Series vector (True/False)
- Line 3 performs boolean indexing on the DataFrame using the Series vector. It only returns the rows that are True (i.e. it performs true filtering).
Sorting and grouping
Motivating example: Calculate the wealth Z-score for each country
# Calculate z scores for all elements
# z = (data - data.mean(axis=None))/data.std()
# As of July 2024, pandas dataframe.std(axis=None) doesn't work. We are dropping down to
# Numpy to use the .std() method on the underlying values array.
z = (data - data.mean(axis=None))/data.values.std(ddof=1)
# Get the mean z score for each country (i.e. across all columns)
mean_z = z.mean(axis=1)
# Group countries into "wealthy" (z > 0) and "not wealthy" (z <= 0)
z_bool = mean_z > 0
print(mean_z)
print(z_bool)
Append new columns to the data frame containing our summary statistics
Data frames are dictionaries of Series:
data["mean_z"] = mean_z
data["wealthy"] = z_bool
Sort and group by new columns
data.sort_values(by="mean_z")
# Get descriptive statistics for the group
data.groupby("wealthy").mean()
data.groupby("wealthy").describe()
Write output
Capture the results of your filter in a new file, rather than overwriting your original data.
# Save to a new CSV, preserving your original data
data.to_csv('gapminder_gdp_europe_normed.csv')
# If you don't want to preserve row names:
#data.to_csv('gapminder_gdp_europe_normed.csv', index=False)
Working with multiple tables
Concatenating data frames
surveys = pd.read_csv('data/surveys.csv', index_col="record_id")
print(surveys.shape)
df1 = surveys.head(10)
df2 = surveys.tail(10)
df3 = pd.concat([df1, df2])
print(df3.shape)
(Optional) Joining data frames (in an SQL-like manner)
-
Import species data
species = pd.read_csv('data/species.csv', index_col="species_id") print(species.shape)
-
Join tables on common column. The “left” join is a strategy for augmenting the first table (surveys) with information from the second table (species).
df_join = surveys.merge(species, on="species_id", how="left") print(df_join.head()) print(df_join.shape)
-
The resulting table loses its index because
surveys.record_id
is not being used in the join. To keeprecord_id
as the index for the final table, we need to retain it as an explicit column.# Don't set record_id as index during initial import surveys = pd.read_csv('data/surveys.csv') df_join = surveys.merge(species, on="species_id", how="left").set_index("record_id") df_join.index
-
Get the subset of species that match a criterion, and join on that subset. The “inner” join only includes rows where both tables match on the key column; it’s a strategy for filtering the first table by the second table.
# Get the taxa column, masking the rows based on which values match "Bird" birds = species[species["taxa"] == "Bird"] df_birds = surveys.join(birds, on="species_id").set_index("record_id") print(df_birds.head()) print(df_birds.shape)
(Optional) Text processing in Pandas
cf. https://pandas.pydata.org/docs/user_guide/text.html
-
Import tabular data that contains strings
species = pd.read_csv('data/species.csv', index_col='species_id') # You can explicitly set all of the columns to type string # species = pd.read_csv('data/species.csv', index_col='species_id', dtype='string') # ...or specify the type of individual columns # species = pd.read_csv('data/species.csv', index_col='species_id', # dtype = {"genus": "string", # "species": "string", # "taxa": "string"}) print(species.head()) print(species.info()) print(species.describe())
-
A Pandas Series has string methods that operate on the entire Series at once
# Two ways of getting an individual column print(type(species.genus)) print(type(species["genus"])) # Inspect the available string methods print(dir(species["genus"].str))
-
Use string methods for filtering
# Which species are in the taxa "Bird"? print(species["taxa"].str.startswith("Bird")) # Filter the dataset to only look at Birds print(species[species["taxa"].str.startswith("Bird")])
-
Use string methods to transform and combine data
binomial_name = species["genus"].str.cat(species["species"].str.title(), " ") species["binomial"] = binomial_name print(species.head())
(Optional) Adding rows to DataFrames
A row is a view onto the nth item of each of the column Series. Appending rows is a performance bottleneck because it requires a separate append operation for each Series. You should concatenate data frames instead.
-
Create a single row as a data frame and concatenate it.
row = pd.DataFrame({"1962": 5000, "1967": 5000, "1972": 5000}, index=["Latveria"]) pd.concat([subset, row])
-
If you have individual rows as Series,
pd.concat()
will produce a data frame.# Get each row as a Series italy = data.loc["Italy", :] poland = data.loc["Poland", :] # Omitting axis argument (or axis=0) concatenates the 2 series end-to-end # axis=1 creates a 2D data frame # Transpose recovers original orientation # Column labels come from Series index # Row labels come from Series name pd.concat([italy, poland], axis=1).T
(Optional) Scientific Computing Libraries
Libraries
- SciPy projects
- Numpy: Linear algebra
- Pandas
- Scipy.stats: Probability distributions and basic tests
- Statsmodels: Statistical models and formulae built on Scipy.stats
- Scikit-Learn: Machine learning tools built on NumPy
- Tensorflow/PyTorch: Deep learning and other voodoo
The basics of Scikit-Learn
Scikit-Learn documentation: https://scikit-learn.org/stable/
-
Motivating example: Ordinary least squares regression
from sklearn import linear_model # Create some random data x_train = np.random.rand(20) y = np.random.rand(20) # Fit a linear model reg = linear_model.LinearRegression() reg.fit(x_train.reshape(-1,1), y) print("Regression slope:", reg.coef_)
-
Estimate model fit
from sklearn.metrics import r2_score # Test model fit with new data x_test = np.random.rand(20) y_prediction = reg.predict(x_test.reshape(-1,1)) # Get model stats mse = mean_squared_error(y, y_prediction) r2 = r2_score(y, y_prediction) print("R squared:", "{:.3f}".format(r2))
-
(Optional) Inspect our prediction
import matplotlib.pyplot as plt fig, ax = plt.subplots() ax.scatter(x_train, y, color="black") ax.plot(x_test, y_prediction, color="blue") # `fig` in Jupyter Lab fig.show()
-
(Optional) Compare with Statsmodels
# Load modules and data import statsmodels.api as sm # Fit and summarize OLS model (center data to get accurate model fit mod = sm.OLS(y - y.mean(), x_train - x_train.mean()) res = mod.fit() print(res.summary())
(Optional) Statsmodels regression example with applied data
-
Import data
data = pd.read_csv('surveys.csv') # Check for NaN print("Valid weights:", data['weight'].count()) print("NaN weights:", data['weight'].isna().sum()) print("Valid lengths:", data['hindfoot_length'].count()) print("NaN lengths:", data['hindfoot_length'].isna().sum())
-
Fit OLS regression model
from statsmodels.formula.api import ols model = ols("weight ~ hindfoot_length", data, missing='drop').fit() print(model.summary())
-
Generic parameters for all models
import statsmodels help(statsmodels.base.model.Model)
(Optional) Things we didn’t talk about
- pipe
- map/applymap/apply (in general you should prefer vectorized functions)
(Optional) Pandas method chaining in the wild
wine.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.query("alcohol > 14 and color_filter == 1")
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
(Optional) Introspecting on the DataFrame object
-
DataFrames have a huge number of fields and methods, so dir() is not very useful
print(dir(data))
-
Create a new list that filters out internal attributes
df_joinpublic = [item for item in dir(data) if not item.startswith('_')] print(df_public)
-
(Optional) Pretty-print the new list
importort pprint pp = pprint.PrettyPrinter(width=100, compact=True, indent=2) pp.pprint(df_public)
-
Objects have fields (i.e. data/variables) and methods (i.e. functions/procedures). The difference between a method and a function is that methods are attached to objects, whereas functions are free-floating (“first-class citizens”). Methods and functions are “callable”:
# GeneratorExitenerate a list of public methods and a list of public fields. We do this # by testing each attribute to determine whether it is "callable". # NB: Because Python allows you to override any attribute at runtime, # testing with `callable` is not always reliable. # List of methods (callable attributes) df_methods = [item for item in dir(data) if not item.startswith('_') and callable(getattr(data, item))] # List of fields (non-callable attributes) df_attr = [item for item in dir(data) if not item.startswith('_') and not callable(getattr(data, item))] pp.pprint(df_methods) pp.pprint(df_attr)
(Carpentries version) Group By: split-apply-combine
-
Split data according to criterion, do numeric transformations, then recombine.
# Get all GDPs greater than the mean mask_higher = data > data.mean() # Count the number of time periods in which each country exceeds the mean higher_count = mask_higher.aggregate('sum', axis=1) # Create a normalized wealth-over-time score wealth_score = higher_count / len(data.columns) wealth_score
-
A DataFrame is a spreadsheet, but it is also a dictionary of columns.
data['gdpPercap_1962']
-
Add column to data frame
# Warningealth Score is a series type(wealth_score) data['normalized_wealth'] = wealth_score
Building Programs (Week 3)
Notebooks vs Python scripts
Differences between .ipynb and .py
- Export notebook to .py file
- Move .py file into data directory
- Compare files in TextEdit/Notepad
Workflow differences between notebooks and scripts
Broadly, a trade-off between managing big code bases and making it easy to experiment. See: https://github.com/elliewix/Ways-Of-Installing-Python/blob/master/ways-of-installing.md#why-do-you-need-a-specific-tool
- Interactive testing and debugging
- Graphics integration
- Version control
- Remote scripts
(Optional) Python from the terminal
-
Python is an interactive interpreter (REPL)
python
-
Python is a command line program
# hello.py print("Hello!")
python hello.py
-
(Optional) Python programs can accept command line arguments as inputs
- List of command line inputs:
sys.argv
(https://docs.python.org/3/library/sys.html#sys.argv) - Utility for working with arguments:
argparse
(https://docs.python.org/3/library/argparse.html)
- List of command line inputs:
Looping Over Data Sets
File paths as an example of increasing abstraction in program development
- File paths as literal strings
- File paths as string patterns
- File paths as abstract Path objects
Use a for
loop to process files given a list of their names
import pandas as pd
file_list = ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']
for filename in file_list:
data = pd.read_csv(filename, index_col='country')
print(filename)
print(data.head(1))
Use glob.glob to find sets of files whose names match a pattern
-
Get a list of all the CSV files
import glob glob.glob('data/*.csv')
-
In Unix, the term “globbing” means “matching a set of files with a pattern”. It uses shell expansion rules, not regular expressions, so there’s an upper limit to how flexible it can be. The most common patterns are:
- `*` meaning “match zero or more characters”
- `?` meaning “match exactly one character”
-
(Optional) Get a list of all CSV or TSV files
glob.glob('data/*.?sv')
-
Get a list of all the Gapminder CSV files
glob.glob('data/gapminder_gdp_*.csv')
-
(Optional) Exclude the “all” CSV file
glob.glob('data/gapminder_[!all]*.csv')
Use glob and a for
loop to process batches of files
data_frames = []
for filename in glob.glob('data/gapminder_gdp_*.csv'):
print(filename)
data = pd.read_csv(filename)
data_frames.append(data)
all_data = pd.concat(data_frames)
print(all_data.shape)
Conditionals
Evaluating the truth of a statement
-
Does a file end in
"all.csv"
?for filename in glob.glob('data/gapminder*.csv'): print("Current file:", filename) print(filename.endswith("all.csv")
-
Value of a variable
mass = 3 print(mass == 3) print(mass > 5) print(mass < 4)
-
Membership in a collection
primes = [2, 3, 5] print(2 in primes) print(7 in primes)
-
Missing values
# Recreate data frame with missing data if necessary data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country') subset = data.loc['Italy':'Poland', '1962':'1972'] df = subset[subset > 10000] # Which values are missing? print(df.isna()) # Are any values missing? print(df.isna().any(axis=None))
-
(Optional) Truth of a collection Note that
any()
andall()
evaluate each item using.__bool__()
or.__len()__
, which tells you whether an item is “truthy” or “falsey” (i.e. interpreted as being true or false).my_list = [2.75, "green", 0] print(any(my_list)) print(all(my_list))
-
(Optional) Understanding “truthy” and “falsey” values in Python (cf. https://stackoverflow.com/a/53198991) Every value in Python, regardless of type, is interpreted as being
True
except for the following values (which are interpreted asFalse
). “Truthy” values satisfyif
orwhile
statements; “Falsey” values do not.- Constants defined to be false:
None
andFalse
. - Zero of any numeric type:
0
,0.0
,0j
,Decimal(0)
,Fraction(0, 1)
- Empty sequences and collections:
''
,()
,[]
,{}
,set()
,range(0)
- Constants defined to be false:
Use if
statements to control whether or not a block of code is executed
-
An
if
statement (more properly called a conditional statement) controls whether some block of code is executed or not.mass = 3.5 if mass > 3.0: print(mass, 'is large')
mass = 2.0 if mass > 3.0: print (mass, 'is large')
-
Structure is similar to a
for
statement:- First line opens with
if
and ends with a colon - Body containing one or more statements is indented (usually by 4 spaces)
- First line opens with
-
Use conditionals to decide which files to process
data_frames = [] for filename in glob.glob('data/gapminder*.csv'): print("Current file:", filename) if not filename.endswith("all.csv"): print("Passes test:", filename) data = pd.read_csv(filename) data_frames.append(data) all_data = pd.concat(data_frames) print(all_data.shape)
Use else to execute a block of code when an if condition is not true
-
else
can be used following anif
. This allows us to specify an alternative to execute when the if branch isn’t taken.if m > 3.0: print(m, 'is large') else: print(m, 'is small')
-
This lets you explicitly handle the base case
data_frames = [] for filename in glob.glob('data/gapminder*.csv'): if filename.endswith("all.csv"): print("I don't want any of that") else: print("Passes test:", filename) data = pd.read_csv(filename) data_frames.append(data) all_data = pd.concat(data_frames) print(all_data.shape)
Challenge: Process small files
Iterate through all of the CSV files in the data directory. Print the name and length (number of lines) of any file that is less than 30 lines long.
Solution
Note that the data frame will report the number of data rows, which doesn’t include the column headers (the actual file has a leading row with the header names).
for filename in glob.glob('data/*.csv'):
data = pd.read_csv(filename)
if len(data) < 30:
print(filename, len(data))
Challenge: Find the European data
Iterate through all of the CSV files in the data directory. Print the file name that includes “europe”, then print the column names for the file.
Solution
for filename in glob.glob('data/*.csv'):
if "europe" in filename.lower():
print(filename)
data = pd.read_csv(filename)
print(data.columns)
Use elif
to specify additional tests
May want to provide several alternative choices, each with its own test; use elif
(short for “else if”) and a condition to specify these.
if m > 9.0:
print(m, 'is huge')
elif m > 3.0:
print(m, 'is large')
else:
print(m, 'is small')
- Always associated with an
if
. - Must come before the
else
(which is the “catch all”).
(Optional) Conditions are tested once, in order
Python steps through the branches of the conditional in order, testing each in turn. Order matters! The following is wrong:
grade = 85
if grade >= 70:
print('grade is C')
elif grade >= 80:
print('grade is B')
elif grade >= 90:
print('grade is A')
Compound Relations Using and
, or
, and Parentheses
Often, you want some combination of things to be true. You can combine relations within a conditional using and
and or
.
mass = [1, 2, 3, 4, 5]
velocity = [5, 4, 3, 2, 5]
for m, v in zip(mass, velocity):
if m <= 3 and v <= 3:
print("Small and slow")
elif m <= 3 and v > 3:
print("Small and fast")
elif m > 3 and v <= 3:
print("Large and slow")
else:
print("Check data")
- Use () to group subsets of conditions
- Aside: For a more natural way of working with many lists, look at
zip()
(Optional) Use the modulus to print occasional status messages
for count, filename in enumerate(glob.glob('data/gapminder_*.csv')):
# Print every other filename
if count % 2 == 0:
print(count, filename)
(Optional) Use pathlib to write code that works across operating systems
-
Pathlib provides cross-platform path objects
from pathlib import Path # Create Path objects raw_path = Path("data") processed_path = Path("data/processed") print("Relative path:", processed_path) print("Absolute path:", processed_path.absolute())
-
The file objects have methods that provide much better information about files and directories.
#Note the careful testing at each level of the code. data_frames = [] if raw_path.exists(): for filename in raw_path.glob('gapminder_gdp_*.csv'): if filename.is_file(): data = pd.read_csv(filename) print(filename) data_frames.append(data) all_data = pd.concat(data_frames) # Check for destination folder and create if it doesn't exist if not processed_path.exists(): processed_path.mkdir() all_data.to_csv(processed_path.joinpath("combined_data.csv"))
(Optional) Generic file handling
Pandas understands specific file types, but what if you need to work with a generic file?
Open the file with a context manager
with open("data/bouldercreek_09_2013.txt", "r") as infile:
lines = infile.readlines()
- The context manager closes the file when you’re done reading it
"bouldercreek_09_2013.txt"
is the name of the fileinfile
is a variable that refers to the file on disk
A file is a collection of lines
.readlines()
produces the file contents as a list of lines; each line is a string.
print(len(text))
print(type(text))
# View the first 10 lines
print(text[:10])
Strings contain formatting marks
Compare the following:
# This displays the nicely-formatted document
print(lines[0])
# This shows the true nature of the string; you can see newlines (/n),
# tabs (/t), and other hidden characters
lines[0]
(Optional) Text processing and data cleanup
Use string methods to determine which lines to keep
-
The file contains front matter that we can discard
tabular_lines = [] for line in lines: if not line.startswith("#"): tabular_lines.append(line)
-
Now the first line is tab-separated data. Note that the print statement prints the tabs instead of showing us the
\t
character.tabular_lines[0]
Open an output file for writing
outfile_name = "data/tabular_data.txt"
with open(outfile_name, "w") as outfile:
outfile.writelines(tabular_lines)
Format output as a comma-delimited text file
-
Strip trailing whitespace
stripped_line = tabular_lines[0].strip() stripped_line
-
Split each line into a list based using the tabs.
split_line = stripped_line.split("\t") split_line
-
Use a special-purpose library to create a correctly-formatted CSV file
import csv outfile_name = "data/csv_data.csv" with open(outfile_name, "w") as outfile: writer = csv.writer(outfile) for line in tabular_lines: csv_line = line.strip().split("\t") writer.writerow(csv_line)
-
You can initialize
csv.reader
andcsv.writer
with different “dialects” or with custom delimiters and quotechars; see https://docs.python.org/3/library/csv.html
(Optional) Avoid memory limitations by processing the input file one line at a time
infile_name = "data/bouldercreek_09_2013.txt"
outfile_name = "data/csv_data.csv"
with open(infile_name, "r") as infile, open(outfile_name, "w") as outfile:
writer = csv.writer(outfile)
for line in infile:
if not line.startswith("#"):
writer.writerow(line.strip().split("\t"))
(Optional) Notes
-
Pandas has utilities for reading fixed-width files: https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html
-
Saving datasets with new-style string formatting
for i in datasets_list: do_something(f'{i}.png'
Writing Functions
Break programs down into functions to make them easier to understand
- Human beings can only keep a few items in working memory at a time.
- Understand larger/more complicated ideas by understanding and combining pieces
- Functions serve the same purpose in programs:
- Encapsulate complexity so that we can treat it as a single “thing”
- Removes complexity from remaining code, making it easier to test
- Enables re-use: Write one time, use many times
Define a function using def
with a name, parameters, and a block of code
def print_greeting():
print('Hello!')
- Begin the definition of a new function with
def
, followed by the name of the function. - Must obey the same rules as variable names.
- Parameters in parentheses; empty parentheses if the function doesn’t take any inputs.
- Indent function body
Defining a function does not run it
print_greeting()
- Like assigning a value to a variable
- Must call the function to execute the code it contains.
Arguments in call are matched to parameters in definition
-
Positional arguments
def print_date(year, month, day): joined = '/'.join([year, month, day]) print(joined) print_date(1871, 3, 19)
-
(Optional) Keyword arguments
print_date(month=3, day=19, year=1871)
Functions may return a result to their caller using return
-
Use
return ...
to give a value back to the caller.return
ends the function’s execution and returns you to the code that originally called the function.def average(values): """Return average of values, or None if no values are supplied.""" if len(values) == 0: return None else: return sum(values) / len(values)
a = average([1, 3, 4]) print(a)
-
You should explicitly handle common problems:
print(average([]))
-
Notes:
return
can occur anywhere in the function, but functions are easier to understand if return occurs:- At the start to handle special cases
- At the very end, with a final result
- Docstring provides function help. Use triple quotes if you need the docstring to span multiple lines.
(Optional) challenge: Encapsulate text processing in a function
Write a function that takes line
as an input and returns the information required by writer.writerow()
.
Challenge (data normalization): Encapsulate Z score calculations in a function
Write a function that encapsulates the Z-score calculations from the Pandas workshop. Your function needs to do the following:
- Read a CSV file into a data frame
- Calculate the Z score for each item
- Calculate the mean Z score for each country
- Append the mean Z scores as a new column
- Return the data frame
Solution
def norm_data(filename):
"""Add a Z score column to a data frame."""
df = pd.read_csv(filename, index_col = "country")
# If you need to drop the continent column
if "continent" in df.columns:
df.drop("continent", axis=1)
# Calculate individual Z scores
z = (data - data.mean(axis=None))/data.values.std(ddof=1)
# Get the mean z score for each country
mean_z = z.mean(axis=1)
df["mean_z"] = mean_z
return df
df = norm_data("data/gapminder_gdp_europe.csv")
# If you need to drop the contintent column
# mean_z, z_bool = norm_data(data.drop("continent", axis=1))
(Optional) Use the function to process all files
for filename in glob.glob('data/gapminder_*.csv'):
# Print a status message
print("Current file:", filename)
# Read the data into a DataFrame and modify it
data = pd.read_csv(filename, index_col = "country")
mean_z, z_bool = norm_data(data)
# Append to DataFrame
data["mean_z"] = mean_z
data["wealthy"] = z_bool
# Generate an output file name
parts = filename.split(".csv")
newfile = ''.join([parts[0], "_normed.csv"])
data.to_csv(newfile)
(Optional) A worked example: The Lorenz attractor
https://matplotlib.org/stable/gallery/mplot3d/lorenz_attractor.html
(Carpentries version) Conditionals
Use if
statements to control whether or not a block of code is executed
An if
statement (more properly called a conditional statement) controls whether some block of code is executed or not.
mass = 3.54
if mass > 3.0:
print(mass, 'is large')
mass = 2.07
if mass > 3.0:
print (mass, 'is large')
Structure is similar to a for
statement:
- First line opens with
if
and ends with a colon - Body containing one or more statements is indented (usually by 4 spaces)
Conditionals are often used inside loops
Not much point using a conditional when we know the value (as above), but useful when we have a collection to process.
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
if m > 3.0:
print(m, 'is large')
Use else to execute a block of code when an if condition is not true
else
can be used following an if
. This allows us to specify an alternative to execute when the if branch isn’t taken.
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
if m > 3.0:
print(m, 'is large')
else:
print(m, 'is small')
Use elif
to specify additional tests
May want to provide several alternative choices, each with its own test; use elif
(short for “else if”) and a condition to specify these.
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
if m > 9.0:
print(m, 'is HUGE')
elif m > 3.0:
print(m, 'is large')
else:
print(m, 'is small')
- Always associated with an
if
. - Must come before the
else
(which is the “catch all”).
Conditions are tested once, in order
Python steps through the branches of the conditional in order, testing each in turn. Order matters! The following is wrong:
grade = 85
if grade >= 70:
print('grade is C')
elif grade >= 80:
print('grade is B')
elif grade >= 90:
print('grade is A')
Use conditionals in a loop to “evolve” the values of variables
velocity = 10.0
for i in range(5): # execute the loop 5 times
print(i, ':', velocity)
if velocity > 20.0:
velocity = velocity - 5.0
else:
velocity = velocity + 10.0
print('final velocity:', velocity)
- This is how dynamical systems simulations work
Compound Relations Using and
, or
, and Parentheses (optional)
Often, you want some combination of things to be true. You can combine relations within a conditional using and
and or
. Continuing the example above, suppose you have:
mass = [ 3.54, 2.07, 9.22, 1.86, 1.71]
velocity = [10.00, 20.00, 30.00, 25.00, 20.00]
i = 0
for i in range(5):
if mass[i] > 5 and velocity[i] > 20:
print("Fast heavy object. Duck!")
elif mass[i] > 2 and mass[i] <= 5 and velocity[i] <= 20:
print("Normal traffic")
elif mass[i] <= 2 and velocity[i] <= 20:
print("Slow light object. Ignore it")
else:
print("Whoa! Something is up with the data. Check it")
- Use () to group subsets of conditions
- Aside: For a more natural way of working with many lists, look at
zip()
Visualization with Matplotlib and Seaborn (Week 4)
Orientation
Briefly revisit week 1
- Python orientation
- Jupyter orientation
A brief history of plotting in Matplotlib
- Multiple interfaces
- Local graphs and global settings
- Matplotlib is the substrate for higher-level libraries
- Drawing things is verbose in any language
Plotting with Matplotlib
The basic plot
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
time = [0, 1, 2, 3]
position = [0, 100, 200, 300]
ax.plot(time, position)
Two kinds of plotting objects
type(fig)
print(type(fig))
print(type(ax))
- Figure objects handle display, printing, saving, etc.
- Axes objects contain graph information
(Optional) Three ways of showing a figure
-
Show figure inline (Jupyter Lab default)
fig
-
Show figure in a separate window (command line default)
fig.show()
-
Show figure in a separate window from Jupyter Lab. You may need to specify a different “backend” parameter for
matplotlib.use()
depending on your exact setup: https://matplotlib.org/stable/tutorials/introductory/usage.html#the-builtin-backendsimport matplotlib matplotlib.use('TkAgg') fig.show()
The lifecycle of a custom plot
-
Create mock data
import numpy as np y = np.random.random(10) # outputs an array of 10 random numbers between 0 and 1 x = np.arange(1980,1990,1) # generates an ordered array of numbers from 1980 to 1989 # Check that x and y contain the same number of values assert len(x) == len(y)
-
Inspect our data
print("x:", x) print("y:", y)
-
Create the basic plot
# Convert y axis into a percentage y = y * 100 # Draw plot fig, ax = plt.subplots() ax.plot(x, y)
-
Show available styles
# What are the global styles? plt.style.available
# Set a global figure style plt.style.use("dark_background") # The style is only applied to new figures, not pre-existing figures fig
# Re-creating the figure applies the new style fig, ax = plt.subplots() ax.plot(x, y)
-
Customize the graph In principle, nearly every element on a Matplotlib figure is independently modifiable.
# Set figure size fig, ax = plt.subplots(figsize=(8,6)) # Set line attributes ax.plot(x, y, color='darkorange', linewidth=2, marker='o') # Add title and labels ax.set_title("Percent Change in Stock X", fontsize=22, fontweight='bold') ax.set_xlabel(" Years ", fontsize=20, fontweight='bold') ax.set_ylabel(" % change ", fontsize=20, fontweight='bold') # Adjust the tick labels ax.tick_params(axis='both', which='major', labelsize=18) # Add a grid ax.grid(True)
-
Save your figure
fig.savefig("mygraph_dark.png", dpi=300)
Plotting multiple data sets
In this example, plot GDP over time for multiple countries.
-
Import data
import pandas as pd data = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
# Inspect our data data.head(3)
-
Transform column headers into an ordinal scale
-
(Optional) Original column names are object (i.e. string) data
data.columns
-
Strip off non-numeric portion of each column title
years = data.columns.str.strip('gdpPercap_') years
-
Convert years strings into integers and replace original data frame column headers
data.columns = years.astype(int)
-
-
Extract rows from the DataFrame
x_years = data.columns y_austria = data.loc['Austria'] y_bulgaria = data.loc['Bulgaria']
-
Create the plot object
# Change global background back to default plt.style.use("default") # Create GDP figure fig, ax = plt.subplots(figsize=(8,6)) # Create GDP plot ax.plot(x_years, y_austria, label='Austria', color='darkgreen', linewidth=2, marker='x') ax.plot(x_years, y_bulgaria, label='Bulgaria', color='maroon', linewidth=2, marker='o') # Decorate the plot ax.legend(fontsize=16, loc='upper center') #automatically uses labels ax.set_title("GDP of Austria vs Bulgaria", fontsize=22, fontweight='bold') ax.set_xlabel("Years", fontsize=20, fontweight='bold') ax.set_ylabel("GDP", fontsize=20, fontweight='bold')
(Optional) Plot directly from Pandas
Don’t do this.
-
The basic plot syntax
ax = data.loc['Austria'].plot() fig = ax.get_figure() fig
-
Decorate your Pandas plot
ax = data.loc['Austria'].plot(figsize=(8,6), color='darkgreen', linewidth=2, marker='*') ax.set_title("GDP of Austria", fontsize=22, fontweight='bold') ax.set_xlabel("Years",fontsize=20, fontweight='bold' ) ax.set_ylabel("GDP",fontsize=20, fontweight='bold' ) fig = ax.get_figure() fig
-
Overlaying multiple plots on the same figure with Pandas. This is super unintuitive.
# Create an Axes object with the Austria data ax = data.loc['Austria'].plot(figsize=(8,6), color='darkgreen', linewidth=2, marker='*') print("Austria graph", id(ax)) # Overlay the Bulgaria data on the same Axes object ax = data.loc['Bulgaria'].plot(color='maroon', linewidth=2, marker='o') print("Bulgaria graph", id(ax))
-
The equivalent Matplotlib plot (optional)
# extract the x and y values from dataframe x_years = data.columns y_gdp = data.loc['Austria'] # Create the plot fig, ax = plt.subplots(figsize=(8,6)) ax.plot(x_years, y_gdp, color='darkgreen', linewidth=2, marker='x') # etc.
Visualization Strategy
There are many kinds of plots
## Visualize the same data using a scatterplot
plt.style.use('ggplot')
# Create a scatter plot
fig, ax = plt.subplots(figsize=(8,6))
ax.scatter(y_austria, y_bulgaria, color='blue', linewidth=2, marker='o')
# Decorate the plot
ax.set_title("GDP of Austria vs Bulgaria", fontsize=22, fontweight='bold')
ax.set_xlabel("GDP of Austria",fontsize=20, fontweight='bold' )
ax.set_ylabel("GDP of Bulgaria",fontsize=20, fontweight='bold' )
Read the docs
- Matplotlib gallery: https://matplotlib.org/stable/gallery/index.html
- “Plotting categorical variables” example of multiple subplots
- Download code examples
- .py vs .ipynb
- Matplotlib tutorials: https://matplotlib.org/stable/tutorials/index.html
- Seaborn gallery: https://seaborn.pydata.org/examples/index.html
- Seaborn tutorials: https://seaborn.pydata.org/tutorial.html
Workflow strategy
- Get in the ball park
- Look at lots of data
- Try lots of presets
- Customize judiciously
- Build collection of interactive and publication code snippets
Fast visualization and theming with Seaborn
Seaborn is a set of high-level pre-sets for Matplotlib.
Seaborn is a nice way to look at your data
# Import the Seaborn library
import seaborn as sns
ax = sns.lineplot(data=data.T, legend=False, dashes=False)
- Doing more with this data set requires transforming the data from wide form to long form; see https://seaborn.pydata.org/tutorial/data_structure.html
Using preset styles
Let’s make a poster!
-
Import Iris data set https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/0e7a9b0a5d22642a06d3d5b9bcbad9890c8ee534/iris.csv
iris = pd.read_csv("../data/iris.csv") iris.head()
-
Create a basic scatter plot
ax = sns.scatterplot(data=iris, x='sepal_length',y='petal_length')
-
Change plotting theme
plt.style.use("dark_background") # Fix grid if necessary #plt.rcParams["axes.grid"] = False # Make everything visible at a distance sns.set_context('poster') # Color by species ax = sns.scatterplot(data=iris, x='sepal_length', y='petal_length', hue='species', palette='colorblind', size='petal_width') # Place legend ax.legend(bbox_to_anchor=(2,1))
- Read more about
loc
vs.bbox_to_anchor
in the legend documentation: https://matplotlib.org/stable/api/legend_api.html
- Read more about
-
The Seaborn plot uses Matplotlib under the hood
# Set the figure size fig = ax.get_figure() fig.set_size_inches(8,6) fig
(Optional) There are many styling options
-
Add styling to individual points
ax = sns.scatterplot(data=iris, x='sepal_length', y='petal_length', hue='species', palette='colorblind', style='species')
-
Prettify column names
words = [' '.join(i) for i in iris.columns.str.split('_')] iris.columns = words
-
Make a regression plot
# Color by species, size by petal width ax = sns.regplot(data=iris, x='sepal_length', y='petal_length', scatter=True, scatter_kws={'color':'white'})
(Optional) Bar Charts
-
Bar Plot
ax = sns.barplot(data=iris, x='species', y='sepal_width', palette='colorblind')
- Default summary statistic is mean, and default error bars are 95% confidence interval.
-
Add custom parameters
# Error bars show standard deviation ax = sns.barplot(data=iris, x='species', y='sepal_width', ci='sd', edgecolor='black')
-
(Optional) count plot counts the records in each category
ax = sns.countplot(data=iris, x='species', palette='colorblind')
(Optional) Histograms
-
Histogram of overall data set
ax = sns.histplot(data=iris, x='petal_length', kde=True)
- KDE: If True, compute a kernel density estimate to smooth the distribution and show on the plot as (one or more) line(s).
- There seems a bimodal distribution of petal length. What factors underly this distribution?
-
Histogram of data decomposed by category
ax = sns.histplot(data=iris, x='petal_length', hue='species', palette='Set2')
-
Create multiple subplots to compare binning strategies
# This generates 3 subplots (ncols=3) on the same figure fig, axes = plt.subplots(figsize=(12,4), nrows=1, ncols=3) # Note that we can use Seaborn to draw on our Matplotlib figure sns.histplot(data=iris,x='petal_length', bins=5, ax=axes[0], color='#f5a142') sns.histplot(data=iris,x='petal_length', bins=10, ax=axes[1], color='maroon') sns.histplot(data=iris,x='petal_length', bins=15, ax=axes[2], color='darkmagenta')
(Optional) Box Plots and Swarm Plots
-
Box plot
ax = sns.boxplot(data=iris, x='species', y='petal_length')
-
Swarm plot
ax = sns.swarmplot(data=iris,x='species', y='petal_length', hue='species', palette='Set1') ax.legend(loc='upper left', fontsize=16) ax.tick_params(axis='x', labelrotation = 45)
This gives us a format warning.
-
Strip plot
ax = sns.swarmplot(data=iris,x='species', y='petal_length', hue='species', palette='Set1') ax.legend(loc='upper left', fontsize=16) ax.tick_params(axis='x', labelrotation = 45)
-
Overlapping plots
ax = sns.boxplot(data=iris, x='species', y='petal_length') sns.stripplot(data=iris, x='species', y='petal_length', ax=ax, palette='Set1')
(Optional) How Matplotlib works
Understanding Matplotlib
- Everything is an Artist (object)
- Multiple levels of specificity
plt
vsaxes
- rcParams vs temporary stylings
- Simplified high-level interfaces, aka “syntactic sugar”
legend()
vs get legend handles and patches
Matplotlib object syntax
- The
object.set_field(value)
usage is taken from Java, which was popular in 2003 when Matplotlib was developing its object-oriented syntax - You get values back out with
object.get_field(value)
- The Pythonic way to set a value would be
object.field = value
. However, the Matplotlib getters and setters do a lot of internal bookkeeping, so if you try to set field values directly you will get errors. For example, compareax.get_ylabel()
withax.yaxis.label
. - Read “The Lifecycle of a Plot”: https://matplotlib.org/stable/tutorials/introductory/lifecycle.html
- Read “Why you hate Matplotlib”: https://ryxcommar.com/2020/04/11/why-you-hate-matplotlib/
Special Topics
Environments
Working with unstructured files
Open the file with a context handler
with open('pettigrew_letters_ORIGINAL.txt', 'r') as file_in:
text = file_in.read()
print(len(text))
Strings contain formatting marks
Compare the following:
# This displays the nicely-formatted document
print(text[:300])
# This shows the true nature of the string; you can see newlines (/n),
# tabs (/t), and other hidden characters
text[:300]
Many ways of handling a file
.read()
produces the file contents as one string
type(text)
.readlines()
produces the file contents as a list of lines; each line is a string
with open('pettigrew_letters_ORIGINAL.txt', 'r') as file_in:
text = file_in.readlines()
print(len(text))
print(type(text))
Inspect parts of the file using list syntax
# View the first 10 lines
text[:10]
Working with unstructured file data
Contents of pettigrew_letters_ORIGINAL.txt
- Intro material
- Manifest of letters
- Individual letters
Query: Are all the letters in the manifest actually there?
- check if all the letters reported in the manifest appear in the actual file
- check if all the letters in the file are reported in the manifest
- Therefore, construct two variables: (1) A list of every location line from the manifest, and (2) a list of every location line within the file proper
Get the manifest by visual inspection
manifest_list = text[14:159]
Use string functions to clean up and inspect text
Demonstrate string tests with manifest_list:
# Raw text
for location in manifest_list[:10]:
print(location)
# Remove extra whitespace
for location in manifest_list[:10]:
print(location.strip())
# Test whether the cleaned line starts with 'Box '
for location in manifest_list[:10]:
stripped_line = location.strip()
print(stripped_line.startswith('Box '))
# Test whether the cleaned line starts with 'box '
for location in manifest_list[:10]:
stripped_line = location.strip()
print(stripped_line.startswith('box '))
Gather all the locations in the full document
letters = text[162:]
for line in letters[:25]:
# Create a variables to hold current line and truth value of is_box
stripped_line = line.strip()
is_box = stripped_line.startswith('Box ')
if is_box == True:
print(stripped_line)
# If the line is empty, don't print anything
elif stripped_line == '\n':
continue
# Indent non-Box lines
else:
print('---', stripped_line)
- Before automate everything, we run the code with lots of
print()
statements so that we can see what’s happening
Collect the positive results
letter_locations = []
for line in letters:
stripped_line = line.strip()
is_box = stripped_line.startswith("Box ")
if is_box == True:
letter_locations.append(stripped_line)
Compare the manifest and the letters
print('Items in manifest:', len(manifest_list))
print('Letters:', len(letter_locations))
Follow-up questions
- Which items are in one list but not the other?
- Are there other structural regularities you could use to parse the data? (Note that in the letters, sometimes there are multiple letters under a single box header)
Exception handling
Explicitly handle common errors, rather than waiting for your code to blow up.
def average(values):
"Return average of values, or None if no values are supplied."
if len(values) == 0:
return None
return sum(values) / len(values)
print(average([3, 4, 5])) # Prints expected output
print(average([])) # Explicitly handles possible divide-by-zero error
print(average(4)) # Unhandled exception
def average(values):
"Return average of values, or an informative error if bad values are supplied."
try:
return sum(values) / len(values)
except ZeroDivisionError as err:
return err
except TypeError as err:
return err
print(average([3, 4, 5]))
print(average(4))
print(average([]))
- Use judiciously, and be as specific as possible. When in doubt, allow your code to blow up rather than silently commit errors.
Performance and profiling
from timeit import time
import cProfile
import pstats
def my_fun(val):
# Get 1st timestamp
t1 = time.time()
# do work
# Get 2nd timestamp
t2 = time.time()
print(round(t2 - t1, 3))
# Run the function with the profiler and collect stats
cProfile.run('my_fun(val)', 'dumpstats')
s = pstats.Stats('dumpstats')
Reducing memory usage
Read a file one line at a time
with open('pettigrew_letters_ORIGINAL.txt', 'r') as file_in:
for line in file_in:
# Do stuff to current line
pass
Use a SQLite database
import sqlite3
conn = sqlite3.connect('my_database_name.db')
with conn:
c = conn.execute("SELECT column_name FROM table_name WHERE criterion")
results = c.fetchall()
c.close
# Do stuff with `results`
Other optional topics
- Checking performance
- List comprehensions
- Defensive programming
- Debugging and Testing
Endnotes
Credits
- Plotting and Programming in Python (Pandas-oriented): http://swcarpentry.github.io/python-novice-gapminder/
- Programming with Python (NumPy-oriented): https://swcarpentry.github.io/python-novice-inflammation/index.html
- Python for Ecology: https://datacarpentry.org/python-ecology-lesson/
- Humanities Python Tour (file and text processing): https://github.com/elliewix/humanities-python-tour/blob/master/Two-Hour-Beginner-Tour.ipynb
- Introduction to Cultural Analytics & Python: https://melaniewalsh.github.io/Intro-Cultural-Analytics/welcome.html
- Rhondene Wint: Matplotlib and Seaborn notes
- Fruit Alphabet: https://en.wikibooks.org/wiki/Wikijunior:Fruit_Alphabet
References
Standard Python
- Python tutorial: https://docs.python.org/3/tutorial/index.html
- Python standard library: https://docs.python.org/3/library/
- String formatting: https://pyformat.info/
- True and False in Python: https://docs.python.org/3/library/stdtypes.html#truth-value-testing
Scientific Computing Libraries
- NumPy documentation: https://numpy.org/doc/stable/user/index.html
- Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/
- Pandas user guide: https://pandas.pydata.org/docs/user_guide/index.html
- SciPy user guide: https://docs.scipy.org/doc/scipy/tutorial/index.html
- Statsmodels library: https://www.statsmodels.org/stable/index.html
- Scikit-Learn documentation: https://scikit-learn.org/stable/
- Statistics in Python tutorial: https://scipy-lectures.org/packages/statistics/
Data Visualization Libraries
- Matplotlib gallery of examples: https://matplotlib.org/gallery/index.html
- Matplotlib tutorials: https://matplotlib.org/stable/tutorials/index.html
- Seaborn gallery of examples: https://seaborn.pydata.org/examples/index.html
- Seaborn tutorials: https://seaborn.pydata.org/tutorial.html
Marginalia
- How to choose a code editor: https://github.com/elliewix/Ways-Of-Installing-Python/blob/master/ways-of-installing.md#why-do-you-need-a-specific-tool
- IPython magic commands: https://ipython.readthedocs.io/en/stable/interactive/magics.html
- Writing documentation in Markdown: https://docs.github.com/en/get-started/writing-on-github/getting-started-with-writing-and-formatting-on-github/basic-writing-and-formatting-syntax
Data Sources
- Gapminder data: http://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip
- Ecology data (field surveys): https://datacarpentry.org/python-ecology-lesson/data/portal-teachingdb-master.zip
- Social Science data (SAFI): https://datacarpentry.org/socialsci-workshop/data/
- Humanities data (Pettigrew letters): http://dx.doi.org/10.5334/data.1335350291