Pandas Reference Guide

Useful API
Plotting with Matplotlib
- Basic Plotting
- Customizing Plots
Merging DataFrames
read_csv
DataFrame Plot API Summary

Useful API

Both DataFrame and Series

API	Comments
`describe()`	Outputs useful statistical information about the data
`unique()`	Returns only unique rows/values in the data. TODO: validate on DF
`drop_duplicates()`	Removes all duplicate rows/values in the data
`isnull()`	Replaces all values that are NaN with True. False otherwise
`notnull()`	Replaces all values that are NaN with False. True otherwise
`dropna()`	Removes all rows that have any NaN value
`fillna(value)`	Fills all values that are NaN with the value given
`apply(fn)`	Applies the function to all values (TODO: Validate works on DF)
`index`	The name of the index
`nunique()`	Gets a count of the unique rows/values in the data
`copy()`	Makes a copy of the DF. Should be used judiciously
`take(iterable)`	Returns the rows found in the iterable TODO: Series?
`concat()`	Not the same as ‘append’. TODO: Series. Describe better.

Question: How to append multiple columns to a DataFrame?

DataFrame only

API	Comments
`columns`	An iterable of all the column names
`loc[rows, cols]`	Returns a value, Series or DataFrame filtered as specified
`set_index('col')`	Sets the index of the DF to ‘col’
`reset_index()`	Restores the index of the DF to the default ordinal values
`sort_index()`	Sorts the DF by the index. TODO: is this Series too?
`sort_values(by='col')`	Sorts the DF by the values in column, ascending
`nlargest(count, 'col')`	Returns a DataFrame, sorted by ‘col’, descending, of size count
`nsmallest(count, 'col')`	Returns a DataFrame, sorted by ‘col’, ascending, of size count
`groupby('groupcol')['value_col'].xxx()`	Returns a Series where the index are the values in ‘groupcol’, the values are computed from the column ‘value_col’ and the computation is ‘xxx’ which is one of many mathematical functions.
`rename(dict)`	Uses the dictionary to rename the columns

Available on DataFrame, Series and Groupby Result

Math API	Comments
`min()`	Finds the minimum value
`max()`	Finds the maximum value
`mean()`	Finds the average/mean value
`idxmin()`	Finds the index of the minimum value
`idxmax()`	Finds the index of the maximum value

Math methods

add, sub, mul, div, mod, pow

# do element-by-element math
result = df1.add(df2)

Series only

API	Comments
`nlargest(count)`	Returns a sorted Series, descending, of size count
`nsmallest(count)`	Returns a sorted Series, ascending, of size count
`isin(iterable)`	Returns a boolean Series, True if the value is found in the iterable

API	Comments
`to_csv()`	Saves the DF as a CSV file. Handy during Final Project preprocessing step.
`read_csv()`	Details below

If you set a location to None (e.g. df.iloc[3, 2] = None), The value may become np.NaN if column is numeric
read_csv and parse_dates to get TimeSeries
Reading Unicode files: with (‘filename.txt, encoding=’utf-8’)
- This is helpful when copying content from Google Docs or Word
Series
- s.isin(list) # a mask
- (most of the DataFrame methods):
Iterating through a DataFrame
df[ X ] where X can be:
- ‘column_name’ - returns a Series
- Boolean Series - acts as a mask and returns a DataFrame
- ['col1', 'col2'] - using a list of column names returns a DataFrame
- Does a slice work? ['col1':'col2']
df.loc[ row, col ] where row/col can be:
- index value (or column name)
- list of values
- slice of values
- Will a boolean mask work?

import pandas as pd

# do details and examples of
# unique vs drop_duplicates (on df and series)
# groupby()
# .loc[]

more topics:

dropna or fillna
keep select columns with df = df[ col_list ]
perhaps coerce column dtypes: df = df.apply(pd.to_numeric, errors='coerce')
rename columns to something short and readable: df = df.rename(dict)

Plotting with Matplotlib

Matplotlib is a powerful Python library for creating static, animated, and interactive visualizations. While Pandas provides built-in plotting via .plot(), you can use Matplotlib directly for more customization.

Basic Plotting

import matplotlib.pyplot as plt

# Basic line plot
plt.plot(df['col1'], df['col2'])
plt.xlabel('Column 1')
plt.ylabel('Column 2')
plt.title('Line Plot Example')
plt.show()

# Scatter plot
plt.scatter(df['col1'], df['col2'])
plt.xlabel('Column 1')
plt.ylabel('Column 2')
plt.title('Scatter Plot Example')
plt.show()

# Histogram
plt.hist(df['col1'], bins=20)
plt.xlabel('Column 1')
plt.ylabel('Frequency')
plt.title('Histogram Example')
plt.show()

You can combine Pandas and Matplotlib for flexible plotting:

ax = df.plot(kind='bar')
ax.set_title('Bar Plot with Pandas and Matplotlib')
plt.show()

Customizing Plots

Here are some common customizations for Matplotlib plots:

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(df['col1'], df['col2'], color='green', marker='o', linestyle='--', label='Data')
ax.set_xlabel('Column 1')
ax.set_ylabel('Column 2')
ax.set_title('Customized Line Plot')
ax.set_xticks([0, 1, 2, 3])
ax.set_xticklabels(['A', 'B', 'C', 'D'])
ax.legend(loc='upper left')
plt.grid(True)
plt.tight_layout()
plt.show()

Colormap: For plots with multiple series, you can use cmap:
```
df.plot(kind='bar', colormap='viridis')
plt.show()
```
Legend: Use ax.legend() to show or customize the legend.
Grid: Use plt.grid(True) or ax.grid(True) to show grid lines.
Figure size: Use plt.figure(figsize=(8, 4)) or fig, ax = plt.subplots(figsize=(8, 4)).
Saving plots: Use plt.savefig('filename.png') to save the figure.

For more, see the Matplotlib documentation.

Merging DataFrames

Merging (joining) DataFrames is a common operation in Pandas. Use pd.merge() to combine DataFrames based on common columns or indices.

import pandas as pd

# Example DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})

# Inner join (only keys present in both)
merged = pd.merge(df1, df2, on='key', how='inner')

# Left join (all keys from df1)
merged_left = pd.merge(df1, df2, on='key', how='left')

# Outer join (all keys from both)
merged_outer = pd.merge(df1, df2, on='key', how='outer')

how can be 'inner', 'left', 'right', or 'outer'.
You can merge on multiple columns using on=['col1', 'col2'].
For index-based merging, use left_index=True and/or right_index=True.

See the Pandas merge documentation for more details.

read_csv

For full documentation, see Pandas API reference

Formatting the Date:
In Pandas version 2.0, there is a date_format argument. But the version that NCHS has installed is older (v 1.3) and does not. In Jupyter Notebook, you can see your version using: print(pd._version).

Older versions support data_parser=fun. fun is a function to parse the date. It is recommended to use pd.to_datetime(date_string, format='%Y-%m-%d').

Here is an example that will parse MM-YYYY-DD formatted dates such as 03-2023-31.

df = pd.read_csv('data.csv', index_col='date', parse_dates=True, 
                 date_parser=lambda s: pd.to_datetime(s, format='%m-%Y-%d'))

When parse_dates=True, then we will parse the index as our date.

parse_dates: bool, list of Hashable, list of lists or dict of {Hashablelist}, default False

The line above has the following behavior:

bool. If True -> try parsing the index.
list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
list of list. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result foo

If a column or index cannot be represented as an array of datetime, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use to_datetime() after read_csv().

DataFrame Plot API Summary

For full documentation, see Pandas DataFrame plot.

Note that in this documentation: label means the column name in the DataFrame, and position is a integer representing the ordinal of the column.

x : label or position, default None (use the index as the x-axis)
y : label, position, list of labels or positions, default None.
kind : One of the following strings describing the kind of plot to produce.

‘line’ : line plot (default)
‘bar’ : vertical bar plot
‘barh’ : horizontal bar plot
‘hist’ : histogram
‘box’ : boxplot
‘kde’ : Kernel Density Estimation plot
‘density’ : same as ‘kde’
‘area’ : area plot
‘pie’ : pie plot
‘scatter’ : scatter plot (DataFrame only)
‘hexbin’ : hexbin plot (DataFrame only) ax : The axes object to plot to subplot : bool, True means to make separate subplots for each column.

There are many more arguments such as: title, grid, legend, style, xticks, yticks, xlim, ylim, xlabel, ylabel, rot, colormap, table, stacked

import pandas as pd
df = pd.DataFrame()
help(df.plot)