Useful API
Both DataFrame and Series
API | Comments |
---|---|
describe() | Outputs useful statistical information about the data |
unique() | Returns only unique rows/values in the data. TODO: validate on DF |
drop_duplicates() | Removes all duplicate rows/values in the data |
isnull() | Replaces all values that are NaN with True. False otherwise |
notnull() | Replaces all values that are NaN with False. Tue otherwise |
dropna() | Removes all rows that have any NaN value |
fillna(value) | Fills all values that are NaN with the value given |
apply(fn) | Applies the function to all values (TODO: Validate works on DF) |
index | The name of the index |
nunique() | Gets a count of the unique rows/values in the data |
copy() | Makes a copy of the DF. Should be used judiciously |
take(iterable) | Returns the rows found in the iterable TODO: Series? |
concat() | Not the same as ‘append’. TODO: Series. Describe better. |
Question: How to append multiple columns to a DataFrame?
DataFrame only
|API|Comments|
|—|——–| |columns
|An iterable of all the column names|
|loc[rows, cols]
|Returns a value, Series or DataFrame filtered as specified|
|set_index('col')
|Sets the index of the DF to ‘col’|
|reset_index()
|Restores the index of the DF to the default ordinal values|
|sort_index()
|Sorts the DF by the index. TODO: is this Series too?|
|sort_values(by='col')
|Sorts the DF by the values in column, ascending|
|nlargest(count, 'col')
|Returns a DataFrame, sorted by ‘col’, descending, of size count|
|nsmallest(count, 'col')
|Returns a DataFrame, sorted by ‘col’, ascending, of size count|
|groupby('groupcol')['value_col'].xxx()
|Returns a Series where the index are the values in ‘groupcol’, the values are computed from the column ‘value_col’ and the computation is ‘xxx’ which is one of many mathematical functions.|
|rename(dict)
|Uses the dictionary to rename the columns|
Available on DataFrame, Series and Groupby Result
|Math API|Comments|
|——–|——–|
|min()
|Finds the minimum value|
|max()
|Finds the maximum value|
|mean()
|Finds the average/mean value|
|idxmin()
|Finds the index of the minimum value|
|idxmax()
|Finds the index of the maximum value|
Math methods
add
, sub
, mul
, div
, mod
, pow
# do element-by-element math
result = df1.add(df2)
Series only
|API|Comments|
|—|——–| |nlargest(count)
|Returns a sorted Series, descending, of size count|
|nsmallest(count)
|Returns a sorted Series, ascending, of size count|
|isin(iterable)
|Returns a boolean Series, True if the value is found in the iterable|
File Related
|API|Comments|
|—|——–|
|to_csv()
|Saves the DF as a CSV file. Handy during Final Project preprocessing step.|
|read_csv()
| Details below|
- If you set a location to
None
(e.g.df.iloc[3, 2] = None
), The value may becomenp.NaN
if column is numeric - read_csv and parse_dates to get TimeSeries
- Reading Unicode files: with (‘filename.txt, encoding=’utf-8’)
- This is helpful when copying content from Google Docs or Word
- Series
s.isin(list)
# a mask- (most of the DataFrame methods):
- Iterating through a DataFrame
df[ X ]
where X can be:- ‘column_name’ - returns a Series
- Boolean Series - acts as a mask and returns a DataFrame
['col1', 'col2']
- using a list of column names returns a DataFrame- Does a slice work?
['col1':'col2']
df.loc[ row, col ]
where row/col can be:- index value (or column name)
- list of values
- slice of values
- Will a boolean mask work?
- API on DataFrame
describe
unique
,index
,columns
copy
to_csv
:- Math:
add
,sub
,mul
,div
,mod
,pow
: elementwise math of two dataframes min
,max
,idxmin
,idxmax
: returns a Series with information for each columnnunique
:dropna
:isnull
andnotnull
nlargest(count, 'col')
nsmallest(count, 'col')
groupby
sort_values('col')
concat([df1, df2])
vsappend
which is deprecated.sort_index
set_index
andreset_index
rename(dict)
where dict is{ 'orig': 'New_name' }
apply
plot
boxplot
df = df.apply(pd.to_numeric, errors='coerce')
: forces values to be numeric
import pandas as pd
# do details and examples of
# unique vs drop_duplicates (on df and series)
# groupby()
# .loc[]
more topics:
- dropna or fillna
- keep select columns with
df = df[ col_list ]
- perhaps coerce column dtypes:
df = df.apply(pd.to_numeric, errors='coerce')
- rename columns to something short and readable:
df = df.rename(dict)
read_csv
For full documentation, see Pandas API reference
Formatting the Date:
In Pandas version 2.0, there is a date_format
argument. But the version that NCHS has installed is older (v 1.3) and does not. In Jupyter Notebook, you can see your version using: print(pd._version)
.
Older versions support data_parser=fun
. fun is a function to parse the date. It is recommended to use pd.to_datetime(date_string, format='%Y-%m-%d')
.
Here is an example that will parse MM-YYYY-DD
formatted dates such as 03-2023-31
.
df = pd.read_csv('data.csv', index_col='date', parse_dates=True,
date_parser=lambda s: pd.to_datetime(s, format='%m-%Y-%d'))
When parse_dates=True
, then we will parse the index as our date.
```{admonition} parse_dates :class: dropdown seealso
parse_dates: bool, list of Hashable, list of lists or dict of {Hashablelist}, default False
The line above has the following behavior:
- bool. If True -> try parsing the index.
- list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
- list of list. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
- dict, e.g.
{'foo' : [1, 3]}
-> parse columns 1, 3 as date and call resultfoo
If a column or index cannot be represented as an array of datetime, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use to_datetime() after read_csv().
## Iterating a DataFrame
There are several ways to iterate through a DataFrame.
```python
# iterates through the columns
for n in df2:
print(n)
# iterates through the tuples (col_name, col_series)
for col_name, col_series in df2.iteritems():
print(col_name)
print(col_series)
# identical to the above
for col_name, col_series in df2.items():
print(col_name)
print(col_series)
DataFrame Plot API Summary
For full documentation, see Pandas DataFrame plot.
```{admonition} Documentation Notes :class: dropdown Note that in this documentation:
label
means the column name in the DataFrame.position
is a integer representing the ordinal of the column.
```
x : label or position, default None
(use the index as the x-axis)
y : label, position, list of labels or positions, default None.
kind : One of the following strings describing the kind of plot to produce.
- ‘line’ : line plot (default)
- ‘bar’ : vertical bar plot
- ‘barh’ : horizontal bar plot
- ‘hist’ : histogram
- ‘box’ : boxplot
- ‘kde’ : Kernel Density Estimation plot
- ‘density’ : same as ‘kde’
- ‘area’ : area plot
- ‘pie’ : pie plot
- ‘scatter’ : scatter plot (DataFrame only)
- ‘hexbin’ : hexbin plot (DataFrame only) ax : The axes object to plot to subplot : bool,
True
means to make separate subplots for each column.
There are many more arguments such as: title
, grid
, legend
, style
, xticks
, yticks
, xlim
, ylim
, xlabel
, ylabel
, rot
, colormap
, table
, stacked
import pandas as pd
df = pd.DataFrame()
help(df.plot)