DataFrame
Table of Contents
- Creating a DataFrame
- DataFrame Indexing Methods
- Iterating a DataFrame
- Accessing/Changing Attributes & Values
- Handy Solutions to Common problems
Here we will review common and useful actions with a DataFrame.
First, it is helpful to understand how the Pandas DataFrame documentation is structured. Painfully, it is not alphabetic! However, wonderfully, it is structured by activity.
Creating a DataFrame
Creating from a list of dictionaries
Each dictionary is a row in the DataFrame
. Each key in the dictionary is the column name with a value for the cell.
d1 = { 'City':'New York', 'Pop': 8.6, 'Rating': 3 }
d2 = { 'City':'Seattle', 'Pop': .8, 'Rating': 4 }
d3 = { 'City':'Los Angeles', 'Pop': 3.9, 'Rating': 3.5 }
d4 = { 'City':'Houston', 'Pop': 2.1, 'Rating': 4.1 }
my_list = [d1, d2, d2, d4]
pd = pd.DataFrame(my_list)
# The above creates a DataFrame with 4 rows, 3 columns.
# The column names are: 'City', 'Pop', and 'Rating'
Creating from a list of lists
In this list of lists, each inner list is a row.
# This will have 5 rows, 3 columns, using default indexing.
# The column names are provided in the constructor below
data = [["New York", 8.6, 20],
["Chicago", 2.7, 20],
["Los Angeles", 3.9, 20],
["Philadelphia", 1.5, 20],
["Houston", 2.1, 20]]
# Give names to the columns when constructing
df = pd.DataFrame(data, columns=["City", "Population", "Year(2020)"])
Creating by adding columns
Here, we create an empty DataFrame
and then add on columns.
df = pd.DataFrame()
# add a column with the name 'seat_number'
df['seat_number'] = [ 1, 2, 3, 4 ]
# add a columns with the name 'student_id`
df['student_id'] = [ 1113, 1125, 1427, 1211]
# df has 4 rows, 2 columns, using default indexing.
Creating from a dictionary
The keys are the column names, the values are lists of column values.
data = { "New York": [8.6, 20],
"Chicago": [2.7, 20],
"Los Angeles": [3.9, 20],
"Philadelphia": [1.5, 20],
"Houston": [2.1, 20]
}
# creates 5 columns, 2 rows, default indexing
df = pd.DataFrame(data)
The structure of the dictionary can take many forms. To see the various structures take a look at the converse function DataFrame.to_dict().
Creating from a List
This is highly unusual, but possible.
# to get a dataframe with 1 column from a list of values
list = [ 2, 4, 6, 8 ]
df = pd.DataFrame(list)
# The above code creates 4 rows, 1 column, using default indexing. Column name is 0.
DataFrame Indexing Methods
We will explore four methods that index data within a Pandas DataFrame. The abundance of approaches can be overwhelming and this guide aims to provide clarity. For additional information, refer to the Pandas DataFrame Documentation.
API | Index/Position | Comment |
---|---|---|
df[] | rows by position, columns by name | One argument only |
df.loc[] | by index | [inclusive : inclusive] |
df.iloc[] | by position | Columns by position, too |
df.take() | by position |
axis=0 take rowsaxis=1 take columns |
Except for loc
the API are [inclusive : exclusive]
.
Argument Types
Type | Example | Notes |
---|---|---|
Number |
df.iloc[3] df.loc[3]
| Describes the position, or, index |
Numerical Slice |
df[0:3] df.loc[1:3]
| A slice using numbers (positional) (index) |
Numerical List |
df.take([0, 3]) df.loc[[1,3]]
| List of numbers (positional) (index) |
Mask | df[[True, False, True]] | Any iterable of True/False values |
String | df['A'] | A column name (or index value) |
String Slice | df.loc[:, 'C':'words'] | Represents a range of columns |
String List | df[['C','D','words']] | A list of column names |
API Details
-
-
df[]
takes a single argument only. - Slices are
[ Inclusive : Exclusive ]
. - The most common usage is to get a
Series
(column) from theDataFrame
viadf['col_name']
. - One can add a column to a
DataFrame
withdf['new_column'] = list_of_values
It is important to note here that when a numerical slice is used, the values represent positions, not the index. When string values are used, the values represent column names.
Valid Argument Types:
- Numerical Slice: represents the Row Positions
- Mask: represents which Rows to keep
- String: column name
- String List: list of column names
Invalid Argument Types:
-
Numerical List: You cannot use
df[[0, 5, 8]]
to access a list of rows. -
String Slice: You cannot use
df['first':'exclude_me']
to access a slice of columns. Instead, you must use a list of strings.
Examples:
Valid Results Invalid Note df[0:1]
DataFrame
All columns, 1 rowdf[0]
single number NOT allowed df[1:3]
DataFrame
All columns, 2 rowsdf[[0, 3]]
List must be column names df[2:10]
DataFrame
All columns. Many rowsAll rows at those positions. If the slice is completely out of bounds, an empty DataFrame is returned. df[[True, False]]
DataFrame
All columns, rows where mask==TrueLength of mask must be len(df) df[['C', 'D']]
DataFrame
All rows with columns ‘C’ and ‘D’The columns are ordered as they appear in the list. df[['C']]
DataFrame
All rows, one column ‘C’A list of just 1 still results in a DataFrame. df['C']
Series. Column ‘C’ df['C':'words']
Column slices NOT allowed. Note that, except for the last example, all valid API return a DataFrame.
-
-
-
df.loc[rows, cols]
takes one or two arguments. - To get all rows, use an ‘empty’ slice:
df.loc[ : , 'Name']
- Each argument can describe multiple values or a single value.
- The type returned depends on the argument types (vectors vs scalar [1][2]).
- If both arguments are vectors, a
DataFrame
is returned. - If exactly one argument is a vector, then a
Series
is returned. - If both arguments are scalar, then the contents of the single cell is returned.
- If both arguments are vectors, a
- Slices are
[ Inclusive : Inclusive ]
. - One can set a value.
df.loc[1, 'Name'] = 5
will set theDataFrame
’s value to 5 at index 1 and row ‘Name’.
The first argument describes the indices of the rows desired.
The second argument describes the column names desired.Valid Argument Types:
- Slice: For rows, the slice must match the index type. For columns, the slice indicates the names of the columns to include and every column between.
-
Mask: Represents which rows/cols to keep. Mask is an iterable of True/False values (such as a
Series
or a list) - List: A list of indices or column names to include.
Examples:
Valid Note df.loc[0:3]
Rows with indices 0 & 3, and everything in between df.loc[11:5]
Since indices don’t need to be sorted, this will have index 11 as the first row and index 5 as the last row, and all rows that are positioned between them. df.loc[mask1, mask2]
Gets all the rows where mask1 is True, and all the columns where mask2 is True. len(mask1) == count of rows. len(mask2) == count of columns. df.loc[ : , df.columns != 'Bad']
Gets all the rows and all the columns except for ‘Bad’. df.columns is a list of column names. The result of comparing the list to ‘Bad’ is a list of True/False values… a mask df.loc[ [0, 8, 11], ['One', 'Three'] ]
Each list represents which row/column to keep. The result has the same order as that provided in the list. ```{admonition} Why Inclusive? :class: note The reason this API is inclusive is because one will often want to get a set of columns that one knows about. It is better to say:
“I want all the columns starting with ‘One’ and ending with ‘Last’”
than it is to say:
“I want all the columns starting with ‘One’ and goes until the column after the ‘Last’ column that I want named… gosh, what is the name of the column that I don’t want?
The same can be true for rows when the index is not sorted.
``` -
-
-
df.iloc[rows, cols]
takes 1 or 2 arguments. - Arguments are primarily integers that represent position, but can also be a boolean mask.
- Arguments can also be a lambda for with the argument of the lambda is the calling object, either a DataFrame or Series.
- There are no string arguments.
Valid Argument Types:
- integer that represents the position of the row or column.
- Numerical Slice that represents the positions of the rows or columns.
- Numerical List that represents the positions of the rows or columns.
-
-
df.take()
is primarily a positional based indexer/slicer. It will return the columns or rows in the position specified by the inputs, which will either be a single value, list, or slice. It has another parameter,axis
, which is set to 0 by default. The default parameter refers to the index/rows, meaning we are selecting rows to extract. If you setaxis
to 1, then it will select columns.Since
df.take()
is positional based, you cannot input labels. They must be integers. Additionally, it ignores whatever custom index you may have applied, even if it is an integer index. It will use the underlying positional index to collect the data. Remember, you need to include the square brackets, because the function takes only lists.''' Series based on a row ''' s1 = df.take([1]) # returns a series representing row 1 ''' Series based on a column ''' s2 = df.take([1], axis=1) # returns a series representing column 1 ''' DataFrame based on rows ''' d1 = df.take([1, 3, 0]) # returns a dataframe containing rows 1, 3, and 0 # Setting `axis` to 1 indicates whether the list of indexes is for **rows** or **columns** d2 = df.take([1, 3, 0], axis=1) # returns a dataframe containing columns 1, 3, and 0
Exceptions and Oddities
- Make sure to look out for
IndexError
exceptions, which will arise if you input a number that is not in the index. This means you can’t uselen(df)
as a parameter in order to get the last row of the dataframe. - You can use negative indices. This works similarly to slicing in lists in base python, where the -1 index corresponds to the last value of the set.
-
df.take([0])
returns the first (0 position) row.
- Make sure to look out for
Examples
-
df =
Result API df[0:1]
df.loc[0:0]
df.iloc[0:1]
df.take([0])
df[1:3]
df.loc[1:2]
df.iloc[1:3]
df.take([1,2])
mask=[True,False,True,False]
df[mask]
df.loc[mask]
df.loc[[0, 2]]
df.iloc[mask]
df.iloc[[0, 2]]
df.take([0, 2])
df[[0, 2]]
FAILS!!mask=[False,False,True,True,True,False]
df[['C', 'D', 'words']]
df.loc[:, mask]
df.loc[:, 'C':'words']
df.loc[:, ['C', 'D', 'words']]
df.iloc[:, mask]
df.iloc[:, [0, 2]]
df.take([2, 3, 4], axis=1)
df['C':'words']
FAILS!!df.loc[:, mask]
df.loc[:, 'A':'A']
df.loc[:, ['A']]
df.iloc[:, 0:1]
df.iloc[:, mask]
df.iloc[:, [0]]
df.take([0], axis=1)
df.loc[:, 'A']
WRONG!! Returns SeriesSeries: Column ‘A’
<pre>0 1
1 2
2 3
3 4
Name: A, dtype: int64</pre>df['A']
df.loc[:, 'A']
df.loc[:, mask]
df.iloc[:, 0]
df.iloc[:, mask]
Series: 2nd Row with index 1
<pre>A 2
B 6
C 10
D 14
words cat
jumpy 2
Name: 1, dtype: object0</pre>df.loc[1, :]
df.iloc[1, :]
Scalar Value: 7
df.loc[2, 'B']
df.iloc[2, 1]
-
DataFrame Result API df[0:1]
df.loc[9:9]
df.iloc[0:1]
df.take([0])
df[1:3]
df.loc[2:11]
df.iloc[1:3]
df.take([1,2])
-
DataFrame Result API df[0:1]
df.loc['ax']
df.iloc[0:1]
df.take([0])
df[1:3]
df.loc['cat':'dog']
df.iloc[1:3]
df.take([1,2])
Footnotes
[1] Vector is intended to a data structure that can hold multiple values such as a list, slice, or array, even when these structures contain a single value.
[2] Scalar is intended to mean a single value such as a string or integer. It is NOT a list, slice, or array.
Iterating a DataFrame
There are several ways to iterate through a DataFrame.
# iterates through the columns
for n in df2:
print(n)
# iterates through the tuples (col_name, col_series)
for col_name, col_series in df2.iteritems():
print(col_name)
print(col_series)
# identical to the above
for col_name, col_series in df2.items():
print(col_name)
print(col_series)
Accessing/Changing Attributes & Values
You can access and modify DataFrame attributes and values in several ways:
Renaming Columns
Rename columns using the rename()
method with a dictionary mapping old names to new names:
df = df.rename(columns={'old_name': 'new_name', 'A': 'Alpha'})
Rename all columns at once by assigning to df.columns
:
df.columns = ['col1', 'col2', 'col3']
Index and Reindexing
Get the index (row labels) or columns:
df.index # returns the index (row labels)
df.columns # returns the column labels
Set a column as the index:
df = df.set_index('column_name')
Reset the index to default integer values:
df = df.reset_index(drop=True)
Reindex to change the order or add new indices:
df = df.reindex([2, 0, 1]) # reorder rows by index
Accessing Values
Access a single value by label or position:
value = df.at[3, 'col1'] # by label
value = df.iat[0, 1] # by position
Set a value:
df.at[3, 'col1'] = 42
df.iat[0, 1] = 99
Changing Data Types
Convert a column to a different type:
df['col1'] = df['col1'].astype(float)
Convert all columns to numeric (invalid parsing will be set as NaN):
df = df.apply(pd.to_numeric, errors='coerce')
Adding/Removing Columns
Add a new column:
df['new_col'] = [1, 2, 3, 4]
Remove a column:
df = df.drop(columns=['col_to_remove'])
Adding/Removing Rows
Add a new row (returns a new DataFrame):
df = pd.concat([df, pd.DataFrame([{'col1': 5, 'col2': 6}])], ignore_index=True)
Remove a row by index:
df = df.drop(index=2)
Summary Table
Operation | Example |
---|---|
Rename columns | df.rename(columns={'A': 'Alpha'}) |
Set column as index | df.set_index('col') |
Reset index | df.reset_index(drop=True) |
Access value (label) | df.at[3, 'col1'] |
Access value (position) | df.iat[0, 1] |
Change dtype | df['col1'].astype(float) |
Add column | df['new'] = ... |
Remove column | df.drop(columns=['col']) |
Add row | pd.concat([...], ignore_index=True) |
Remove row | df.drop(index=2) |
Handy Solutions to Common problems
-
pd.read_csv
documenationExample:
pd.read_csv(name, thousands=',', na_values='---', usecols=['c1', 'c2'], dtype={'c1':np.float64, 'c2':np.int32}, skiprows=3)
- setting NaN (
na_values='---'
), or the comma separator (viathousands=','
for currency - changing all columns to be numeric:
df = df.apply(pd.to_numeric, errors='coerce')
- one can reduce time & memory of reading a file using the argument
usecols=['col1_name', 'col4_name']'
-
dtype
argument can set the types of each columns -
skiprows=5
can be used to skip the first 5 lines of a CSV file if there are some header lines to be ignored - df.sort_index()
- df.sample(n=30)
- df.reset_index(inplace=True)
- df.melt(value_name=’value’, var_name=’source’, ignore_index=False)