DataFrame

Creating a DataFrame
DataFrame Indexing Methods
Iterating a DataFrame
Accessing/Changing Attributes & Values
Handy Solutions to Common problems

Here we will review common and useful actions with a DataFrame.

First, it is helpful to understand how the Pandas DataFrame documentation is structured. Painfully, it is not alphabetic! However, wonderfully, it is structured by activity.

Creating a DataFrame

Creating from a list of dictionaries
Each dictionary is a row in the DataFrame. Each key in the dictionary is the column name with a value for the cell.

d1 = { 'City':'New York', 'Pop': 8.6, 'Rating': 3 }
d2 = { 'City':'Seattle', 'Pop': .8, 'Rating': 4 }
d3 = { 'City':'Los Angeles', 'Pop': 3.9, 'Rating': 3.5 }
d4 = { 'City':'Houston', 'Pop': 2.1, 'Rating': 4.1 }
my_list = [d1, d2, d2, d4]
pd = pd.DataFrame(my_list)

# The above creates a DataFrame with 4 rows, 3 columns.
# The column names are: 'City', 'Pop', and 'Rating'

Creating from a list of lists
In this list of lists, each inner list is a row.

# This will have 5 rows, 3 columns, using default indexing.
# The column names are provided in the constructor below
data = [["New York", 8.6, 20],
        ["Chicago", 2.7, 20],
        ["Los Angeles", 3.9, 20],
        ["Philadelphia", 1.5, 20],
        ["Houston", 2.1, 20]]
 
# Give names to the columns when constructing
df = pd.DataFrame(data, columns=["City", "Population", "Year(2020)"])

Creating by adding columns
Here, we create an empty DataFrame and then add on columns.

df = pd.DataFrame()
# add a column with the name 'seat_number'
df['seat_number'] = [ 1, 2, 3, 4 ]
# add a columns with the name 'student_id`
df['student_id'] = [ 1113, 1125, 1427, 1211]

# df has 4 rows, 2 columns, using default indexing.

Creating from a dictionary
The keys are the column names, the values are lists of column values.

data = { "New York": [8.6, 20],
         "Chicago": [2.7, 20],
         "Los Angeles": [3.9, 20],
         "Philadelphia": [1.5, 20],
         "Houston": [2.1, 20]
       }
# creates 5 columns, 2 rows, default indexing
df = pd.DataFrame(data)

The structure of the dictionary can take many forms. To see the various structures take a look at the converse function DataFrame.to_dict().

Creating from a List
This is highly unusual, but possible.

# to get a dataframe with 1 column from a list of values
list = [ 2, 4, 6, 8 ]
df = pd.DataFrame(list)
# The above code creates 4 rows, 1 column, using default indexing. Column name is 0.

DataFrame Indexing Methods

We will explore four methods that index data within a Pandas DataFrame. The abundance of approaches can be overwhelming and this guide aims to provide clarity. For additional information, refer to the Pandas DataFrame Documentation.

API	Index/Position	Comment
`df[]`	rows by position, columns by name	One argument only
`df.loc[]`	by index	`[inclusive : inclusive]`
`df.iloc[]`	by position	Columns by position, too
`df.take()`	by position	`axis=0` take rows `axis=1` take columns

Except for loc the API are [inclusive : exclusive].

Argument Types

Type	Example	Notes
Number	`df.iloc[3]` `df.loc[3]`	Describes the position, or, index
Numerical Slice	`df[0:3]` `df.loc[1:3]`	A slice using numbers (positional) (index)
Numerical List	`df.take([0, 3])` `df.loc[[1,3]]`	List of numbers (positional) (index)
Mask	`df[[True, False, True]]`	Any iterable of True/False values
String	`df['A']`	A column name (or index value)
String Slice	`df.loc[:, 'C':'words']`	Represents a range of columns
String List	`df[['C','D','words']]`	A list of column names

API Details

df[] takes a single argument only.
Slices are [ Inclusive : Exclusive ].
The most common usage is to get a Series (column) from the DataFrame via df['col_name'].
One can add a column to a DataFrame with df['new_column'] = list_of_values

It is important to note here that when a numerical slice is used, the values represent positions, not the index. When string values are used, the values represent column names.

Valid Argument Types:

Numerical Slice: represents the Row Positions
Mask: represents which Rows to keep
String: column name
String List: list of column names

Invalid Argument Types:

Numerical List: You cannot use df[[0, 5, 8]] to access a list of rows.
String Slice: You cannot use df['first':'exclude_me'] to access a slice of columns. Instead, you must use a list of strings.

Examples:

Valid	Results	Invalid	Note
`df[0:1]`	DataFrame All columns, 1 row	`df[0]`	single number NOT allowed
`df[1:3]`	DataFrame All columns, 2 rows	`df[[0, 3]]`	List must be column names
`df[2:10]`	DataFrame All columns. Many rows		All rows at those positions. If the slice is completely out of bounds, an empty DataFrame is returned.
`df[[True, False]]`	DataFrame All columns, rows where mask==True		Length of mask must be len(df)
`df[['C', 'D']]`	DataFrame All rows with columns ‘C’ and ‘D’		The columns are ordered as they appear in the list.
`df[['C']]`	DataFrame All rows, one column ‘C’		A list of just 1 still results in a DataFrame.
`df['C']`	Series. Column ‘C’	`df['C':'words']`	Column slices NOT allowed.

Note that, except for the last example, all valid API return a DataFrame.

df.loc[rows, cols] takes one or two arguments.
To get all rows, use an ‘empty’ slice: df.loc[ : , 'Name']
Each argument can describe multiple values or a single value.
The type returned depends on the argument types (vectors vs scalar [1][2]).
- If both arguments are vectors, a DataFrame is returned.
- If exactly one argument is a vector, then a Series is returned.
- If both arguments are scalar, then the contents of the single cell is returned.
Slices are [ Inclusive : Inclusive ].
One can set a value. df.loc[1, 'Name'] = 5 will set the DataFrame’s value to 5 at index 1 and row ‘Name’.

The first argument describes the indices of the rows desired.
The second argument describes the column names desired.

Valid Argument Types:

Slice: For rows, the slice must match the index type. For columns, the slice indicates the names of the columns to include and every column between.
Mask: Represents which rows/cols to keep. Mask is an iterable of True/False values (such as a Series or a list)
List: A list of indices or column names to include.

Examples:

Valid	Note
`df.loc[0:3]`	Rows with indices 0 & 3, and everything in between
`df.loc[11:5]`	Since indices don’t need to be sorted, this will have index 11 as the first row and index 5 as the last row, and all rows that are positioned between them.
`df.loc[mask1, mask2]`	Gets all the rows where mask1 is True, and all the columns where mask2 is True. len(mask1) == count of rows. len(mask2) == count of columns.
`df.loc[ : , df.columns != 'Bad']`	Gets all the rows and all the columns except for ‘Bad’. df.columns is a list of column names. The result of comparing the list to ‘Bad’ is a list of True/False values… a mask
`df.loc[ [0, 8, 11], ['One', 'Three'] ]`	Each list represents which row/column to keep. The result has the same order as that provided in the list.

```{admonition} Why Inclusive? :class: note The reason this API is inclusive is because one will often want to get a set of columns that one knows about. It is better to say:

“I want all the columns starting with ‘One’ and ending with ‘Last’”

than it is to say:

“I want all the columns starting with ‘One’ and goes until the column after the ‘Last’ column that I want named… gosh, what is the name of the column that I don’t want?

The same can be true for rows when the index is not sorted.
```

- df.iloc[rows, cols] takes 1 or 2 arguments.
- Arguments are primarily integers that represent position, but can also be a boolean mask.
- Arguments can also be a lambda for with the argument of the lambda is the calling object, either a DataFrame or Series.
- There are no string arguments.
Valid Argument Types:
- integer that represents the position of the row or column.
- Numerical Slice that represents the positions of the rows or columns.
- Numerical List that represents the positions of the rows or columns.
df.take() is primarily a positional based indexer/slicer. It will return the columns or rows in the position specified by the inputs, which will either be a single value, list, or slice. It has another parameter, axis, which is set to 0 by default. The default parameter refers to the index/rows, meaning we are selecting rows to extract. If you set axis to 1, then it will select columns.

Since df.take() is positional based, you cannot input labels. They must be integers. Additionally, it ignores whatever custom index you may have applied, even if it is an integer index. It will use the underlying positional index to collect the data. Remember, you need to include the square brackets, because the function takes only lists.
```
'''
Series based on a row
'''
s1 = df.take([1]) # returns a series representing row 1

'''
Series based on a column
'''
s2 = df.take([1], axis=1) # returns a series representing column 1

'''
DataFrame based on rows
'''
d1 = df.take([1, 3, 0]) # returns a dataframe containing rows 1, 3, and 0

# Setting `axis` to 1 indicates whether the list of indexes is for **rows** or **columns**
d2 = df.take([1, 3, 0], axis=1) # returns a dataframe containing columns 1, 3, and 0
```
Exceptions and Oddities
- Make sure to look out for IndexError exceptions, which will arise if you input a number that is not in the index. This means you can’t use len(df) as a parameter in order to get the last row of the dataframe.
- You can use negative indices. This works similarly to slicing in lists in base python, where the -1 index corresponds to the last value of the set.
- df.take([0]) returns the first (0 position) row.

Examples

df = Default Index

Result	API
	`df[0:1]` `df.loc[0:0]` `df.iloc[0:1]` `df.take([0])`
	`df[1:3]` `df.loc[1:2]` `df.iloc[1:3]` `df.take([1,2])`
	`mask=[True,False,True,False]` `df[mask]` `df.loc[mask]` `df.loc[[0, 2]]` `df.iloc[mask]` `df.iloc[[0, 2]]` `df.take([0, 2])` `df[[0, 2]]` FAILS!!
	`mask=[False,False,True,True,True,False]` `df[['C', 'D', 'words']]` `df.loc[:, mask]` `df.loc[:, 'C':'words']` `df.loc[:, ['C', 'D', 'words']]` `df.iloc[:, mask]` `df.iloc[:, [0, 2]]` `df.take([2, 3, 4], axis=1)` `df['C':'words']` FAILS!!
	`df.loc[:, mask]` `df.loc[:, 'A':'A']` `df.loc[:, ['A']]` `df.iloc[:, 0:1]` `df.iloc[:, mask]` `df.iloc[:, [0]]` `df.take([0], axis=1)` `df.loc[:, 'A']` WRONG!! Returns Series
Series: Column ‘A’ <pre>0 1 1 2 2 3 3 4 Name: A, dtype: int64</pre>	`df['A']` `df.loc[:, 'A']` `df.loc[:, mask]` `df.iloc[:, 0]` `df.iloc[:, mask]`
Series: 2nd Row with index 1 <pre>A 2 B 6 C 10 D 14 words cat jumpy 2 Name: 1, dtype: object0</pre>	`df.loc[1, :]` `df.iloc[1, :]`
Scalar Value: `7`	`df.loc[2, 'B']` `df.iloc[2, 1]`

DataFrame Result API

df[0:1]
df.loc[9:9]
df.iloc[0:1]
df.take([0])

df[1:3]
df.loc[2:11]
df.iloc[1:3]
df.take([1,2])
DataFrame Result API

df[0:1]
df.loc['ax']
df.iloc[0:1]
df.take([0])

df[1:3]
df.loc['cat':'dog']
df.iloc[1:3]
df.take([1,2])

Footnotes

[1] Vector is intended to a data structure that can hold multiple values such as a list, slice, or array, even when these structures contain a single value.
[2] Scalar is intended to mean a single value such as a string or integer. It is NOT a list, slice, or array.

Iterating a DataFrame

There are several ways to iterate through a DataFrame.

# iterates through the columns
for n in df2:
    print(n)

# iterates through the tuples (col_name, col_series)
for col_name, col_series in df2.iteritems():
    print(col_name)
    print(col_series)

# identical to the above
for col_name, col_series in df2.items():
    print(col_name)
    print(col_series)

Accessing/Changing Attributes & Values

You can access and modify DataFrame attributes and values in several ways:

Renaming Columns

Rename columns using the rename() method with a dictionary mapping old names to new names:

df = df.rename(columns={'old_name': 'new_name', 'A': 'Alpha'})

Rename all columns at once by assigning to df.columns:

df.columns = ['col1', 'col2', 'col3']

Index and Reindexing

Get the index (row labels) or columns:

df.index      # returns the index (row labels)
df.columns    # returns the column labels

Set a column as the index:

df = df.set_index('column_name')

Reset the index to default integer values:

df = df.reset_index(drop=True)

Reindex to change the order or add new indices:

df = df.reindex([2, 0, 1])  # reorder rows by index

Accessing Values

Access a single value by label or position:

value = df.at[3, 'col1']      # by label
value = df.iat[0, 1]          # by position

Set a value:

df.at[3, 'col1'] = 42
df.iat[0, 1] = 99

Changing Data Types

Convert a column to a different type:

df['col1'] = df['col1'].astype(float)

Convert all columns to numeric (invalid parsing will be set as NaN):

df = df.apply(pd.to_numeric, errors='coerce')

Adding/Removing Columns

Add a new column:

df['new_col'] = [1, 2, 3, 4]

Remove a column:

df = df.drop(columns=['col_to_remove'])

Adding/Removing Rows

Add a new row (returns a new DataFrame):

df = pd.concat([df, pd.DataFrame([{'col1': 5, 'col2': 6}])], ignore_index=True)

Remove a row by index:

df = df.drop(index=2)

Summary Table

Operation	Example
Rename columns	`df.rename(columns={'A': 'Alpha'})`
Set column as index	`df.set_index('col')`
Reset index	`df.reset_index(drop=True)`
Access value (label)	`df.at[3, 'col1']`
Access value (position)	`df.iat[0, 1]`
Change dtype	`df['col1'].astype(float)`
Add column	`df['new'] = ...`
Remove column	`df.drop(columns=['col'])`
Add row	`pd.concat([...], ignore_index=True)`
Remove row	`df.drop(index=2)`

Handy Solutions to Common problems

pd.read_csv documenation

Example: pd.read_csv(name, thousands=',', na_values='---', usecols=['c1', 'c2'], dtype={'c1':np.float64, 'c2':np.int32}, skiprows=3)
setting NaN (na_values='---'), or the comma separator (via thousands=',' for currency
changing all columns to be numeric: df = df.apply(pd.to_numeric, errors='coerce')
one can reduce time & memory of reading a file using the argument usecols=['col1_name', 'col4_name']'
dtype argument can set the types of each columns
skiprows=5 can be used to skip the first 5 lines of a CSV file if there are some header lines to be ignored
df.sort_index()
df.sample(n=30)
df.reset_index(inplace=True)
df.melt(value_name=’value’, var_name=’source’, ignore_index=False)

DataFrame	Result	API
		`df[0:1]` `df.loc[9:9]` `df.iloc[0:1]` `df.take([0])`
		`df[1:3]` `df.loc[2:11]` `df.iloc[1:3]` `df.take([1,2])`

DataFrame	Result	API
		`df[0:1]` `df.loc['ax']` `df.iloc[0:1]` `df.take([0])`
		`df[1:3]` `df.loc['cat':'dog']` `df.iloc[1:3]` `df.take([1,2])`