Machine Learning

In this section, we will be predicting the magnitude of Earthquakes using LinearRegression and DecisionTreeRegressor.

Dataset: Earthquakes and Countries¶

For today’s activity, we will be utilizing the earthquake data from the lectures on pandas!

Earthquake: id (str), year (int), month (int), day (int), latitude (float), longitude (float), name (str), magnitude (float)
Countries: POP_EST (float), GDP_MD (int), CONTINENT (str), SUBREGION (str), geometry (geometry)

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_squared_error

# Load in Earthquake data
earthquakes = pd.read_csv("earthquakes.csv").set_index("id")
earthquakes = gpd.GeoDataFrame(
    earthquakes,
    # crs="EPSG:4326" specifies WGS84 or GPS coordinate system, see https://epsg.io/4326
    geometry=gpd.points_from_xy(earthquakes["longitude"], earthquakes["latitude"], crs="EPSG:4326")
)
earthquakes["month"] = earthquakes["month"].astype('category')

# Load in Country data
columns = ["POP_EST", "GDP_MD", "CONTINENT", "SUBREGION", "geometry"]
countries = gpd.read_file("ne_110m_admin_0_countries.shp").set_index("NAME")[columns]

Before we continue with creating our models, let’s take a look at where our Earthquake data is. This will be important later when we interpret the results of our model. Run the cell below to see our plot!

fig, ax = plt.subplots(figsize=(13, 5))
countries.plot(ax=ax, color="#EEE")
earthquakes.plot(ax=ax, column="magnitude", markersize=0.1, legend=True)
ax.set(title="Earthquakes between July 27, 2016 and August 25, 2016")
ax.set_axis_off()

Predicting Earthquake Magnitude¶

Now that we have a decent sense of the data we are working with, let’s start our main task: predicting earthquake magnitudes using a LinearRegression model!

Iteration 1: Using `longitude` and `latitude`¶

For our first model, let’s try using longitude and latitude. It seems the most intuitive, so let’s start from there!

Linear Regression Using Longitude and Latitude¶

Below, fill in the code cells to perform each task.

First, we need to split our Earthquake data into our training and testing set. Here, our

Features contain [longitude, latitude],
Label is magnitude,
test_size=0.2

# TODO: Split the Earthquake data into the training, testing sets

Since we have now done our train-test split, all we need to do is fit and analyze the model!

# TODO: Fit the model



# Do not alter
grid = sns.relplot(x=y_test, y=y_predict)
grid.set(title="Predicted Magnitude v. Observed Magnitude",
       xlabel="Observed Magnitude (test data)",
       ylabel="Predicted Magnitude (predictions)",
       yticks=list(range(0, 8)), xticks=list(range(0, 8)))
grid.ax.axline((0, 0), slope=1, color='k', ls='--')

Iteration 2: Using `longitude`, `latitude`, `name`, and `month`¶

For our second model, let’s try incorporating some of our categorical variables in addition to longitude and latitude. Because we’ll be incorporating categorical variables, it’s unlikely that the relationship between magnitude (our output) and longitude, latitude, name, and month (our inputs) will be able to be represented by a simple linear relationship. Instead, let’s use a non-linear regression model: a DecisionTreeRegressor! Note that because name and month are not numeric, we’ll need to create dummy variables so sklearn can handle them properly.

Decision Tree Regression Using Longitude, Latitude, Name, and Month¶

Below, fill in the code cells to perform each task.

First, as before, we need to filter for our features and label, and then do train-test split. Here, our

Features contain [longitude, latitude, name, month]
Label is magnitude,
test_size=0.2

However, unlike before, we also need to create “dummy” variables (one-hot encode) for our features since they contain categorical data.

# TODO: Create "dummy" variables for categorical features

# TODO: Split the Earthquake data into the training testing sets

Now, let’s fit and analyze the model!

# TODO: Fit the model



# Do not alter
grid = sns.relplot(x=y_test, y=predictions)
grid.set(title="Observed Magnitude v. Predicted Magnitude",
       xlabel='Observed Magnitude (test data)',
       ylabel='Predicted Magnitude (predictions)',
       yticks=list(range(0, 8)), xticks=list(range(0, 8)))
grid.ax.axline((0, 0), slope=1, color='k', ls='--')

Next, let’s visualize the decision tree that we used for our model! It’s okay if you don’t understand all of the code. This will serve as a step-by-step of how our model makes decisions. Since the max_depth is set to 2, we don’t see all of the decisions that our model makes–that’s okay!

plt.figure(dpi=300)
plot_tree(
    model,
    feature_names=X.columns,
    label="root",
    filled=True,
    impurity=False,
    proportion=True,
    rounded=False,
    max_depth=2,
    fontsize=5
) and None # Hide return value of plot_tree

Debrief¶

How did our models do at predicting the magnitude of our earthquakes?

Why do you think this was the case?

Classwork

Practice: Time Series