In this section, we will be predicting the magnitude of Earthquakes using LinearRegression
and DecisionTreeRegressor
.
Dataset: Earthquakes and Countries¶
For today’s activity, we will be utilizing the earthquake data from the lectures on pandas
!
- Earthquake: id (
str
), year (int
), month (int
), day (int
), latitude (float
), longitude (float
), name (str
), magnitude (float
) - Countries: POP_EST (
float
), GDP_MD (int
), CONTINENT (str
), SUBREGION (str
), geometry (geometry
)
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_squared_error
# Load in Earthquake data
earthquakes = pd.read_csv("earthquakes.csv").set_index("id")
earthquakes = gpd.GeoDataFrame(
earthquakes,
# crs="EPSG:4326" specifies WGS84 or GPS coordinate system, see https://epsg.io/4326
geometry=gpd.points_from_xy(earthquakes["longitude"], earthquakes["latitude"], crs="EPSG:4326")
)
earthquakes["month"] = earthquakes["month"].astype('category')
# Load in Country data
columns = ["POP_EST", "GDP_MD", "CONTINENT", "SUBREGION", "geometry"]
countries = gpd.read_file("ne_110m_admin_0_countries.shp").set_index("NAME")[columns]
Before we continue with creating our models, let’s take a look at where our Earthquake data is. This will be important later when we interpret the results of our model. Run the cell below to see our plot!
fig, ax = plt.subplots(figsize=(13, 5))
countries.plot(ax=ax, color="#EEE")
earthquakes.plot(ax=ax, column="magnitude", markersize=0.1, legend=True)
ax.set(title="Earthquakes between July 27, 2016 and August 25, 2016")
ax.set_axis_off()
Predicting Earthquake Magnitude¶
Now that we have a decent sense of the data we are working with, let’s start our main task: predicting earthquake magnitudes using a LinearRegression
model!
Iteration 1: Using longitude
and latitude
¶
For our first model, let’s try using longitude
and latitude
. It seems the most intuitive, so let’s start from there!
Linear Regression Using Longitude and Latitude¶
Below, fill in the code cells to perform each task.
First, we need to split our Earthquake data into our training and testing set. Here, our
- Features contain
[longitude, latitude]
, - Label is
magnitude
, test_size=0.2
# TODO: Split the Earthquake data into the training, testing sets
Since we have now done our train-test split, all we need to do is fit and analyze the model!
# TODO: Fit the model
# Do not alter
grid = sns.relplot(x=y_test, y=y_predict)
grid.set(title="Predicted Magnitude v. Observed Magnitude",
xlabel="Observed Magnitude (test data)",
ylabel="Predicted Magnitude (predictions)",
yticks=list(range(0, 8)), xticks=list(range(0, 8)))
grid.ax.axline((0, 0), slope=1, color='k', ls='--')
Iteration 2: Using longitude
, latitude
, name
, and month
¶
For our second model, let’s try incorporating some of our categorical variables in addition to longitude
and latitude
. Because we’ll be incorporating categorical variables, it’s unlikely that the relationship between magnitude (our output) and longitude
, latitude
, name
, and month
(our inputs) will be able to be represented by a simple linear relationship. Instead, let’s use a non-linear regression model: a DecisionTreeRegressor
! Note that because name
and month
are not numeric, we’ll need to create dummy variables so sklearn
can handle them properly.
Decision Tree Regression Using Longitude, Latitude, Name, and Month¶
Below, fill in the code cells to perform each task.
First, as before, we need to filter for our features and label, and then do train-test split. Here, our
- Features contain
[longitude, latitude, name, month]
- Label is
magnitude
, test_size=0.2
However, unlike before, we also need to create “dummy” variables (one-hot encode) for our features since they contain categorical data.
# TODO: Create "dummy" variables for categorical features
# TODO: Split the Earthquake data into the training testing sets
Now, let’s fit and analyze the model!
# TODO: Fit the model
# Do not alter
grid = sns.relplot(x=y_test, y=predictions)
grid.set(title="Observed Magnitude v. Predicted Magnitude",
xlabel='Observed Magnitude (test data)',
ylabel='Predicted Magnitude (predictions)',
yticks=list(range(0, 8)), xticks=list(range(0, 8)))
grid.ax.axline((0, 0), slope=1, color='k', ls='--')
Next, let’s visualize the decision tree that we used for our model! It’s okay if you don’t understand all of the code. This will serve as a step-by-step of how our model makes decisions.
Since the max_depth
is set to 2, we don’t see all of the decisions that our model makes–that’s okay!
plt.figure(dpi=300)
plot_tree(
model,
feature_names=X.columns,
label="root",
filled=True,
impurity=False,
proportion=True,
rounded=False,
max_depth=2,
fontsize=5
) and None # Hide return value of plot_tree
Debrief¶
How did our models do at predicting the magnitude of our earthquakes?
Why do you think this was the case?