Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

In the previous lessons, we learned about random forests, how XGBoost works, and how to tune its hyperparameters. In this final lesson, we’ll put it all together in a complete machine learning pipeline — from raw data exploration to final model evaluation — and answer a question that matters: how much should a California home cost?

Learning Objectives

By the end of this lesson, students will be able to:

  • Build a complete ML pipeline: EDA → preprocessing → model selection → tuning → evaluation

  • Justify model choices using train/test error, cross-validation, and feature analysis

  • Compare Decision Tree, Random Forest, and XGBoost on the same problem

  • Apply the hyperparameter tuning skills from Lesson 3 in a realistic workflow

try:
    import xgboost
except ImportError:
    %pip install xgboost
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error
from xgboost import XGBRegressor
sns.set_theme()

Step 1: Understand the Problem and Data

Before writing a single line of modeling code, a data scientist always asks:

  1. What are we predicting? Median house value (MedHouseVal) — a continuous number. This is a regression problem.

  2. What features do we have? 8 numerical features from the 1990 Census.

  3. How will we measure success? Root Mean Squared Error (RMSE) in units of $100,000.

  4. What’s the baseline? Predicting the mean value for all houses.

Let’s start by exploring the data thoroughly before any modeling.

housing = fetch_california_housing(as_frame=True)
df = housing.frame

print(f"Dataset shape: {df.shape}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nTarget variable statistics:")
print(df["MedHouseVal"].describe())
# Distribution of target variable
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

ax1.hist(df["MedHouseVal"], bins=50, edgecolor="white")
ax1.set_xlabel("Median House Value ($100k)")
ax1.set_ylabel("Count")
ax1.set_title("Distribution of House Values")

ax2.hist(np.log1p(df["MedHouseVal"]), bins=50, edgecolor="white", color="salmon")
ax2.set_xlabel("log(Median House Value)")
ax2.set_ylabel("Count")
ax2.set_title("Log-Transformed Distribution")

plt.tight_layout();

The raw distribution is right-skewed with a hard cap at 5.0 ($500,000). The log- transformed distribution is more bell-shaped. For tree-based models like XGBoost, the skew matters less — trees split based on thresholds, not distances — so we’ll use the raw target.

Correlation Analysis

# Correlation matrix
corr = df.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))  # show only lower triangle
sns.heatmap(corr, mask=mask, annot=True, fmt=".2f",
            cmap="coolwarm", vmin=-1, vmax=1, center=0,
            square=True, linewidths=0.5)
plt.title("Feature Correlation Matrix")
plt.tight_layout();

Which pairs of features are strongly correlated with each other? Which features are most correlated with the target MedHouseVal? Does anything surprise you?

Geographic Distribution

# Plot house values geographically
plt.figure(figsize=(8, 6))
scatter = plt.scatter(
    df["Longitude"], df["Latitude"],
    c=df["MedHouseVal"], cmap="viridis",
    alpha=0.4, s=5
)
plt.colorbar(scatter, label="Median House Value ($100k)")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("California Housing Values by Location")
plt.tight_layout();

The geographic plot reveals a clear spatial pattern: coastal areas (especially the Bay Area and Los Angeles) have much higher home values than inland regions. This explains why Latitude and Longitude appear as important features — the model can learn these geographic clusters.

Step 2: Establish a Baseline

Before training any ML model, we always establish a baseline: the simplest possible prediction. For regression, the baseline is predicting the mean of y for every house. Any model we train must beat this baseline to be useful.

X = df.drop("MedHouseVal", axis=1)
y = df["MedHouseVal"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

# Baseline: predict the mean for every house
baseline_pred = np.full(len(y_test), y_train.mean())
baseline_rmse = root_mean_squared_error(y_test, baseline_pred)
print(f"Baseline RMSE (predicting mean): {baseline_rmse:.4f}")
print(f"(Roughly ${baseline_rmse * 100_000:,.0f} average prediction error)")

Step 3: Model Progression

Good ML practice is to start simple and add complexity gradually. We’ll train three models in order of complexity:

  1. Decision Tree — our starting point

  2. Random Forest — parallel ensemble to reduce variance

  3. XGBoost — sequential ensemble to reduce bias and variance

For each model, we’ll report both training and testing RMSE to detect overfitting.

results = {}

# Model 1: Decision Tree
dt = DecisionTreeRegressor(max_depth=6, random_state=42)
dt.fit(X_train, y_train)
results["Decision Tree"] = {
    "train": root_mean_squared_error(y_train, dt.predict(X_train)),
    "test":  root_mean_squared_error(y_test,  dt.predict(X_test))
}

# Model 2: Random Forest
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
results["Random Forest"] = {
    "train": root_mean_squared_error(y_train, rf.predict(X_train)),
    "test":  root_mean_squared_error(y_test,  rf.predict(X_test))
}

# Model 3: XGBoost (default parameters)
xgb = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=6,
                   random_state=42, verbosity=0)
xgb.fit(X_train, y_train)
results["XGBoost"] = {
    "train": root_mean_squared_error(y_train, xgb.predict(X_train)),
    "test":  root_mean_squared_error(y_test,  xgb.predict(X_test))
}

print(f"{'Model':<20} {'Train RMSE':>12} {'Test RMSE':>12}")
print("-" * 45)
print(f"{'Baseline (mean)':<20} {'—':>12} {baseline_rmse:>12.4f}")
for name, r in results.items():
    print(f"{name:<20} {r['train']:>12.4f} {r['test']:>12.4f}")

Looking at the results, which model overfits the most (large gap between train and test)? Which model generalizes best? Does the complexity increase always improve test performance?

Step 4: Tune XGBoost

Now we’ll improve on the default XGBoost using a systematic grid search. Remember: we only use the training data here. The test set remains locked until the very end.

param_grid = {
    "max_depth":        [3, 4, 5, 6],
    "learning_rate":    [0.05, 0.10],
    "min_child_weight": [1, 3],
    "subsample":        [0.8, 1.0],
    "colsample_bytree": [0.8, 1.0],
}

search = GridSearchCV(
    estimator=XGBRegressor(n_estimators=300, reg_lambda=1,
                           random_state=42, verbosity=0),
    param_grid=param_grid,
    scoring="neg_root_mean_squared_error",
    cv=5,
    n_jobs=-1,
    verbose=1
)
search.fit(X_train, y_train)

print("\nBest parameters:")
for k, v in search.best_params_.items():
    print(f"  {k}: {v}")
print(f"\nBest CV RMSE: {-search.best_score_:.4f}")

Step 5: Final Evaluation

We’ve tuned our model using cross-validation on the training set. Now — and only now — we evaluate the final model on the test set to report our true performance.

best_xgb = search.best_estimator_
tuned_test_rmse = root_mean_squared_error(y_test, best_xgb.predict(X_test))

print("=" * 50)
print("FINAL MODEL COMPARISON")
print("=" * 50)
print(f"{'Baseline':<25} {baseline_rmse:.4f}")
for name, r in results.items():
    print(f"{name:<25} {r['test']:.4f}")
print(f"{'XGBoost (tuned)':<25} {tuned_test_rmse:.4f}")
print()
improvement = (baseline_rmse - tuned_test_rmse) / baseline_rmse * 100
print(f"Improvement over baseline: {improvement:.1f}%")
# Visualize: actual vs. predicted values for the tuned XGBoost
preds = best_xgb.predict(X_test)

plot_df = pd.DataFrame({"Actual": y_test.values, "Predicted": preds})
sns.scatterplot(plot_df, x="Actual", y="Predicted", alpha=0.3, s=10)
plt.plot([0, 5], [0, 5], "r--", linewidth=2, label="Perfect prediction")
plt.xlabel("Actual Value ($100k)")
plt.ylabel("Predicted Value ($100k)")
plt.title("Tuned XGBoost: Actual vs. Predicted")
plt.legend()
plt.tight_layout();

The red dashed line represents a perfect prediction. Points close to this line indicate accurate predictions; scatter around it shows prediction error.

What patterns do you notice in the residuals? Does the model perform differently at high vs. low house values? What might explain this?

Step 6: Feature Importance Analysis

To understand why our model makes the predictions it does, we can examine feature importances. We’ll compare the importances from all three models.

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
model_list = [("Decision Tree", dt), ("Random Forest", rf), ("XGBoost (tuned)", best_xgb)]

for ax, (name, model) in zip(axes, model_list):
    imp = pd.DataFrame({
        "Feature": X.columns,
        "Importance": model.feature_importances_
    }).sort_values("Importance", ascending=True)
    ax.barh(imp["Feature"], imp["Importance"])
    ax.set_title(name)
    ax.set_xlabel("Feature Importance")

plt.suptitle("Feature Importance Comparison Across Models", fontsize=13)
plt.tight_layout();

Going Deeper: Further Resources

You now have an end-to-end picture of XGBoost. If you want to understand why the algorithm works the way it does — the math behind residuals, structure scores, and pruning — these two resources are the clearest available:

  • The official XGBoost tutorial: Introduction to Boosted Trees. A self-contained, principled derivation with the diagrams used throughout these lessons.

  • StatQuest’s 4-part XGBoost series (video, intuition-first):

StatQuest: XGBoost Part 1 — Regression
  1. Part 1 — Regression (start here)

  2. Part 2 — Classification

  3. Part 3 — Mathematical Details

  4. Part 4 — Crazy Cool Optimizations

Video: StatQuest with Josh Starmer. Click any thumbnail or link to open on YouTube.

Do the three models agree on which features matter most? When they disagree, what might explain the difference? Which features were consistently important across all models?

Summary: Complete ML Pipeline Checklist

Here is the workflow we followed — a template you can apply to any supervised ML problem:

StepActionTool Used
1Understand the problem (regression vs. classification)domain knowledge
2Explore the data (distributions, correlations, missing values)pandas, seaborn
3Establish a baselinemean prediction
4Start simple, add complexity graduallyDecisionTree → RF → XGBoost
5Tune hyperparameters using cross-validationGridSearchCV
6Evaluate the final model on the held-out test set (once!)RMSE
7Interpret the modelfeature importances

What we’ve learned across all 4 lessons:

  • Decision trees are interpretable but suffer from high variance

  • Random forests fix variance through parallel bagging and feature randomness

  • XGBoost goes further: sequential boosting corrects each previous tree’s mistakes

  • Good ML practice involves systematic tuning, honest evaluation, and interpretation


Practice Problems

Problem 1

The model comparison above uses default random forest and decision tree settings, which may not be fair. Apply what you learned in Lesson 1 to train a tuned random forest using GridSearchCV. Compare its best test RMSE to the tuned XGBoost.

Is there still a meaningful gap between tuned RF and tuned XGBoost on this dataset? Under what circumstances might a random forest be preferred despite lower accuracy?

from sklearn.model_selection import GridSearchCV

rf_param_grid = {
    "n_estimators": [100, 200],
    "max_depth":    [10, 20],
    "max_features": ["sqrt", 0.5],
}

# TODO: Create a GridSearchCV for RandomForestRegressor
rf_search = ...
rf_search.fit(X_train, y_train)

print("Best RF params:", rf_search.best_params_)
print(f"Best RF CV RMSE:   {-rf_search.best_score_:.4f}")
print(f"Best RF test RMSE: "
      f"{root_mean_squared_error(y_test, rf_search.predict(X_test)):.4f}")
print(f"XGBoost test RMSE: {tuned_test_rmse:.4f}")

Problem 2

So far we’ve only used the 8 original features. Real-world feature engineering often creates new features from existing ones. Create at least 2 new features from the existing ones (e.g., rooms_per_person = AveRooms / AveOccup) and retrain XGBoost with the same tuned parameters. Does adding these engineered features improve the test RMSE?

df_eng = df.copy()

# Feature engineering ideas:
# TODO: Add at least 2 new features
df_eng["rooms_per_person"]    = df_eng["AveRooms"] / df_eng["AveOccup"]
df_eng["bedrooms_per_room"]   = df_eng["AveBedrms"] / df_eng["AveRooms"]
# df_eng["your_feature"]      = ...

X_eng = df_eng.drop("MedHouseVal", axis=1)
y_eng = df_eng["MedHouseVal"]

X_eng_tr, X_eng_te, y_eng_tr, y_eng_te = train_test_split(
    X_eng, y_eng, test_size=0.2, random_state=42)

# TODO: Train an XGBRegressor with the same best params as the tuned model
xgb_eng = XGBRegressor(...)
xgb_eng.fit(X_eng_tr, y_eng_tr)
eng_rmse = root_mean_squared_error(y_eng_te, xgb_eng.predict(X_eng_te))

print(f"Original features RMSE: {tuned_test_rmse:.4f}")
print(f"Engineered features RMSE: {eng_rmse:.4f}")

Problem 3

Plot a residual analysis for the tuned XGBoost model. The residual for each prediction is actual - predicted. A well-calibrated model should have residuals that are roughly normally distributed around 0 with no obvious patterns.

Create two plots:

  1. A histogram of residuals

  2. A scatter plot of residuals vs. predicted values

Do you see any systematic patterns (e.g., are residuals larger for higher-value homes)? What does this suggest about where the model struggles?

preds_train = best_xgb.predict(X_train)
preds_test  = best_xgb.predict(X_test)
residuals   = y_test.values - preds_test

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

# TODO: Plot 1 — histogram of residuals
# Hint: Use .hist(...)
# Hint: Use .axvline(...)

# TODO: Plot 2 — residuals vs. predicted values
# Hint: Use .scatter(...)
# Hint: Use .axhline(...)


plt.tight_layout();

Problem 4

The eval_set parameter in XGBoost’s .fit() method allows early stopping: training halts automatically when the validation error stops improving. This prevents overfitting without needing to specify n_estimators in advance.

Use early stopping with a validation set split from the training data. Set n_estimators=1000 and early_stopping_rounds=50.

How many rounds does the model train before stopping? Compare the RMSE of early-stopped XGBoost to the GridSearchCV-tuned model.

# Split training data into train and validation
X_tr2, X_val, y_tr2, y_val = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42)

xgb_early = XGBRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    verbosity=0,
    early_stopping_rounds=50,
    eval_metric="rmse"
)
xgb_early.fit(
    X_tr2, y_tr2,
    eval_set=[(X_val, y_val)],
    verbose=False
)

print(f"Best iteration: {xgb_early.best_iteration}")
early_rmse = root_mean_squared_error(y_test, xgb_early.predict(X_test))
print(f"Early stopping RMSE: {early_rmse:.4f}")
print(f"Grid search RMSE:    {tuned_test_rmse:.4f}")

Problem 5 (Challenge)

Across all four lessons, we’ve evaluated models using RMSE. But RMSE is just one metric.

Compute the following additional metrics on the tuned XGBoost’s test predictions:

  • MAE (Mean Absolute Error): average of |actual − predicted|

  • R² score: proportion of variance in y explained by the model (1.0 = perfect)

  • MAPE (Mean Absolute Percentage Error): average of |actual − predicted| / actual × 100

Then, answer: if you were reporting your model’s performance to a non-technical audience (e.g., a real estate executive), which metric would you emphasize, and why?

from sklearn.metrics import mean_absolute_error, r2_score

preds_final = best_xgb.predict(X_test)

# TODO: Compute each metric
mae  = ...
r2   = ...
mape = ...   # hint: np.mean(np.abs((y_test.values - preds_final) / y_test.values)) * 100

print(f"RMSE: {root_mean_squared_error(y_test, preds_final):.4f} ($100k)")
print(f"MAE:  {mae:.4f} ($100k)")
print(f"R²:   {r2:.4f}")
print(f"MAPE: {mape:.2f}%")