Regression - Distance Study

In this study, we will use Machine Learning to create a model to predict the distance a person can throw a baseball.

Predicting Distance

In this hypothetical example, we will use the data found in the Distance Case Study please go to that page to view and learn more about the data. To summarize, the data provides a distance of how far a person can throw a baseball. The features include: age, gender, and sport.

Since we are predicting distance, a real number, we need to use a model that does Regression. Examine all the tabs below to see the Data, Code, the results of our two models, and graphs of the models’ predictions.

Summary

The graphs presented in the Model Graphs tab are most insightful. It provide a lot of information about how the model makes predictions. While MSE values help us to determine that the LinearRegressor is better at making predictions, it is a far, far cry from providing real value.

Listing out the Coefficients and Feature Importance helps us understand what features are making the biggest impact on the predictions. But, again, the real value comes from the Model Graphs.

  • Please look at the Distance Case Study. Here is a scatter plot of all the data and we can see it pretty messy. There are some patterns one can extract, but very little is obvious.
    All Distance Data
    Code to generate plot

    sns.scatterplot(x=df['age'], y=df['distance'], hue=df['sport'], size=df['gender'])
    plt.title('All Distance Data')
    plt.legend(loc='upper left', bbox_to_anchor=(1.0, 1.1))
    
  • df, features, labels = load_data()
    
    print('LinearRegression Info')
    model = model_regression(features, labels, linear=True)
    show_coefficients(model, features)
    print('\nDecisionTreeRegressor Info')
    dtm = model_regression(features, labels, linear=False)
    show_importance(dtm, features)
    
  • Let’s look at the MSE (Mean Squared Error) for both the LinearRegressor and the DecisionTreeRegressor. We see that LinearRegressor is much better on average with an MSE = 274 while the DecisionTreeRegressor has MSE = 453.

    We can see from the Coefficients determined in the LinearRegressor that the male gender throwing prediction is 17.7 farther. Those who play baseball are given 15.3 boost and those who do no sport at all are reduced by 6. For each year you age, you lose about 0.47.

    The DecisionTreeRegressor reveals how important it thinks each feature is. The most important to the model is age at 50%, then gender at 29%, followed by Baseball, None, and Track dummy features.

    The LinearRegressor is much more transparent in how it does its predictions. Approx:
    $dist = 54 + 17male - 0.46age + 15Baseball + 4.9Football - 6*None - …$

  • LinearRegression Info

    MSE : 273.95
    Intercept: 54
    male : 17.787
    age : -0.466
    sport_Baseball : 15.329
    sport_Basketball : -0.119
    sport_Cycling : -1.767
    sport_Football : 4.918
    sport_Golf : -1.514
    sport_Hockey : 0.182
    sport_None : -6.140
    sport_Soccer : -2.176
    sport_Swimming : -3.630
    sport_Tennis : -1.729
    sport_Track : -1.290
    sport_XCountry : -2.064

    DecisionTreeRegressor Info
    MSE : 452.93
    Feature: male, Importance: 28.59%
    Feature: age, Importance: 50.04%
    Feature: sport_Baseball, Importance: 4.48%
    Feature: sport_Basketball, Importance: 1.27%
    Feature: sport_Cycling, Importance: 1.05%
    Feature: sport_Football, Importance: 0.71%
    Feature: sport_Golf, Importance: 1.00%
    Feature: sport_Hockey, Importance: 0.21%
    Feature: sport_None, Importance: 3.28%
    Feature: sport_Soccer, Importance: 1.60%
    Feature: sport_Swimming, Importance: 1.86%
    Feature: sport_Tennis, Importance: 1.83%
    Feature: sport_Track, Importance: 2.27%
    Feature: sport_XCountry, Importance: 1.83%

  • To get an truly insightful understanding of how a model is make its predictions, it is handy to make a graph of the predictions from the features. For example, below you’ll see how the LinearRegressor predicts that women who play baseball can throw farther than men who do no sports at all.

    To generate these graphs, we had to create a DataFrame with all possible combinations of the features and then do a prediction. The predictions are graphed here:

    Linear Regressor Predictions Decision Tree Regressor Predictions

    NOTE: The DecisionTreeRegressor was not bounded when learning. The complexity of the predictions could have been simplified and, perhaps, the MSE could have been reduced if we had optimized max_depth via hyperparameter tuning. For simplicity, we did not do this. Having the complexity also helps establish that the model output was quite complicated and likely represented some over fitting.

    Code to create graphs

    def create_all_features():
        # lets create a set of data to make predictions on
        # get all ages for each sport and gender
        sports = df['sport'].unique()
        ages = [ age for age in range(18, 101) ]
        genders = [False] * (len(ages)*len(sports))
        gender_m = [True] * len(genders)
        genders.extend(gender_m)
        # replicate ages to be for all sports (multiply by number of sports)
        # then, double itself to get both genders
        all_ages = []
        for i in range(len(sports)):
            all_ages.extend(ages)
        all_ages.extend(all_ages)
        # replicate sports to be for all ages (multiply by number of ages)
        # then, double itself to get both genders
        all_sports = []
        for i in range(len(ages)):
            all_sports.extend(sports)
        all_sports.extend(all_sports)
    
        # create the dataframe structured like original features during training
        df_features = pd.DataFrame({'male':genders, 'age':all_ages, 'sport':all_sports})
        return df_features
    
    def graph_linear_predictions(features):
        # View the predictions made by LinearRegressor
        # make some predictions and graph them!
        # sport is categorical data. We definitely need to get_dummies!
        df_dummified_features = pd.get_dummies(features)
    
        label_predictions = model.predict(df_dummified_features)
    
        sns.lineplot(x=features['age'], y=label_predictions, hue=features['sport'], style=features['male'])
        plt.title('LinearRegressor\nPredicted Distances By Age, Sport & Gender')
        plt.legend(loc='upper left', bbox_to_anchor=(1.0, 1.1))
        plt.ylabel('Predicted Distance')
    
    def graph_dtr_predictions(features):
        # View the predictions made by DecisionTreeRegressor
        # make some predictions and graph them!
        # sport is categorical data. We definitely need to get_dummies!
        df_dummified_features = pd.get_dummies(features)
    
        label_predictions = dtm.predict(df_dummified_features)
    
        sns.scatterplot(x=features['age'], y=label_predictions, hue=features['sport'], size=features['male'])
        plt.title('DecisionTreeRegressor\nPredicted Distances By Age, Sport & Gender')
        plt.legend(loc='upper left', bbox_to_anchor=(1.0, 1.1))
        plt.ylabel('Predicted Distance')