Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Abstract

This page presents graphs for a ‘fake’ research project. It will show you both good graphs and bad graphs and a brief summary of what the graphs mean. It will provide the code used to generate the graphs as well as a short discussion on the code: how the code works, why the code was selected, and potential pitfalls to watch out for when generating your own.

This case study explores how far humans are able to throw a ball. The data set contains individual distance values, recording age and gender as data elements for this.

This exercise is intended to show how one can be creative in examining and visualizing data to surface true insights.

This exercise is NOT intended to show how to clean the data nor how to test the code.

Author: Jeff Stride

Setup

Note that the original data set used for this activity is no longer available but can be recreated with some creative use of numpy randomized data (e.g. x = np.random.normal(loc=80, scale=20, size=100))

Here are the imports and Seaborn initialization used in the code on this page. Furthermore, the code herein does not include saving the image to a file. Because the code was developed in Jupyter Notebook, we include %matplotlib inline.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns
import scipy.integrate as integrate
from scipy.optimize import curve_fit
from scipy import stats
sns.set()
%matplotlib inline

Background of Hypothetical Project

This is a hypothetical research project on contrived data. This dataset was created and is not real-life data. We pretend that this data was collected in the following hypothetical situation:

A total of 1,866 people’s responses were recorded.

Here is a sample of the data. These are the people with the top 10 and bottom 10 distance values:

Distance Data

Note that our data is very “clean” and is not riddled with NaN or errors.


Worthless Graphs

Here are a set of common and essentially worthless graphs. A Final Report with only these plots in their report would be poor. These graphs are simple and not very insightful. Furthermore, one of the plots is mislabeled and misleading.

Average
Dist vs Sport
M/F Sport & Age
Stacked Bars
Sports Histogram

The most common question would be to see the average distance thrown by gender. Here is the basic bar plot of that data. The small ‘tick’ or ‘bar’ at the top represents a range showing 95% confidence that the ‘true mean’ is within the range of the black bar. It assumes that the data we have is a sample from the true population and that it isn’t 100% representative. With the count of data points present, the bar represents with 95% confidence where would the mean actually fall in the true population.

Is the bar helpful?

Bar Chart

The default is to have ci=95 or not defined at all. What you see below is ‘sd’ (St. Dev) is much better because we are not so focused on our confidence in the average so much as the variance in the distance.

def sns_bar_stats(df, ci):
    # display the line with something other than the default %95 confidence interval.
    sns.barplot(data=df, x='gender', y='distance', ci=ci)
    plt.ylabel('Distance')
    plt.xlabel('')
    plt.title('Average Distance')

Data by Gender

It is always a good idea to get a firm idea of what the data comprises. In these next few plots we show how the data differs by gender.

Pie
Average
Histogram
Swarm

The following pie chart shows us how many men and women there are. We can see that in this (fake) study, we had more men than women likely due to some bias in how the data was collected.

Gender Pie

The code to generate this pie chart was the same as the more complicated pie charts below. In short, it was:

# Calculate the percentages of each gender
counts = df['gender'].value_counts()
percentages = counts / counts.sum() * 100
plt.figure(figsize=(8, 6))
plt.pie(percentages, labels=percentages.index, autopct=lambda pct: f'{pct:.1f}%', startangle=45)

Data by Sport

Let’s see if we can learn how the sport impacts the distance thrown. We already saw that a line plot was horrible. However, when examining the data by gender, we saw that a histogram provided some decent data, while a swarm plot was better. Attempts at these two plots showed that we needed to do even better. And, so we moved onto the better option, a box plot.

Pie Chart
Swarm Plot
Distance Box Plot
Age Box Plot
Area Plots

This pie chart shows how the sports compose both the male and female genders. We can see that more women do not affiliate with any sport at all, and that women are not a part of football.

Sport Pie Chart by Gender
def plot_pie(ax, df, col_name, display_name):
    # Calculate the percentages of each category
    counts = df[col_name].value_counts()
    percentages = counts / counts.sum() * 100
   
    label_threshold = 2
    # Define the autopct formatting function
    def autopct_format(pct):
        return f'{pct:.1f}%' if pct >= label_threshold else ''

    # Create a list of labels based on the threshold
    labels = [name if percentages[name] >= label_threshold else '' for name in percentages.index]

    ax.pie(percentages, labels=labels, autopct=autopct_format, startangle=45)

    plt.axis('equal')
    ax.set_title('Percentage Makeup of ' + display_name)
    
    # Identify sports for the table that were not annotated
    not_annotated = [name for name in percentages.index if percentages[name] < label_threshold]

    # Create a table for not annotated sports
    if not_annotated:
        table_data = pd.DataFrame({display_name: not_annotated, 'Percentage': [f'{percentages[name]:.1f}%' for name in not_annotated]})
        table = ax.table(cellText=table_data.values, colLabels=table_data.columns, cellLoc='center', loc='bottom', bbox=[1, .75, .4, 0.25])
        table.auto_set_font_size(False)
        table.set_fontsize(10)
        table.scale(1, 1.5)
        
def plot_pie_by_side(df):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    plot_pie(ax1, df[df['gender'] == 'Female'], 'sport', 'Sport (F)')
    plot_pie(ax2, df[df['gender'] == 'Male'], 'sport', 'Sport (M)')

Data by Age

Pie Chart
Actual vs Sample
Box Plot
Area of Percent

The largest age groups happen to be: 38-47, then 28-37, followed by 18-27. It appears that there is a bias in the data which could skew the results of the research. Given that we will soon see that younger people throw a bit farther on average, having our data skewed a bit older makes the results less accurate overall. The number of people in the general population should get smaller as the ages go up, and surely this distribution by age does not match what we expect. This triggered another plot (Actual vs Sample) to examine more closely the age distribution relative to the actual population.

Age Pie

See Code
See the code for the plot_pie method in the section above “Data By Sport -Pie Chart”.

def pie_by_age(df):
    # add an age bracket to a copy of the DF
    df2 = df.copy()
    bracket_names = [str(n) + "-" + str(n + 9) for n in range(18, 81, 10)]
    bracket_names.append('88-99')
    age_bracket = [bracket_names[-1] if age 87 else bracket_names[(age - 18) // 10] for age in df['age']]
    df2['age_bracket'] = age_bracket
    df2 = df2.sort_values(by='age_bracket')
    fig, ax = plt.subplots()
    plot_pie(ax, df2, 'age_bracket', 'Age Bracket')

Examining Distance

Scatter
Regression
Histogram
Cumulative Histogram
KDE

These two plots clearly show how scattered the distance values are. The plot on the left shows how men throw farther than women. The density of the dots gets thinner above 80 years-old and beyond 100 yards.

The plot on the right gives a glimpse into how many people are in a particular sport (e.g., very few in Golf and Hockey). The colors allow one to see that the older folks, by and large, throw shorter distances. And, Baseball has a few outliers.

Distance Scatter Plots

See Code
We use Seaborn to present these plots so that we can take advantage of the named argument, hue. The two plots were positioned too close to each other at the start and the Sport names were overlapped on top of the plot on the left. We set the spacing using plt.subplots_adjust. Most of this relatively simple code is setting the titles and labels.

def scatter_plots(df):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
    sns.scatterplot(data=df, ax=ax1, x='distance', y='age', hue='gender')
    sns.scatterplot(data=df, ax=ax2, x='distance', y='sport', hue='age')
    # space out the two plots horizontally
    plt.subplots_adjust(wspace=0.5)
    ax1.set_title('Distance vs Age')
    ax1.legend(title='Gender')
    ax1.set_ylabel('Age')
    for ax in (ax1, ax2):
        ax.set_xlabel('Distance')
    
    ax2.set_title('Distance vs Sport w/ Age')
    ax2.set_ylabel('')
    ax2.legend(title='Age', loc='lower left', bbox_to_anchor=(.95, .05))

Human Limits

In this section we will attempt to discover the farthest a human being can throw a baseball. This was inspired by the Cumulative Bar Chart where we could see that the slope of the chart flattened out as the distance got farther and farther. It suggests (as does common sense), that eventually, no one else will be able to throw farther: there is a limit to how far a human being can throw a ball. What is that distance?

Summary
The Curves
Bootstrap Resamples
Code

The shape of the curve in the Cumulative chart is Sigmoid-like (S-Curve). In this specific graph, the asymptote is at 100%. The y-value asymptote is at the percentage and not at the distance. We want to rotate the graph so that our y-value is the distance thrown and the x-axis is “time”. Unfortunately, we can’t just swap the x & y axis to get this graph we want. What do we do?

I did a little thought experiment inspired by a statistical technique called bootstrapping. The idea is that when you don’t have access to the true population data, you simulate it by resampling your sample data over and over. The idea is to pretend that my data sample is the entire population of the world over an infinite amount of time. Let’s call this “True Population” (even though it is definitely NOT)! Then, I “simulate” a moment of time by randomly taking a sample of our “True Population” to get a small dataset in this moment in time.

So, I’m assuming that I have “True Population” data and I’m simulating time. For each resample, I take the maximum value to see if anyone has set a record for the farthest distance thrown. I repeat this until I get the maximum value in my “True Population” dataset.

With this set of resampled data, I generated a plot of “records” over “time.” Then, I used some Curve of Best Fit techniques to find the best fitting curve to our data. This has some limits because in real life, the curve would always have the asymptote above the farthest distance thrown, but a curve of best fit minimizes the MSE and may decide that the best fitting curve has the asymptote below the farthest distance thrown. Also, the resampling process is random and it can create a simulated record set that is far from real-looking. I had to do multiple attempts at resampling to find something that looks convincingly real.

I did four different Bootstrap Resamples, graphed them, did a curve of best fit to a Sigmoid curve, and then picked the best one.

Final Result
The most likely y-upper-limit (asymptote) that represents the upper limit of human performance is:
149 Yards
See the content in the other tabs for details.

Machine Learning

You can go here to view more results that show how creating an ML model can help predicts distance based on the three features: gender, sport, and age.

Do you think the results show that the model is accurate at predicting, or not?