Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

In this section, we will show how you can identify a Curve of Best Fit for a set of data points. The curve can be anything an equation can express, but there are some limitations with discontinuous curves and asymptotes. The easiest is finding a line of best fit. Secondarily, you can find the polynomial that fits the data points. Tertiary, you can provide an equation of a continuous curve involving a combination of functions such as sine or logarithm. Lastly, you can have a non-continuous curve like Sigmoid or `y = 1 / (x - 3).

We will review four different ways to get the equation of a fitted curve.

  1. scipy.stats.linregress: a statistical method that identifies a line’s coefficients as well some statistical information.

  2. numpy.polyfit: a Numpy method that identifies the coefficients of any n-ordered polynomial.

  3. scipy.stats.curve_fit: a statistical method that identifies the coefficients of any function as well as some statistical information about the fit.

  4. ML LinearRegression:

When identifying a Curve of Best Fit, it is good to know how good the curve is at representing the data points. We want an objective quantification that tells us whether a straight line is better at fitting the points, or some other curved line such as a parabola. We can compare how well two different lines fit using the Mean Squared Error.

Often we want to know how much of a correlation there is between the x & y data points. Afterall, we may have just asked Seaborn to plot a regression line and the line presented is the best line that fits the data, but is the data linearly related at all? This is a different, but also important question! This is answered by looking at the Coefficient of Determination.

Understanding R² (Coefficient of Determination)

R-squared (R2R^2) is a statistical measure that provides an indication of how well the line of best fit (the regression line) fits the data points. It quantifies the proportion of the variance in the dependent variable (y) that can be explained by the independent variable (x) in a linear regression model.

The R-squared value ranges from 0 to 1. A value of 1 indicates that the regression line perfectly predicts the dependent variable, meaning all the data points lie exactly on the line. A value of 0 indicates that the regression line does not explain any of the variance in the dependent variable, and the points are scattered randomly.

In general, a higher R-squared value indicates a better fit of the line to the data. For example, an R-squared value of 0.8 means that 80% of the variability in the dependent variable is accounted for by the independent variable(s) in the model.

Here is a table that shows some of the most frequent API we used in this section.

APINotes
plt.plotThis does a simple line plot on the current axis. It does not offer ‘ax=ax’ argument.
plt.scatterThis does a simple scatter plot on the current axis. It does not offer ‘ax=ax’ argument.
np.linspaceThis creates a numpy array of values linearly spaced between [start, end] with a specific number of points. It is helpful in generating x_data during plotting.
np.polyfitThis identifies the coefficients of any n-ordered polynomial.
linregressa statistical method that identifies a line’s coefficients as well some statistical information.
curve_fitA statistical method that identifies the coefficients of any function as well as some statistical information about the fit.
mean_squared_errorCalculates the Mean Squared Error (MSE) from two sets of y_data points.

Required Imports

Below are the necessary imports for the code examples in this section:

import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from scipy.optimize import curve_fit
from scipy.stats import linregress
from scipy import stats
from sklearn.metrics import mean_squared_error

sns.set()

# This if for Jupyter Notebook only
%matplotlib inline

Seaborn Regression

In many research projects, it is beneficial to find a curved line that best fits the data. Students will often do a scatter plot of the data points and have Seaborn do a regression plot which shows a line of best fit with a shaded area around the line representing a 95% confidence interval. None of that is horrible. It is convenient to have a line of best fit, if there is one. And, regplot will draw one for you by default.

Simple regplot
Code
Comments
regplot

SciPy Linregress

Annotated Regression
Code
Comments

This plot shows a scatter plot of data drawn with plt.scatter. It then finds the line of best fit using linregress. It plots it out, extending an extra 20 values on the x-axis, and it provides data on the graph itself. It shows what the slope and y-intercept are (m & b), gives the R2R^2 value, and lastly it presents the Mean Squared Error.
regplot

Numpy Polyfit

Here we use np.polyfit to find the best fit for a line and a parabola. Put another way, the coefficients of a polynomial of degree 1 and 2.

Best Fit Plot
Code
Comments
Correlations
Correlation plot

The graph shows many things: scatter plot of data, line of best fit, parabola of best fit, a legend, coefficients of the parabola of best fit, and MSE of both the parabola and line. Whew!
It shows that the parabola is a much better fit for the points.
regplot

Curve Fit

Here we see how we can find a curve of best fit to a custom, continuous curve: a logarithmically degrading, sinusoidal wave. This is a continuous curve with no x-value that causes a division by zero. This allows us to use the curve_fit API. It will highly unlikely that you’ll ever want to fit a set of points to a curve like this. In the event that you ever encounter wave data, you’ll more likely want to leverage Fourier Transforms. The reason we use a sinusoidal wave is simply to illustrate how curve_fit can be used on any curve.

To fit a line to a curve:

  1. Identify an equation that you want to fit. You’ll create a method that has the generalized equation with coefficients that are unknown. For example, a line is: y = mx + b where curve_fit will identify the values for m and b. You may have a polynomial curve such as: y = a*x**4 + b*x**3 + c*x**2 + d*x + e and curve_fit will find the values for a through e. If you know want to fix one of the values, you can use a constant to the known value. Write a method that takes x along with the coefficients are arguments. For example:

def my_function(x, a, b, c):
    return a*x**2 + b*x + c
  1. Select a set of values for the coefficients that represents a valid guess. Assign these coefficients to a list, often named, p0. You’ll provide p0 as an argument to curve_fit to help it get started as well as to understand how many coefficients to solve for.

  2. You’ll call curve_fit with the following arguments: method pointer, set of points you want to fit, along with the initial guess. For example:

# curve_fit returns a tuple. The first item is a list of coefficients. The second
# is statistical information about the coefficients. In this example we ignore
# the covariance values, and so we unpack them into the identifier '_' which is a conventional
# name for a variable that goes ignored and unused.
coeffs, _ = curve_fit(logarithmic_sinusoidal_wave, x_data, y_data, p0=p0)
Sinusoidal
Best Fit
Code
Comments

In our custom curve, we have coefficients for each of the following: amplitude, frequency, rate of logarithmic degradation, phase and offset.

The top plot has “zero arguments” provided in the method logarithmic_sinusoidal_wave which means that it uses all the default values (10, 1, 0.2, 0, 0). This is why the title of the top plot is: “Arguments: ()”.

The bottom plot is the same equation but with specific coefficients provided (15, 0.5, 0.1, 0, 10). It shows how the coefficients impact the curve. Pay attention to the values on the y-axis.
regplot

See Code

def logarithmic_sinusoidal_wave(x, a=10, frequency=1, r=0.2, phase=0, offset=0):
    # Logarithmic convergence of amplitude
    amplitude = a * np.exp(-r * x)
    return offset + amplitude * np.sin(2 * np.pi * frequency * x + phase)

def plot_with_args(ax, x_max, *args):
    # Generate x values
    x = np.linspace(0, x_max, 200)
    # Compute y values (plural) 
    y = logarithmic_sinusoidal_wave(x, *args)

    # Plot on the axis & show labels
    ax.plot(x, y)
    ax.set_xlabel('X')
    ax.set_ylabel('Amplitude')
    ax.set_title(f'Arguments: {args}')
        
def show_sinusoidal_curve():    
    fig, (ax1, ax2) = plt.subplots(2, figsize=(10, 9))
    # add some more spacing horizontally between the two subplots
    plt.subplots_adjust(hspace=0.3)
    plot_with_args(ax1, 20)
    
    wave_args = (15, 0.5, 0.1, 0, 10)
    plot_with_args(ax2, 20, *wave_args)
    
    plt.suptitle('Logarithmic Sinusoidal Waves', fontsize=18)

ML regressions

The study on distance using Machine Learning models is done on another page.

See ML Study

Predicting Distance