Classwork Efficiency - Intermediate Data Programming

This activity is more practice for performance.

Run the following code and observe the related timing graphs and answer the questions below. Note that the axis values are different between the different graphs. Note that the plotting is trying to do a line of best fit which may or may not match the actual data points. This line of best fit helps you determine what the data points are actually telling you.

In each of these graphs, we are using n values that grow exponentially. For example, the list of n values might be: [100, 200, 400, 800, 1600, 3200, 6400, 12800, 25600] Note that the value of n starts at 100 and doubles each time. We call the method using these different n values and measure the time it takes. Because the slow method is so much slower, we stop with n = 3200. But, for the fast algorithm, we let n grow to 3276800.

NOTE

Try running a couple of times if you are getting strange results (what do you think can happen to throw this data off?)

get_list_XXX¶

%%time 
from time_graph import TimeGraph

def get_list_slow(n):
    result = []
    for num in range(1, n):
        result.insert(0, num)
    return result


def get_list_med(n):
    result = []
    for num in range(n - 1, 0, -1):
        result.append(num)
    return result


def get_list_fast(n):
    return [num for num in range(n - 1, 0, -1)]

timer = TimeGraph()
timer.time_and_graph(get_list_slow, max_iters=8)
timer.time_and_graph(get_list_med, max_iters=16)
timer.time_and_graph(get_list_fast, max_iters=18)

Question 1¶

Examine the code and graphs above. Explain why the code is slower or faster. Don’t forget to look at the scale. Explain what you see in the graphs. What is the Big-O for each method?

Explain with a complete sentence for each:
get_list_slow :
get_list_med :
get_list_fast :

get_sum¶

%%time
from time_graph import TimeGraph

def get_sum(n):
    sum = 0
    for i in range(n):
        for j in range(n):
            sum += j
    return sum

timer = TimeGraph()
timer.time_and_graph(get_sum, max_iters=7)

Question 2¶

Explain what the graph is saying about the code found in the method get_sum. What is the Big-O?

Explain in 1-3 sentences.

limit_list_to_range¶

The following code is working to include items in a list only if they are within a specific range. One common solution is very slow:

return [ n for n in range(low, high) if n in items]

Another solution is much better.

return [n for n in items if n >= low and n < high ]

%%time 
from time_graph import TimeGraph
import random

def list_to_range_slow(items, low, high):
    return [n for n in range(low, high) if n in items]


def list_to_range_fast(items, low, high):
    return [n for n in items if n >= low and n < high]

def limit_list_to_range_fast(n):
    # generate a random list of size n
    # then, create a range that grows linearly with n
    items = [random.randrange(0, 1000000) for n in range(n)]
    low = 0
    high = 100 * n
    list_to_range_fast(items, low, high)

def limit_list_to_range_slow(n):
    # generate a random list of size n
    # then, use a range of that grows linearly with n
    items = [random.randrange(0, 1000000) for n in range(n)]
    low = 0
    high = 100 * n
    list_to_range_slow(items, low, high)

timer = TimeGraph()
timer.time_and_graph(limit_list_to_range_slow, max_iters=4)
timer.time_and_graph(limit_list_to_range_fast, max_iters=14)

Question 3¶

Examine the graphs limit_range_fast.png and limit_range_slow.png. How do the points on each of the graphs grow on the y-axis? Does it look linear? Describe how the graphs look using Big-O notation.

Explain graph observations in 1-3 sentences.

Examine the code for the limit_range_slow. Explain why it is so slow. Explain the Big-O complexity as best you can. Do your best to show the work to derive the Big-O complexity. In this example, we will assume that there are n values in the list. AND, the range will be of size n. (e.g. n = high - low).

Explain the slowness of the code in 2-4 sentences.

species_count¶

%%time 

import pandas as pd
from time_graph import TimeGraph

def get_species_count_fns():
    """
    Create a closure that loads the data set that is
    held in memory for subsequent calls to run_species_count methods.
    We do it this way so that we don't have to continually reload the data
    or use a global.
    """
    data = pd.read_csv("pokemon_expanded.csv").to_dict("records")

    def species_count_set(n):
        rows = min(n // 100, len(data))
        df = data[0:rows]
        return get_species_count_set(df)

    def species_count_list(n):
        rows = min(n // 100, len(data))
        df = data[0:rows]
        return get_species_count_list(df)

    return species_count_set, species_count_list


def get_species_count_set(data):
    """
    From HW: species_count returns the number of unique Pokemon
    species (determined by the name attribute) found in
    the dataset. The data is a list of dictionaries.
    """
    unique = set()
    for pokemon in data:
        unique.add(pokemon["name"])

    return len(unique)


def get_species_count_list(data):
    """
    Use a list solution to demonstrate slower performance than a set
    """
    unique = []
    for pokemon in data:
        if pokemon["name"] not in unique:
            unique.append(pokemon["name"])

species_set, species_list = get_species_count_fns()
timer = TimeGraph()
timer.time_and_graph(species_set, max_iters=13)
timer.time_and_graph(species_list, max_iters=13)

Question 4¶

Examine the graphs and code for the two method:

species_count_list
species_count_set

Explain why the code is slower or faster. Don’t forget to look at the scale. Explain what you see in the graphs. What is the Big-O for each method?

Explain in 1-2 sentences.

# DON'T MODIFY FOLLOWING CODE! YOU MUST RUN THIS AS PART OF RUN ALL TO COLLECT YOUR ANSWERS

from IPython import get_ipython
from IPython.display import display, Markdown

ip = get_ipython()

import json
import nbformat

# Load the notebook file
import os
nb_path = os.path.basename(ip.user_ns["__vsc_ipynb_file__"]) if "__vsc_ipynb_file__" in ip.user_ns else "classwork-efficiency.ipynb"

with open(nb_path, "r", encoding="utf-8") as f:
    nb = nbformat.read(f, as_version=4)

collected = []

for cell in nb["cells"]:
    if cell["cell_type"] == "markdown":
        if "tags" in cell["metadata"] and "question" in cell["metadata"]["tags"]:
            collected.append(cell["source"])

for block in collected:
    display(Markdown(block))