Processing

In this homework, you’ll implement, document, and test a series of functions to apply control structures and data structures to solve problems mimicking different types of data cleanup and processing. Later, we’ll learn how to use more real-world library functions to complete these tasks more effectively.

For each question, you’ll be asked to implement a function, document it with a docstring, and test it with doctests. For specific guidance, search for the “style guide” on the course website. Generally:

To fulfill the documentation requirements, use your own words to provide a brief description of only the details that a client needs to know to call the function.
To fulfill testing requirements, convert each provided valid function call example into a doctest and additionally write 2 more test cases of your own. You may need to change the given examples slightly to meet doctest requirements.

The run_docstring_examples function call at the end of each task will only print a message if test cases fail.

import doctest

Outside Sources¶

Update the following Markdown cell to include your name and list your outside sources. Submitted work should be consistent with the curriculum and your sources.

Name: YOUR_NAME_HERE

Enter your outside sources as a list here, or remove this line if you did not consult any outside sources at all.

Task: `text_normalize`¶

Text normalization is the process of removing unwanted characters from a piece of text, such as whitespace or special characters. Write and test a function text_normalize that takes a string and returns a new string that keeps only alphabetical characters (ignore whitespace, numbers, non-alphabet characters, etc.) and turns all alphabetical characters to lowercase.

text_normalize("Hello") should return "hello"
text_normalize("Hello!") should return "hello"
text_normalize("heLLo tHEr3!!!") should return "hellother"

...


doctest.run_docstring_examples(text_normalize, globals())

Task: `average_tokens_per_line`¶

Write and test a function average_tokens_per_line that takes the name of a .txt file and returns the average number of tokens per line in the file. For example, if the file song.txt contains the text:

Row, row, row your boat
Gently down the stream
Merrily, merrily, merrily, merrily,
Life is but a dream!

The first line has 5 tokens; the second has 4; the third has 4; and the fourth has 5. This gives an average tokens per line of 4.5.

To write additional test cases, create new text files. From the JupyterLab File menu, choose New and then Text File.

...


doctest.run_docstring_examples(average_tokens_per_line, globals())

Task: `pair_up`¶

When creating and processing datasets, sometimes it’s useful to pair-up identifiers with each data element. For this task, you are given some buggy code that is intended to take a set of identifiers and a set of elements and returns a set of every identifier paired with every element. Since sets are unordered, there is no inherent ordering to the tuples in the result set.

Your task is to identify and correct the bug, and then explain the bugs you encountered and what drew you to your specific fixes. For this task, you do not need to write additional test cases.

TODO: Replace this text with your explanation.

def pair_up(identifiers, elements):
    """
    Given two sets, returns a set of tuples where each item in the first set is paired with each
    item in the second set.

    For the doctests, we use the sorted function to ensure a predictable ordering for the tuples
    because sets do not generally guarantee a specific ordering.

    >>> sorted(pair_up({10, 20}, {5, 6, 7}))
    [(10, 5), (10, 6), (10, 7), (20, 5), (20, 6), (20, 7)]
    >>> sorted(pair_up({10, 20}, {"I", "am", "Groot"}))
    [(10, 'Groot'), (10, 'I'), (10, 'am'), (20, 'Groot'), (20, 'I'), (20, 'am')]
    """
    result = {}
    for identifier in identifiers:
        for element in elements:
            result.add(identifier, element)
    return result


doctest.run_docstring_examples(pair_up, globals())

Task: `five_number_summary`¶

Write and test a function five_number_summary that takes a sorted list of at least 5 numbers and returns a tuple containing the five-number summary of the input: the input list’s (minimum, first-quartile, median, third-quartile, maximum). The first quartile is the median of the lower half of the data (including the minimum), and the third quartile is the median of the upper half of the data (including the maximum). The median should be excluded from the calculations of the first and third quartiles.

five_number_summary([1, 2, 3, 4, 5]) should return (1, 1.5, 3, 4.5, 5)
five_number_summary([1, 1, 1, 1, 1]) should return (1, 1, 1, 1, 1)
five_number_summary([30, 31, 31, 34, 36, 38, 39, 51, 53]) should return (30, 31, 36, 45, 53)
five_number_summary([5, 13, 14, 15, 16, 17, 25]) should return (5, 13, 15, 17, 25)
five_number_summary([5, 12, 12, 13, 13, 15, 16, 26, 26, 29, 29, 30]) should return (5, 12.5, 15.5, 27.5, 30)
five_number_summary([12, 12, 13, 13, 15, 16, 26, 26, 29, 29]) should return (12, 13, 15.5, 26, 29)

The following examples of invalid function calls should not be tested:

five_number_summary([1]) since the input list does not have at least five numbers
five_number_summary([5, 4, 3, 2, 1]) since the input list is not sorted from least to greatest

We recommend defining a helper function to find the median of a given list.

...


doctest.run_docstring_examples(five_number_summary, globals())

Task: `num_outliers`¶

An outlier is an extreme data point that can influence the shape and distribution of numeric data. $x$ is considered an outlier if either:

$x$ is less than the first quartile minus 1.5 times the interquartile range
$x$ is greater than the third quartile plus 1.5 times the interquartile range

The interquartile range is defined as the third quartile minus the first quartile.

Write and test a function num_outliers that takes a sorted list of at least five numbers and returns the number of data points that would be considered outliers using your five_number_summary to calculate the first and third quartiles.

num_outliers([1, 2, 3, 4, 5]) should return 0
num_outliers([1, 99, 200, 500, 506, 507]) should return 0
num_outliers([5, 13, 14, 15, 16, 17, 25]) should return 2 (the outliers are 5 and 25)
num_outliers([33, 34, 35, 36, 36, 36, 37, 37, 100, 101]) should return 2 (the outliers are 100 and 101)
num_outliers([8, 10, 10, 11, 11, 12]) should return 1 (the outlier is 8)

The following examples of invalid function calls should not be tested:

num_outliers([3, 3, 3]) input list should contain at least five numbers
num_outliers([3, 2, 1, 0, 5]) input list should be sorted from least to greatest

...


doctest.run_docstring_examples(num_outliers, globals())

Task: `reformat_date`¶

Write and test a function reformat_date that takes three strings: a date string, an input date format, and an output date format. This function should return a new date string formatted according to the output date format.

A date string is a non-empty string of numbers separated by /, such as "2/20/1991" or "1991/02/20". The order of date fields (month, day, year) will depend on the date format, and the number of digits for each field can vary but there must be at least one digit for each field.

A date format is a non-empty string of the date symbols "D", "M", "Y" separated by /. Assume the date string will match the date formats (share the same number of /s), that any date symbol in the output date format will also appear in the input date format, and that date formats do not duplicate date symbols.

reformat_date("12/31/1998", "M/D/Y", "D/M/Y") returns "31/12/1998"
reformat_date("1/2/3", "M/D/Y", "Y/M/D") returns "3/1/2"
reformat_date("0/200/4", "Y/D/M", "M/Y") returns "4/0"
reformat_date("3/2", "M/D", "D") returns "2"

The following examples of invalid function calls should not be tested:

reformat_date("3/2", "M/D/Y", "Y/M/D") date string and input date format do not match
reformat_date("3/2", "M/D", "Y/M/D") input date format missing a field present in the output date format
reformat_date("1/2/3/4", "M/D/Y/S", "M/D") input date format contains a field that is not “D”, “M”, “Y”
reformat_date("1/2/3", "M/M/Y", "M/Y") input date format contains a duplicate date symbol
reformat_date("", "", "") date strings and date formats must be non-empty strings

...


doctest.run_docstring_examples(reformat_date, globals())

Testing¶

Double check that each task has 2 of your own additional test cases.

test_results = doctest.testmod()
print(test_results)
assert test_results.failed == 0, "There are failed doctests"
assert test_results.attempted >= 31, "There should be at least 31 total doctests"

Classwork

Accessibility Simulations

Homework

Pokemon