2020/4/8 P7: COVID-19 Growth Trend Clustering

https://canvas.wisc.edu/courses/176717/assignments/804556 1/4

P7: COVID-19 Growth Trend Clustering

Due Thursday by 9:29am Points 100 Submitting a file upload File Types py

Available until Apr 9 at 9:29am

Submit Assignment

NOTE: A tie-breaking protocol has been added to the HAC description below.

Assignment Goals

Implement hierarchical clustering

Process real-world data of particular contemporary relevance

Summary

There is a lot of analysis happening with the various datasets for COVID-19 right now. One of the goals

of these analyses is to help figure out which countries are "beating" the pandemic.

The Ten Hundred Plot

2020/4/8 P7: COVID-19 Growth Trend Clustering

https://canvas.wisc.edu/courses/176717/assignments/804556 2/4

Using the publicly available Johns Hopkins COVID-19 data, you'll be performing clustering on time series

data for different regions in the world. Each region is defined by a row in the data set, which can be a

country, a province, etc. The time series data represents the number of (cumulative) confirmed cases on

each day. Because different regions have different onset of the outbreak and differ in magnitude, it is

often desirable to convert a raw time series into a shorter feature vector. For this assignment, you will

represent a time series by two numbers: the "x" and "y" values in the above ten-hundred plot video.

After each region becomes that two-dimensional feature vector, you will cluster all regions with HAC.

Program Speci?cation

Download the data in CSV format: time_series_covid19_confirmed_global.csv

This is a snapshot of the data from the morning of April 2, 2020. If you wish to test your code on current

data, the Johns Hopkins University git repo (https://github.com/CSSEGISandData/COVID-

19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv)

has constantly-updating data, but we will only be testing your code with the snapshot.

Write the following Python functions:

1. load_data(filepath) — takes in a string with a path to a CSV file formatted as in the link above, and

returns the data (without the lat/long columns but retaining all other columns) in a single structure.

2. calculate_x_y(time_series) — takes in one row from the data loaded from the previous function,

calculates the corresponding x, y values for that region as specified in the video, and returns them in

a single structure.

Notes:

1. The "n/10 day" is the latest day with LESS THAN OR EQUAL TO n/10 cases, similarly for the

"n/100 day".

2. If "n/10 day" is day i, and today is day j, then x=j-i, not j-1+1.

3. Some x or y can be NaN if the time series doesn't contain enough growth.

4. There is a link to Matlab code in the video description on YouTube, please consult that for the

precise definition of x and y.

3. hac(dataset) — performs single linkage hierarchical agglomerative clustering on the regions with the

(x,y) feature representation, and returns a data structure representing the clustering.

You may implement other helper functions as necessary, but these are the functions we will be testing.

Load Data

Read in the file specified in the argument (the DictReader from Python's csv module

(https://docs.python.org/3/library/csv.html#csv.DictReader) will be of use) and return a list of dictionaries,

where each row in the dataset is a dictionary with the column headers as keys and the row elements as

values. These dictionaries should not include the lat/long columns, as we will not be using them in this

2020/4/8 P7: COVID-19 Growth Trend Clustering

https://canvas.wisc.edu/courses/176717/assignments/804556 3/4

program, but retain the province (possibly empty) and country columns so that data points can be

uniquely identified.

You may assume the file exists and is a properly formatted CSV.

Calculate Feature Values

This function takes in the data from a single row of the raw dataset as read in the previous function (i.e. a

single dictionary, without the lat/long values but retaining all other columns). As explained in the video

above, this function should return the x, y values in a tuple, formatted as (x, y) .

Perform HAC

For this function, we would like you to mimic the behavior of SciPy's HAC function

(https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html) , linkage() . You

may not use this function in your implementation, but we strongly recommend using it to verify your

results!

Input: A collection of m observation vectors in n dimensions may be passed as an m by n array. All

elements of the condensed distance matrix must be finite, i.e. no NaNs or infs. If you follow the Matlab

code from the YouTube video description, you will occasionally have NaN values — such rows should be

filtered out within this function and should not count toward your total number of regions.

In our case, m is the number of regions and n is 2: the x and y features for each region.

Using single linkage, perform the hierarchical agglomerative clustering algorithm as detailed on slide 19

of this presentation (http://pages.cs.wisc.edu/~jerryzhu/cs540/handouts/introML.pdf) . Use a standard

Euclidean distance function for calculating the distance between two points.

Output: An (m-1) by 4 matrix Z . At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are

combined to form cluster m + i. A cluster with an index less than m corresponds to one of the m original

observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2] . The fourth value

Z[i, 3] represents the number of original observations in the newly formed cluster.

That is:

Number each of your starting data points from 0 to m-1. These are their original cluster numbers.

Create an (m-1)x4 array or list. Iterate through the list row by row.

For each row, determine which two clusters you will merge and put their numbers into the first and

second elements of the row. The first point listed should be the smaller of the two cluster indexes.

The single-linkage distance between the two clusters goes into the third element of the row. The total

number of points in the cluster goes into the fourth element.

If you merge a cluster containing more than one data point, its number (for the first or second element

of the row) is given by m+the row index in which the cluster was created.

Before returning the data structure, convert it into a NumPy matrix.

2020/4/8 P7: COVID-19 Growth Trend Clustering

https://canvas.wisc.edu/courses/176717/assignments/804556 4/4

If you follow these guidelines for input and output, your result should match the result of

scipy.cluster.hierarchy.linkage() and you can use that function to verify your results. Be aware that this

function does not contain code to filter NaN values, so this filtering should be performed before calling

the function.

Tie Breaking

In the event that there are multiple pairs of points with equal distance for the next cluster:

Given a set of pairs with equal distance {(x , x )} where i < j, we prefer the pair with the smallest first

cluster index i. If there are still ties (x , x ), ... (x , x ) where i is that smallest first index, we prefer the pair

with the smallest second cluster index.

Be aware that this tie breaking strategy may not produce identical results to

scipy.cluster.hierarchy.linkage() .

Challenge Options

If you wish to continue exploring the data, we challenge you to implement a complete linkage option and

compare the results with single linkage. You may also wish to use the results of your HAC algorithm to

choose an appropriate k for k-means, and implement that clustering as well. Do you get the same

clusters every time? If not, how do they differ? Is there anything meaningful there, do you think?

Submission

Please submit your code in a file called ten_hundred.py . All code should be contained in functions or

under a

if __name__=="__main__":

check so that it will not run if your code is imported to another program.

i j

i j i k