top of page

Clustering

Since our data contains many attributes and characteristics, we looked to examine relationships within the data in terms of clustering. In pursuit of that goal we attemptig clustering on our combined CDC and EPA data set using KMeans, DBSCAN and Hierarchical algorithms. After running these algorithms on our data set, we examined the clusterings and attempted to identify the important aspects of these clusters and what they might signify in terms of the data attributes.

 

Attributes used for clustering:

['STATE_ABBR', 'AGE_ADJUSTED_CANCER_RATE', 'AVG_REL_EST_TOTAL_PER_CAPITA']

Hiearchical

KMeans

cool clustersH6.png

6 Clusters

Silhouette Score: 0.381

cool clustersK6.png

6 Clusters

Silhouette Score: 0.376

As seen in the above two plots, clustering analysis on our merged data set did not yield very defined clusters with fairly low silhouette scores. Importantly, the plots each have 51 points representing the US states and the axis values that correspond for each state represent the average from 1999 to 2016. The cluster labels are represented with the different colors. With 6 clusters, both KMeans and Hierarchical clustering resulted in identical clusterings. (DBSCAN was not able to make any meaningful clusters with few noise points). The states towards the center of the plot seem to be characterized by middle-valued cancer rates but a large spread of chemical release levels. To better visualize clusters we created a map, and colored each state by the cluster it was in. We also included the average pollution amounts and cancer rates for each cluster.

This map choropleth shows that timezone-based regions do not quite capture the regional differences in the cancer rates and chemical release levels, but nearby states in different, smaller regional groupings do have noticeable similarities.  Many of the East coast states had higher cancer rates but lower average chemical release levels. However, other groups of states such as NM, AZ, HI, CA & CO had lower cancer rates and lower average chemical release levels. Perhaps these groups are related to regional industries like agriculture or chemical production that have nothing to do with the timezone. Altogether, the clustering result indicate that a state's neighbors and chemical release characteristics could be used to identify an approximate cancer rate, or at least a range of likely values, for the state. Importantly, neither the region for a state nor its geographic location was included in any way in the attributes used for the clustering analysis so these clustering were generated just from the pollution and cancer values.

dendrogram.PNG

The above dendrogram for the Hierarchical clustering algorithm displays how the clusters we used above were iteratively split. The clustering for n = 6 is split around a height of 13.

bottom of page