Conclusion
CONCLUSIONS AND KEY TAKEAWAYS
Our analyses demonstrated several things about the relationship between cancer rates and carcinogen emissions in the United States. Firstly, our linear-regression revealed that time-zone based regions have distinct cancer rate differences, while the distinction in emissions is less significant. However, while timezone captured some of the regional differences, our clustering and network analysis revealed that are other, more localized regional groups may capture the variance between the states.
Looking at the individual data sets, beginning with the carcinogen emission, we found that Alaska had a very high average rate of total toxics release for the years between 1999 and 2016. Besides Alaska, we also found that the majority of states with high rates of toxics release were generally located in the Mountain time zone: Idaho, Wyoming, Nevada, and Arizona. General observations about the individual datasets are that both the cancer rates and the toxin emissions followed a general downward trend in toxin emission across almost all states over the 18-year period.
For the average age-adjusted cancer rate over time, we analyzed the rates for the years 1999-2016 and observed that Kentucky had one of the highest rates in the nation, and other states like New Mexico had lower rates. The majority of the remaining states had rates that were very similar to each other. Similar to the toxics release rate over time, there was a general downward trend across all time zone regions we observed, though the East and Central regions saw a lower comparative decrease overall.
Significant Findings
One of the more interesting relationships we observed was the existence of a relatively high correlation in age-adjusted cancer rates between regions, though Alaska/Hawaii and the Central regions had much lower correlations, suggesting their rates did not vary in similar linear ways over time as much as the other regions did.
Directly plotting the average chemical release rate total (an attribute we calculated from EPA data) and cancer rate over time and labeling each plot point with the region that the state was located in revealed some interesting links between these two attributes as well. Overall, we observed a higher cancer rate for the East and lower cancer rates for the Mountain region states. The Eastern states also appear to have a statistically different cancer rate compared to other regions, according to our ANOVA test and t-test heatmap.
After using a z-score to identify outliers in the cancer rate data and label each data point with one of five class labels (very low, low, medium, high, very high), we observed that aside from a small number of outliers, the majority of states fell within the range of “medium” for cancer rates. This affected our machine learning and classification techniques because certain data points with a class label of very high did not exist and the distribution for class labels was also very uneven, with “medium” rate states occurring much more often than others.
We also discovered that since Alaska was so much of an outlier in terms of toxics release, including/omitting that data in our experimental calculations yielded very different results, as the data from that single state was enough to skew all of our results.
Throughout the process of developing our research methodology, we altered the goals and implementation of our procedure slightly from our original plan. Though we had originally collected Twitter data to supplement our analysis of the quantity of pollution in an area, we found that it was difficult to procure a sufficient amount of Tweets containing relevant key terms about pollution that also contained the geographic location necessary to tie the Tweet to a specific location. We decided that a network analysis would instead illuminate more notable relationships between certain states and their relative rates of cancer and toxics release levels.
While the time zone region does not tend to be a highly accurate indicator of a state’s age-adjusted rate of cancer, we did observe that states adjacent/near each other tended to have similar rate of cancer to one another, based off of our clustering and network analysis. For the network, we used a state’s time zone as one of the attributes of the network, though we only gave this factor a weight of 0.10. However, interestingly, the clustering analysis did not consider this time zone label, yet still resulted in groupings that were still quite similar to the network analysis. This is particularly interesting given that these two methods were different ways of finding groups within our data and were created in different manners.
In conclusion, although we did not find any significant linear correlation between chemical release amounts and age-adjusted cancer rates, we did observe that nearby states (not necessarily in the same timezone region) or with similar chemical release characteristics appeared to have similar cancer rates. This indicates that there is a fairly strong relationship between these variables but that these are not easily seen in simple linear regressions or scatter plots.