Next Steps
There were certain factors that limited our research from being as specific as we originally intended it to be. Given the separate toxics release quantities of the EPA dataset sorted by the release source, we decided to combine these quantities into a single “total toxics” attribute, which we created by adding the quantities together; this was done in order to aggregate the several quantities we were given into a single column in order to better visualize and compare two variables against each other (age-adjusted cancer rate and total overall toxin release by state).
Use of a lagged pollution as a predictor for cancer rates as pollution likely causes effects only over time. Perhaps, adding a lag of several years between the cancer rate and total chemical release estimates would provide greater insight into the cancer and pollution trends we originally hopes to gain insight into.
Since our CDC cancer data lacked county-level granularity, we were unable to investigate more localized trends between chemical release estimates and cancer incidence. In the future, examining more localized trends by looking at where different chemicals are being released geographically and what the cancer rates are in that locale.
Although we considered timezone-based regions of states, looking at urban versus rural states may reveal new and interesting patterns in the cancer and chemical release estimate data. It is likely that rural states and regions have less chemical pollution compared to more urban areas, so such differences could account for hidden patterns in the datasets that we examined.
Finally, for the cancer rate level and chemical release levels (bins), we assumed these pieces of data were normally distributed. Based on the data we collected, this appeared true, especially for the CDC cancer rate data, so we believed that this z-score based method would be sufficient. However, if the data was not truly normally distributed, this method of binning would have forced the data into a normal distribution and biased our results. Particularly, it could have reduced the ability of our classification algorithms, such as Gaussian Naive Bayes, to be able to predict all the class labels with sufficient accuracy, since some class labels had very few values. In the future, other statistical measures could have been used to identify the specific type of distribution, such as Boltzmann or exponential, and bin our data based upon the characteristics of those distributions.