Data Sources and Research Question


We chose to examine cancer data from the Center of Disease Control and chemical release data from the Environmental Protection Agency because these databases contain the information necessary to derive whether or not there exists a relationship between the rate of cancer diagnoses and the rate of chemical toxin release within different states; for each year within our timeframe, we are able to discern the age-adjusted rate of cancer mortality for each state (including specific types of cancer in addition to an overall aggregated rate of all types) and specific rates of chemical toxin release, separated by source of pollution.
We have collected two different datasets from the CDC, a United States Cancer Statistics csv file containing very clean data from the years 1999-2016, and a second CDC cancer dataset obtained through an API, with data from 2010-2016; the latter contains many more patient attributes than the former, but the former has already been preprocessed quite well, and provides the overall age-adjusted cancer rate, which is central to our investigation.
Other researchers have explored this relationship using the CDC’s cancer data and EPA’s toxin release data. While we are using the same two datasets, we are enriching the pollution data with an untraditional source: Tweets by environmental organizations and news organizations. The motivation for this collection of additional Twitter data was that the EPA’s toxics release data is self-reported by companies. Self-reported data could have flaws, including, but not limited to, intentional underreporting and human error. The tweets will capture chemical pollution documented by individuals, newspapers, and other researchers who may have observed pollution that was not self-reported by companies to the EPA. We hope this will offer a different perspective, which will enhance understanding of this topic.
Area of Research and Ethical Considerations
Our project is a healthcare-oriented data science research question, and we have based our research methods off of existing studies regarding the link between certain carcinogenic chemicals and the development of cancer in patients. Though the theory of such a relationship is not wholly new, we determined that this area of oncology could benefit significantly from additional analyses of this potential link, as the affirmation or disproving of our hypothesis may imply numerous changes to not only medical policy, but environmental and social policy as well.
Our project relies on empirical evidence to determine the existence/nonexistence of a relationship between these two variables, but we have placed a number of ethical considerations at the center of this inquiry as well. All of the data that we have collected is readily available to the public, as both the CDC and EPA are funded by the United States government. Additionally, through aggregation, patient data has been anonymized and therefore secured, so there is no risk of a breach of patient privacy.
This is an important issue because every day, Americans are developing cancer as a result of carcinogenic toxins released by various industries. In theory, these toxin-induced cancer cases could be prevented or reduced if the relationship between the toxins and cancer was better understood. Our project will enrich this body of knowledge, and in doing so will support the efforts to reduce toxin-induced cancer, even if only in a small way.
Our Goal
Given the prominence of environmental issues in federal legislation in recent years, this problem has only grown more relevant over time. It is our goal that the conclusions derived from our research be used to create measurable and effective change to federal legislation, provide medical professionals with additional insight into the cancer/toxin release link, and give the public an easily comprehensible summary of this relationship so that they may make the appropriate decisions to reclaim agency over their health and protect it if necessary.