Analyses & Methods | CancerandCarcinogens

Analyses

Basic Statistics and Data Characteristics:

In order to get an overview of our data and its general characteristics, we calculated various basic statistics such as mean, median and mode on different important attributes of our data and plotted these attributes in different ways. Further, we used these statistics do identify outliers in our data and decide how to handle these data points. Based on the plots and statistics, we assumed the age adjusted cancer rate and the chemical level, among other attributes, were normally distributed and chose to use a z-score normalization method to identify outliers based on the z-score.

Linear Regression:

To examine our continuous data attributes, such as the age adjusted cancer rate and the amounts of chemical pollutants, we created scatter plots between these data series and performed a least-squares regression analysis to determine if there is any linear relationship between different data series. This type of analysis would give us a better idea of trends between different variables in our data sources and the strength of these trends, if any.

Hypothesis Testing:

Since the values of our continuous variables, such as the age adjusted cancer rate, might vary between our categorical variables, such as each state’s (timezone) region, we performed an ANOVA (Analysis of Variance) and Student’s Paired t-tests to investigate the nature of these differences and whether they are statistically significant at the 95% confidence level. Importantly, we performed these types of hypothesis testing to identify whether the categorical variables encapsulated the differences in the continuous variables analyzed.

Clustering:

Since our data contains many attributes and characteristics, we looked to examine relationships within the data in terms of clustering. To this effect, we attempted clustering on our combined CDC and EPA data set using KMeans, DBSCAN and Hierarchical algorithms. After running these algorithms on our data set, we examined the clusterings and attempted to identify the important aspects of these clusters and what they might signify in terms of the data attributes. As an example, we tried to determine whether the clusters had any relation to timezone regions that states are found in.

Classification and Machine Learning:

To investigation our additional hypothesis that the cancer rate level (Ex. high, medium, low - based on z-score) of a state can be predicted through a model, we also applied machine learning classifiers to our combined data set. Importantly, we applied the Random Forest, Naive Bayes, Decision Tree and KNearestNeighbor classifiers to our data set to examine the ability of the classification models to predict the cancer rate level for different states over time.

Networks:

Finally, to examine our data in terms of the similarity of states, we constructed a similarity-based network where each node is a state (including DC) and the edges connecting each node to every other node is weighted by the attributes that the two states share. However, in the final network, only the subset of these edges with a certain weight threshold was included. Essentially, we created a network that only displays significant similarities between states.

Data Preprocessing

To clean the attributes collected from the CDC and EPA, we chose to perform a number of preprocessing techniques on the data to prepare it for further analysis, handle missing/bad values, and create new numerical attributes.

In the CDC CDI data, we found that certain values were missing for certain states earlier on in the collection period: Some attributes like racial/ethnic stratification do not contain data until several years into the time period being analyzed and there are missing values for confidence limits and certain cancer rates scattered throughout the dataset. Since deleting rows with null values did not reduce our dataset by a significant size, we chose this technique to make the CDC CDI dataset more suitable for use in further calculations. The USCS data was near-perfect aside from a small number of missing values very early on in the collection period for age-adjusted cancer rate, which we did not remove.

For the EPA data, we calculated a cleanliness score column based on the number of null values in a row. We dropped rows with null values or rows containing a value of 0, since they were not relevant to our analysis and would likely have negatively impacted the accuracy of our results by making the data more skew toward a lower value.

After performing our basic preprocessing techniques, we created a new merged dataset containing data from both the CDC USCS dataset (age-adjusted cancer rate) and EPA dataset (rates for toxics release from different sources). Outliers were identified using a z-score, though we did not eliminate these outliers since deleting data points from this analysis would have resulted in missing data for certain states and years.

Though we have chosen to keep the data from states/years (such as Alaska and Utah) where the reported toxics release value was abnormally high compared to other points, we have omitted this data from certain visualizations in order to better present the variation of this attribute between the remaining states.

Importantly, to examine the cancer incidence rate and the total chemical release estimates (and per capita), we assumed these variables were distributed normally. Thus, we used a z-score based method to bin the values into very low, low, medium, high and very high categories based on the z-score of the value for the cancer rate or chemicals release amount as follows:

Very low: (-inf, -2.5) Low: [-2.5, -1) Medium: [-1, 1] High: (1, 2.5] & Very High: (2.5, inf)

We chose this method because it was fairly scientific in manner and offered a consistent way of binning continuous numerical attributes.

Analyses

Data Preprocessing

Investigating the Relationship Between Cancer Incidence and Toxic Chemical Releases in the US