Machine Learning and Classification

Decision Tree

The decision tree classifier is a supervised form of classification that repeatedly splits data to develop a model for accurately choosing class labels. It implements a recursive partitioning algorithm and terminates when the data is either satisfactorily "homogeneous" or meets another terminating condition. For this classifier, the accuracy was 0.83 for predicting cancer rate level. While the precision is high for the numerous medium class label, the accuracy decreases for the other classes and is quite low for very high. This is likely due to the lack of data points with those class labels as seen below in the confusion matrix.

Figure 1. Confusion Matrix for Decision Tree.

Figure 2. Classifier Results for Decision Tree.

Attributes in model:

['YEAR', 'STATE_ABBR', AVG_REL_EST_AIR_STACK',

'AVG_REL_EST_ENERGY_RECOVERY', 'AVG_REL_EST_RECYCLING', 'AVG_REL_EST_OTH_DISP', AVG_REL_EST_POTW_NON_METALS', 'AVG_REL_EST_WATER', 'AVG_REL_EST_SURF_IMP', 'AVG_REL_EST_UNINJ_IIV', 'AVG_REL_EST_AIR_FUG', 'AVG_REL_EST_LAND_TREA', 'AVG_REL_EST_WASTE_TREATMENT', 'AVG_REL_EST_OTH_LANDF', 'AVG_REL_EST_RCRA_C', 'AVG_REL_EST_TOTAL_ON_OFFSITE_RELEASE', 'AVG_REL_EST_UNINJ_I', 'AVG_REL_EST_SI_5.5.3B', 'AVG_REL_EST_TOTAL_ONSITE_RELEASE', 'AVG_REL_EST_SI_5.5.3A', 'AVG_REL_EST_POTW_RELEASE', 'AVG_REL_EST_POTW_TREATMENT', 'AVG_REL_EST_TOTAL', 'POPULATION', 'AVG_REL_EST_TOTAL_PER_CAPITA']

Random Forest

Random Forest is another classifier that uses multiple decision trees to cross-reference the results of each individual tree's classification label. These uncorrelated models work in conjunction to provide results that are generally more accurate, and the classifier was able to achieve an accuracy score of 0.89 in predicting the class labels indicating the level of the of rate cancer for a particular state and year. The precision was fairly high for the medium, high and low class labels but zero for very low. Again, the lack of data points for some class labels likely resulted in a loss of precision in the model. This seems to indicate that a Random Forest model works quite well for predicting a state's cancer rate level given various other characteristics, including chemical release levels.

Figure 3. Confusion Matrix for Random Forest.

Figure 4. Classifier Results for Random Forest.

Attributes in model:

['YEAR', 'STATE_ABBR', 'AVG_REL_EST_TOTAL_PER_CAPITA', 'CHEM_RATE_LEVEL']

0 : high

1 : low

2 : medium

3 : very low

Naive Bayes

The Naive Bayes Gaussian algorithm assumes the independent and equal effect of each element in predicting the final class label. For this dataset, the algorithm was less successful than the others we used in predicting the rate of cancer within a state, performing with an accuracy of 0.67, suggesting that certain attributes we used for the classifier had more weight than others and must be examined together to more adequately understand their effect on a state's cancer rate level. For the Naive Bayes algorithm, the precision was only high for the medium class label but low or zero for for other classes. This likely indicates the inability of this type of statistical model to predict the class labels of this data set that have a very uneven distribution of class labels.

Figure 6. Classifier Results for Naive Bayes.

Attributes in model:

['YEAR', 'STATE_ABBR', 'AVG_REL_EST_TOTAL_PER_CAPITA', 'CHEM_RATE_LEVEL']

Figure 5. Confusion Matrix for Naive Bayes.

K Nearest Neighbors (KNN)

0 : high

1 : low

2 : medium

3 : very low

The KNN classifier worked fairly well on the training set but not as well on the test set, with a test set of 80% of the data. This indicates, along with the confusion matrix and “ROC CURVE KNN” plot, that the KNN classifier did not predict the class label that well, but better than a random classifier, seen in the ROC curve plot. Importantly, in this case, we tried to classify whether the cancer rate was above or below the mean for all the states.

Figure 7. Confusion Matrix for K Nearest Neighbors.

Figure 8. Classifier Results for K Nearest Neighbor.

Attributes in Model:

['YEAR', 'STATE_ABBR', 'AVG_REL_EST_AIR_STACK', 'AVG_REL_EST_ENERGY_RECOVERY', 'AVG_REL_EST_RECYCLING', 'AVG_REL_EST_OTH_DISP', 'AVG_REL_EST_POTW_NON_METALS', 'AVG_REL_EST_WATER', 'AVG_REL_EST_SURF_IMP', 'AVG_REL_EST_UNINJ_IIV', 'AVG_REL_EST_AIR_FUG', 'AVG_REL_EST_LAND_TREA', 'AVG_REL_EST_WASTE_TREATMENT', 'AVG_REL_EST_OTH_LANDF', 'AVG_REL_EST_RCRA_C', 'AVG_REL_EST_TOTAL_ON_OFFSITE_RELEASE', 'AVG_REL_EST_UNINJ_I', 'AVG_REL_EST_SI_5.5.3B', 'AVG_REL_EST_TOTAL_ONSITE_RELEASE','AVG_REL_EST_SI_5.5.3A', 'AVG_REL_EST_POTW_RELEASE', 'AVG_REL_EST_POTW_TREATMENT', 'AVG_REL_EST_TOTAL', 'POPULATION', 'AVG_REL_EST_TOTAL_PER_CAPITA', 'region']

Figure 9. ROC Curve for K Nearest Neighbor.

Conclusion

Overall, our machine learning classification algorithms varies in their ability to predict the cancer rate level given various pollution attributes and state identifiers. However, Random Forrest had the highest accuracy overall and the best precision metrics for the different classes. This indicates, as a whole, that average chemical release estimates for carcinogens and other average chemical release amount for different types of releases can be used to predict the relative level of cancer successfully. This highlight that there is a strong relationship between these characteristics, although not necessarily linear in nature.

Machine Learning and Classification

Decision Tree

Figure 1. Confusion Matrix for Decision Tree.

Figure 2. Classifier Results for Decision Tree.

Attributes in model:

['YEAR', 'STATE_ABBR', AVG_REL_EST_AIR_STACK',

Random Forest

Figure 3. Confusion Matrix for Random Forest.

Figure 4. Classifier Results for Random Forest.

Attributes in model:

['YEAR', 'STATE_ABBR', 'AVG_REL_EST_TOTAL_PER_CAPITA', 'CHEM_RATE_LEVEL']

Naive Bayes

Figure 6. Classifier Results for Naive Bayes.

Attributes in model:

['YEAR', 'STATE_ABBR', 'AVG_REL_EST_TOTAL_PER_CAPITA', 'CHEM_RATE_LEVEL']

Figure 5. Confusion Matrix for Naive Bayes.

K Nearest Neighbors (KNN)

Figure 7. Confusion Matrix for K Nearest Neighbors.

Figure 8. Classifier Results for K Nearest Neighbor.

Attributes in Model:

Figure 9. ROC Curve for K Nearest Neighbor.

Conclusion

Investigating the Relationship Between Cancer Incidence and Toxic Chemical Releases in the US