top of page

Data Attributes and Relevance

Twitter

Date tweet was posted - allows us to see a timeline of tweets and bin them in specific years if that is necessary. This data seems to be very clean and does not have much noise or discrepancies.

Tweet text - allows us to look for terms related to pollution and the environment in addition to conducting sentiment analysis. This data has a large amount of noise and only a small subset of tweet text has the terms we are looking for. However, the large amount of data that we have likely compensated for this issue, as about 5% of the data contains pollution-related terms that we have identified, totaling to ~3500 tweets out of ~60,000. Some tweets contain emoji characters or are in another language, so this noise must be filtered out or interpreted correctly. 

Tweet location - allows us to determine what country/city the tweet was sent from. However, this value seems to be empty for a majority of the Twitter data, so it is likely not helpful. Tweet handle - the

 

Twitter handle the tweet came from - this has no noise and is set within the code.

The original, uncleaned dataset we collected contained 67,589 rows of data and 7 different attributes.

CDC Chronic Disease Indicators (Cancer):

Year-end and start - allows us to see the time period during which the cancer diagnoses occurred, though the data collection period is limited to the years 2010-2016.

Racial/Ethnic Stratification - allows us to analyze differences in frequency of certain cancer diagnoses for people of different races/ethnicities. Though this data may be helpful in identifying populations who may be more vulnerable to cancer diagnoses, data for this attribute does not exist until 2012. Additionally, some data values for this category, including “multi-racial” can be quite ambiguous.

Question - allows us to see the type of cancer, for example, lung, breast, colon, prostate, and more, as well as the gender and age bracket of the subject. A limitation of this attribute is that it excludes procedures that test for certain cancers, so it is not a comprehensive analysis of every cancer.

Confidence Limit (upper and lower) - allows us to see the certainty of the value range given for the data. This data is relatively clean. Some values are missing in these columns, though the data value which shows the overall rate of cancer prevalence is not affected by the lack of confidence limits.

Geolocation and State - allows us to see the state in which a particular subject’s data was collected. This is very helpful for seeing which areas receive the highest rates of cancer diagnoses, and also allows us to incorporate EPA data to see the prevalence of toxins in these same states; this attribute did not require cleaning. Issues: none.

Data Value Type - allows us to see rates of cancer prevalence according to different scales; examples: average annual age-adjusted rate, average annual crude rate, and average annual number. Since this data is aggregated, it is very helpful in seeing the overall rates of cancer data by state.

The original, uncleaned dataset we collected contained 14,101 rows of data and 27 different attributes.

CDC United States Cancer Statistics

Year - gives the year in which the diagnosis was recorded, with the earliest year being 1999 and the most recent data coming from 2016. Issues: None

Cancer Type - indicates which cancer the subject was diagnosed with. Issues: None

Age-Adjusted Rate - column contains the rate of cancer incidences, with respect to the age range of subjects. Issues: some missing values, but otherwise perfect data.

Lower and Upper Confidence Intervals - the range of confidence for the rate of cancer occurrences within the population. Issues: some missing values, but otherwise perfect data.

Case Count - the total number of cancer cases in a state during the given year. Issues: some missing values, but otherwise perfect data.

Population - the population of the state in a given year. Issues: some missing values, but otherwise perfect data.

The original, uncleaned dataset we collected contained 918 rows of data and 8 different attributes.

Environmental Protection Agency Toxic Release Inventory

State – gives the state abbreviation, allows us to locate where the pollution occurred. Issues: None

Category (i.e., Air Stack, Waste Treatment, Water) – allows us to determine what kind of pollution we are looking at.  Issues: not all categories are very clear or comparable

Sum/Avg/Min/Max Release Estimates – gives us numerical value for how much waste/what amount of chemicals was released, allows us to gauge the severity of the pollution problem.  Issues: some rows have the same value across all four estimate columns (probably because only one report was made), unclear as to whether or not this is an issue (we may only look at average estimate rather than all four values)

Chemical Name – if we choose to do analyses based on specific chemicals, we have that option (maybe cancer shows a greater correlation with some chemicals over others). Issues: None

Carcinogen – Y/N value, tells us whether or not chemical is considered a carcinogen. Issues: None

Clean Air – Y/N value, tells us whether or not the chemical is monitored by the Federal Clean Air Act. Issues: None

The original, uncleaned dataset we collected contained 1,149,501 rows of data and 17 different attributes.

Merged Dataset (EPA and CDC USCS)

We combined attributes from the EPA toxics release dataset and CDC USCS dataset to directly compare the rates of cancer and toxics release for each US state during the 18-year timeframe that we were able to obtain both cancer and pollution data.

This new merged dataset has one row of data for each state from 1999-2016 and also contains multiple attributes from the two datasets, including the rates of toxics release as categorized by the source of the pollution, the age-adjusted cancer rate for the corresponding state and year, and a new attribute of the sum of the toxics release from all the different types of pollution that were recorded.

The original, uncleaned dataset we collected contained 1,008 rows of data and 24 different attributes.

After cleaning and dropping US territories (we only focused on 50 states + DC anyway) and rows with null/empty values for cancer incidence or average chemical release total, we ended up with 909 rows of data in the cleaned merged dataset that was used for much of the analyses. (~5 % real loss)

bottom of page