We worked with multiple datasets to analyze possible links between drug overdoses and other factors such as opioid prescriptions and income within the U.S. from 2015-2020. Our first dataset is The Provisional Drug Overdose Death Counts from the Center for Disease Control (CDC). It contains the specific types of overdoses and total counts in every state for the U.S. We plan on comparing this dataset to the Medicare Prescribing Rates Dataset. This dataset contains information sorted by state that describes a multitude of fatures for how states have prescribed opioids for the years we are interested in. We want to investigate if there is a correlation between the number of drug overdoses and the rate at which these drugs are prescribed. The data will help answer this question as it is sorted regionally, so we can study trends in specific places. Additionally, we loaded in a state average income dataset. We want to use this dataset to see if this economic factor has some correlation with drug overdose levels in states. Finally, we brought in a state population dataset from from the U.S. Census website. This enables us to normalize the different datasets we have, and therefore we can compare the trends between states accurately.
We aim to be able to be able to create a model that predicts the rate of drug overdose deaths in individual states within the United States.
Importing necessary libraries for project.
import matplotlib.pyplot as py
import numpy as np
import pandas as pd
import seaborn as sns
We first load in the data for drug overdose deaths from the CDC website. The data is displayed in the table below. It contains the number of deaths for a class of drugs for each state and year/month combination. The year and month columns will prove especially helpful, as we can examine the date in a much more detailed manner than if we just had the year. This will help to answer our question of whether there is a causation/correlation between drug related deaths and prescriptions, as we can easily visualize trends over time in a detailed manner. We filtered the orginal dataset to remove extraneous columns and keep our data tidy. Additionally, we removed NaN values from the table.
#Reads in datafile containg data on different drug deaths for every state 2015-2022
drug_deaths = pd.read_csv("https://data.cdc.gov/api/views/xkb8-kh2a/rows.csv?accessType=DOWNLOAD&bom=true&format=true")
#Gets rid of commas in data values column and converts object column to float
drug_deaths["Data Value"]= drug_deaths["Data Value"].str.replace(',','').astype(float)
#Filters dataframe for only select columns we want and renames Data Values column to Deaths
drug_deaths= drug_deaths[["State Name","Year","Month","Indicator","Data Value"]].loc[drug_deaths["Data Value"].isnull()==False]
drug_deaths.columns = drug_deaths.columns.str.replace('Data Value', 'Deaths')
display(drug_deaths)
display(drug_deaths.dtypes)
Below we filter the table further to obtain the total number of drug overdose deaths and the number of drug overdose deaths caused by opioids.
#Filters drug deaths dataframe to only contain values for total drug overdose deaths
drug_deaths_total = drug_deaths.loc[drug_deaths["Indicator"]=="Number of Drug Overdose Deaths"]
drug_deaths_total.columns = drug_deaths_total.columns.str.replace("Deaths","Total Drug Deaths")
#Filters drug deaths dataframe to only contain values for opioid drug overdose deaths
drug_deaths_opioids = drug_deaths.loc[drug_deaths["Indicator"]=="Natural, semi-synthetic, & synthetic opioids, incl. methadone (T40.2-T40.4)"]
drug_deaths_opioids.columns = drug_deaths_opioids.columns.str.replace("Deaths","Opioid Deaths")
Next, we read in the Medicare Opioid Prescribing Rates for the years 2013-2020. This dataset contains the opioid prescription rates and total number of opioids dispensed for a state in a year. The data is collected by the Centers for Medicare & Medicaid Services, and it is based on claims data for Medicare beneficiaries. Unlike the previous drug overdose dataset, we do not receive a months column because this data is updated annually, so it will be a bit more difficult to visualize this data trend over small time intervals. We initially filter the table to only contain relevant data columns, which from this dataset was the Opioid Prescribing Rate and the Total Number of Opioid Claims.
#Reads in datafile containg opioid prescription data
opiod_presc = pd.read_csv("https://raw.githubusercontent.com/tcutro1/final-project/main/PrescriptionRates.csv")
opiod_presc = opiod_presc.loc[opiod_presc['Plan_Type']=="All"]
#Filters dataframe for only select columns we want and renames Geo_Desc column to State Name
opiod_presc = opiod_presc[["Year","Geo_Desc","Tot_Opioid_Clms","Opioid_Prscrbng_Rate"]].loc[opiod_presc["Opioid_Prscrbng_Rate"].isnull()==False]
opiod_presc.columns = opiod_presc.columns.str.replace('Geo_Desc', 'State Name')
display(opiod_presc)
display(opiod_presc.dtypes)
The data loaded in below contains every states average income over a certain amount of years. We will be using this data to test for any trends that income has with drug overdoses. First, we load in the data and then rename the columns to be the value of the first row, which contains the years for each column.
Source: https://nces.ed.gov/programs/digest/d20/tables/dt20_102.30.asp
income = pd.read_csv("https://raw.githubusercontent.com/tcutro1/final-project/main/Income%20Data.csv")
income.columns = income.iloc[1,:]
income = income.iloc[3:55,0:18]
Next, we select the certain columns we want, melt the table to make it easier to combine with our other data, and then rename the columns.
income = income[["State","2015","2016","2017","2018","2019"]]
income = pd.melt(income, id_vars="State")
income.columns = ["State Name","Year","Income"]
Now we reformat the income column, so it can be converted to an int. We also had to fix the formatting of the State's column because a few of them had extra spaces.
income["Year"]= pd.to_numeric(income["Year"])
income["Income"]= income["Income"].str.replace(',','')
income["Income"]= income["Income"].str.replace('$','').astype(float)
income["State Name"] = income["State Name"].apply(lambda x: x.strip())
display(income)
display(income.dtypes)
The dataset below was retrieved from the US Census Data. It provides a breakdown of state population by year. First the dataset is read into a dataframe from the URL below.
url = 'https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/totals/nst-est2020-popchg2010-2020.csv'
population_df = pd.read_csv(url)
Filtering the dataframe down to the columns desired, which are the state names and population count for each year. Dropping extraneous columns
population_columns = ['NAME', 'POPESTIMATE2010','POPESTIMATE2011', 'POPESTIMATE2012',
'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015',
'POPESTIMATE2016', 'POPESTIMATE2017', 'POPESTIMATE2018',
'POPESTIMATE2019', 'POPESTIMATE2020']
population_df = population_df[population_columns]
population_df = population_df.drop(population_df.index[1:5])
Setting the index for future melting. Removing the "popestimate" from each column so that the years are easily accessible and can be merged into the full dataframe.
population_df.set_index('NAME')
population_df.columns = population_df.columns.str.replace("POPESTIMATE", "")
population_df.head(10)
Melting the dataframe so that it is tidy and not too wide. Each row now represents a state's popuoation for a year ranging from 2010-2020. Setting the Year datatype to numeric.
population_melt = pd.melt(population_df, id_vars=['NAME'])
population_melt.rename(columns={'NAME': 'State Name', 'variable': 'Year', 'value':'Population'}, inplace=True)
population_melt["Year"]= pd.to_numeric(population_melt["Year"])
display(population_melt)
display(population_melt.dtypes)
We now combine the different datasets we have into one dataframe. This will allow us to model the data and do more effective anlysis in the following EDA section. We begin by using the groupby() function to get each dataset in the same form so that we can merge them together using the state name and year. We do this for the overdose deaths, opioid prescriptions, incomes, and population datasets.
#Summarizes and groups both dataframes by State Name and year
deaths_by_year = drug_deaths_total.groupby(["State Name","Year"]).mean()
opioids_deaths_by_year = drug_deaths_opioids.groupby(["State Name","Year"]).mean()
prescr_by_year = opiod_presc.groupby(["State Name","Year"]).mean()
income_by_year = income.groupby(["State Name","Year"]).mean()
population_by_year = population_melt.groupby(["State Name","Year"]).mean()
We then merge the tables into each other one by one. We do this on the state name and year using an inner join.
prescr_and_deaths = deaths_by_year.merge(prescr_by_year, on = ["State Name","Year"],how = "inner")
prescr_and_two_deaths = prescr_and_deaths.merge(opioids_deaths_by_year,on = ["State Name","Year"],how = "inner" )
presc_deaths_income = prescr_and_two_deaths.merge(income_by_year, on= ["State Name","Year"],how = "inner")
presc_deaths_income_pop = presc_deaths_income.merge(population_by_year, on= ["State Name","Year"],how = "inner")
Below we create some new data columns that normalize the deaths and total opioid claims by dividing them with the population of each state for that year.
presc_deaths_income_pop["Opiod Claim Rate"] = presc_deaths_income_pop["Tot_Opioid_Clms"]/presc_deaths_income_pop['Population']
presc_deaths_income_pop["Death Rate"] = (presc_deaths_income_pop["Total Drug Deaths"]/presc_deaths_income_pop['Population'])*10000
display(presc_deaths_income_pop)
#Group data by State and combine datasets to get Average Deaths and Average Opiod Prescriptions in one dataframe
avg_state_deaths = drug_deaths_total.groupby(["State Name"])[["Total Drug Deaths"]].mean()
avg_state_deaths_opiods = drug_deaths_opioids.groupby(["State Name"])[["Opioid Deaths"]].mean()
avg_state_presc = opiod_presc.groupby(["State Name"])[['Tot_Opioid_Clms','Opioid_Prscrbng_Rate']].mean()
avg_state_income = income.groupby(["State Name"])[["Income"]].mean()
avg_state_population = population_melt.groupby(["State Name"])[["Population"]].mean()
avg_state = avg_state_deaths.merge(avg_state_presc, on = "State Name", how = "inner" )
avg_state = pd.merge(avg_state,avg_state_deaths_opiods,on = "State Name", how = "inner")
avg_state = pd.merge(avg_state,avg_state_income,on = "State Name", how = "inner")
avg_state = pd.merge(avg_state, avg_state_population, on = "State Name", how = "inner")
avg_state['Opioid_Clms_Rate'] = avg_state['Tot_Opioid_Clms']/avg_state['Population']
avg_state['Deaths_Rate'] = (avg_state["Total Drug Deaths"]/avg_state['Population'])
avg_state["Opioid_Deaths_Rate"] = (avg_state["Opioid Deaths"]/avg_state['Population'])
avg_state
Below is a plot of the Drug Deaths per 10,000 people in the United States from 2015-2020. Visualizes the increase in drug overdose deaths over the past five years.
py.plot((normalized_deaths.loc["United States"]["Total Drug Deaths"]/normalized_deaths.loc["United States"]["Population"])*10000)
py.title("Drug Deaths per 10,000 People in United States From 2015-2020")
py.xlabel("Year")
py.ylabel("Death Rate per 10,000 People")
To get an inital look at the issue we aim to explore, we graphed the number of drug overdose deaths from 2015 to 2022 for DC, Louisiana, and West Virginia. In order to do this, we grouped the total drug deaths dataframe by state name and year. We plotted the non-normalized graph first to show just how big the issue is, and it is only getting worse. We hope to evince the necessity to take action in certain states based on the high overdose death rates, and other factors that correlate with this.
drug_deaths_summary= drug_deaths_total.groupby(["State Name","Year"]).mean()
state_names = ["District of Columbia", "Louisiana", "West Virginia"]
for i in state_names:
py.plot(drug_deaths_summary.loc[i])
py.title("Drug Deaths From 2015-2022 For Select States")
py.xlabel("Year")
py.ylabel("Number Of Deaths")
We then graph the drug overdose deaths per person in DC, Louisiana, and West Virginia from 2015-2020. From this graph, you can tell that many of the states with the highest drug overdose death rates over these years are still trending upward. We plan on focusing on the states with the highest overdose death rates, and look at factors that could be playing a major role in these increases.
normalized_deaths = pd.merge(drug_deaths_summary, population_by_year, on = ["State Name","Year"], how = "inner")
for i in state_names:
py.plot(normalized_deaths.loc[i]["Total Drug Deaths"]/normalized_deaths.loc[i]["Population"])
py.title("Normalized Drug Deaths based on States Population From 2015-2020")
py.xlabel("Year")
py.ylabel("Death Rate per Person")
Below is a scatterplot of the Total Drug Overdose Deaths vs Total Opioid Claims. We plotted this data to see if there was any correlation present between the two variables. Our results manifested the need to normalize our data based on populations. It is clear that states with very large or very small populations will appear as the outliers on this plot.
#Scatter Plot comparing Average Deaths and Average Opiod Prescriptions for each state
py.figure(figsize=(10,5))
py.scatter(avg_state['Tot_Opioid_Clms']**.1,avg_state["Total Drug Deaths"])
py.title('Total Drug Overdose Deaths vs Total Opioid Claims in the U.S.')
py.xlabel("Total Opiod Claims")
py.ylabel("Drug Overdose Deaths")
The first two entries shown below, Florida and California, represent the states with the most drug overdose deaths and the most total opioid claims respectively. This proves interesting once you look at the state with the least overdose deaths, South Dakota, and the state with the least total opioid claims, Wyoming. While the opioid prescription rates between the different states are similar, the total number of deaths and total opioid claims differs by thousands between the entries.
#state with highest amount of opiod overdoses
display(avg_state.loc[avg_state["Total Drug Deaths"]== avg_state["Total Drug Deaths"].max()])
#state with highest amount of opiod prescriptions
display(avg_state.loc[avg_state["Tot_Opioid_Clms"]== avg_state["Tot_Opioid_Clms"].max()])
#state with lowest amount of opiod overdoses
display(avg_state.loc[avg_state["Total Drug Deaths"]== avg_state["Total Drug Deaths"].min()])
#state with lowest amount of opiod prescriptions
display(avg_state.loc[avg_state["Tot_Opioid_Clms"]== avg_state["Tot_Opioid_Clms"].min()])
Even though California stands out as having the most opiod overdoses and opiod claims, this is primarily due to California's high population. Based on these results, it is clear that we need to normalize this data using the population of each state. Thus, we can visualize trends in our data.
Below is the normalized scatterplot, which displayed the Drug Overdose Death Rate (deaths/population). We added a best fit line to show the data trend that seems to manifest itself on this graph.
#Scatter Plot comparing Average Deaths and Average Opiod Prescriptions for each state
py.figure(figsize=(10,5))
py.scatter(avg_state['Opioid_Clms_Rate'], avg_state["Deaths_Rate"])
#Line of Best Fit
a, b = np.polyfit(avg_state['Opioid_Clms_Rate'], avg_state["Deaths_Rate"], 1)
py.plot(avg_state['Opioid_Clms_Rate'], a*avg_state['Opioid_Clms_Rate']+b)
#Annotating Outliers that are shown below
py.annotate("West Virginia", (avg_state.loc["West Virginia"]["Opioid_Clms_Rate"]-.015, avg_state.loc["West Virginia"]["Deaths_Rate"]-.00003))
py.annotate("South Dakota", (avg_state.loc["South Dakota"]["Opioid_Clms_Rate"]+.002, avg_state.loc["South Dakota"]["Deaths_Rate"]))
py.annotate("District of Columbia", (avg_state.loc["District of Columbia"]["Opioid_Clms_Rate"]+.002, avg_state.loc["District of Columbia"]["Deaths_Rate"]))
py.title('Drug Overdose Deaths Rate vs Opioid Claims Rate')
py.ylabel("Drug Overdose Deaths Rate")
py.xlabel('Opioid Claims Rate')
py.show()
From this graph, it appaears there is a correlation between the rate at which opioids are being claimed and used as medication in a state, and the number of overdose deaths that these states experience. Below we extrapolate a few of the extreme cases to get a better idea of the data we are dealing with. In an upcoming section we will use a correlation matrix to see if we can corroborate these results.
#state with highest rate of opiod overdoses
display(avg_state.loc[avg_state["Deaths_Rate"]== avg_state["Deaths_Rate"].max()])
#state with highest rate of opiod prescriptions
display(avg_state.loc[avg_state["Opioid_Clms_Rate"]== avg_state["Opioid_Clms_Rate"].max()])
#state with lowest rate of opiod overdoses
display(avg_state.loc[avg_state["Deaths_Rate"]== avg_state["Deaths_Rate"].min()])
#state with lowest rate of opiod prescriptions
display(avg_state.loc[avg_state["Opioid_Clms_Rate"]== avg_state["Opioid_Clms_Rate"].min()])
West Virginia immediately stands out with the highest rate of drug overdoses and opioid prescriptions of any state. West Virginia is amidst an opioid epidemic worse than they have ever seen. The high rate of drug usage and overdoses there have created a situation where employees are sparse and children struggle to grow up with parents or friends unaffected by the crisis. More information can be found from the DEA at this link.
To further investigate if there was correlations present within the data, specifically whether any features had a correlation with the drug overdose death rate, we took a look at a correlation matrix.
First, we looked at the correlation matrix for our combined table which has the average data for each state from 2015-2020.
py.figure(figsize=(12,10))
cor = avg_state.corr()
sns.heatmap(cor, annot=True, cmap=py.cm.Reds)
py.show()
As expected, the population of a state has a strong correlation with the total drug overdose deaths and the total opioid claims. Of significant interest to us was the correlation value of 0.48 between the overdose death rate and the opioid claims rate. We graphed this relationship in Section B, and noticed a potentially linear relationship. The correlation value further supports our findings.
py.figure(figsize=(12,10))
cor = presc_deaths_income_pop.corr()
sns.heatmap(cor, annot=True, cmap=py.cm.Reds)
py.show()
This correlation matrix does not provide as interesting of results. When the data is not grouped together for each state the results do not seem to be as significant.
After reviewing the correlation matrix, we wanted to visualize the correlation between income and opioid claim rates per state. Although it is slight, the graph below shows correlation between the two variables, and as average income increases opioid prescription rate decreases. I believe the average income across an entire state is not the best representation of economic factors for that region, so other economic data would be needed to further study this.
py.scatter(avg_state['Income'], avg_state["Opioid_Clms_Rate"]**2.8)
py.title("State's Average Income vs Average Opioid Claims Rate")
py.xlabel("Income")
py.ylabel("Opioid Claim Rate")
py.annotate("West Virginia", (avg_state.loc["West Virginia"]["Income"]+1000, avg_state.loc["West Virginia"]["Opioid_Clms_Rate"]**2.8-.0002))
py.show()
Our model predicts the drug overdose rate in a state for a given year, based on the features given and data from different years. We identified mutliple states that have climbing overdose death totals, so being able to predict the increase or decrease a state will experience would prove helpful in stopping the overdose problem throughout our country and helping to better prepare different areas of the country. Our independent variables are the state name, opioid prescription rate, income, population, and opioid claim rate. Our dependent variable is the number of drug overdose death rate (deaths per 10,000 people). We will test whether this model is correct by comparing our predicted values for the drug overdose deaths per year to the actual values contained in our datasets. This will be using an 80/20 train and test split for a Linear Regression model
Below is a spread of all the variables contained in our current combined dataset. For the model, we will be dropping some of these features which contain repeat information or have an extreme skew.
presc_deaths_income_pop.hist(bins=50, figsize=(20,15))
py.show()
We drop the total opioid claims because we created a normalized rate which is the feature "Opioid Claim Rate". We also drop the "Opioid Deaths" column, which gets into some assumptions we made about the model. We assumed that with information on opioids and the rates at which they are prescribed and claimed, we can predict the drug overdose death rate per 10,000 people. This generalizes opioid information to encompass drug overdoses, but there are a multtitude of other factors that influence this death rate year to year. Another possible issue could stem from people overdosing in a state they are not a resident in, yet their overdose death would count towards that specific state's total. A possible future step would be to examine and account for all the different biases present, something we tried to do but it is not perfect. Additionally, we assumed grouping by state could yield significant results, but we definitely think having data by county and parish would be extremely beneficial. Many states are large and have a huge variation in different prescribing rates, income, and deaths even by county, so being able to examine the regions more specifically would be beneficial.
model_df = presc_deaths_income_pop.drop(["Tot_Opioid_Clms", "Opioid Deaths", "Total Drug Deaths"], axis = 1)
This is the dataframe that will be used for the model. The Death Rate is the target variable we are trying to predict with a Linear Regression model.
model_df
First, the data is split into an 80% training set and 20% test set for model evaluation.
from sklearn.model_selection import train_test_split
X = model_df.drop(["Death Rate"], axis = 1)
y = model_df["Death Rate"]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=0)
We fit a linear regression model to the training data. This is done using sklearns Linear Regression function.
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
We use the mean absolute error to evaluate our model's ability to predict the death rate. This gives us the distance our prediction was from the true value.
#The mean absolute error of training set
from sklearn.metrics import mean_absolute_error
y_train_pred = lin_reg.predict(X_train)
y_test_pred = lin_reg.predict(X_val)
train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_val, y_test_pred)
print('Train MAE: ' + str(train_mae))
print('Test MAE: ' + str(test_mae))
The training set mean absolute error was 0.536. This can be interpreted as the predicted drug overdose death rate being within 0.536 of the actual death rate per 10,0000 people. The test MAE was close, at 0.613.
While our model did a decent job of predicting the death rate, bringing in more features and a larger dataset would increase the accuracy of the model. We believe that with improvement we could help states predict trends for their future drug overdose deaths to stop a problem before it gets worse. If we were able to predict the states with the greatest number of opioid deaths, then our model could be used to target regions that need more drug education funding, narcan supplies, and other resources in hopes of lowering the predicted death number to save lives.