Analysis of Opioid Drug Overdoses and Prescriptions in the U.S. from 2015 - 2020

Thomas Cutro

Joeseph Wagner

Website

Project Plan

We worked with multiple datasets to analyze possible links between drug overdoses and other factors such as opioid prescriptions and income within the U.S. from 2015-2020. Our first dataset is The Provisional Drug Overdose Death Counts from the Center for Disease Control (CDC). It contains the specific types of overdoses and total counts in every state for the U.S. We plan on comparing this dataset to the Medicare Prescribing Rates Dataset. This dataset contains information sorted by state that describes a multitude of fatures for how states have prescribed opioids for the years we are interested in. We want to investigate if there is a correlation between the number of drug overdoses and the rate at which these drugs are prescribed. The data will help answer this question as it is sorted regionally, so we can study trends in specific places. Additionally, we loaded in a state average income dataset. We want to use this dataset to see if this economic factor has some correlation with drug overdose levels in states. Finally, we brought in a state population dataset from from the U.S. Census website. This enables us to normalize the different datasets we have, and therefore we can compare the trends between states accurately.

Goals

We aim to be able to be able to create a model that predicts the rate of drug overdose deaths in individual states within the United States.

Setup

Importing necessary libraries for project.

In [26]:
import matplotlib.pyplot as py
import numpy as np
import pandas as pd 
import seaborn as sns

ETL

Loading in Drug Overdose Deaths Data from CDC

We first load in the data for drug overdose deaths from the CDC website. The data is displayed in the table below. It contains the number of deaths for a class of drugs for each state and year/month combination. The year and month columns will prove especially helpful, as we can examine the date in a much more detailed manner than if we just had the year. This will help to answer our question of whether there is a causation/correlation between drug related deaths and prescriptions, as we can easily visualize trends over time in a detailed manner. We filtered the orginal dataset to remove extraneous columns and keep our data tidy. Additionally, we removed NaN values from the table.

In [27]:
#Reads in datafile containg data on different drug deaths for every state 2015-2022
drug_deaths = pd.read_csv("https://data.cdc.gov/api/views/xkb8-kh2a/rows.csv?accessType=DOWNLOAD&bom=true&format=true")
#Gets rid of commas in data values column and converts object column to float
drug_deaths["Data Value"]= drug_deaths["Data Value"].str.replace(',','').astype(float)
In [28]:
#Filters dataframe for only select columns we want and renames Data Values column to Deaths
drug_deaths= drug_deaths[["State Name","Year","Month","Indicator","Data Value"]].loc[drug_deaths["Data Value"].isnull()==False]
drug_deaths.columns = drug_deaths.columns.str.replace('Data Value', 'Deaths')
display(drug_deaths)
display(drug_deaths.dtypes)
State Name Year Month Indicator Deaths
6 Alaska 2015 April Number of Deaths 4133.000000
9 Alaska 2015 April Percent with drugs specified 88.095238
11 Alaska 2015 April Number of Drug Overdose Deaths 126.000000
14 Alaska 2015 August Percent with drugs specified 87.903226
20 Alaska 2015 August Number of Deaths 4222.000000
... ... ... ... ... ...
53776 New York City 2022 May Percent with drugs specified 99.544499
53777 New York City 2022 May Natural & semi-synthetic opioids, incl. methad... 710.000000
53778 New York City 2022 May Number of Drug Overdose Deaths 2854.000000
53779 New York City 2022 May Methadone (T40.3) 333.000000
53780 New York City 2022 May Synthetic opioids, excl. methadone (T40.4) 2255.000000

43761 rows × 5 columns

State Name     object
Year            int64
Month          object
Indicator      object
Deaths        float64
dtype: object

Below we filter the table further to obtain the total number of drug overdose deaths and the number of drug overdose deaths caused by opioids.

In [29]:
#Filters drug deaths dataframe to only contain values for total drug overdose deaths
drug_deaths_total = drug_deaths.loc[drug_deaths["Indicator"]=="Number of Drug Overdose Deaths"]
drug_deaths_total.columns = drug_deaths_total.columns.str.replace("Deaths","Total Drug Deaths")
#Filters drug deaths dataframe to only contain values for opioid drug overdose deaths
drug_deaths_opioids = drug_deaths.loc[drug_deaths["Indicator"]=="Natural, semi-synthetic, & synthetic opioids, incl. methadone (T40.2-T40.4)"]
drug_deaths_opioids.columns = drug_deaths_opioids.columns.str.replace("Deaths","Opioid Deaths")

Loading in Medicare Opioid Prescribing Rates

Next, we read in the Medicare Opioid Prescribing Rates for the years 2013-2020. This dataset contains the opioid prescription rates and total number of opioids dispensed for a state in a year. The data is collected by the Centers for Medicare & Medicaid Services, and it is based on claims data for Medicare beneficiaries. Unlike the previous drug overdose dataset, we do not receive a months column because this data is updated annually, so it will be a bit more difficult to visualize this data trend over small time intervals. We initially filter the table to only contain relevant data columns, which from this dataset was the Opioid Prescribing Rate and the Total Number of Opioid Claims.

In [30]:
#Reads in datafile containg opioid prescription data
opiod_presc = pd.read_csv("https://raw.githubusercontent.com/tcutro1/final-project/main/PrescriptionRates.csv")
opiod_presc = opiod_presc.loc[opiod_presc['Plan_Type']=="All"]
In [31]:
#Filters dataframe for only select columns we want and renames Geo_Desc column to State Name
opiod_presc = opiod_presc[["Year","Geo_Desc","Tot_Opioid_Clms","Opioid_Prscrbng_Rate"]].loc[opiod_presc["Opioid_Prscrbng_Rate"].isnull()==False]
opiod_presc.columns = opiod_presc.columns.str.replace('Geo_Desc', 'State Name')
display(opiod_presc)
display(opiod_presc.dtypes)
Year State Name Tot_Opioid_Clms Opioid_Prscrbng_Rate
0 2020 National 20144712.0 3.21
3 2020 Alabama 182840.0 2.71
6 2020 Alaska 57160.0 4.29
9 2020 Arizona 496944.0 3.68
12 2020 Arkansas 183044.0 3.88
... ... ... ... ...
1233 2013 Virginia 730742.0 6.95
1236 2013 Washington 620072.0 8.82
1239 2013 West Virginia 401118.0 6.92
1242 2013 Wisconsin 899000.0 8.85
1245 2013 Wyoming 41202.0 7.78

416 rows × 4 columns

Year                      int64
State Name               object
Tot_Opioid_Clms         float64
Opioid_Prscrbng_Rate    float64
dtype: object

Loading in US Income Data

The data loaded in below contains every states average income over a certain amount of years. We will be using this data to test for any trends that income has with drug overdoses. First, we load in the data and then rename the columns to be the value of the first row, which contains the years for each column.

Source: https://nces.ed.gov/programs/digest/d20/tables/dt20_102.30.asp

In [32]:
income = pd.read_csv("https://raw.githubusercontent.com/tcutro1/final-project/main/Income%20Data.csv")
income.columns = income.iloc[1,:]
income = income.iloc[3:55,0:18]

Next, we select the certain columns we want, melt the table to make it easier to combine with our other data, and then rename the columns.

In [33]:
income = income[["State","2015","2016","2017","2018","2019"]]
income = pd.melt(income, id_vars="State")
income.columns = ["State Name","Year","Income"]

Now we reformat the income column, so it can be converted to an int. We also had to fix the formatting of the State's column because a few of them had extra spaces.

In [34]:
income["Year"]= pd.to_numeric(income["Year"])
income["Income"]= income["Income"].str.replace(',','')
income["Income"]= income["Income"].str.replace('$','').astype(float)
income["State Name"] = income["State Name"].apply(lambda x: x.strip())
display(income)
display(income.dtypes)
<ipython-input-34-2b46d3bcec3f>:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
  income["Income"]= income["Income"].str.replace('$','').astype(float)
State Name Year Income
0 United States 2015 60200.0
1 Alabama 2015 48300.0
2 Alaska 2015 79200.0
3 Arizona 2015 55600.0
4 Arkansas 2015 45300.0
... ... ... ...
255 Virginia 2019 76500.0
256 Washington 2019 78700.0
257 West Virginia 2019 48900.0
258 Wisconsin 2019 64200.0
259 Wyoming 2019 65000.0

260 rows × 3 columns

State Name     object
Year            int64
Income        float64
dtype: object

Loading in State Populations by Year for Normalization

The dataset below was retrieved from the US Census Data. It provides a breakdown of state population by year. First the dataset is read into a dataframe from the URL below.

Source: https://www.census.gov/programs-surveys/popest/technical-documentation/research/evaluation-estimates/2020-evaluation-estimates/2010s-state-total.html

In [35]:
url = 'https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/totals/nst-est2020-popchg2010-2020.csv'
population_df = pd.read_csv(url)

Filtering the dataframe down to the columns desired, which are the state names and population count for each year. Dropping extraneous columns

In [36]:
population_columns = ['NAME', 'POPESTIMATE2010','POPESTIMATE2011', 'POPESTIMATE2012',
       'POPESTIMATE2013', 'POPESTIMATE2014', 'POPESTIMATE2015',
       'POPESTIMATE2016', 'POPESTIMATE2017', 'POPESTIMATE2018',
       'POPESTIMATE2019', 'POPESTIMATE2020']
population_df = population_df[population_columns]
population_df = population_df.drop(population_df.index[1:5])

Setting the index for future melting. Removing the "popestimate" from each column so that the years are easily accessible and can be merged into the full dataframe.

In [37]:
population_df.set_index('NAME')
population_df.columns = population_df.columns.str.replace("POPESTIMATE", "")
population_df.head(10)
Out[37]:
NAME 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 United States 309327143 311583481 313877662 316059947 318386329 320738994 323071755 325122128 326838199 328329953 329484123
5 Alabama 4785514 4799642 4816632 4831586 4843737 4854803 4866824 4877989 4891628 4907965 4921532
6 Alaska 713982 722349 730810 737626 737075 738430 742575 740983 736624 733603 731158
7 Arizona 6407342 6473416 6556344 6634690 6732873 6832810 6944767 7048088 7164228 7291843 7421401
8 Arkansas 2921998 2941038 2952876 2960459 2968759 2979732 2991815 3003855 3012161 3020985 3030522
9 California 37319550 37636311 37944551 38253768 38586706 38904296 39149186 39337785 39437463 39437610 39368078
10 Colorado 5047539 5121900 5193660 5270774 5352637 5454328 5543844 5617421 5697155 5758486 5807719
11 Connecticut 3579173 3588632 3595211 3595792 3595697 3588561 3579830 3575324 3574561 3566022 3557006
12 Delaware 899647 907590 915518 924062 933131 942065 949989 957942 966985 976668 986809
13 District of Columbia 605282 620290 635737 651559 663603 677014 687576 697079 704147 708253 712816

Melting the dataframe so that it is tidy and not too wide. Each row now represents a state's popuoation for a year ranging from 2010-2020. Setting the Year datatype to numeric.

In [38]:
population_melt = pd.melt(population_df, id_vars=['NAME'])
population_melt.rename(columns={'NAME': 'State Name', 'variable': 'Year', 'value':'Population'}, inplace=True)
population_melt["Year"]= pd.to_numeric(population_melt["Year"])
display(population_melt)
display(population_melt.dtypes)
State Name Year Population
0 United States 2010 309327143
1 Alabama 2010 4785514
2 Alaska 2010 713982
3 Arizona 2010 6407342
4 Arkansas 2010 2921998
... ... ... ...
578 Washington 2020 7693612
579 West Virginia 2020 1784787
580 Wisconsin 2020 5832655
581 Wyoming 2020 582328
582 Puerto Rico 2020 3159343

583 rows × 3 columns

State Name    object
Year           int64
Population     int64
dtype: object

Combining all Our Data Tables

We now combine the different datasets we have into one dataframe. This will allow us to model the data and do more effective anlysis in the following EDA section. We begin by using the groupby() function to get each dataset in the same form so that we can merge them together using the state name and year. We do this for the overdose deaths, opioid prescriptions, incomes, and population datasets.

In [39]:
#Summarizes and groups both dataframes by State Name and year 
deaths_by_year = drug_deaths_total.groupby(["State Name","Year"]).mean()
opioids_deaths_by_year = drug_deaths_opioids.groupby(["State Name","Year"]).mean()
prescr_by_year = opiod_presc.groupby(["State Name","Year"]).mean()
income_by_year = income.groupby(["State Name","Year"]).mean()
population_by_year = population_melt.groupby(["State Name","Year"]).mean()

We then merge the tables into each other one by one. We do this on the state name and year using an inner join.

In [40]:
prescr_and_deaths = deaths_by_year.merge(prescr_by_year, on = ["State Name","Year"],how = "inner")
prescr_and_two_deaths = prescr_and_deaths.merge(opioids_deaths_by_year,on = ["State Name","Year"],how = "inner" )
presc_deaths_income = prescr_and_two_deaths.merge(income_by_year, on= ["State Name","Year"],how = "inner")
presc_deaths_income_pop = presc_deaths_income.merge(population_by_year, on= ["State Name","Year"],how = "inner")

Below we create some new data columns that normalize the deaths and total opioid claims by dividing them with the population of each state for that year.

In [41]:
presc_deaths_income_pop["Opiod Claim Rate"] = presc_deaths_income_pop["Tot_Opioid_Clms"]/presc_deaths_income_pop['Population']
presc_deaths_income_pop["Death Rate"] = (presc_deaths_income_pop["Total Drug Deaths"]/presc_deaths_income_pop['Population'])*10000
display(presc_deaths_income_pop)
Total Drug Deaths Tot_Opioid_Clms Opioid_Prscrbng_Rate Opioid Deaths Income Population Opiod Claim Rate Death Rate
State Name Year
Alaska 2015 122.666667 73451.0 7.61 65.000000 79200.0 738430.0 0.099469 1.661182
2016 132.833333 96490.0 7.90 71.333333 81400.0 742575.0 0.129940 1.788820
2017 130.916667 97745.0 6.50 60.750000 76300.0 740983.0 0.131913 1.766797
2018 120.166667 80393.0 5.76 62.916667 75700.0 736624.0 0.109137 1.631316
2019 128.666667 65394.0 4.56 55.166667 75500.0 733603.0 0.089141 1.753900
... ... ... ... ... ... ... ... ... ...
Wisconsin 2018 1122.833333 591217.0 5.40 710.416667 61900.0 5809319.0 0.101770 1.932814
2019 1134.750000 503993.0 4.70 743.916667 64200.0 5824581.0 0.086529 1.948209
Wyoming 2017 72.583333 29216.0 5.85 36.111111 63000.0 579994.0 0.050373 1.251450
2018 63.583333 23396.0 4.94 37.416667 62700.0 579054.0 0.040404 1.098055
2019 66.416667 19614.0 4.43 37.333333 65000.0 580116.0 0.033810 1.144886

165 rows × 8 columns

Getting averages for the states

In [42]:
#Group data by State and combine datasets to get Average Deaths and Average Opiod Prescriptions in one dataframe
avg_state_deaths = drug_deaths_total.groupby(["State Name"])[["Total Drug Deaths"]].mean()
avg_state_deaths_opiods = drug_deaths_opioids.groupby(["State Name"])[["Opioid Deaths"]].mean()
avg_state_presc = opiod_presc.groupby(["State Name"])[['Tot_Opioid_Clms','Opioid_Prscrbng_Rate']].mean()
avg_state_income = income.groupby(["State Name"])[["Income"]].mean()
avg_state_population = population_melt.groupby(["State Name"])[["Population"]].mean()
avg_state = avg_state_deaths.merge(avg_state_presc, on = "State Name", how = "inner" )
avg_state = pd.merge(avg_state,avg_state_deaths_opiods,on = "State Name", how = "inner")
avg_state = pd.merge(avg_state,avg_state_income,on = "State Name", how = "inner")
avg_state = pd.merge(avg_state, avg_state_population, on = "State Name", how = "inner")
avg_state['Opioid_Clms_Rate'] = avg_state['Tot_Opioid_Clms']/avg_state['Population']
avg_state['Deaths_Rate'] = (avg_state["Total Drug Deaths"]/avg_state['Population'])
avg_state["Opioid_Deaths_Rate"] = (avg_state["Opioid Deaths"]/avg_state['Population'])
avg_state
Out[42]:
Total Drug Deaths Tot_Opioid_Clms Opioid_Prscrbng_Rate Opioid Deaths Income Population Opioid_Clms_Rate Deaths_Rate Opioid_Deaths_Rate
State Name
Alaska 147.274725 77177.000 6.52125 84.580247 77620.0 7.332014e+05 0.105260 0.000201 0.000115
Arizona 1873.978022 951808.125 6.46500 1245.370968 58820.0 6.864346e+06 0.138660 0.000273 0.000181
Arkansas 444.538462 291670.000 5.96125 368.800000 47440.0 2.980382e+06 0.097863 0.000149 0.000124
California 6625.769231 3534421.625 4.71125 4838.243243 74740.0 3.867048e+07 0.091398 0.000171 0.000125
Colorado 1200.538462 619904.875 8.37375 730.340426 72300.0 5.442315e+06 0.113905 0.000221 0.000134
Connecticut 1127.791209 426297.375 4.86000 900.384615 77820.0 3.581437e+06 0.119030 0.000315 0.000251
Delaware 362.417582 131720.625 5.27625 376.528302 66740.0 9.418551e+05 0.139852 0.000385 0.000400
District of Columbia 345.549451 86177.750 4.00625 238.065934 85380.0 6.693960e+05 0.128740 0.000516 0.000356
Florida 5564.417582 1154344.750 4.14250 5336.125000 55620.0 2.026482e+07 0.056963 0.000275 0.000263
Georgia 1598.791209 839710.875 5.32875 926.488095 58560.0 1.020295e+07 0.082301 0.000157 0.000091
Hawaii 220.362637 137171.125 5.75000 56.462687 80920.0 1.407845e+06 0.097433 0.000157 0.000040
Idaho 255.989011 133868.125 5.75375 184.800000 55880.0 1.674511e+06 0.079945 0.000153 0.000110
Illinois 2787.912088 1078913.250 4.77875 1857.617978 66060.0 1.280115e+07 0.084283 0.000218 0.000145
Indiana 1822.164835 789527.250 5.86250 1629.317073 56220.0 6.619022e+06 0.119282 0.000275 0.000246
Iowa 351.230769 282087.250 5.36250 163.764045 60560.0 3.115596e+06 0.090540 0.000113 0.000053
Kansas 405.428571 188696.750 5.48250 269.250000 59380.0 2.898537e+06 0.065101 0.000140 0.000093
Kentucky 1615.714286 601814.375 4.02000 1350.092593 50500.0 4.424540e+06 0.136017 0.000365 0.000305
Maine 398.329670 119402.750 4.81250 328.945055 57260.0 1.334291e+06 0.089488 0.000299 0.000247
Maryland 2203.648352 712545.750 5.52625 1831.065934 84320.0 5.961754e+06 0.119519 0.000370 0.000307
Massachusetts 2139.923077 570301.500 3.83750 1886.000000 80840.0 6.771586e+06 0.084220 0.000316 0.000279
Michigan 2490.208791 1608325.125 6.36250 2130.194444 57120.0 9.937251e+06 0.161848 0.000251 0.000214
Minnesota 820.285714 537136.125 4.62500 909.062500 71180.0 5.489749e+06 0.097843 0.000149 0.000166
Mississippi 415.505495 310771.125 5.63500 332.666667 45000.0 2.983226e+06 0.104173 0.000139 0.000112
Missouri 1578.824176 560065.875 5.00500 1260.918367 55640.0 6.075600e+06 0.092183 0.000260 0.000208
Montana 138.065934 182078.875 8.43750 76.592593 55180.0 1.033568e+06 0.176165 0.000134 0.000074
Nevada 780.252747 410353.250 7.24250 365.813187 59780.0 2.888736e+06 0.142053 0.000270 0.000127
New Hampshire 428.901099 68196.625 4.28750 373.230769 76440.0 1.339741e+06 0.050903 0.000320 0.000279
New Jersey 2425.054945 669843.000 3.86625 2451.160000 82300.0 8.863289e+06 0.075575 0.000274 0.000277
New Mexico 637.032967 296910.500 5.82500 343.230769 49500.0 2.090117e+06 0.142055 0.000305 0.000164
New York 2361.340659 1620774.125 2.59500 1745.384615 68320.0 1.954416e+07 0.082929 0.000121 0.000089
North Carolina 2416.736264 886286.750 5.58125 1733.055556 54520.0 1.006706e+07 0.088038 0.000240 0.000172
Ohio 4431.615385 1757815.125 4.79625 3359.909091 56580.0 1.162010e+07 0.151274 0.000381 0.000289
Oklahoma 759.109890 378104.750 6.55750 322.461538 52860.0 3.887131e+06 0.097271 0.000195 0.000083
Oregon 666.989011 586564.000 6.78875 295.285714 62840.0 4.036383e+06 0.145319 0.000165 0.000073
Rhode Island 352.032967 112548.750 3.78375 277.604396 66120.0 1.056424e+06 0.106537 0.000333 0.000263
South Carolina 1232.417582 329041.625 4.97750 873.406593 53180.0 4.906767e+06 0.067059 0.000251 0.000178
South Dakota 74.285714 41600.125 4.98625 32.434211 58180.0 8.561392e+05 0.048590 0.000087 0.000038
Tennessee 2221.494505 755051.000 5.82750 1570.715909 53120.0 6.609986e+06 0.114229 0.000336 0.000238
Texas 3341.846154 1090523.500 3.52375 1436.045455 61560.0 2.736909e+07 0.039845 0.000122 0.000052
Utah 634.241758 171272.500 6.22500 372.824176 71600.0 3.001934e+06 0.057054 0.000211 0.000124
Vermont 148.813187 83535.875 5.68625 113.967033 61560.0 6.253857e+05 0.133575 0.000238 0.000182
Virginia 1624.296703 971224.125 7.25875 1208.230769 73820.0 8.345389e+06 0.116379 0.000195 0.000145
Washington 1389.945055 832328.375 6.54375 663.725275 73760.0 7.202045e+06 0.115568 0.000193 0.000092
West Virginia 1005.318681 365201.125 4.30625 798.098901 46120.0 1.832192e+06 0.199325 0.000549 0.000436
Wisconsin 1229.516484 748482.250 6.87250 875.237500 61700.0 5.764201e+06 0.129850 0.000213 0.000152
Wyoming 86.230769 30742.375 6.10500 44.890625 63900.0 5.788710e+05 0.053107 0.000149 0.000078

EDA

Below is a plot of the Drug Deaths per 10,000 people in the United States from 2015-2020. Visualizes the increase in drug overdose deaths over the past five years.

In [59]:
py.plot((normalized_deaths.loc["United States"]["Total Drug Deaths"]/normalized_deaths.loc["United States"]["Population"])*10000)
py.title("Drug Deaths per 10,000 People in United States From 2015-2020")
py.xlabel("Year")
py.ylabel("Death Rate per 10,000 People")
Out[59]:
Text(0, 0.5, 'Death Rate per 10,000 People')

To get an inital look at the issue we aim to explore, we graphed the number of drug overdose deaths from 2015 to 2022 for DC, Louisiana, and West Virginia. In order to do this, we grouped the total drug deaths dataframe by state name and year. We plotted the non-normalized graph first to show just how big the issue is, and it is only getting worse. We hope to evince the necessity to take action in certain states based on the high overdose death rates, and other factors that correlate with this.

In [44]:
drug_deaths_summary= drug_deaths_total.groupby(["State Name","Year"]).mean()
state_names = ["District of Columbia", "Louisiana", "West Virginia"]
for i in state_names:
  py.plot(drug_deaths_summary.loc[i])
py.title("Drug Deaths From 2015-2022 For Select States")
py.xlabel("Year")
py.ylabel("Number Of Deaths")
Out[44]:
Text(0, 0.5, 'Number Of Deaths')

We then graph the drug overdose deaths per person in DC, Louisiana, and West Virginia from 2015-2020. From this graph, you can tell that many of the states with the highest drug overdose death rates over these years are still trending upward. We plan on focusing on the states with the highest overdose death rates, and look at factors that could be playing a major role in these increases.

In [45]:
normalized_deaths = pd.merge(drug_deaths_summary, population_by_year, on = ["State Name","Year"], how = "inner")
for i in state_names:
  py.plot(normalized_deaths.loc[i]["Total Drug Deaths"]/normalized_deaths.loc[i]["Population"])
py.title("Normalized Drug Deaths based on States Population From 2015-2020")
py.xlabel("Year")
py.ylabel("Death Rate per Person")
Out[45]:
Text(0, 0.5, 'Death Rate per Person')

Comparing Drug Overdose Deaths and Total Opioid Claims in each state

Below is a scatterplot of the Total Drug Overdose Deaths vs Total Opioid Claims. We plotted this data to see if there was any correlation present between the two variables. Our results manifested the need to normalize our data based on populations. It is clear that states with very large or very small populations will appear as the outliers on this plot.

In [46]:
#Scatter Plot comparing Average Deaths and Average Opiod Prescriptions for each state
py.figure(figsize=(10,5))
py.scatter(avg_state['Tot_Opioid_Clms']**.1,avg_state["Total Drug Deaths"])
py.title('Total Drug Overdose Deaths vs Total Opioid Claims in the U.S.')
py.xlabel("Total Opiod Claims")
py.ylabel("Drug Overdose Deaths")
Out[46]:
Text(0, 0.5, 'Drug Overdose Deaths')

The first two entries shown below, Florida and California, represent the states with the most drug overdose deaths and the most total opioid claims respectively. This proves interesting once you look at the state with the least overdose deaths, South Dakota, and the state with the least total opioid claims, Wyoming. While the opioid prescription rates between the different states are similar, the total number of deaths and total opioid claims differs by thousands between the entries.

In [47]:
#state with highest amount of opiod overdoses
display(avg_state.loc[avg_state["Total Drug Deaths"]== avg_state["Total Drug Deaths"].max()])
#state with highest amount of opiod prescriptions
display(avg_state.loc[avg_state["Tot_Opioid_Clms"]== avg_state["Tot_Opioid_Clms"].max()])
#state with lowest amount of opiod overdoses
display(avg_state.loc[avg_state["Total Drug Deaths"]== avg_state["Total Drug Deaths"].min()])
#state with lowest amount of opiod prescriptions
display(avg_state.loc[avg_state["Tot_Opioid_Clms"]== avg_state["Tot_Opioid_Clms"].min()])
Total Drug Deaths Tot_Opioid_Clms Opioid_Prscrbng_Rate Opioid Deaths Income Population Opioid_Clms_Rate Deaths_Rate Opioid_Deaths_Rate
State Name
California 6625.769231 3534421.625 4.71125 4838.243243 74740.0 3.867048e+07 0.091398 0.000171 0.000125
Total Drug Deaths Tot_Opioid_Clms Opioid_Prscrbng_Rate Opioid Deaths Income Population Opioid_Clms_Rate Deaths_Rate Opioid_Deaths_Rate
State Name
California 6625.769231 3534421.625 4.71125 4838.243243 74740.0 3.867048e+07 0.091398 0.000171 0.000125
Total Drug Deaths Tot_Opioid_Clms Opioid_Prscrbng_Rate Opioid Deaths Income Population Opioid_Clms_Rate Deaths_Rate Opioid_Deaths_Rate
State Name
South Dakota 74.285714 41600.125 4.98625 32.434211 58180.0 856139.181818 0.04859 0.000087 0.000038
Total Drug Deaths Tot_Opioid_Clms Opioid_Prscrbng_Rate Opioid Deaths Income Population Opioid_Clms_Rate Deaths_Rate Opioid_Deaths_Rate
State Name
Wyoming 86.230769 30742.375 6.105 44.890625 63900.0 578871.0 0.053107 0.000149 0.000078

Even though California stands out as having the most opiod overdoses and opiod claims, this is primarily due to California's high population. Based on these results, it is clear that we need to normalize this data using the population of each state. Thus, we can visualize trends in our data.

Below is the normalized scatterplot, which displayed the Drug Overdose Death Rate (deaths/population). We added a best fit line to show the data trend that seems to manifest itself on this graph.

In [48]:
#Scatter Plot comparing Average Deaths and Average Opiod Prescriptions for each state

py.figure(figsize=(10,5))
py.scatter(avg_state['Opioid_Clms_Rate'], avg_state["Deaths_Rate"])

#Line of Best Fit
a, b = np.polyfit(avg_state['Opioid_Clms_Rate'], avg_state["Deaths_Rate"], 1)
py.plot(avg_state['Opioid_Clms_Rate'], a*avg_state['Opioid_Clms_Rate']+b)

#Annotating Outliers that are shown below
py.annotate("West Virginia", (avg_state.loc["West Virginia"]["Opioid_Clms_Rate"]-.015, avg_state.loc["West Virginia"]["Deaths_Rate"]-.00003))
py.annotate("South Dakota", (avg_state.loc["South Dakota"]["Opioid_Clms_Rate"]+.002, avg_state.loc["South Dakota"]["Deaths_Rate"]))
py.annotate("District of Columbia", (avg_state.loc["District of Columbia"]["Opioid_Clms_Rate"]+.002, avg_state.loc["District of Columbia"]["Deaths_Rate"]))

py.title('Drug Overdose Deaths Rate vs Opioid Claims Rate')
py.ylabel("Drug Overdose Deaths Rate")
py.xlabel('Opioid Claims Rate')
py.show()

From this graph, it appaears there is a correlation between the rate at which opioids are being claimed and used as medication in a state, and the number of overdose deaths that these states experience. Below we extrapolate a few of the extreme cases to get a better idea of the data we are dealing with. In an upcoming section we will use a correlation matrix to see if we can corroborate these results.

In [49]:
#state with highest rate of opiod overdoses
display(avg_state.loc[avg_state["Deaths_Rate"]== avg_state["Deaths_Rate"].max()])
#state with highest rate of opiod prescriptions
display(avg_state.loc[avg_state["Opioid_Clms_Rate"]== avg_state["Opioid_Clms_Rate"].max()])
#state with lowest rate of opiod overdoses
display(avg_state.loc[avg_state["Deaths_Rate"]== avg_state["Deaths_Rate"].min()])
#state with lowest rate of opiod prescriptions
display(avg_state.loc[avg_state["Opioid_Clms_Rate"]== avg_state["Opioid_Clms_Rate"].min()])
Total Drug Deaths Tot_Opioid_Clms Opioid_Prscrbng_Rate Opioid Deaths Income Population Opioid_Clms_Rate Deaths_Rate Opioid_Deaths_Rate
State Name
West Virginia 1005.318681 365201.125 4.30625 798.098901 46120.0 1.832192e+06 0.199325 0.000549 0.000436
Total Drug Deaths Tot_Opioid_Clms Opioid_Prscrbng_Rate Opioid Deaths Income Population Opioid_Clms_Rate Deaths_Rate Opioid_Deaths_Rate
State Name
West Virginia 1005.318681 365201.125 4.30625 798.098901 46120.0 1.832192e+06 0.199325 0.000549 0.000436
Total Drug Deaths Tot_Opioid_Clms Opioid_Prscrbng_Rate Opioid Deaths Income Population Opioid_Clms_Rate Deaths_Rate Opioid_Deaths_Rate
State Name
South Dakota 74.285714 41600.125 4.98625 32.434211 58180.0 856139.181818 0.04859 0.000087 0.000038
Total Drug Deaths Tot_Opioid_Clms Opioid_Prscrbng_Rate Opioid Deaths Income Population Opioid_Clms_Rate Deaths_Rate Opioid_Deaths_Rate
State Name
Texas 3341.846154 1090523.5 3.52375 1436.045455 61560.0 2.736909e+07 0.039845 0.000122 0.000052

West Virginia immediately stands out with the highest rate of drug overdoses and opioid prescriptions of any state. West Virginia is amidst an opioid epidemic worse than they have ever seen. The high rate of drug usage and overdoses there have created a situation where employees are sparse and children struggle to grow up with parents or friends unaffected by the crisis. More information can be found from the DEA at this link.

Correlation Matrix for Combined Dataframes

To further investigate if there was correlations present within the data, specifically whether any features had a correlation with the drug overdose death rate, we took a look at a correlation matrix.

First, we looked at the correlation matrix for our combined table which has the average data for each state from 2015-2020.

In [50]:
py.figure(figsize=(12,10))
cor = avg_state.corr()
sns.heatmap(cor, annot=True, cmap=py.cm.Reds)
py.show()

As expected, the population of a state has a strong correlation with the total drug overdose deaths and the total opioid claims. Of significant interest to us was the correlation value of 0.48 between the overdose death rate and the opioid claims rate. We graphed this relationship in Section B, and noticed a potentially linear relationship. The correlation value further supports our findings.

In [51]:
py.figure(figsize=(12,10))
cor = presc_deaths_income_pop.corr()
sns.heatmap(cor, annot=True, cmap=py.cm.Reds)
py.show()

This correlation matrix does not provide as interesting of results. When the data is not grouped together for each state the results do not seem to be as significant.

Income Correlation

After reviewing the correlation matrix, we wanted to visualize the correlation between income and opioid claim rates per state. Although it is slight, the graph below shows correlation between the two variables, and as average income increases opioid prescription rate decreases. I believe the average income across an entire state is not the best representation of economic factors for that region, so other economic data would be needed to further study this.

In [52]:
py.scatter(avg_state['Income'], avg_state["Opioid_Clms_Rate"]**2.8)
py.title("State's Average Income vs Average Opioid Claims Rate")
py.xlabel("Income")
py.ylabel("Opioid Claim Rate")
py.annotate("West Virginia", (avg_state.loc["West Virginia"]["Income"]+1000, avg_state.loc["West Virginia"]["Opioid_Clms_Rate"]**2.8-.0002))
py.show()

Model Question

Can we predict the number of drug overdoses a specific state will experience based on data from different years?

Our model predicts the drug overdose rate in a state for a given year, based on the features given and data from different years. We identified mutliple states that have climbing overdose death totals, so being able to predict the increase or decrease a state will experience would prove helpful in stopping the overdose problem throughout our country and helping to better prepare different areas of the country. Our independent variables are the state name, opioid prescription rate, income, population, and opioid claim rate. Our dependent variable is the number of drug overdose death rate (deaths per 10,000 people). We will test whether this model is correct by comparing our predicted values for the drug overdose deaths per year to the actual values contained in our datasets. This will be using an 80/20 train and test split for a Linear Regression model

Model

Below is a spread of all the variables contained in our current combined dataset. For the model, we will be dropping some of these features which contain repeat information or have an extreme skew.

In [53]:
presc_deaths_income_pop.hist(bins=50, figsize=(20,15))
py.show()

We drop the total opioid claims because we created a normalized rate which is the feature "Opioid Claim Rate". We also drop the "Opioid Deaths" column, which gets into some assumptions we made about the model. We assumed that with information on opioids and the rates at which they are prescribed and claimed, we can predict the drug overdose death rate per 10,000 people. This generalizes opioid information to encompass drug overdoses, but there are a multtitude of other factors that influence this death rate year to year. Another possible issue could stem from people overdosing in a state they are not a resident in, yet their overdose death would count towards that specific state's total. A possible future step would be to examine and account for all the different biases present, something we tried to do but it is not perfect. Additionally, we assumed grouping by state could yield significant results, but we definitely think having data by county and parish would be extremely beneficial. Many states are large and have a huge variation in different prescribing rates, income, and deaths even by county, so being able to examine the regions more specifically would be beneficial.

In [54]:
model_df = presc_deaths_income_pop.drop(["Tot_Opioid_Clms", "Opioid Deaths", "Total Drug Deaths"], axis = 1)

This is the dataframe that will be used for the model. The Death Rate is the target variable we are trying to predict with a Linear Regression model.

In [55]:
model_df
Out[55]:
Opioid_Prscrbng_Rate Income Population Opiod Claim Rate Death Rate
State Name Year
Alaska 2015 7.61 79200.0 738430.0 0.099469 1.661182
2016 7.90 81400.0 742575.0 0.129940 1.788820
2017 6.50 76300.0 740983.0 0.131913 1.766797
2018 5.76 75700.0 736624.0 0.109137 1.631316
2019 4.56 75500.0 733603.0 0.089141 1.753900
... ... ... ... ... ... ...
Wisconsin 2018 5.40 61900.0 5809319.0 0.101770 1.932814
2019 4.70 64200.0 5824581.0 0.086529 1.948209
Wyoming 2017 5.85 63000.0 579994.0 0.050373 1.251450
2018 4.94 62700.0 579054.0 0.040404 1.098055
2019 4.43 65000.0 580116.0 0.033810 1.144886

165 rows × 5 columns

First, the data is split into an 80% training set and 20% test set for model evaluation.

In [56]:
from sklearn.model_selection import train_test_split
X = model_df.drop(["Death Rate"], axis = 1)
y = model_df["Death Rate"]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20, random_state=0)

We fit a linear regression model to the training data. This is done using sklearns Linear Regression function.

In [57]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
Out[57]:
LinearRegression()

We use the mean absolute error to evaluate our model's ability to predict the death rate. This gives us the distance our prediction was from the true value.

In [58]:
#The mean absolute error of training set
from sklearn.metrics import mean_absolute_error

y_train_pred = lin_reg.predict(X_train)
y_test_pred = lin_reg.predict(X_val)

train_mae = mean_absolute_error(y_train, y_train_pred)
test_mae = mean_absolute_error(y_val, y_test_pred)
print('Train MAE: ' + str(train_mae))
print('Test MAE: ' + str(test_mae))
Train MAE: 0.5373734949536381
Test MAE: 0.6623537393234372

The training set mean absolute error was 0.536. This can be interpreted as the predicted drug overdose death rate being within 0.536 of the actual death rate per 10,0000 people. The test MAE was close, at 0.613.

Conclusions

While our model did a decent job of predicting the death rate, bringing in more features and a larger dataset would increase the accuracy of the model. We believe that with improvement we could help states predict trends for their future drug overdose deaths to stop a problem before it gets worse. If we were able to predict the states with the greatest number of opioid deaths, then our model could be used to target regions that need more drug education funding, narcan supplies, and other resources in hopes of lowering the predicted death number to save lives.