Final Tutorial

Analysis of Crime in Maryland

Group member: Zhifan Wang, Shanghua Yang

INTRODCUTION

The overall objective of this project will be to discover the relationship between the year and the total crime number in Maryland and finally predict the total crime number in 2021. Hypothesis tests will be setup and different models will be built. For citizens in Maryland, we hope this analysis would introduce some helpful facts about their neighborhoods so that they can be more familiar with the environment they are living in. For the governor, we believe this tutorial would be able to provide some potential solutions to the crime problem and help prevent future crime based on our prediction.

Throughout this tutorial, we will attempt to uncover the potential trends of numbers of typical violent and property crimes such as murder, rape, robbery and larceny theft etc. and propotion of each type of crimes and proportion of violent and property cirmes. Also, we will using linear model and polynomial model to predict total crimes in Maryland.

Required Tools

  1. sqlite3
  2. sys
  3. numpy
  4. pandas
  5. matplotlib
  6. sklearn
  7. scipy
  8. seanborn
  9. statsmodels.api
In [1]:
import sqlite3
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, metrics, datasets
from sklearn.preprocessing import PolynomialFeatures
from scipy import stats
import seaborn as sns
import statsmodels.api as sm

Collecting & Tidying the Data

The datasets we are going to analyze is from the website:https://opendata.maryland.gov/Public-Safety/Violent-Crime-Property-Crime-by-County-1975-to-Pre/jwfa-fdxs The dataset comes in the form of a CSV (Comma Separated Value) file. So the pandas.read_csv function is used here to read the data and put all the data into a Panda Dataframe for us.

In [2]:
data = pd.read_csv("data.csv")
data.head()
Out[2]:
JURISDICTION YEAR POPULATION MURDER RAPE ROBBERY AGG. ASSAULT B & E LARCENY THEFT M/V THEFT ... B & E PER 100,000 PEOPLE LARCENY THEFT PER 100,000 PEOPLE M/V THEFT PER 100,000 PEOPLE MURDER RATE PERCENT CHANGE PER 100,000 PEOPLE RAPE RATE PERCENT CHANGE PER 100,000 PEOPLE ROBBERY RATE PERCENT CHANGE PER 100,000 PEOPLE AGG. ASSAULT RATE PERCENT CHANGE PER 100,000 PEOPLE B & E RATE PERCENT CHANGE PER 100,000 PEOPLE LARCENY THEFT RATE PERCENT CHANGE PER 100,000 PEOPLE M/V THEFT RATE PERCENT CHANGE PER 100,000 PEOPLE
0 Allegany County 1975 79655 3 5 20 114 669 1425 93 ... 839.9 1789.0 116.8 NaN NaN NaN NaN NaN NaN NaN
1 Allegany County 1976 83923 2 2 24 59 581 1384 73 ... 692.3 1649.1 87.0 -36.7 -62.0 13.9 -50.9 -17.6 -7.8 -25.5
2 Allegany County 1977 82102 3 7 32 85 592 1390 102 ... 721.1 1693.0 124.2 53.3 257.8 36.3 47.3 4.2 2.7 42.8
3 Allegany County 1978 79966 1 2 18 81 539 1390 100 ... 674.0 1738.2 125.1 -65.8 -70.7 -42.2 -2.2 -6.5 2.7 0.7
4 Allegany County 1979 79721 1 7 18 84 502 1611 99 ... 629.7 2020.8 124.2 0.3 251.1 0.3 4.0 -6.6 16.3 -0.7

5 rows × 38 columns

As shown in the dataframe above, there are some missing values for several columns.

In order to have a better understanding of the missing values, all the rows containing the missing values are displayed as below so that we can choose the best way to handle them.

In [3]:
data.isnull().values.any()
is_NaN = data.isnull()
rows = is_NaN.any(axis=1)
df_r = data[rows]
In [4]:
data = data.fillna(0)

We observe missing values of the rate of change in the same year of 1975. Since 1975 is the starting year of our dataset, it makes sense that we cannot have any values of rate of change here. So we decide to just set these values to zero.

In [5]:
data.columns = data.columns.str.replace(" ", "_")

Explorative data analysis

This is explorative data analysis section. In this section, we will plot several charts such as scatter plot or pie chart to present trends of crimes from 1975 to 2017. For plotting scatter chart and pie chart, we will be using matplotlib. 'GRAND_TOTAL' is the total number of crime in a specific county, and we will using the sum of GRAND_TOTAL of all counties in the data to represent the numbers of crimes happended in a specific year, and plot a scatter chart with years as x-axis and numbers of crime happended in that year, and we will see the trends of numbers of crimes across time.

In [6]:
Total_crime = data.groupby('YEAR')['GRAND_TOTAL'].sum()
plt.title('Total crime across time')
x = Total_crime.index
y = Total_crime.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('Total_Crime')
Out[6]:
Text(0, 0.5, 'Total_Crime')

From the scatter chart, we find out that from 1975 to 1995, the numbers of crimes was increasing, but since 1995, the numbers of crimes were gradually going down, and the decreasing trends seems like the total crime numbers are going to decrease in the future.

After we have the scatter chart of total crimes across time, we will be repeating the process to get Murder numbers across time, rape numbers across time, robbery numbers accross time, Larcency theft across time and M/V theft across time to see if the general trends of each type of crime is consistent with each other or not.

In [7]:
Murder = data.groupby('YEAR')['MURDER'].sum()
plt.title('Murder number across time')
x = Murder.index
y = Murder.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('Murder')
Out[7]:
Text(0, 0.5, 'Murder')
In [8]:
Rape = data.groupby('YEAR')['RAPE'].mean()
plt.title('Rape number across time')
x = Rape.index
y = Rape.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('Rape')
Out[8]:
Text(0, 0.5, 'Rape')
In [9]:
Robbery = data.groupby('YEAR')['ROBBERY'].mean()
plt.title('Robbery across time')
x = Robbery.index
y = Robbery.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('Robbery')
Out[9]:
Text(0, 0.5, 'Robbery')
In [10]:
larceny_theft = data.groupby('YEAR')['LARCENY_THEFT'].mean()
plt.title('Larceny theft across time')
x = larceny_theft.index
y = larceny_theft.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('larceny_theft')
Out[10]:
Text(0, 0.5, 'larceny_theft')
In [11]:
MV_theft = data.groupby('YEAR')['M/V_THEFT'].mean()
plt.title('M/V theft across time')
x = MV_theft.index
y = MV_theft.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('MV_theft')
Out[11]:
Text(0, 0.5, 'MV_theft')

After we have these 6 scatter charts, and we compared with each other. We can find out that the gerenal trends of these five types of crime is mostly consistent with total_crime which means the almost every types of crime was decreasing since 2000.

In [12]:
sns.set(rc={'figure.figsize':(20,10)})
vp = sns.violinplot(x="YEAR", y="GRAND_TOTAL", data = data)

The chart above is called violin plot which is plotted by seaborn (sns). Violin plot is a method of plotting numeric data. It is similar to a box plot, with the addition of a rotated kernel density plot on each side. Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator. Typically a violin plot will include all the data that is in a box plot: a marker for the median of the data; a box or marker indicating the interquartile range; and possibly all sample points, if the number of samples is not too high.

In [13]:
Total_crime = data.groupby('YEAR')['PERCENT_CHANGE'].mean()
plt.title('Crime percentage change across time')
x = Total_crime.index
y = Total_crime.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('Percentage')
Out[13]:
Text(0, 0.5, 'Percentage')

From above we can see the percentage change of numbers of crime each year. When the dots are below 0, it means number of crimes are decreasing this year.

In [14]:
Violent_crime_ratio = data['VIOLENT_CRIME_TOTAL'].sum()
Property_crime_ratio = data['PROPERTY_CRIME_TOTALS'].sum()
fig1, ax1 = plt.subplots()
Ratio = [Violent_crime_ratio, Property_crime_ratio]

labels = 'Violent_Crime', 'Property_Crime'
explode = (0.1,0)
ax1.pie(Ratio, explode = explode, labels = labels, autopct='%1.1f%%', shadow=True, startangle=90) 
ax1.axis('equal')
plt.show
Out[14]:
<function matplotlib.pyplot.show(close=None, block=None)>

Above pie chart shows the proportion of types of Crime. Violent crime only occupies 15% of the grand total crimes. Most of the crimes are property crime such as M/V theft and larceny theft

In [15]:
Murder_type = data["MURDER"].mean()
Rape_type = data['RAPE'].mean()
Robbery_type = data['ROBBERY'].mean()
Larceny_theft_type = data['LARCENY_THEFT'].mean()
MV_theft_type = data['M/V_THEFT'].mean()
fig2, ax2 = plt.subplots()
pie_chart = [Murder_type, Rape_type, Robbery_type, Larceny_theft_type, MV_theft_type]
labels = 'Murder','Rape','Robbery','Larceny theft','M/V theft'
explode = (0,0,0,0,0)
ax2.pie(pie_chart, explode = explode, labels = labels,  autopct='%1.1f%%', shadow=True, startangle=90)
ax2.axis('equal')
plt.show
Out[15]:
<function matplotlib.pyplot.show(close=None, block=None)>

Above chart shows the proportion of each specific types of crime.

Hypothesis Test & Machine Learning

Linear Model

By observing the graph above and using intuition, we think there might be a linear relation between independent variable YEAR and the dependent variable GRAND_TOTAL.

Therefore, we want to fit a linear regression model, and start a hypothesis test with null hypothesis of no relationship between YEAR and GRAND_TOTAL.

The linear regrssion model is created by the linear model from sklearn. For more information: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [16]:
data = data.sort_values('YEAR')
reg = linear_model.LinearRegression()
new_year = np.array(data.YEAR)[:, np.newaxis]
reg.fit(new_year, data.GRAND_TOTAL)
crime_predicted = reg.predict(new_year)
data['crime_prediction']= crime_predicted 
plt.plot(new_year, crime_predicted, color='blue', linewidth=3)
Out[16]:
[<matplotlib.lines.Line2D at 0x28bfee8f640>]
In [17]:
print('Intercept:', reg.intercept_)
print('Coeff:', reg.coef_)
print('Score:',reg.score(new_year,data.GRAND_TOTAL))
print('Mean Absolute Error:', metrics.mean_absolute_error(data.GRAND_TOTAL, crime_predicted))
print('Mean Squared Error:', metrics.mean_squared_error(data.GRAND_TOTAL, crime_predicted))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(data.GRAND_TOTAL, crime_predicted)))
print('Pearson correlation coefficient & p-value:', stats.pearsonr(data.GRAND_TOTAL, crime_predicted))
Intercept: 153380.66954847475
Coeff: [-71.72319541]
Score: 0.0026826458299703537
Mean Absolute Error: 12066.864583333327
Mean Squared Error: 294516762.1370686
Root Mean Squared Error: 17161.490673512853
Pearson correlation coefficient & p-value: (0.051794264450517144, 0.0963168536453429)

It turns out that the null hypothesis would not be rejected. The P-value is 0.096 which is bigger than the siginificance level of 0.05. Therefore, we should accept the null hypothesis, meaning that there is no linear relationship between year and total crime number. The score, which is the value of R^2, here is only about 0.0026.

We do notice a huge Root Mean Squared Error between the predicted total crime and the real total crime. So we decide to compute the residual by subtracting the predicted total crime from the real total crime and make a violin plot to display the distribution of the residual and to get a better understanding of our prediction.

To create a violinplot we use the the seaborn.violinplot() function. For more information: https://seaborn.pydata.org/generated/seaborn.violinplot.html

In [18]:
data['residual'] = data['GRAND_TOTAL'] - data['crime_prediction'] 
sns.set(rc={'figure.figsize':(40,20)})
ax = sns.violinplot(x="YEAR", y="residual", data = data)

From the graph above we can observe a really terrible prediction of the total crime. The range of the residual is extremely big every year, meaning that we are not considering the outliers. Also, the predicted crime seems to exceed the real number ever year

Linear model with interaction terms

This violin plot would suggest a over generalization. Therefore, we decide to consider the difference between counties. We believe there is a dependence between residual and counties, since each county seems to has its own specific crime number.

A new linear regression model with interaction terms between year and county is implemented with the equation of:

Predicted_Total_Crime = constant + a1 year + a2 year county1 + a3 year county2 +...+ aN year * countyN

[a is the coefficient for each term]

In [19]:
Y = data.GRAND_TOTAL
X_inter = data[['YEAR','JURISDICTION']]
X_inter = pd.get_dummies(data=X_inter)##counties are transformed to binary
for i in X_inter.columns:
    if(i != 'YEAR'):
        X_inter['YEAR'+'*'+ i] = X_inter.YEAR * X_inter[i]
X_inter = sm.add_constant(X_inter)
inter_model = sm.OLS(Y, X_inter).fit()
data['interaction_prediction'] = inter_model.predict(X_inter).values

Here, we choose the model from the statsmodel package so that we can observe the regression results with more details in a table created. For more information using the model: https://www.statsmodels.org/stable/regression.html

In [20]:
inter_model.summary()
Out[20]:
OLS Regression Results
Dep. Variable: GRAND_TOTAL R-squared: 0.958
Model: OLS Adj. R-squared: 0.956
Method: Least Squares F-statistic: 474.3
Date: Mon, 21 Dec 2020 Prob (F-statistic): 0.00
Time: 12:13:47 Log-Likelihood: -9895.8
No. Observations: 1032 AIC: 1.989e+04
Df Residuals: 984 BIC: 2.012e+04
Df Model: 47
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 1.472e+05 1.74e+04 8.466 0.000 1.13e+05 1.81e+05
YEAR -68.8543 8.714 -7.902 0.000 -85.954 -51.755
JURISDICTION_Allegany County -1.69e+05 8.69e+04 -1.945 0.052 -3.4e+05 1517.179
JURISDICTION_Anne Arundel County -5.012e+04 8.69e+04 -0.577 0.564 -2.21e+05 1.2e+05
JURISDICTION_Baltimore City 1.961e+06 8.69e+04 22.567 0.000 1.79e+06 2.13e+06
JURISDICTION_Baltimore County 6.69e+05 8.69e+04 7.699 0.000 4.98e+05 8.39e+05
JURISDICTION_Calvert County -1.959e+05 8.69e+04 -2.255 0.024 -3.66e+05 -2.54e+04
JURISDICTION_Caroline County -1.727e+05 8.69e+04 -1.987 0.047 -3.43e+05 -2159.812
JURISDICTION_Carroll County -1.708e+05 8.69e+04 -1.966 0.050 -3.41e+05 -330.172
JURISDICTION_Cecil County -2.221e+05 8.69e+04 -2.556 0.011 -3.93e+05 -5.16e+04
JURISDICTION_Charles County -2.172e+05 8.69e+04 -2.499 0.013 -3.88e+05 -4.66e+04
JURISDICTION_Dorchester County -1.425e+05 8.69e+04 -1.640 0.101 -3.13e+05 2.8e+04
JURISDICTION_Frederick County -1.755e+05 8.69e+04 -2.020 0.044 -3.46e+05 -4978.843
JURISDICTION_Garrett County -1.449e+05 8.69e+04 -1.667 0.096 -3.15e+05 2.56e+04
JURISDICTION_Harford County -9.848e+04 8.69e+04 -1.133 0.257 -2.69e+05 7.2e+04
JURISDICTION_Howard County -1.827e+05 8.69e+04 -2.102 0.036 -3.53e+05 -1.22e+04
JURISDICTION_Kent County -1.383e+05 8.69e+04 -1.591 0.112 -3.09e+05 3.22e+04
JURISDICTION_Montgomery County 2.637e+05 8.69e+04 3.035 0.002 9.32e+04 4.34e+05
JURISDICTION_Prince George's County 4.35e+05 8.69e+04 5.006 0.000 2.64e+05 6.06e+05
JURISDICTION_Queen Anne's County -1.588e+05 8.69e+04 -1.827 0.068 -3.29e+05 1.17e+04
JURISDICTION_Somerset County -1.486e+05 8.69e+04 -1.711 0.087 -3.19e+05 2.19e+04
JURISDICTION_St. Mary's County -1.812e+05 8.69e+04 -2.085 0.037 -3.52e+05 -1.07e+04
JURISDICTION_Talbot County -1.415e+05 8.69e+04 -1.628 0.104 -3.12e+05 2.9e+04
JURISDICTION_Washington County -1.536e+05 8.69e+04 -1.768 0.077 -3.24e+05 1.69e+04
JURISDICTION_Wicomico County -2.053e+05 8.69e+04 -2.362 0.018 -3.76e+05 -3.48e+04
JURISDICTION_Worcester County -1.122e+05 8.69e+04 -1.292 0.197 -2.83e+05 5.83e+04
YEAR*JURISDICTION_Allegany County 80.9490 43.531 1.860 0.063 -4.476 166.374
YEAR*JURISDICTION_Anne Arundel County 29.9397 43.531 0.688 0.492 -55.485 115.365
YEAR*JURISDICTION_Baltimore City -955.3972 43.531 -21.947 0.000 -1040.822 -869.972
YEAR*JURISDICTION_Baltimore County -321.2223 43.531 -7.379 0.000 -406.648 -235.797
YEAR*JURISDICTION_Calvert County 93.9514 43.531 2.158 0.031 8.526 179.377
YEAR*JURISDICTION_Caroline County 82.0047 43.531 1.884 0.060 -3.421 167.430
YEAR*JURISDICTION_Carroll County 82.1075 43.531 1.886 0.060 -3.318 167.533
YEAR*JURISDICTION_Cecil County 107.7928 43.531 2.476 0.013 22.368 193.218
YEAR*JURISDICTION_Charles County 105.9251 43.531 2.433 0.015 20.500 191.350
YEAR*JURISDICTION_Dorchester County 67.1136 43.531 1.542 0.123 -18.312 152.539
YEAR*JURISDICTION_Frederick County 85.2514 43.531 1.958 0.050 -0.174 170.677
YEAR*JURISDICTION_Garrett County 67.9588 43.531 1.561 0.119 -17.466 153.384
YEAR*JURISDICTION_Harford County 47.1690 43.531 1.084 0.279 -38.256 132.594
YEAR*JURISDICTION_Howard County 90.3082 43.531 2.075 0.038 4.883 175.733
YEAR*JURISDICTION_Kent County 64.5936 43.531 1.484 0.138 -20.832 150.019
YEAR*JURISDICTION_Montgomery County -123.5343 43.531 -2.838 0.005 -208.960 -38.109
YEAR*JURISDICTION_Prince George's County -198.6962 43.531 -4.564 0.000 -284.121 -113.271
YEAR*JURISDICTION_Queen Anne's County 75.0906 43.531 1.725 0.085 -10.335 160.516
YEAR*JURISDICTION_Somerset County 69.9024 43.531 1.606 0.109 -15.523 155.328
YEAR*JURISDICTION_St. Mary's County 87.0097 43.531 1.999 0.046 1.584 172.435
YEAR*JURISDICTION_Talbot County 66.4810 43.531 1.527 0.127 -18.944 151.906
YEAR*JURISDICTION_Washington County 73.7895 43.531 1.695 0.090 -11.636 159.215
YEAR*JURISDICTION_Wicomico County 99.9828 43.531 2.297 0.022 14.558 185.408
YEAR*JURISDICTION_Worcester County 52.6750 43.531 1.210 0.227 -32.750 138.100
Omnibus: 601.487 Durbin-Watson: 1.789
Prob(Omnibus): 0.000 Jarque-Bera (JB): 24839.372
Skew: 2.016 Prob(JB): 0.00
Kurtosis: 26.694 Cond. No. 1.68e+19


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.52e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Our R^2 is 0.958 now! We do assume a higher R^2 value here. But this model fits unexpectedly well.

Let's create a new violin plot to observe the distribution of the new residual.

In [21]:
data['inter_residual'] = data['GRAND_TOTAL'] - data['interaction_prediction'] 
sns.set(rc={'figure.figsize':(40,20)})
ax = sns.violinplot(x="YEAR", y="inter_residual", data = data)

From the violin plot shown above, we can conclude the effect of year on total crime could be influenced remarkably by the counties, which matches our intuition.

Most of the mean residuals lie on the 0 axis, which is good. However, there are certain outliers shown in the graph especially between 1991 and 1996 and after 2010. After 2010, our model seems to predict bigger numbers of total crime compred to the real numbers.

Let's take a detailed look on the total crime of PG county, which is our county, after 2010.

In [22]:
pg = data.loc[(data['YEAR'] >= 2010) & (data['JURISDICTION'] == 'Prince George\'s County')]
pg[['YEAR','GRAND_TOTAL','interaction_prediction']]
Out[22]:
YEAR GRAND_TOTAL interaction_prediction
723 2010 43073 44460.852009
724 2011 38110 44193.301572
725 2012 36072 43925.751134
726 2013 33058 43658.200696
727 2014 30671 43390.650258
728 2015 26386 43123.099820
729 2016 24230 42855.549382
730 2017 22558 42587.998944

It seems that the prediction goes pretty well at the beginning, but the real number drops abruptly as the time goes by.

Therefore, we decide to fit a model of higher degrees to cover this change using the PolynomialFeatures module from sklearn.

Polynomial model

The polynomial model would have a degree of two with the function:

Predicted_Total_Crime = constant + a1 year + a2 year county1 + a3 year county2 +...+ aN year countyN + b1 year ^2 + b2 year^2 county1^2 + b3 year^2 county2^2 +...+ bN year^2 countyN^2

We just manually set up the function by adding squared term because using the polymodel from sklearn or statsmodele would include interaction terms between counties, which would be useless.

In [23]:
X_inter = X_inter.drop(columns = 'const')
In [24]:
xp = np.hstack((X_inter, X_inter**2))
xp = sm.add_constant(xp)
poly_model = sm.OLS(Y, xp).fit()
poly_pred = poly_model.predict(xp) 
data['poly_prediction'] = poly_pred
poly_model.summary()
Out[24]:
OLS Regression Results
Dep. Variable: GRAND_TOTAL R-squared: 0.978
Model: OLS Adj. R-squared: 0.976
Method: Least Squares F-statistic: 595.5
Date: Mon, 21 Dec 2020 Prob (F-statistic): 0.00
Time: 12:13:48 Log-Likelihood: -9563.5
No. Observations: 1032 AIC: 1.927e+04
Df Residuals: 960 BIC: 1.963e+04
Df Model: 71
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -3.075e+07 2.21e+06 -13.924 0.000 -3.51e+07 -2.64e+07
x1 3.213e+04 2301.287 13.960 0.000 2.76e+04 3.66e+04
x2 1.519e+07 5.74e+06 2.648 0.008 3.93e+06 2.64e+07
x3 -6.788e+06 5.74e+06 -1.183 0.237 -1.8e+07 4.47e+06
x4 -9.863e+07 5.74e+06 -17.192 0.000 -1.1e+08 -8.74e+07
x5 -3.172e+07 5.74e+06 -5.530 0.000 -4.3e+07 -2.05e+07
x6 1.343e+07 5.74e+06 2.341 0.019 2.17e+06 2.47e+07
x7 1.433e+07 5.74e+06 2.497 0.013 3.07e+06 2.56e+07
x8 9.377e+06 5.74e+06 1.635 0.102 -1.88e+06 2.06e+07
x9 1.486e+07 5.74e+06 2.590 0.010 3.6e+06 2.61e+07
x10 4.729e+06 5.74e+06 0.824 0.410 -6.53e+06 1.6e+07
x11 1.512e+07 5.74e+06 2.636 0.009 3.87e+06 2.64e+07
x12 9.032e+06 5.74e+06 1.574 0.116 -2.23e+06 2.03e+07
x13 1.496e+07 5.74e+06 2.607 0.009 3.7e+06 2.62e+07
x14 9.092e+06 5.74e+06 1.585 0.113 -2.17e+06 2.04e+07
x15 1.141e+06 5.74e+06 0.199 0.842 -1.01e+07 1.24e+07
x16 1.522e+07 5.74e+06 2.653 0.008 3.96e+06 2.65e+07
x17 -3.351e+07 5.74e+06 -5.840 0.000 -4.48e+07 -2.22e+07
x18 -8.867e+07 5.74e+06 -15.457 0.000 -9.99e+07 -7.74e+07
x19 1.398e+07 5.74e+06 2.436 0.015 2.72e+06 2.52e+07
x20 1.406e+07 5.74e+06 2.451 0.014 2.8e+06 2.53e+07
x21 1.39e+07 5.74e+06 2.422 0.016 2.64e+06 2.52e+07
x22 1.4e+07 5.74e+06 2.441 0.015 2.74e+06 2.53e+07
x23 1.614e+07 5.74e+06 2.813 0.005 4.88e+06 2.74e+07
x24 7.162e+06 5.74e+06 1.248 0.212 -4.1e+06 1.84e+07
x25 1.286e+07 5.74e+06 2.242 0.025 1.61e+06 2.41e+07
x26 -3.177e+04 1.15e+04 -2.763 0.006 -5.43e+04 -9207.495
x27 1.235e+04 1.15e+04 1.074 0.283 -1.02e+04 3.49e+04
x28 1.974e+05 1.15e+04 17.182 0.000 1.75e+05 2.2e+05
x29 6.269e+04 1.15e+04 5.454 0.000 4.01e+04 8.52e+04
x30 -2.826e+04 1.15e+04 -2.456 0.014 -5.08e+04 -5680.206
x31 -3.004e+04 1.15e+04 -2.612 0.009 -5.26e+04 -7466.455
x32 -2.012e+04 1.15e+04 -1.750 0.080 -4.27e+04 2440.287
x33 -3.113e+04 1.15e+04 -2.708 0.007 -5.37e+04 -8571.645
x34 -1.083e+04 1.15e+04 -0.941 0.347 -3.34e+04 1.17e+04
x35 -3.162e+04 1.15e+04 -2.753 0.006 -5.42e+04 -9080.085
x36 -1.943e+04 1.15e+04 -1.691 0.091 -4.2e+04 3115.926
x37 -3.129e+04 1.15e+04 -2.725 0.007 -5.38e+04 -8755.638
x38 -1.951e+04 1.15e+04 -1.698 0.090 -4.21e+04 3040.629
x39 -3618.8127 1.15e+04 -0.315 0.753 -2.62e+04 1.89e+04
x40 -3.182e+04 1.15e+04 -2.767 0.006 -5.44e+04 -9248.751
x41 6.605e+04 1.15e+04 5.743 0.000 4.35e+04 8.86e+04
x42 1.767e+05 1.15e+04 15.365 0.000 1.54e+05 1.99e+05
x43 -2.933e+04 1.15e+04 -2.552 0.011 -5.19e+04 -6773.330
x44 -2.949e+04 1.15e+04 -2.565 0.010 -5.21e+04 -6926.561
x45 -2.918e+04 1.15e+04 -2.538 0.011 -5.17e+04 -6617.408
x46 -2.937e+04 1.15e+04 -2.555 0.011 -5.19e+04 -6808.967
x47 -3.366e+04 1.15e+04 -2.926 0.004 -5.62e+04 -1.11e+04
x48 -1.57e+04 1.15e+04 -1.366 0.172 -3.83e+04 6856.196
x49 -2.708e+04 1.15e+04 -2.355 0.019 -4.96e+04 -4517.967
x50 -59.8864 29.353 -2.040 0.042 -117.491 -2.282
x51 1.519e+07 5.74e+06 2.648 0.008 3.93e+06 2.64e+07
x52 -6.788e+06 5.74e+06 -1.183 0.237 -1.8e+07 4.47e+06
x53 -9.863e+07 5.74e+06 -17.192 0.000 -1.1e+08 -8.74e+07
x54 -3.172e+07 5.74e+06 -5.530 0.000 -4.3e+07 -2.05e+07
x55 1.343e+07 5.74e+06 2.341 0.019 2.17e+06 2.47e+07
x56 1.433e+07 5.74e+06 2.497 0.013 3.07e+06 2.56e+07
x57 9.378e+06 5.74e+06 1.635 0.102 -1.88e+06 2.06e+07
x58 1.486e+07 5.74e+06 2.590 0.010 3.6e+06 2.61e+07
x59 4.728e+06 5.74e+06 0.824 0.410 -6.53e+06 1.6e+07
x60 1.512e+07 5.74e+06 2.636 0.009 3.87e+06 2.64e+07
x61 9.032e+06 5.74e+06 1.574 0.116 -2.23e+06 2.03e+07
x62 1.496e+07 5.74e+06 2.607 0.009 3.7e+06 2.62e+07
x63 9.092e+06 5.74e+06 1.585 0.113 -2.17e+06 2.04e+07
x64 1.142e+06 5.74e+06 0.199 0.842 -1.01e+07 1.24e+07
x65 1.522e+07 5.74e+06 2.653 0.008 3.96e+06 2.65e+07
x66 -3.351e+07 5.74e+06 -5.840 0.000 -4.48e+07 -2.22e+07
x67 -8.867e+07 5.74e+06 -15.457 0.000 -9.99e+07 -7.74e+07
x68 1.398e+07 5.74e+06 2.436 0.015 2.72e+06 2.52e+07
x69 1.406e+07 5.74e+06 2.451 0.014 2.8e+06 2.53e+07
x70 1.39e+07 5.74e+06 2.422 0.016 2.64e+06 2.52e+07
x71 1.4e+07 5.74e+06 2.441 0.015 2.74e+06 2.53e+07
x72 1.614e+07 5.74e+06 2.813 0.005 4.88e+06 2.74e+07
x73 7.162e+06 5.74e+06 1.248 0.212 -4.1e+06 1.84e+07
x74 1.286e+07 5.74e+06 2.242 0.025 1.61e+06 2.41e+07
x75 59.7995 27.640 2.163 0.031 5.557 114.042
x76 48.7363 29.215 1.668 0.096 -8.596 106.069
x77 2.1246 29.553 0.072 0.943 -55.871 60.120
x78 36.0376 29.803 1.209 0.227 -22.449 94.524
x79 58.9233 29.482 1.999 0.046 1.067 116.780
x80 59.3668 29.775 1.994 0.046 0.936 117.798
x81 56.8823 29.864 1.905 0.057 -1.724 115.488
x82 59.6462 29.340 2.033 0.042 2.068 117.225
x83 54.5598 29.334 1.860 0.063 -3.007 112.126
x84 59.7602 29.772 2.007 0.045 1.335 118.186
x85 56.7098 29.486 1.923 0.055 -1.155 114.575
x86 59.6770 29.446 2.027 0.043 1.892 117.462
x87 56.7210 31.088 1.825 0.068 -4.288 117.730
x88 52.7506 28.179 1.872 0.062 -2.550 108.051
x89 59.8075 29.417 2.033 0.042 2.078 117.537
x90 35.2443 29.609 1.190 0.234 -22.861 93.350
x91 7.5055 29.025 0.259 0.796 -49.454 64.465
x92 59.1876 29.395 2.014 0.044 1.502 116.873
x93 59.2275 29.431 2.012 0.044 1.470 116.985
x94 59.1532 29.606 1.998 0.046 1.054 117.253
x95 59.1964 29.674 1.995 0.046 0.963 117.430
x96 60.2726 29.773 2.024 0.043 1.846 118.700
x97 55.7788 29.796 1.872 0.062 -2.695 114.252
x98 58.6179 29.630 1.978 0.048 0.471 116.765
Omnibus: 507.445 Durbin-Watson: 1.959
Prob(Omnibus): 0.000 Jarque-Bera (JB): 24468.076
Skew: 1.504 Prob(JB): 0.00
Kurtosis: 26.664 Cond. No. 7.41e+18


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.11e-22. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

We can observe a higher R^2 value of 0.978, meaning that the polynomial model is doing a better job now at prediction.

Let's take a look at the prediction of the PG county by the polynomial model again.

In [25]:
pg = data.loc[(data['YEAR'] >= 2010) & (data['JURISDICTION'] == 'Prince George\'s County')]
pg[['YEAR','GRAND_TOTAL','poly_prediction']]
Out[25]:
YEAR GRAND_TOTAL poly_prediction
723 2010 43073 42260.786981
724 2011 38110 40474.189745
725 2012 36072 38582.830675
726 2013 33058 36586.709770
727 2014 30671 34485.827031
728 2015 26386 32280.182457
729 2016 24230 29969.776049
730 2017 22558 27554.607806

This time, the prediction of the total crime number drops much more quickly than the number predicted by the linear model and shows a better prediction.

Let's have a try to predict the Grand_total of Prince George's County in 2021.

In [26]:
X_2021 = [0] * len(X_inter.columns)
for index, i in enumerate(X_inter.columns): 
    if(i == 'YEAR'):
        X_2021[index] = 2021
    if(i == 'JURISDICTION_Prince George\'s County'):
        X_2021[index] = 1
    if(i == 'YEAR*JURISDICTION_Prince George\'s County'):
        X_2021[index] = 2021
X_2021 = np.hstack((X_2021, np.array(X_2021)**2))
X_2021 = np.insert(X_2021, 0, 1)
poly_model.predict(X_2021)
Out[26]:
array([16846.31648842])

Our model predicts the 2021 crime number of 16846, which looks pretty decent based on the data observed before.

We want to predict the 2021 crime number of all the Maryland counties using our model.

We choose to use a seaborn bar plot because it is easy to observe the number of total crimes for each county and also straightforward to compare the levels of safety among the conties. For more information creating the bar plot: https://seaborn.pydata.org/generated/seaborn.barplot.html

In [27]:
columns = np.array(X_inter.columns)
columns = np.delete(columns,0)
columns = columns.reshape(2,-1)
predictions_2021 = {'county':[],'total_crime':[]}
for j in range(len(columns[0])):
    X_2021 = [0] * len(X_inter.columns)
    for index, i in enumerate(X_inter.columns): 
        if(i == 'YEAR'):
            X_2021[index] = 2021
        if(i == columns[0][j]):
            X_2021[index] = 1
        if(i == columns[1][j]):
            X_2021[index] = 2021
    X_2021 = np.hstack((X_2021, np.array(X_2021)**2))
    X_2021 = np.insert(X_2021, 0, 1)
    pred = poly_model.predict(X_2021)
    predictions_2021['county'].append(columns[0][j].split("_",1)[1])
    predictions_2021['total_crime'].append(pred[0])
df_2021 = pd.DataFrame(predictions_2021)
In [28]:
sns.set(font_scale=2) 
ax = sns.barplot(x = 'county', y = "total_crime", data = df_2021)
ax.set_title('Crime Predictions of Maryland in 2021')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()

The total crime of each county is quite obvious. We could observe that our county has almost the highest number of crime committed in Maryland. The differences among the counties are huge.

Conclusion & Insight

After fitting the simple linear function of y = ax +b to predict the total crimes Y in year X, we found out that impact of county should also be taken into account. Therefore, based on the huge differences among the counties, we then added interaction terms of year*county. The predictions are relatively more accurate. However, we noticed a sudden decrease of the total number of crimes. And finally, we decided to fit a polynomial model with interaction terms. The results of predictions are impressive.

Based on our analysis and observations, we believe that the number of crime in Maryland since 2010 has dropped much more quickly than before, meaning that our state is getting better in recent years. However, from the data and predictions, we conclude that the total crime number is showing huge discrepancies among different counties in Maryland.

Although the easiest suggestion for this problem coming out the mind might just be sending out more police officer to provide a better level of security in those counties with higher number of crimes committed, we think further analysis of the county, crime characters and the population should be conducted to provide a better solution.