The overall objective of this project will be to discover the relationship between the year and the total crime number in Maryland and finally predict the total crime number in 2021. Hypothesis tests will be setup and different models will be built. For citizens in Maryland, we hope this analysis would introduce some helpful facts about their neighborhoods so that they can be more familiar with the environment they are living in. For the governor, we believe this tutorial would be able to provide some potential solutions to the crime problem and help prevent future crime based on our prediction.
Throughout this tutorial, we will attempt to uncover the potential trends of numbers of typical violent and property crimes such as murder, rape, robbery and larceny theft etc. and propotion of each type of crimes and proportion of violent and property cirmes. Also, we will using linear model and polynomial model to predict total crimes in Maryland.
import sqlite3
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, metrics, datasets
from sklearn.preprocessing import PolynomialFeatures
from scipy import stats
import seaborn as sns
import statsmodels.api as sm
The datasets we are going to analyze is from the website:https://opendata.maryland.gov/Public-Safety/Violent-Crime-Property-Crime-by-County-1975-to-Pre/jwfa-fdxs The dataset comes in the form of a CSV (Comma Separated Value) file. So the pandas.read_csv function is used here to read the data and put all the data into a Panda Dataframe for us.
data = pd.read_csv("data.csv")
data.head()
As shown in the dataframe above, there are some missing values for several columns.
In order to have a better understanding of the missing values, all the rows containing the missing values are displayed as below so that we can choose the best way to handle them.
data.isnull().values.any()
is_NaN = data.isnull()
rows = is_NaN.any(axis=1)
df_r = data[rows]
data = data.fillna(0)
We observe missing values of the rate of change in the same year of 1975. Since 1975 is the starting year of our dataset, it makes sense that we cannot have any values of rate of change here. So we decide to just set these values to zero.
data.columns = data.columns.str.replace(" ", "_")
This is explorative data analysis section. In this section, we will plot several charts such as scatter plot or pie chart to present trends of crimes from 1975 to 2017. For plotting scatter chart and pie chart, we will be using matplotlib. 'GRAND_TOTAL' is the total number of crime in a specific county, and we will using the sum of GRAND_TOTAL of all counties in the data to represent the numbers of crimes happended in a specific year, and plot a scatter chart with years as x-axis and numbers of crime happended in that year, and we will see the trends of numbers of crimes across time.
Total_crime = data.groupby('YEAR')['GRAND_TOTAL'].sum()
plt.title('Total crime across time')
x = Total_crime.index
y = Total_crime.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('Total_Crime')
From the scatter chart, we find out that from 1975 to 1995, the numbers of crimes was increasing, but since 1995, the numbers of crimes were gradually going down, and the decreasing trends seems like the total crime numbers are going to decrease in the future.
After we have the scatter chart of total crimes across time, we will be repeating the process to get Murder numbers across time, rape numbers across time, robbery numbers accross time, Larcency theft across time and M/V theft across time to see if the general trends of each type of crime is consistent with each other or not.
Murder = data.groupby('YEAR')['MURDER'].sum()
plt.title('Murder number across time')
x = Murder.index
y = Murder.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('Murder')
Rape = data.groupby('YEAR')['RAPE'].mean()
plt.title('Rape number across time')
x = Rape.index
y = Rape.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('Rape')
Robbery = data.groupby('YEAR')['ROBBERY'].mean()
plt.title('Robbery across time')
x = Robbery.index
y = Robbery.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('Robbery')
larceny_theft = data.groupby('YEAR')['LARCENY_THEFT'].mean()
plt.title('Larceny theft across time')
x = larceny_theft.index
y = larceny_theft.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('larceny_theft')
MV_theft = data.groupby('YEAR')['M/V_THEFT'].mean()
plt.title('M/V theft across time')
x = MV_theft.index
y = MV_theft.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('MV_theft')
After we have these 6 scatter charts, and we compared with each other. We can find out that the gerenal trends of these five types of crime is mostly consistent with total_crime which means the almost every types of crime was decreasing since 2000.
sns.set(rc={'figure.figsize':(20,10)})
vp = sns.violinplot(x="YEAR", y="GRAND_TOTAL", data = data)
The chart above is called violin plot which is plotted by seaborn (sns). Violin plot is a method of plotting numeric data. It is similar to a box plot, with the addition of a rotated kernel density plot on each side. Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator. Typically a violin plot will include all the data that is in a box plot: a marker for the median of the data; a box or marker indicating the interquartile range; and possibly all sample points, if the number of samples is not too high.
Total_crime = data.groupby('YEAR')['PERCENT_CHANGE'].mean()
plt.title('Crime percentage change across time')
x = Total_crime.index
y = Total_crime.values
plt.scatter(x,y)
plt.xlabel('years')
plt.ylabel('Percentage')
From above we can see the percentage change of numbers of crime each year. When the dots are below 0, it means number of crimes are decreasing this year.
Violent_crime_ratio = data['VIOLENT_CRIME_TOTAL'].sum()
Property_crime_ratio = data['PROPERTY_CRIME_TOTALS'].sum()
fig1, ax1 = plt.subplots()
Ratio = [Violent_crime_ratio, Property_crime_ratio]
labels = 'Violent_Crime', 'Property_Crime'
explode = (0.1,0)
ax1.pie(Ratio, explode = explode, labels = labels, autopct='%1.1f%%', shadow=True, startangle=90)
ax1.axis('equal')
plt.show
Above pie chart shows the proportion of types of Crime. Violent crime only occupies 15% of the grand total crimes. Most of the crimes are property crime such as M/V theft and larceny theft
Murder_type = data["MURDER"].mean()
Rape_type = data['RAPE'].mean()
Robbery_type = data['ROBBERY'].mean()
Larceny_theft_type = data['LARCENY_THEFT'].mean()
MV_theft_type = data['M/V_THEFT'].mean()
fig2, ax2 = plt.subplots()
pie_chart = [Murder_type, Rape_type, Robbery_type, Larceny_theft_type, MV_theft_type]
labels = 'Murder','Rape','Robbery','Larceny theft','M/V theft'
explode = (0,0,0,0,0)
ax2.pie(pie_chart, explode = explode, labels = labels, autopct='%1.1f%%', shadow=True, startangle=90)
ax2.axis('equal')
plt.show
Above chart shows the proportion of each specific types of crime.
By observing the graph above and using intuition, we think there might be a linear relation between independent variable YEAR and the dependent variable GRAND_TOTAL.
Therefore, we want to fit a linear regression model, and start a hypothesis test with null hypothesis of no relationship between YEAR and GRAND_TOTAL.
The linear regrssion model is created by the linear model from sklearn. For more information: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
data = data.sort_values('YEAR')
reg = linear_model.LinearRegression()
new_year = np.array(data.YEAR)[:, np.newaxis]
reg.fit(new_year, data.GRAND_TOTAL)
crime_predicted = reg.predict(new_year)
data['crime_prediction']= crime_predicted
plt.plot(new_year, crime_predicted, color='blue', linewidth=3)
print('Intercept:', reg.intercept_)
print('Coeff:', reg.coef_)
print('Score:',reg.score(new_year,data.GRAND_TOTAL))
print('Mean Absolute Error:', metrics.mean_absolute_error(data.GRAND_TOTAL, crime_predicted))
print('Mean Squared Error:', metrics.mean_squared_error(data.GRAND_TOTAL, crime_predicted))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(data.GRAND_TOTAL, crime_predicted)))
print('Pearson correlation coefficient & p-value:', stats.pearsonr(data.GRAND_TOTAL, crime_predicted))
It turns out that the null hypothesis would not be rejected. The P-value is 0.096 which is bigger than the siginificance level of 0.05. Therefore, we should accept the null hypothesis, meaning that there is no linear relationship between year and total crime number. The score, which is the value of R^2, here is only about 0.0026.
We do notice a huge Root Mean Squared Error between the predicted total crime and the real total crime. So we decide to compute the residual by subtracting the predicted total crime from the real total crime and make a violin plot to display the distribution of the residual and to get a better understanding of our prediction.
To create a violinplot we use the the seaborn.violinplot() function. For more information: https://seaborn.pydata.org/generated/seaborn.violinplot.html
data['residual'] = data['GRAND_TOTAL'] - data['crime_prediction']
sns.set(rc={'figure.figsize':(40,20)})
ax = sns.violinplot(x="YEAR", y="residual", data = data)
From the graph above we can observe a really terrible prediction of the total crime. The range of the residual is extremely big every year, meaning that we are not considering the outliers. Also, the predicted crime seems to exceed the real number ever year
This violin plot would suggest a over generalization. Therefore, we decide to consider the difference between counties. We believe there is a dependence between residual and counties, since each county seems to has its own specific crime number.
A new linear regression model with interaction terms between year and county is implemented with the equation of:
[a is the coefficient for each term]
Y = data.GRAND_TOTAL
X_inter = data[['YEAR','JURISDICTION']]
X_inter = pd.get_dummies(data=X_inter)##counties are transformed to binary
for i in X_inter.columns:
if(i != 'YEAR'):
X_inter['YEAR'+'*'+ i] = X_inter.YEAR * X_inter[i]
X_inter = sm.add_constant(X_inter)
inter_model = sm.OLS(Y, X_inter).fit()
data['interaction_prediction'] = inter_model.predict(X_inter).values
Here, we choose the model from the statsmodel package so that we can observe the regression results with more details in a table created. For more information using the model: https://www.statsmodels.org/stable/regression.html
inter_model.summary()
Our R^2 is 0.958 now! We do assume a higher R^2 value here. But this model fits unexpectedly well.
Let's create a new violin plot to observe the distribution of the new residual.
data['inter_residual'] = data['GRAND_TOTAL'] - data['interaction_prediction']
sns.set(rc={'figure.figsize':(40,20)})
ax = sns.violinplot(x="YEAR", y="inter_residual", data = data)
From the violin plot shown above, we can conclude the effect of year on total crime could be influenced remarkably by the counties, which matches our intuition.
Most of the mean residuals lie on the 0 axis, which is good. However, there are certain outliers shown in the graph especially between 1991 and 1996 and after 2010. After 2010, our model seems to predict bigger numbers of total crime compred to the real numbers.
Let's take a detailed look on the total crime of PG county, which is our county, after 2010.
pg = data.loc[(data['YEAR'] >= 2010) & (data['JURISDICTION'] == 'Prince George\'s County')]
pg[['YEAR','GRAND_TOTAL','interaction_prediction']]
It seems that the prediction goes pretty well at the beginning, but the real number drops abruptly as the time goes by.
Therefore, we decide to fit a model of higher degrees to cover this change using the PolynomialFeatures module from sklearn.
The polynomial model would have a degree of two with the function:
We just manually set up the function by adding squared term because using the polymodel from sklearn or statsmodele would include interaction terms between counties, which would be useless.
X_inter = X_inter.drop(columns = 'const')
xp = np.hstack((X_inter, X_inter**2))
xp = sm.add_constant(xp)
poly_model = sm.OLS(Y, xp).fit()
poly_pred = poly_model.predict(xp)
data['poly_prediction'] = poly_pred
poly_model.summary()
We can observe a higher R^2 value of 0.978, meaning that the polynomial model is doing a better job now at prediction.
Let's take a look at the prediction of the PG county by the polynomial model again.
pg = data.loc[(data['YEAR'] >= 2010) & (data['JURISDICTION'] == 'Prince George\'s County')]
pg[['YEAR','GRAND_TOTAL','poly_prediction']]
This time, the prediction of the total crime number drops much more quickly than the number predicted by the linear model and shows a better prediction.
Let's have a try to predict the Grand_total of Prince George's County in 2021.
X_2021 = [0] * len(X_inter.columns)
for index, i in enumerate(X_inter.columns):
if(i == 'YEAR'):
X_2021[index] = 2021
if(i == 'JURISDICTION_Prince George\'s County'):
X_2021[index] = 1
if(i == 'YEAR*JURISDICTION_Prince George\'s County'):
X_2021[index] = 2021
X_2021 = np.hstack((X_2021, np.array(X_2021)**2))
X_2021 = np.insert(X_2021, 0, 1)
poly_model.predict(X_2021)
Our model predicts the 2021 crime number of 16846, which looks pretty decent based on the data observed before.
We want to predict the 2021 crime number of all the Maryland counties using our model.
We choose to use a seaborn bar plot because it is easy to observe the number of total crimes for each county and also straightforward to compare the levels of safety among the conties. For more information creating the bar plot: https://seaborn.pydata.org/generated/seaborn.barplot.html
columns = np.array(X_inter.columns)
columns = np.delete(columns,0)
columns = columns.reshape(2,-1)
predictions_2021 = {'county':[],'total_crime':[]}
for j in range(len(columns[0])):
X_2021 = [0] * len(X_inter.columns)
for index, i in enumerate(X_inter.columns):
if(i == 'YEAR'):
X_2021[index] = 2021
if(i == columns[0][j]):
X_2021[index] = 1
if(i == columns[1][j]):
X_2021[index] = 2021
X_2021 = np.hstack((X_2021, np.array(X_2021)**2))
X_2021 = np.insert(X_2021, 0, 1)
pred = poly_model.predict(X_2021)
predictions_2021['county'].append(columns[0][j].split("_",1)[1])
predictions_2021['total_crime'].append(pred[0])
df_2021 = pd.DataFrame(predictions_2021)
sns.set(font_scale=2)
ax = sns.barplot(x = 'county', y = "total_crime", data = df_2021)
ax.set_title('Crime Predictions of Maryland in 2021')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()
The total crime of each county is quite obvious. We could observe that our county has almost the highest number of crime committed in Maryland. The differences among the counties are huge.
After fitting the simple linear function of y = ax +b to predict the total crimes Y in year X, we found out that impact of county should also be taken into account. Therefore, based on the huge differences among the counties, we then added interaction terms of year*county. The predictions are relatively more accurate. However, we noticed a sudden decrease of the total number of crimes. And finally, we decided to fit a polynomial model with interaction terms. The results of predictions are impressive.
Based on our analysis and observations, we believe that the number of crime in Maryland since 2010 has dropped much more quickly than before, meaning that our state is getting better in recent years. However, from the data and predictions, we conclude that the total crime number is showing huge discrepancies among different counties in Maryland.
Although the easiest suggestion for this problem coming out the mind might just be sending out more police officer to provide a better level of security in those counties with higher number of crimes committed, we think further analysis of the county, crime characters and the population should be conducted to provide a better solution.