DEV Community

Cover image for Analyzing the severity of car accidents
Yel Martinez - Consultor Seo
Yel Martinez - Consultor Seo

Posted on • Edited on

Analyzing the severity of car accidents

Business problem - Introduction

1. A description of the problem and a discussion of the background

Traffic accidents represent one of the leading causes of death worldwide and of economic expenditure. Despite the numerous measures and campaigns that are deployed every year to raise awareness of the seriousness of the problem, it still occurs quite frequently. The impact of road accidents on society and the economy is high, and human losses are compounded by large expenditures on health care, awareness campaigns, mobilization of specialized personnel, etc. The WHO sets the economic impact of road accidents in a developed country at 2 to 3% of GDP, a significant figure for any country. Collaboration to reduce these losses has become an important issue of general interest.

Defining the problem:

  • What are the factors that have a high impact on road accidents?

  • Is there a pattern to them?

  • Correlation?

We will have to analyze the data to get a clearer picture and draw conclusions.

Introduction

Note that this work represents the final project of the IBM certification course, for which we have provided the data with which we will develop the project.

These data have been collected and shared by the Seattle Police Department (Traffic Records) and are provided by Coursera for downloading through a link.

It takes into account a period of time from 2004 to the present, recording information related to the severity of the traffic accident, location, type of collision, weather and road conditions, visibility, number of people involved, etc.

The objective is to define the problem, to find the factors that can have a relevant weight in the quantity and seriousness of the accidents, so that any organism, company or enterprise interested in reducing these figures, can focus the resources in points where these conditions converge.

In order to provide greater clarity, I will try to analyze the data, see if there are relationships or patterns, especially in high impact accidents, so that preventive measures can focus on these points as a first prevention strategy.

Data to be used

2. A description of the data and how it will be used to solve the problem

For an accurate prediction of the magnitude of damage caused by accidents, they require a large number of reports on traffic accidents with accurate data to train prediction models. The data set provided for this work allows the analysis of a record of 200,000 accidents in the state of Seattle, from 2004 to the date it is issued, in which 37 attributes or variables are recorded and the codification of the type of accident is allowed, grouped according to 84 codes. The information can be extracted from it:

speed information
information on road conditions and visibility
type of collision
affected persons, etc

The data will be used so that we can determine which attributes are most common in traffic accidents in order to target prevention at these high-incidence points.

Data Source

Data Source: These data have been collected and shared by the Seattle Police Department (Traffic Records) and are provided by Coursera for downloading through a link.

Data Location: Coursera_Capstone/Data assets

Data set name: Data-Collisions (1)_shaped.csv

Methodology

Objective: The objective of this project is to predict the severity of a traffic accident based on the other characteristics contained in the report.

Packages and libraries: We will use libraries and packages for both data manipulation and data visualization. PANDA, NUMPY, SCIPY, Matplotlib, Seaborn

A data analysis will be performed in order to determine what type of methodology and learning of the machine will be the most appropriate, in addition to obtaining a first contact with the data that we find more relevant to use in this project.

Obtaining and cleaning data

Importing libraries and packages

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
print('imported')
Enter fullscreen mode Exit fullscreen mode

Uploading the data

df_data_1 = pd.read_csv(Data-Collisions.csv)
df_data_1.head()

Enter fullscreen mode Exit fullscreen mode
# choosing the data we will work with
test = ['SEVERITYCODE', 'SPEEDING','ROADCOND']
df_data_1 = df_data_1[test]

# obtaining data dimensions
for feature in ["SPEEDING", "ROADCOND"]:
    print(df_data_1[feature].unique())
Enter fullscreen mode Exit fullscreen mode

['N' 'Y']
['Wet' 'Dry' 'Unknown' 'Snow/Slush' 'Ice' 'Other' 'Sand/Mud/Dirt'
'Standing Water' 'Oil']

# in speed we replace Nan with a negative value N
df_data_1['SPEEDING'] = df_data_1['SPEEDING'].fillna('N')


#we replace the value Nan declaring it as unknown too

df_data_1['ROADCOND'] = df_data_1['ROADCOND'].fillna('Unknown')

# checking value once again...
for feature in ["SPEEDING", "ROADCOND"]:
    print(df_data_1[feature].unique())
Enter fullscreen mode Exit fullscreen mode

['N' 'Y']
['Wet' 'Dry' 'Unknown' 'Snow/Slush' 'Ice' 'Other' 'Sand/Mud/Dirt'
'Standing Water' 'Oil']

# We assign new values to roadcond
df_data_1['ROADCOND'].replace(to_replace=['Wet','Dry','Unknown','Snow/Slush','Ice','Other','Sand/Mud/Dirt','Standing Water','Oil'], value = ['Dangerous','Normal','Normal','Dangerous','Dangerous','Normal','Dangerous','Dangerous','Dangerous'], inplace=True)

Enter fullscreen mode Exit fullscreen mode
df_data_1["SPEEDING"].replace(to_replace=['N', 'Y'], value=[0,1], inplace=True)
df_data_1['ROADCOND'].replace(to_replace=['Dangerous','Normal'],value=[0,1],inplace=True)
test_condition = df_data_1[['SPEEDING','ROADCOND']]
test_condition.head()

Enter fullscreen mode Exit fullscreen mode

| | SPEEDING | ROADCOND |
| ------------- |:----------:|
| 0 | 0 | 0 |
| 1 | 0 | 0 |

| 2 | 0 | 1 |

| 3 | 0 | 1 |
| 4 | 0 | 0 |

Training the model

x = test_condition
y = df_data_1['SEVERITYCODE'].values.astype(str)
x = preprocessing.StandardScaler().fit(x).transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1234)

# obtaining data dimensions
print("Training set: ", x_train.shape, y_train.shape)
print("Testing set: ", x_test.shape, y_test.shape)

Enter fullscreen mode Exit fullscreen mode

Training set: (155738, 2) (155738,)
Testing set: (38935, 2) (38935,)

Selecting the methods: Tree model, Logistic Regression and KNN methodology

#Tree model
Tree_model = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree_model.fit(x_train, y_train)
predicted = Tree_model.predict(x_test)
Tree_f1 = f1_score(y_test, predicted, average='weighted')
Tree_acc = accuracy_score(y_test, predicted)
Enter fullscreen mode Exit fullscreen mode
#Logistic Regression
LR_model = LogisticRegression(C=0.01, solver='liblinear').fit(x_train, y_train)
predicted = LR_model.predict(x_test)
LR_f1 = f1_score(y_test, predicted, average='weighted')
LR_acc = accuracy_score(y_test, predicted)

Enter fullscreen mode Exit fullscreen mode
#KNN methodology
KNN_model = KNeighborsClassifier(n_neighbors = 4).fit(x_train, y_train)
predicted = KNN_model.predict(x_test)
KNN_f1 = f1_score(y_test, predicted, average='weighted')
KNN_acc = accuracy_score(y_test, predicted)

Enter fullscreen mode Exit fullscreen mode

Results

Comparing the results obtained

results = {
    "Method of Analisys": ["KNN", "Decision Tree", "LogisticRegression"],
    "F1-score": [KNN_f1, Tree_f1, LR_f1],
    "Accuracy": [KNN_acc, Tree_acc, LR_acc]
}

results = pd.DataFrame(results)
results

Enter fullscreen mode Exit fullscreen mode

| | Method of Analisys | F1-score | Accuracy |
| ---------------------- |:--------:| :-------:|
| 0 | KNN | 0.591378 | 0.69675 |
| 1 | Decision Tree | 0.576051 | 0.699679|

| 2 | LogisticRegression | 0.576051 | 0.699679|

# Comparing results using LR
results = {
    "Intercept": LR_model.intercept_,
    "SPEEDING ": LR_model.coef_[:,0],
    "ROADCOND ": LR_model.coef_[:,1],
}

results = pd.DataFrame(results)
results

Enter fullscreen mode Exit fullscreen mode

| | Intercept | SPEEDING | ROADCOND |
| ----- ------- |:--------:| :--------:|
| 0 | -0.853729 | 0.067702 | -0.068295 |

Looking at the results obtained in the comparison, it is understood that speed and road conditions influence the severity of traffic accidents.

Top comments (0)