Would you die in an accident? Machine Learning with Data.gov.uk

June 30, 2017

Would you die in an accident? Machine Learning with Data.gov.uk

Summary:

In this blog post we are using open data available at https://data.gov.uk/dataset/road-accidents-safety-data about accidents, vehicles and casualties to try to predict the severity of the casualty when someone is involved in an accident.

All will be done using Python 3.x, Pandas, Seaborn(optional), Sklearn and its out-of-the-box Random Forest Classifier on the casualties dataset.

Short Steps:

1. Download Casualties_2015.csv at https://data.gov.uk/dataset/road-accidents-safety-data

2. Run https://gist.github.com/Lougarou/101c39a0a60ab02c16ee9d405d8c457f

Datasets

For this example we are going to the casualties data for 2015 CSV File.

Additionally, download the data guide which explains the encoding used in the casualties CSV file.

Normally we need to clean, encode, prepare our data but in this case it is already done which is great!

Necessary Libraries

import pandas as pd
from sklearn.cross_validation import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import average_precision_score, accuracy_score
import seaborn as sns

Reading CSV

casualties =  pd.read_csv("Casualties_2015.csv", index_col=False)

Using Pandas we read the casualty csv file into a Dataframe. You can check http://therandomtechadventure.blogspot.com/2017/06/handling-csv-files-with-pandas-in-python.html to quickly get started with handling csv files with Pandas.

Peeking At The Data

print(casualties.describe())

Vehicle_Reference  Casualty_Reference  Casualty_Class  Sex_of_Casualty  \
count      186189.000000       186189.000000   186189.000000    186189.000000   
mean            1.494804            1.414896        1.482413         1.406614   
std             0.660141            1.085014        0.712847         0.493200   
min             1.000000            1.000000        1.000000        -1.000000   
25%             1.000000            1.000000        1.000000         1.000000   
50%             1.000000            1.000000        1.000000         1.000000   
75%             2.000000            2.000000        2.000000         2.000000   
max            32.000000           38.000000        3.000000         2.000000   

       Age_of_Casualty  Age_Band_of_Casualty  Casualty_Severity  \
count    186189.000000         186189.000000      186189.000000   
mean         36.094023              6.245213           2.862484   
std          19.136416              2.386039           0.370391   
min          -1.000000             -1.000000           1.000000   
25%          22.000000              5.000000           3.000000   
50%          33.000000              6.000000           3.000000   
75%          49.000000              8.000000           3.000000   
max         104.000000             11.000000           3.000000   

       Pedestrian_Location  Pedestrian_Movement  Car_Passenger  \
count        186189.000000        186189.000000  186189.000000   
mean              0.672510             0.481731       0.256025   
std               1.951537             1.663567       0.575981   
min              -1.000000            -1.000000      -1.000000   
25%               0.000000             0.000000       0.000000   
50%               0.000000             0.000000       0.000000   
75%               0.000000             0.000000       0.000000   
max              10.000000             9.000000       2.000000   

       Bus_or_Coach_Passenger  Pedestrian_Road_Maintenance_Worker  \
count           186189.000000                       186189.000000   
mean                 0.079333                            0.060390   
std                  0.533912                            0.345357   
min                 -1.000000                           -1.000000   
25%                  0.000000                            0.000000   
50%                  0.000000                            0.000000   
75%                  0.000000                            0.000000   
max                  4.000000                            2.000000   

       Casualty_Type  Casualty_Home_Area_Type  Casualty_IMD_Decile  
count  186189.000000            186189.000000        186189.000000  
mean        7.277186                 1.045647             3.848992  
std         7.504565                 0.959094             3.491078  
min         0.000000                -1.000000            -1.000000  
25%         3.000000                 1.000000             1.000000  
50%         9.000000                 1.000000             4.000000  
75%         9.000000                 1.000000             7.000000  
max        98.000000                 3.000000            10.000000

print(casualties.info())

RangeIndex: 186189 entries, 0 to 186188

Data columns (total 16 columns):

Accident_Index 186189 non-null object

Vehicle_Reference 186189 non-null int64

Casualty_Reference 186189 non-null int64

Casualty_Class 186189 non-null int64

Sex_of_Casualty 186189 non-null int64

Age_of_Casualty 186189 non-null int64

Age_Band_of_Casualty 186189 non-null int64

Casualty_Severity 186189 non-null int64

Pedestrian_Location 186189 non-null int64

Pedestrian_Movement 186189 non-null int64

Car_Passenger 186189 non-null int64

Bus_or_Coach_Passenger 186189 non-null int64

Pedestrian_Road_Maintenance_Worker 186189 non-null int64

Casualty_Type 186189 non-null int64

Casualty_Home_Area_Type 186189 non-null int64

Casualty_IMD_Decile 186189 non-null int64

dtypes: int64(15), object(1)

memory usage: 22.7+ MB

sample = casualties[['Sex_of_Casualty','Age_of_Casualty','Casualty_Severity']].sample(1000)
sns.set(style="ticks")
sns.pairplot(sample,hue="Casualty_Severity")

sns.plt.show()

Seaborn can be handy in discovering relationships between features. You can find out more here https://seaborn.pydata.org/

Features

Feature engineering is a core and personally I find it the most interesting part of machine learning. It requires both logic and creativity. For this example, we are going to leave it as it is and only use features that we are interested in. In many cases, you can also create new features based on the problem that you are tackling.

Based on the features below we are going to predict the Casualty_Severity.

features = ['Sex_of_Casualty','Age_Band_of_Casualty','Pedestrian_Location',

'Pedestrian_Movement','Car_Passenger','Bus_or_Coach_Passenger',

'Pedestrian_Road_Maintenance_Worker']

Training

In order to train our model we need to split our data-set.

data_x_train, data_x_test, data_y_train, data_y_test = \
    train_test_split(casualties[features], casualties['Casualty_Severity'], test_size=0.25, random_state=42)

Model

We are going to use the Random Forest Classifier to fit our model.

clf = RandomForestClassifier(n_estimators=16)
clf.fit(data_x_train, data_y_train)



Accuracy

score = accuracy_score(data_y_test, clf_probs)
print("Single Score: %f",score)
Single Score: %f 0.868071667956


Feature Importance

print("Features & Importance:")
print(clf.feature_importances_)

[ 0.13536568  0.31775353  0.19876226  0.23288952  0.04297272  0.01941277
  0.05284351]

Once you have fitted your model you can check this parameter to see the impact of the features on predictions.
Cross Validation(Optional Here)

scores = cross_val_score(clf, casualties[features], casualties['Casualty_Severity'], cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))



Accuracy: 0.87 (+/- 0.01)

In many cases you are going to have to check if you are overfitting your model and CV comes in 
handy by splitting the data-set as we have done before a couple of times and checking the results.
 For example in this case it is split into 5. 
Note that it is not using our previously fitted model, the classifier is cloned and refitted inside the
 every cross validation.
Testing (And having fun)

Finally now that we have a model ready we can use it to test out some predictions.  For example 
let's create test.csv as follows: 

Accident_Index,Sex_of_Casualty,Age_of_Casualty,Age_Band_of_Casualty,Pedestrian_Location,Pedestrian_Movement,Car_Passenger,Bus_or_Coach_Passenger,Pedestrian_Road_Maintenance_Worker
test1,1,27,6,2,1,0,0,0
test2,2,27,6,2,1,0,0,0

The values are encoded but using the data guide it can be read as follows: 
test1: A male pedestrian crossing the road in zig-zag approach lines from a driver's nearside  
test2: same as test1 but with a female pedestrian



test = pd.read_csv("test.csv", index_col=False)
severities = clf.predict(test[features])
severity_verbose = {1: "Fatal", 2:"Serious", 3:"Slight"}
for severity in severities:
    print(severity_verbose.get(severity))

The results:


Serious

Slight


It seems that neither the man nor the woman would die (Fatal outcome) although the man 
would sustain serious injury compared to slight injury for the woman.

Final Words
Machine learning is a vast subject but I hope that this blog post helps you get your feet wet. 


Please comment if you have any question and subscribe to see me look at more data-sets and more

machine learning snippets.


Full source code is available here
https://gist.github.com/Lougarou/101c39a0a60ab02c16ee9d405d8c457f



Final - Final Words
In case you want to dive deeper into the Road Safety Dataset and look into other csv files you can 

use the snippet below to make a join.


vehicles = pd.read_csv("Vehicles_2015.csv", index_col=False)
result = casualties.merge(vehicles[['Accident_Index','Vehicle_Type']],
                          how='inner', left_on=['Accident_Index'], right_on=['Accident_Index'])
result = casualties.join(vehicles, on=['Accident_Index'], lsuffix="l_", rsuffix="r_")
You will notice that the size of result is more than that of casualties and that is because there can be

 more than one vehicle involved in an accident. Cheers!

Search This Blog

therandomtechadventure