Would you die in an accident? Machine Learning with Data.gov.uk

Summary:

In this blog post we are using open data available at https://data.gov.uk/dataset/road-accidents-safety-data about accidents, vehicles and casualties to try to predict the severity of the casualty when someone is involved in an accident.

All will be done using Python 3.x, Pandas, Seaborn(optional), Sklearn and its out-of-the-box Random Forest Classifier on the casualties dataset.

Short Steps:

1. Download Casualties_2015.csv at https://data.gov.uk/dataset/road-accidents-safety-data

Datasets

For this example we are going to the casualties data for 2015 CSV File.
Additionally, download the data guide which explains the encoding used in the casualties CSV file.
Normally we need to clean, encode, prepare our data but in this case it is already done which is great!

Necessary Libraries

import pandas as pd
from sklearn.cross_validation import cross_val_score, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import average_precision_score, accuracy_score
import seaborn as sns

Reading CSV

casualties =  pd.read_csv("Casualties_2015.csv", index_col=False)

Using Pandas we read the casualty csv file into a Dataframe. You can check http://therandomtechadventure.blogspot.com/2017/06/handling-csv-files-with-pandas-in-python.html to quickly get started with handling csv files with Pandas.

Peeking At The Data

print(casualties.describe())

Vehicle_Reference  Casualty_Reference  Casualty_Class  Sex_of_Casualty  \
count      186189.000000       186189.000000   186189.000000    186189.000000   
mean            1.494804            1.414896        1.482413         1.406614   
std             0.660141            1.085014        0.712847         0.493200   
min             1.000000            1.000000        1.000000        -1.000000   
25%             1.000000            1.000000        1.000000         1.000000   
50%             1.000000            1.000000        1.000000         1.000000   
75%             2.000000            2.000000        2.000000         2.000000   
max            32.000000           38.000000        3.000000         2.000000   

       Age_of_Casualty  Age_Band_of_Casualty  Casualty_Severity  \
count    186189.000000         186189.000000      186189.000000   
mean         36.094023              6.245213           2.862484   
std          19.136416              2.386039           0.370391   
min          -1.000000             -1.000000           1.000000   
25%          22.000000              5.000000           3.000000   
50%          33.000000              6.000000           3.000000   
75%          49.000000              8.000000           3.000000   
max         104.000000             11.000000           3.000000   

       Pedestrian_Location  Pedestrian_Movement  Car_Passenger  \
count        186189.000000        186189.000000  186189.000000   
mean              0.672510             0.481731       0.256025   
std               1.951537             1.663567       0.575981   
min              -1.000000            -1.000000      -1.000000   
25%               0.000000             0.000000       0.000000   
50%               0.000000             0.000000       0.000000   
75%               0.000000             0.000000       0.000000   
max              10.000000             9.000000       2.000000   

       Bus_or_Coach_Passenger  Pedestrian_Road_Maintenance_Worker  \
count           186189.000000                       186189.000000   
mean                 0.079333                            0.060390   
std                  0.533912                            0.345357   
min                 -1.000000                           -1.000000   
25%                  0.000000                            0.000000   
50%                  0.000000                            0.000000   
75%                  0.000000                            0.000000   
max                  4.000000                            2.000000   

       Casualty_Type  Casualty_Home_Area_Type  Casualty_IMD_Decile  
count  186189.000000            186189.000000        186189.000000  
mean        7.277186                 1.045647             3.848992  
std         7.504565                 0.959094             3.491078  
min         0.000000                -1.000000            -1.000000  
25%         3.000000                 1.000000             1.000000  
50%         9.000000                 1.000000             4.000000  
75%         9.000000                 1.000000             7.000000  
max        98.000000                 3.000000            10.000000  

print(casualties.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186189 entries, 0 to 186188
Data columns (total 16 columns):
Accident_Index                        186189 non-null object
Vehicle_Reference                     186189 non-null int64
Casualty_Reference                    186189 non-null int64
Casualty_Class                        186189 non-null int64
Sex_of_Casualty                       186189 non-null int64
Age_of_Casualty                       186189 non-null int64
Age_Band_of_Casualty                  186189 non-null int64
Casualty_Severity                     186189 non-null int64
Pedestrian_Location                   186189 non-null int64
Pedestrian_Movement                   186189 non-null int64
Car_Passenger                         186189 non-null int64
Bus_or_Coach_Passenger                186189 non-null int64
Pedestrian_Road_Maintenance_Worker    186189 non-null int64
Casualty_Type                         186189 non-null int64
Casualty_Home_Area_Type               186189 non-null int64
Casualty_IMD_Decile                   186189 non-null int64
dtypes: int64(15), object(1)
memory usage: 22.7+ MB

sample = casualties[['Sex_of_Casualty','Age_of_Casualty','Casualty_Severity']].sample(1000)
sns.set(style="ticks")
sns.pairplot(sample,hue="Casualty_Severity")
sns.plt.show()


Seaborn can be handy in discovering relationships between features. You can find out more here https://seaborn.pydata.org/

Features

Feature engineering is a core and personally I find it the most interesting part of machine learning. It requires both logic and creativity. For this example, we are going to leave it as it is and only use features that we are interested in. In many cases, you can also create new features based on the problem that you are tackling.

Based on the features below we are going to predict the Casualty_Severity.

features = ['Sex_of_Casualty','Age_Band_of_Casualty','Pedestrian_Location',
'Pedestrian_Movement','Car_Passenger','Bus_or_Coach_Passenger',
'Pedestrian_Road_Maintenance_Worker']

Training

In order to train our model we need to split our data-set.

data_x_train, data_x_test, data_y_train, data_y_test = \
    train_test_split(casualties[features], casualties['Casualty_Severity'], test_size=0.25, random_state=42)

Model

We are going to use the Random Forest Classifier to fit our model.
clf = RandomForestClassifier(n_estimators=16)
clf.fit(data_x_train, data_y_train)

Accuracy

score = accuracy_score(data_y_test, clf_probs)
print("Single Score: %f",score)
Single Score: %f 0.868071667956

Feature Importance

print("Features & Importance:") print(clf.feature_importances_)

[ 0.13536568  0.31775353  0.19876226  0.23288952  0.04297272  0.01941277
  0.05284351]
Once you have fitted your model you can check this parameter to see the impact of the features on predictions.

Cross Validation(Optional Here)

scores = cross_val_score(clf, casualties[features], casualties['Casualty_Severity'], cv=5) print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.87 (+/- 0.01)
In many cases you are going to have to check if you are overfitting your model and CV comes in
handy by splitting the data-set as we have done before a couple of times and checking the results.
 For example in this case it is split into 5. 
Note that it is not using our previously fitted model, the classifier is cloned and refitted inside the
 every cross validation.

Testing (And having fun)

Finally now that we have a model ready we can use it to test out some predictions. For example
let's create test.csv as follows: 
Accident_Index,Sex_of_Casualty,Age_of_Casualty,Age_Band_of_Casualty,Pedestrian_Location,Pedestrian_Movement,Car_Passenger,Bus_or_Coach_Passenger,Pedestrian_Road_Maintenance_Worker
test1,1,27,6,2,1,0,0,0
test2,2,27,6,2,1,0,0,0
The values are encoded but using the data guide it can be read as follows: test1: A male pedestrian crossing the road in zig-zag approach lines from a driver's nearside test2: same as test1 but with a female pedestrian
test = pd.read_csv("test.csv", index_col=False)
severities = clf.predict(test[features])
severity_verbose = {1: "Fatal", 2:"Serious", 3:"Slight"}
for severity in severities:
    print(severity_verbose.get(severity))
The results:
Serious
Slight
It seems that neither the man nor the woman would die (Fatal outcome) although the man
would sustain serious injury compared to slight injury for the woman.

Final Words

Machine learning is a vast subject but I hope that this blog post helps you get your feet wet.
Please comment if you have any question and subscribe to see me look at more data-sets and more
machine learning snippets.
Full source code is available here https://gist.github.com/Lougarou/101c39a0a60ab02c16ee9d405d8c457f

Final - Final Words

In case you want to dive deeper into the Road Safety Dataset and look into other csv files you can
use the snippet below to make a join.
vehicles = pd.read_csv("Vehicles_2015.csv", index_col=False)
result = casualties.merge(vehicles[['Accident_Index','Vehicle_Type']],
                          how='inner', left_on=['Accident_Index'], right_on=['Accident_Index'])
result = casualties.join(vehicles, on=['Accident_Index'], lsuffix="l_", rsuffix="r_")
You will notice that the size of result is more than that of casualties and that is because there can be
more than one vehicle involved in an accident. Cheers!

Comments

Popular Posts