Would you die in an accident? Machine Learning with Data.gov.uk
Summary:
In this blog post we are using open data available at https://data.gov.uk/dataset/road-accidents-safety-data about accidents, vehicles and casualties to try to predict the severity of the casualty when someone is involved in an accident.
All will be done using Python 3.x, Pandas, Seaborn(optional), Sklearn and its out-of-the-box Random Forest Classifier on the casualties dataset.
Short Steps:
1. Download Casualties_2015.csv at https://data.gov.uk/dataset/road-accidents-safety-data
Datasets
For this example we are going to the casualties data for 2015 CSV File.
Additionally, download the data guide which explains the encoding used in the casualties CSV file.
Normally we need to clean, encode, prepare our data but in this case it is already done which is great!
Necessary Libraries
import pandas as pd from sklearn.cross_validation import cross_val_score, train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import average_precision_score, accuracy_score import seaborn as sns
Reading CSV
casualties = pd.read_csv("Casualties_2015.csv", index_col=False)
Using Pandas we read the casualty csv file into a Dataframe. You can check http://therandomtechadventure.blogspot.com/2017/06/handling-csv-files-with-pandas-in-python.html to quickly get started with handling csv files with Pandas.
Peeking At The Data
print(casualties.describe())
Vehicle_Reference Casualty_Reference Casualty_Class Sex_of_Casualty \ count 186189.000000 186189.000000 186189.000000 186189.000000 mean 1.494804 1.414896 1.482413 1.406614 std 0.660141 1.085014 0.712847 0.493200 min 1.000000 1.000000 1.000000 -1.000000 25% 1.000000 1.000000 1.000000 1.000000 50% 1.000000 1.000000 1.000000 1.000000 75% 2.000000 2.000000 2.000000 2.000000 max 32.000000 38.000000 3.000000 2.000000 Age_of_Casualty Age_Band_of_Casualty Casualty_Severity \ count 186189.000000 186189.000000 186189.000000 mean 36.094023 6.245213 2.862484 std 19.136416 2.386039 0.370391 min -1.000000 -1.000000 1.000000 25% 22.000000 5.000000 3.000000 50% 33.000000 6.000000 3.000000 75% 49.000000 8.000000 3.000000 max 104.000000 11.000000 3.000000 Pedestrian_Location Pedestrian_Movement Car_Passenger \ count 186189.000000 186189.000000 186189.000000 mean 0.672510 0.481731 0.256025 std 1.951537 1.663567 0.575981 min -1.000000 -1.000000 -1.000000 25% 0.000000 0.000000 0.000000 50% 0.000000 0.000000 0.000000 75% 0.000000 0.000000 0.000000 max 10.000000 9.000000 2.000000 Bus_or_Coach_Passenger Pedestrian_Road_Maintenance_Worker \ count 186189.000000 186189.000000 mean 0.079333 0.060390 std 0.533912 0.345357 min -1.000000 -1.000000 25% 0.000000 0.000000 50% 0.000000 0.000000 75% 0.000000 0.000000 max 4.000000 2.000000 Casualty_Type Casualty_Home_Area_Type Casualty_IMD_Decile count 186189.000000 186189.000000 186189.000000 mean 7.277186 1.045647 3.848992 std 7.504565 0.959094 3.491078 min 0.000000 -1.000000 -1.000000 25% 3.000000 1.000000 1.000000 50% 9.000000 1.000000 4.000000 75% 9.000000 1.000000 7.000000 max 98.000000 3.000000 10.000000
print(casualties.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 186189 entries, 0 to 186188
Data columns (total 16 columns):
Accident_Index 186189 non-null object
Vehicle_Reference 186189 non-null int64
Casualty_Reference 186189 non-null int64
Casualty_Class 186189 non-null int64
Sex_of_Casualty 186189 non-null int64
Age_of_Casualty 186189 non-null int64
Age_Band_of_Casualty 186189 non-null int64
Casualty_Severity 186189 non-null int64
Pedestrian_Location 186189 non-null int64
Pedestrian_Movement 186189 non-null int64
Car_Passenger 186189 non-null int64
Bus_or_Coach_Passenger 186189 non-null int64
Pedestrian_Road_Maintenance_Worker 186189 non-null int64
Casualty_Type 186189 non-null int64
Casualty_Home_Area_Type 186189 non-null int64
Casualty_IMD_Decile 186189 non-null int64
dtypes: int64(15), object(1)
memory usage: 22.7+ MB
sample = casualties[['Sex_of_Casualty','Age_of_Casualty','Casualty_Severity']].sample(1000) sns.set(style="ticks") sns.pairplot(sample,hue="Casualty_Severity")
sns.plt.show()
Seaborn can be handy in discovering relationships between features. You can find out more here https://seaborn.pydata.org/
Features
Feature engineering is a core and personally I find it the most interesting part of machine learning. It requires both logic and creativity. For this example, we are going to leave it as it is and only use features that we are interested in. In many cases, you can also create new features based on the problem that you are tackling.
Based on the features below we are going to predict the Casualty_Severity.
features = ['Sex_of_Casualty','Age_Band_of_Casualty','Pedestrian_Location',
'Pedestrian_Movement','Car_Passenger','Bus_or_Coach_Passenger',
'Pedestrian_Road_Maintenance_Worker']
Training
In order to train our model we need to split our data-set.
data_x_train, data_x_test, data_y_train, data_y_test = \ train_test_split(casualties[features], casualties['Casualty_Severity'], test_size=0.25, random_state=42)Model
We are going to use the Random Forest Classifier to fit our model.clf = RandomForestClassifier(n_estimators=16) clf.fit(data_x_train, data_y_train)Accuracy
score = accuracy_score(data_y_test, clf_probs) print("Single Score: %f",score)Single Score: %f 0.868071667956
Feature Importance
print("Features & Importance:") print(clf.feature_importances_)
Once you have fitted your model you can check this parameter to see the impact of the features on predictions.[ 0.13536568 0.31775353 0.19876226 0.23288952 0.04297272 0.01941277 0.05284351]
Cross Validation(Optional Here)
scores = cross_val_score(clf, casualties[features], casualties['Casualty_Severity'], cv=5) print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
In many cases you are going to have to check if you are overfitting your model and CV comes inAccuracy: 0.87 (+/- 0.01)
handy by splitting the data-set as we have done before a couple of times and checking the results.
For example in this case it is split into 5.
Note that it is not using our previously fitted model, the classifier is cloned and refitted inside the
every cross validation.Testing (And having fun)
Finally now that we have a model ready we can use it to test out some predictions. For examplelet's create test.csv as follows:The values are encoded but using the data guide it can be read as follows: test1: A male pedestrian crossing the road in zig-zag approach lines from a driver's nearside test2: same as test1 but with a female pedestrianAccident_Index,Sex_of_Casualty,Age_of_Casualty,Age_Band_of_Casualty,Pedestrian_Location,Pedestrian_Movement,Car_Passenger,Bus_or_Coach_Passenger,Pedestrian_Road_Maintenance_Worker test1,1,27,6,2,1,0,0,0 test2,2,27,6,2,1,0,0,0The results:test = pd.read_csv("test.csv", index_col=False) severities = clf.predict(test[features]) severity_verbose = {1: "Fatal", 2:"Serious", 3:"Slight"} for severity in severities: print(severity_verbose.get(severity))It seems that neither the man nor the woman would die (Fatal outcome) although the manSeriousSlightwould sustain serious injury compared to slight injury for the woman.Final Words
Machine learning is a vast subject but I hope that this blog post helps you get your feet wet.Please comment if you have any question and subscribe to see me look at more data-sets and more
machine learning snippets.Full source code is available here https://gist.github.com/Lougarou/101c39a0a60ab02c16ee9d405d8c457f
Final - Final Words
In case you want to dive deeper into the Road Safety Dataset and look into other csv files you can
use the snippet below to make a join.
vehicles = pd.read_csv("Vehicles_2015.csv", index_col=False) result = casualties.merge(vehicles[['Accident_Index','Vehicle_Type']], how='inner', left_on=['Accident_Index'], right_on=['Accident_Index']) result = casualties.join(vehicles, on=['Accident_Index'], lsuffix="l_", rsuffix="r_")You will notice that the size of result is more than that of casualties and that is because there can be
more than one vehicle involved in an accident. Cheers!
Comments
Post a Comment