This is a part of my IBM Data Science Capstone Project submission. Source Code is available on this GitHub repository.
The Open Data Program makes the data generated by the City of Seattle has been openly available to the public for the purpose of increasing the quality of life for the residents, increasing transparency, accountability and comparability, promoting economic development and research, and improving internal performance management.
The Traffic Records Group, Traffic Management Division, Seattle Department of Transportation, provides data for all collisions and crashes that have occured in the state from 2004 to the present day. The data is updated weekly and can be found at the Seattle Open GeoData Portal.
The objective is to exploit this data to extract vital features that would enable us to end up with a good model that would enable the prediction of the severity of future accidents that take place in the state. This would further enable the Department of Transportation to prioritise their SOPs and channel their energy to ensure that fewer fatalities result in automobile collisions.
The dataset is available as comma-separated values (CSV) files, KML files, and ESRI shapefiles that can be downloaded from the Seattle Open GeoData Portal. The data is also available from RESTful API services in formats such as GeoJSON.
We download the dataset to our project directory and take a look at the data types and the dimensionality of the data. We can see that the dataset contains 221,389 records and 40 fields.
The metadata of the dataset can be found from the website of the Seattle Department of Transportation. On reading the dataset summary, we can determine the description of each of the fields and their possible values.
The data contains several categorical fields and corresponding descriptions which could help us in further analysis. We make an attempt at understanding the data in terms of the fields that we shall take into account for later stages of model building.
Y fields denote the longitude and latitude of the collisions. We can visualize the first few non-null collisions on a map.
WEATHER field contains a description of the weather conditions during the time of the collision. The
ROADCOND field describes the condition of the road during the collision. The
LIGHTCOND field describes the light conditions during the collision. The
SPEEDING field classifies collisions based on whether or not speeding was a factor in the collision. Blanks indicate cases where the vehicle was not speeding.
SEVERITYCODE field contains a code that corresponds to the severity of the collision. and
SEVERITYDESC contains a detailed description of the severity of the collision. We can conclude that there were 349 collisions that resulted in at least one fatality, and 3,102 collisions that resulted in serious injuries. The following table lists the meaning of each of the codes used in the
|1||Accidents resulting in property damage|
|2||Accidents resulting in injuries|
|2b||Accidents resulting in serious injuries|
|3||Accidents resulting in fatalities|
|0||Data Unavailable i.e. Blanks|
UNDERINFL field describes whether or not a driver involved was under the influence of drugs or alcohol. The values
N denote that the driver was not under any influence while
Y that they were. The
VEHCOUNT indicate how many people and vehicles were involved in a collision respectively.
As the dataset has possibly been sourced from a database table, several unique identifiers and spatial features are present in the database which may be irrelevant in further statistical analysis. These fields are are
REPORTNO. Other fields such as
LOCATION and their corresponding descriptions (if any) are categorical but have a large number of distinct values that shall not be that much useful for analysis. The
INCDTTM denote the date and the time of the incident but may not be of use in further analyses. The data needs to be pre-processed.
After dropping irrelevant columns and null values and performing data cleaning, we got a dataset with 171,380 rows.
After fixing other data inconsistencies, we now do an one-hot encoding of the
LIGHTCOND fields as they are categorical. Shuffling of the dataset is also necessary as it is an unbalanced dataset.
Finding the correlation among the features of the dataset helps understand the data better. For example, in the heatmap shown below, it can be observed that some features have a strong positive / negative correlation while most of them have weak / no correlation.
y are constructed. The set
x contains all the training examples and
y contains all the labels. Feature scaling of data is done to normalize the data in a dataset to a specific range.
After normalization, they are split into
y_test. The first two sets shall be used for training and the last two shall be used for testing. Upon choosing a suitable split ratio, 80% of data is used for training and 20% of is used for testing.
Modelling and Evaluation
Decision Tree Classifier
Decision Tree makes decision with tree-like model. It splits the sample into two or more homogenous sets (leaves) based on the most significant differentiators in the input variables. To choose a differentiator (predictor), the algorithm considers all features and does a binary split on them (for categorical data, split by category; for continuous, pick a cut-off threshold). It will then choose the one with the least cost (i.e. highest accuracy), and repeats recursively, until it successfully splits the data in all leaves (or reaches the maximum depth).
Information gain for a decision tree classifier can be calculated either using the Gini Index measure or the Entropy measure, whichever gives a greater gain. A hyper parameter Decision Tree Classifier was used to decide which tree to use, DTC using entropy had greater information gain; hence it was used for this classification problem.
Random Forest Classifier
Random Forest Classifier is an ensemble (algorithms which combines more than one algorithms of same or different kind for classifying objects) tree-based learning algorithm. RFC is a set of decision trees from randomly selected subset of training set. It aggregates the votes from different decision trees to decide the final class of the test object. Used for both classification and regression.
Similar to DTC, RFC requires an input that specifies a measure that is to be used for classification, along with that a value for the number of estimators (number of decision trees) is required. A hyperparameter was used to determine the best choices for the above mentioned parameters. RFC using entropy as the measure gave the best accuracy when trained and tested on pre-processed accident severity dataset.
Logistic Regression Classifier
Logistic Regression is a classifier that estimates discrete values (binary values like 0/1, yes/no, true/false) based on a given set of an independent variables. It basically predicts the probability of occurrence of an event by fitting data to a logistic function. Hence it is also known as logistic regression. The values obtained would always lie within 0 and 1 since it predicts the probability. The chosen dataset has more than two target categories in terms of the accident severity code assigned, one-vs-one (OvO) strategy is employed.
Neural networks can be used to capture non-linearity between features. We have used a Sequential ANN where there are 4 hidden layers. The
sigmoid activation functions are used. The loss function that is used is
categorical_crossentropy as the target is integer-coded.
The accuracies of all models lied was 100% which means we can accurately predict the severity of an accident. A bar plot is plotted below with the bars representing the accuracy of each model.
Initially, the classifiers had an prediction accuracy of 66%-71%, however, upon going back to the data preparation phase, minor tweaking and taking additional fields in the dataset improved the overall accuracy of all models.
The accuracy of the classifiers is excellent, i.e. 100%. This means that the model has trained well and fits the training data and performs well on the testing set as well as the training set. We can conclude that this model can accurately predict the severity of car accidents in Seattle.
The trained model can be deployed onto governance and monitoring web and mobile applications to predict the accident severity for a given set of parameters.