Abstract:
Road traffic collisions are some of the most serious issues that the world is facing. This results in many fatalities, injuries, and financial losses, with low-middle-income countries (LMICs) bearing a disproportionate amount of the cases. Previous studies have examined this scenario by utilizing various methods and strategies on various sections and crossings. Conventional methods such as logit and probit models have been extensively employed to predict road accidents. Nevertheless, these techniques have flaws, such as the requirement of a predetermined mathematical form and the presence of missing values and outliers in the dataset, which negatively impact the outcomes of the prediction model. Conversely to statistical techniques, machine learning (ML) techniques can manage the outliers and missing values in the dataset. Designing accurate predictive models for road accidents is an important task for the transportation network, and this has enabled researchers to become innovative by developing prediction models (PM) and researching factors that contribute to these accidents. This thesis, therefore, aims to develop and evaluate a prediction model using an ensemble ML technique that incorporates supervised ML algorithms such as AdaBoost, K-Nearest Neighbors (K-NN), Decision Trees (DT), and Naive Bayes (NB) to predict road accidents and their patterns. The driving simulator was used as a research instrument to collect data. The data collected was then normalized and cleaned for analysis using the scikit-learn Python library. The synthetic minority oversampling technique (SMOTE) was employed to address the data imbalance prior to training the model. The particle swarm optimization (PSO) algorithm was used to identify the most important features in our dataset. The primary performance indicators, such as testing accuracy, precision, recall, and F1 score, were used to assess the models and compare their outcomes. The findings of this study indicate that the two-layer ensemble model outperforms the four base classification models based on four performance indicators, with 88% testing accuracy, 86% precision, 83% recall, and 84% F1 score. The proposed two-layer ensemble model can be utilized in the future for both theoretical and practical applications, such as road safety management to improve the existing conditions of the road network and inform the formulation of traffic safety policies based on evidence. Ultimately, the results showed that ML-based models outperformed statistical models.
Keywords: machine learning, data imbalance, road safety, driving simulation, SMOTE