A Machine Learning approach to predicting the likelihood of an accident insurance claim – Wonder Mahembe

Project Introduction

Insurance companies face the challenge of accurately predicting the likelihood of policyholders filing claims, as it directly affects policy pricing and risk assessment. Understanding the factors that influence the possibility of an insurance claim is crucial for insurers to establish fair and competitive pricing structures. However, predicting claims is not a straightforward task, as there are many factors that can influence the possibility of an insurance claim, such as the demographic characteristics of the policyholder, the type and value of the vehicle insured, and the location and driving conditions where the vehicle is used. Accurate predictive algorithms can aid in addressing this challenge by providing insights into the probability of claim occurrence for policyholders based on various factors.

The main objective of this project was to develop a supervised machine learning model using a dataset from an unnamed French Auto insurance provider to predict whether a policyholder is likely to file an insurance claim in a year. The project made a comparison of two machine learning models; Logistic Regression, and Random Forest by evaluating their performance based on metrics such as accuracy, precision, recall and F1-score. The purpose of selecting these models was to establish how predictive performance varies between a simple traditional model and a tree-based model. The study also sought to identify which predictors of an insurance claim are the strongest. The expected contribution of this project is to provide insights into the factors that affect the likelihood of an insurance claim and to demonstrate the applicability and effectiveness of machine learning techniques for this problem. This information will enable insurers to price their policies better and optimise risk management. Secondly, the project will highlight the importance of employing advanced predictive algorithms in the insurance industry, showcasing their potential for improving decision-making processes and enhancing overall operational efficiency.

Method overview

A. The Dataset

The study used the French Motor Claims dataset downloaded from Kaggle [9], which is a collection of motor insurance claims data from an unnamed French insurance company. It is composed of 678,013 policy records, each containing information on the policyholder, vehicle, and claims made during the policy period. Analysing the data can provide insight into the factors that influence motor insurance claims, which is of interest to both insurers and policyholders alike. The dataset could be used to develop more accurate risk models and inform pricing decisions for motor insurance policies.

B. Data Mining Method

To ensure that the predictive algorithm contained only the features that were the strongest predictors of claim, a feature selection process and engineering was implemented. Firstly, the target variable was re-coded from a numeric claim frequency to a binary variable with 0 representing no claim and 1 representing a claim. To better capture the factors that were associated with high claim frequency, feature engineering was used to re-code some numeric variables such as VehAge and BonusMalus into categorical variables. Moreover, categorical variables with several levels (Region and VehBrand) were re-coded to two levels to better separate regions and brands that were associated with highest claim frequencies from the rest. Lastly, the process of one-hot encoding was conducted on the resultant categorical variables which were: populous_region, low_miles, special_brands, new_vehicle, Area, and VehGas. One-hot encoding involves converting categorical variables into binary vectors to represent each unique category as a binary feature. This encoding is important before conducting predictions as it allows machine learning algorithms to effectively handle categorical data by representing it in a numerical format.

The resultant data frame had 23 predictors, which were then subjected to feature selection using ANOVA (Analysis of Variance). In the context of feature selection, ANOVA evaluates whether the means of different categories within a categorical feature significantly differ with respect to the target variable. The importance of using ANOVA for feature selection was its ability to identify the most relevant features that exhibit a statistically significant relationship with the target variable. By selecting the features with higher F-values and lower p-values, this meant that the employed algorithm would focus on the variables that are more likely to have a meaningful impact on predicting the insurance claims. This feature selection process picked the top 10 most significant predictors of claim. However, since ANOVA could not identify the correlation between variables, results of correlation analysis were used to remove four variables that had strong correlations with others in the dataset. From correlation analysis, 4 binary-classified categorical variables showed perfect correlations between the two respective variables created through on-hot-encoding. As a result, one in each of the four pairs of perfectly correlated variables were excluded from the analysis. The final dataset used six predictor variables, Exposure, DrivAge, populous_region_Yes, vehicle gas type (regular), new_vehicle_Yes, and low_miles_Yes.

Summary of insights

Based on the findings, it is concluded that supervised machine learning algorithms can be effectively used to predict the likelihood of policyholders filing accident insurance claims. The project highlights the importance of considering various demographic, vehicle, and location factors when predicting insurance claims and pricing policies. Results obtained demonstrate that machine learning approaches offer a valuable solution by enabling accurate claim prediction, fairer pricing, and improved risk management. Specifically, using logistic regression, the project was able to predict claims with 78% accuracy, the performance of which was slightly higher than the Random Forest model used for comparison.

Results of this project demonstrate the value of machine learning algorithms in predicting the probability of an insurance claim. The process of feature engineering was important in ensuring data were prepared in such a way that the algorithms could better identify the difference in claim frequency by different groups. In addition, the feature selection was useful in ensuring that the algorithms employ only the most important features and disregard those which were less important. With most important features selected, algorithms can run faster, while also being less prone to overly fitting the noise introduced by multiple predictors. In this study, three predictors appeared to have the greatest impact on claim frequency, which are: exposure, bonus miles (claim history), and vehicle age. In addition to these, the probability of claim was also affected by driver age, region of residence, and vehicle gas type. While a reasonably high accuracy in prediction was achieved using these features, it is possible that strong domain knowledge by insurers could be useful to ensure that even more accurate algorithms can be trained and deployed. By studying their customer behaviour more closely, insurance companies may better pinpoint which client characteristics are associated with high and low claim frequency.

It should also be noted that the findings of this project only offer a general understanding of the applicability of machine learning in predicting claims. Data for the project were sourced from a single insurance company in France, hence, may not offer client characteristics which are similar to characteristics of clients in other contexts. As a result, the lesson that should be drawn from this study can be on the use of machine learning algorithms as well as the processes for feature selection and engineering. Making decisions on which premiums to charge policyholders of a specific company should be based on an analysis of a company’s own data instead of outside data.