__Insurance Rating based on advanced machine learning based on Penalized Multinomial Regression & Boosting based on Cross Validation on Real Data Set of 52,000 records an 128 predictors for a life insurance company in US (Based on Public Data ____– By Pranshu Tiwari__

Pranshu Tiwari is Managing Consultant & IBM Sr Inventor. He has done extensive work in Application Rationalization, Business Case Development, Program Management, Program Planning, IT Performance Management, and Strategy Planning. & Cost Optimization. He has 1 patent Grant ,4 Patents filed with USPTO and over 15 research papers in field of Applied Analytics. His areas of interest including Neural Networks, Supervised Machine Learning, Principal Component Analysis, Vector Machines, Regression in Analytics Space.

Through machine learning we all know insurance carriers can better understand risk of individuals at the time of underwriting by helping them understand the risk rating of their proposal. Risk assessment has always been one of the most challenging and time-consuming pieces of the underwriting process, but with the help of machine learning, natural language understanding (NLU) and other techniques, underwriting is becoming more accurate and efficient. Machine learning streamlines and speeds up the risk evaluation process by making more accurate inferences and projections from large volumes of data. Algorithms filtered through unbiased machines can be great tools during the risk evaluation process.

Towards this this paper takes a data from a public database on prudential data set consisting of 59,381 observations and 128 predictors of which the response variable is treated as categorical variable. The overall objective of the project is to develop a mathematical model through advance machine learning to arrive at best model to arrive at risk rating for individuals to expedite the underwriting process. The key variables like age, BMI, height have been plotted in the diagram below.

Target Audience

The target audience of this analysis could be insurance companies to improve their risk assessment during underwriting process and improve customer experience by helping them choose right products for a win to both customers as well as insurance agent. There is a potential to use association rules to also identify which products are appropriate for what risk factors which could be used for Upselling insurance products. However this is not in scope of this paper

**Approach **

- We have divided our data into 2 sets – training data (40,000 observations) and remaining 19,832 for observations for test. We wanted to overcome the limitation of heavy computing and hence did not use double cross validation
- We used training data to select the best model leveraging single fold cross validation
- The best model is selected on the basis of Classification error using cross validation methodology for training dataset by creating confusion matrix
- Lasso Regression
- Ridge Regression
- Boosting

- The Best model among the three classification techniques was used to predict the risk or rating of individual.

Appropriateness of the model and comparison with other Mathematical Techniques[ Double Cross validation not considered as we have 40k records and will be highly computationally intensive with limited CPU usage]. Please refer to table 1 for various computational technique

Mathematical Model | Consideration | Functional Constraint | Computational Constraint |

Logistic Regression | Not Considered | Logistic Regression cannot be applied as we have more than 2 output response variables | N.A |

Artificial Neural Networks | Not considered | Not Applicable | While ANN could be a good choice , the presence of >100 parameters and 40,000 records would have increased the computation time with multiple hidden nodes and multiple weights |

Boosting | Considered | Considered for analysis but is non parametric in nature | Considered 40k records instead of 50k. |

Lasso ,Ridge and Penalized regression for Multi-nominal function | Considered | N.A | Limited data to 40 k records for training for ease of computation instead of 50k records |

Linear Discriminant Analysis and Quadratic Discriminant Analysis | Not Considered | Presence of 114 predictor variables may decrease variance and chances of overfitting is high | Cross Validation may increase the computation time |

__Scatter Plot of Data for Columns 1- 9 only.__

__ __Figure 1: Scatter plot of 1^{st}-9 predictor variables

__ __

__Box plot based on Categories of Responses for 2 predictor variables –Height & BMI__

__ __Figure 2:Box plot of how responses vary with BMI

Since there are 128 predictors –it was computationally difficult to create scatter plots against all of them. Hence sample two predictors have been provided above.

Since there are 128 predictors it was important to consider the key variables to ensure models is well fit at the cost of lower bias. We also need to reduce the coefficient of estimate to reduce the variance and ensure that we don’t over fit the model. Model selection has been done to optimize bias and variance to minimize the error of misclassification.

**Ridge Regression:**

P-number of predictors—–Equation 1

In the above graph as number of predictors’ increases, the bias decreases but variance increases and hence need to minimize coefficients to optimize the model.

Values of

Based on Ridge Regression and considering

lambdalist = exp((100:-100)/100) w

we arrive at the following Coefficient pathways

The misclassification rate for various values of lambda are shown in Figure 3.2 while coefficient pathways are shown in Figure 3.1

Fig 3.1 Coefficient Pathways for various values of Lambda

Figure 3.1, 3.2 shows the coefficient pathways for different also known as phi and values of coefficient changes. The misclassification rate is a function and best value is chosen to minimize classification error

* *

*Figure 3.3,3.4 shows the different values of lambda and coefficient pathways with l1 norm values.*

As seen in figure 3.3, 3.4 we can see the coefficients change as per lambda value and hence we need to identify the best value which can maximize classification. In this particular case we find we maximize the classification.

The best CV error can be computed against lambda using cross validation (k=10)

Misclassification as a function of lambda is shown in figure 4 with lowest misclassification at lambda=0.36 and log lambda/phi ~-1

Similarly we compute the penalized misclassification for Lasso regression using equation

The penalized regression leveraging lasso methodology will reduce the coefficients =0 Depending on value of phi, lasso regression can produce model involving any number of predictors. Lasso models hence have higher interpretability

* Figure 4.3 and 4.4 shows the classification rates of log lambda/phi values. Lasso Regression has a constant misclassification rate as compared to Ridge Regression.*

Tree Based Modelling – Since we are considering all the 120 predictors –we may tend to over fit the model as the variance would be high.

- Bagging- Bagging is tree based models with aggregation from Bootstrapping with Replacement. This is useful to reduce variance and not the bias.

2 Boosting- This technique is used to reduce the bias and variance as well.

Steps followed in Boosting

- For B trees n=1,2…B calculate
- for 1
^{st}tree - Compute Misclassifications

- for 1

=

Where b=number of tree from 1- B and each tree having d decision splits/(d variables(subset of P variables). Hence we can create more trees to reduce the residuals or misclassification

Since the processing power of CPU was normal core processor we considered 10 variables and tried to increase the bias for the model

Performance of 3 models and their error Rate

Figure 6- Classification rate based on CV validation –training data for 3 model. Boosting has highest accuracy and hence have used the model for Test Data

__Running the best validated model on Test Data__

We now run the analysis on a test data set of 19000 users based on training of 42,000 users

__ __Confusion Matrix for 8 levels of Risk Factors among 19000 users as given below.

Confusion Matrix on Test Data by using Boosting

The classification success is more than 55%. However the prediction will improve as we increase the number of trees. Due to limitation of CPU memory we used only 100 trees.

**Applicability & Future Work on EC2 & Big Data Computing.** The model can be applied to sets of other companies as census.gov has health insurance data for various customers and can be leveraged in underwriting process for accurate risk rating of individual and help them select the right plan . We could also use Neural networks – However we would need to use high memory CPU/VM to increase the speed of learning rate, weightage. This was limitation of our study.

Business Benefits

The Business Benefit has been outlined below:

- Predict risk score of individuals which will help customers as well as insurance companies know the actual risk while filling the application. The risk computation mechanics could be used by insurance companies during group set up process up for annual payments made to healthcare providers and also customize products
- The risk rating of customers could then also help insurance companies run campaigns for what products could be sold against a risk score.