Вы находитесь на странице: 1из 13

Exploratory data analysis of

Insurance premium renewal

Capstone project notes-1

Vikas Chauhan
12/01/2020
1. Introductory Phase

Project Objective and Business understanding

Insurance provide financial support and reduce uncertainties in business and human life.
It provides safety and security against event. There is always a fear of sudden
loss. Insurance provides a cover against any sudden loss.

An insurance premium is the amount of money an individual or business pays for an


insurance policy. Insurance premiums are paid for policies that cover healthcare, auto,
home, life, and others.
Once earned, the premium is income for the insurance company. It also represents a
liability, as the insurer must provide coverage for claims being made against the policy.
Failure to pay the premium may result in the cancellation of the policy.

Premium paid by the customer is the major revenue source for insurance companies.
Default in premium payments results in significant revenue losses and hence insurance
companies would like to know upfront which type of customers would default premium
payments.

The price of the premium depends on a variety of factors including:

• The type of coverage


• Your age
• The area in which you live
• Any claims filed in the past
• Moral hazard and adverse selection

Problem statement

The given project report focuses on this problem – predicting if the current customer will
get defaulted on future premium payments.

Things to worry

The choice of selection of the renewal ends on various factors like age, income, risk score,
number of premiums paid etc. The given problem is of predictive modelling.

Assumptions

1. The data given to us is having continuous and categorical variables affecting the
decision to predict which customer will renew the premium.
2. We assume that all the predictors are at same scale and no missing values are
identified.
2. Data Understanding and Preparation

Data Overview

• There are 17 columns in the dataset with the name’s


"id","perc_premium_paid_by_cash_credit,"age_in_years","Income""Count_3.
6_months_late","Count_6.12_months_late","Count_more_than_12_months_la
te","Marital.Status","Veh_Owned"
"No_of_dep","Accomodation","risk_score","no_of_premiums_paid","sourcin
g_channel","residence_area_type","premium renewal" and 79857 records.
• This is the data of the customers that are part of the insurance company from
past so many years. We want to predict whether the customer will renew his
premium in future or not.

• Missing Value:
o Checking for missing value.
o It has been observed that there are no missing values
Structure of the dataset:
> str(premium)
'data.frame': 79853 obs. of 17 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ perc_premium_paid_by_cash_credit: num 0.317 0 0.015 0 0.888 0.512 0 0.
994 0.019 0.018 ...
$ age_in_days : num 11330 30309 16069 23733 19360 ..
.
$ Income : num 90050 156080 145020 187560 10305
0 ...
$ Count_3.6_months_late : num 0 0 1 0 7 0 0 0 0 0 ...
$ Count_6.12_months_late : num 0 0 0 0 3 0 0 0 0 0 ...
$ Count_more_than_12_months_late : num 0 0 0 0 4 0 0 0 0 0 ...
$ Marital.Status : Factor w/ 2 levels "0","1": 1 2 1 2 1
1 1 1 2 2 ...
$ Veh_Owned : Factor w/ 3 levels "1","2","3": 3 3 1
1 2 1 3 3 2 3 ...
$ No_of_dep : Factor w/ 4 levels "1","2","3","4": 3
1 1 1 1 4 4 2 4 3 ...
$ Accomodation : Factor w/ 2 levels "0","1": 2 2 2 1 1
1 2 1 2 2 ...
$ risk_score : num 98.8 99.1 99.2 99.4 98.8 ...
$ no_of_premiums_paid : num 8 3 14 13 15 4 8 4 8 8 ...
$ sourcing_channel : Factor w/ 5 levels "A","B","C","D",..
: 1 1 3 1 1 2 3 1 1 1 ...
$ residence_area_type : Factor w/ 2 levels "Rural","Urban": 1
2 2 2 2 1 1 2 2 1 ...
$ premium : num 5400 11700 18000 13800 7500 3300
20100 3300 5400 9600 ...
$ renewal : Factor w/ 2 levels "0","1": 2 2 2 2 1
2 2 2 2 2 ...

All the categorical variables are converted into factors and continues into numeric to have better
understanding in EDA.
#Renaming columns

Age was given in days which was not making that much sense on analysis hence it has been changed
to years.

premium$age_in_days = premium$age_in_days/365
names(premium)[3]<-"age_in_years"

#data understanding and removing invalid columns

Premium_req<-premium[,c(2:17)]
View(Premium_req)

> summary(Premium_req)
perc_premium_paid_by_cash_credit age_in_years Income
Min. :0.0000 Min. : 21.01 Min. : 24030
1st Qu.:0.0340 1st Qu.: 41.02 1st Qu.: 108010
Median :0.1670 Median : 51.03 Median : 166560
Mean :0.3143 Mean : 51.63 Mean : 208847
3rd Qu.:0.5380 3rd Qu.: 62.02 3rd Qu.: 252090
Max. :1.0000 Max. :103.02 Max. :90262600
Count_3.6_months_late Count_6.12_months_late Count_more_than_12_months_la
te
Min. : 0.0000 Min. : 0.00000 Min. : 0.00000
1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000
Median : 0.0000 Median : 0.00000 Median : 0.00000
Mean : 0.2484 Mean : 0.07809 Mean : 0.05994
3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.: 0.00000
Max. :13.0000 Max. :17.00000 Max. :11.00000
Marital.Status Veh_Owned No_of_dep Accomodation risk_sc
ore
Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :9
1.90
1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:9
8.83
Median :1.000 Median :2.000 Median :3.000 Median :2.000 Median :9
9.18
Mean :1.499 Mean :1.998 Mean :2.503 Mean :1.501 Mean :9
9.07
3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:9
9.52
Max. :2.000 Max. :3.000 Max. :4.000 Max. :2.000 Max. :9
9.89
no_of_premiums_paid sourcing_channel residence_area_type premium
Min. : 2.00 Min. :1.000 Min. :1.000 Min. : 1200
1st Qu.: 7.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.: 5400
Median :10.00 Median :1.000 Median :2.000 Median : 7500
Mean :10.86 Mean :1.823 Mean :1.603 Mean :10925
3rd Qu.:14.00 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:13800
Max. :60.00 Max. :5.000 Max. :2.000 Max. :60000

renewal
Min. :1.000
1st Qu.:2.000
Median :2.000
Mean :1.937
3rd Qu.:2.000
Max. :2.000

#check missing values


sum(is.na(Premium_req))

There are no missing values in data.

# Checking null data


sapply(Premium_req,function(x) sum(is.na(x)))

# Checking # of unique values in each column


sapply(Premium_req,function(x) length(unique(x)))

From above two checks, I conclude there is no null or unique data.

3. Exploratory data analysis

Univariate analysis with histograms:


Here we can see that distribution of premium paid by cash is left skewed and distribution of age data
is normal.
Now when we check the distribution of late premium columns then everything is unidimensional,
and all the data lie in the range of 0-10 because of its low frequency it does not provide significant
information.

Distribution of premium paid is also left skewed and from renewal column distribution we can say
that there are more number of people renewing then not opting for insurance again. (Note: 1 → No
and 2 → Yes).

• Distribution of risk score of customers is right skewed and it ranges from 96 to 100
and majority of customers lying in the range of 99-99.8 explaining that most number
of people are opting for renewals due to high risk score and high income.
Bivariate Analysis:

Now here if u can see then age is highly dependent in late premium paid by customers,
there are high chances if person paying late premiums if its age lies between 45 to 55 dur to
mid life crisis. And we can also see some outliers in late premiums after 6-12 month box
plot.

• From below plot we can infer Late premiums are more prevalent when no. of
dependents are more like after 6+ dependent count there are high chances of late
premium paid.
• From above plot we can infer more the income lesser the chances of late premium
paid. We can see income with higher than 4e+07 are having 0 count of late
premiums.

From the above plot it is evident that people with higher income opt for renewals in
comparison to lower income people. (Note: 1→ No and 2→ yes to renewal).
Appendix: Code in R

# Set working directory

setwd("D:/R_datascience/Capstone")

# read file

premium=read.csv("premium.csv",header=TRUE);

View(premium)

#Check structure of data

str(premium)

dim(premium)

names(premium)

premium$perc_premium_paid_by_cash_credit =
as.numeric(premium$perc_premium_paid_by_cash_credit)

premium$age_in_days = as.numeric(premium$age_in_days)

premium$Income = as.numeric(premium$Income)

premium$Count_3.6_months_late = as.numeric(premium$Count_3.6_months_late)

premium$Count_6.12_months_late = as.numeric(premium$Count_6.12_months_late)

premium$Count_more_than_12_months_late =
as.numeric(premium$Count_more_than_12_months_late)

premium$risk_score = as.numeric(premium$risk_score)

premium$no_of_premiums_paid = as.numeric(premium$no_of_premiums_paid)

premium$premium = as.numeric(premium$premium)

#Converting the categorical variables to factor from numeric

premium$Marital.Status = as.factor(premium$Marital.Status)

premium$Veh_Owned = as.factor(premium$Veh_Owned)

premium$No_of_dep = as.factor(premium$No_of_dep)

premium$Accomodation = as.factor(premium$Accomodation)

premium$renewal = as.factor(premium$renewal)
# Convert continuos variable text type into numeric

premium[premium=="Urban"]<- 1

premium[premium=="Rural"]<- 0

premium$residence_area_type <- as.numeric(premium$residence_area_type)

#Renaming columns

premium$age_in_days = premium$age_in_days/365

names(premium)[3]<-"age_in_years"

#data understanding and removing invalid columns

Premium_req<-premium[,c(2:17)]

View(Premium_req)

summary(Premium_req)

str(Premium_req)

#check missing values

sum(is.na(Premium_req))

# remove missing values

Premium_req<- na.omit(Premium_req)

Premium_req <- as.data.frame(Premium_req)

# Checking null data

sapply(Premium_req,function(x) sum(is.na(x)))

# Checking # of unique values in each column


sapply(Premium_req,function(x) length(unique(x)))

# Exploratory Data Analysis

#univariate

par(mfrow=c(3,4))

hist(Premium_req$perc_premium_paid_by_cash_credit,col = "blue", main = "Distribution of


Premium paid by cash")

hist(Premium_req$age_in_years,,col = "blue", main = "Distribution of age")

hist(Premium_req$Income,col = "blue", main = "Distribution of Income of customers")

hist(Premium_req$Count_3.6_months_late,col = "yellow", main = "Distribution of no. of


times premium was paid 3-6 months late")

hist(Premium_req$Count_6.12_months_late,col = "yellow", main = "Distribution of no. of


times premium was paid 6-12 months late")

hist(Premium_req$Count_more_than_12_months_late,col = "yellow", main = "Distribution


of no. of times premium was paid 12+ months late")

hist(Premium_req$no_of_premiums_paid,col = "red", main = "Distribution of Premium paid


by cash")

hist(Premium_req$renewal,col = "red", main = "Distribution of renewal of premium by


customer")

hist(Premium_req$risk_score,col = "red", main = "Distribution of risk score of customer")

hist(Premium_req$Marital.Status,col = "green", main = "Distribution of marital status of


customer")

#Bivariate analysis

boxplot(Premium_req$age_in_years ~Premium_req$Count_3.6_months_late, main = "Age


vs late Premium in 3-6 months.")

boxplot(Premium_req$age_in_years ~Premium_req$Count_6.12_months_late, main ="Age


Vs late premium after 6-12 months")

boxplot(Premium_req$age_in_years ~Premium_req$Count_more_than_12_months_late,
main ="Age Vs late premium after 12 months")
boxplot(Premium_req$Income ~Premium_req$risk_score, main = "Income vs Risk score")

boxplot(Premium_req$Income ~Premium_req$renewal, main = "Income vs renewal")

boxplot(Premium_req$Income ~Premium_req$Count_more_than_12_months_late, main =


"Income vs late premium after 12 months")

boxplot(Premium_req$No_of_dep ~Premium_req$Count_6.12_months_late, main = "No. of


dependents vs late Premium in 6-12 months.")

boxplot(Premium_req$No_of_dep ~Premium_req$Count_3.6_months_late, main = "No. of


dependents vs late Premium in 3-6 months.")

Вам также может понравиться