Вы находитесь на странице: 1из 31

i

Text mining for insurance claim cost


prediction Case Study
PwCs New Powerful Service
Text Mining emerging area of data mining.

Up to 80% of data stored by organisations is in the free text form


(Feldman 2003, p.481). The data that is contained in text fields
holds huge untapped value. The difference between regular
data mining and text mining is that in text mining rather than
from structured databases of facts.

Text mining is a process that translates text into numeric form by


extracting the patterns from natural language text and therefore
allows us to directly incorporate textual information into
predictive modelling.

Page 2
PricewaterhouseCoopers May 05
Text mining is a new area and many questions remain
unanswered.
For example, further work is needed to understand:

How to best set text mining parameters such as synonym


dictionaries, word stemming, allowing specific word
combinations etc
What is the optimal process ofutilising textual information
after the text mining has been performed for a given
business case

This project is a pioneering research into the benefits of text


mining and the optimal ways to use the text mining output in
predictive models.

Page 3
PricewaterhouseCoopers May 05
Text analysis has a huge potential for insurance as
shown by research and industrial studies.
.
There has been recognition for some time now that data about
incidents contain information which allows for a proactive risk
management approach (Feyer and Williamson 1998).

Realising the potential value of information resident in this textual data,


there is growing interest by insurers in the application of new text
mining techniques (Feyer, Stout et al. 2001).

For example, text analysis of narrative fields about claims, resulted in


beneficial claims management and fraud detection in the occupational
injury insurance domain (Stout 1998). In the example, the benefits
stemmed from information in the textual narrative data, which was not
present in the existing coding system.

Page 4
PricewaterhouseCoopers May 05
PwC Australia Case Study. Using Text Mining in Insurance.
Client: Major Australian Insurer
Revisit data & assumptions
Analysis,
Data design, Implement,
Client
issue

Agree value Predictive strategy


drivers collection monitor
modelling formulation
& verification & review
& testing

Client Issue:
Perceived inadequacies in the level of information captured by current injury coding
system led to the need to assess the potential value that using textual information and
text mining facilities could add to the organisation:
To explore the possibilities and benefits of augmenting their existing accident coding
system using free text
To see if adding textual information would result in increased precision of claim cost
prediction
To suggest how text mining could be used for improvement in other areas of the
business
To assist in making decision regarding investing in a commercial text mining software
package that would suit clients needs best.
Page 5
PricewaterhouseCoopers May 05
Assessing value of textual information for the client.
.

Our approach was to create a model identifying at the


time of the incident report, whether the incident would
result in a weekly claim pay-out value within the top
10%, by the end of the next quarter.

We assessed the model:


in terms of the predictive power of textual information
on its own
in terms of textual information adding predictive power
to other, non-textual predictors.

Page 6
PricewaterhouseCoopers May 05
Data Description.
.

The data sets represented all claims reported between 30 September


2002 and 31 March 2004, which were still open at the time of the
research, an 18 month data history. This provided approximately
56,000 records.

The data comprised:


features about claimant demographics,claims payment information,
and codings about various aspects of the incident (injury location,
nature and mechanism).
textual data: unstructured free-typetext fields of about 200 characters
each. These fields described the incident and the resulting injury.

The target variable for prediction was a binary indicator (yes/no) of


whether or not that injury report had resulted in a claim pay-out value
within the top 10 percent by the end of the next quarter.

Page 7
PricewaterhouseCoopers May 05
Text Mining Process.

Stage 1. Does textual information


have predictive value? 1. Prepare
TextData
2. Discover
concepts
3. Reduce
concepts
(text mining)

About 8000 concepts were discovered Prepared


in the Discover concepts phase TransData

Filtered out concepts which had a


frequency of less than 50. After 6. Predictive
modelling
5. Derive domain 4. Select predictive
filtering 860 concepts remained. with concepts only
relevant concepts concepts

Selected the most important concepts


using decision trees and TreeNet.
Added domain expertise to create
enriched concepts 8. Predictive 9. Predictive
modelling modelling
Built predictive model using textual 7. Evaluate results
with features with concepts
information only. The model was only and features

correct in 75.7% cases!

10. Compare
results and
conclude

Page 8
PricewaterhouseCoopers May 05
Text Mining Process.
Stage 1. Does textual information have predictive value?

1. Prepare 2. Discover
3. Reduce
TextData concepts
concepts
(text mining)
Step 2. Discover concepts
Concepts are words or word Prepared
TransData
combinations resident in the
text. 6. Predictive
5. Derive domain 4. Select predictive
The mining of TextData required modelling
with concepts only
relevant concepts concepts

not only considerable time and


effort, but also expertise in the
insurance domain, and in the
used software packages.
8. Predictive 9. Predictive
The mining process was 7. Evaluate results modelling
with features
modelling
with concepts
characterised by iterative only and features

experimentation to find optimal


algorithm settings. These
settings included both
language and mathematical
weightings. 10. Compare
results and
conclude

Page 9
PricewaterhouseCoopers May 05
Text Mining Process.
Stage 1. Does textual information have predictive value?

1. Prepare 2. Discover
3. Reduce
TextData concepts
concepts
(text mining)
Step 3. Reduce the number of
concepts. Prepared
TransData
About 8000 concepts were
discovered in the Discover 6. Predictive
concepts phase. modelling
with concepts only
5. Derive domain
relevant concepts
4. Select predictive
concepts

Difficult to make sense of so many


concepts
Concepts with a low frequency
would not be relevant within 8. Predictive 9. Predictive
our context. Therefore at point 7. Evaluate results modelling
with features
modelling
with concepts
three the researchers filtered only and features

out those concepts which had a


frequency of less than 50 in
TextData.
After filtering 860 concepts 10. Compare
remained. results and
conclude

Page 10
PricewaterhouseCoopers May 05
Text Mining Process.
Stage 1. Does textual information have predictive value?

Step 4. Select predictive 1. Prepare 2. Discover


concepts
3. Reduce
concepts TextData
(text mining)
concepts

The researchers resolvedConcept the


Prepared
issueConcept
of concept
name importance TransData

predictability
LEG
by100using
LACERATED 99.43
TreeNet
FRACTURE
to identify
92.56
the 6. Predictive
modelling
5. Derive domain 4. Select predictive

most predictive
STRESS concepts
92.27
with concepts only
relevant concepts concepts

in the 860EYE concepts.


86.56
HERNIA 84.11
TRUCK 82.62 Use TreeNet to identify about 60 most
Concept predictive concepts out of the 860 available
BURN 73.06
Concept name importance
LADDER
LEG 58 100 8. Predictive 9. Predictive
7. Evaluate results modelling modelling
... ...
LACERATED 99.43 with features with concepts
only and features
FRACTURE 92.56
STRESS 92.27
EYE 86.56
HERNIA 84.11
TRUCK 82.62
BURN 73.06
LADDER 58
10. Compare
... ... results and
conclude

Page 11
PricewaterhouseCoopers May 05
TreeNet Overview

A TreeNet model normally consists of from several dozen to several hundred


small trees, each typically no larger than two to eight terminal nodes. The model
is similar in spirit to a long series expansion (such as a Fourier or Taylor's
series) - a sum of factors that becomes progressively more accurate as the
expansion continues. The expansion can be written as:

where each Ti is a small tree. The first tree in the series contributes a relatively
large amount to the model, while subsequent trees contribute successively
smaller corrections. A model normally consists of 400 to 800 small
trees, each typically no larger than four to eight terminal nodes.
The final model is a collection of weighted and summed trees.
Page 12
PricewaterhouseCoopers May 05
TreeNet vs boosting

TreeNet uses gradient boosting to achieve the benefit


of boosting (accuracy) without the drawback of a
tendency to be fooled by bad data.
In boosting, each tree grown would often be a fully
articulated stand alone model, with each boosted tree
combined with other trees via weighted voting or
averaging.
In contrast, each TreeNet component is a small tree.
Trees are summed together with small weights on each
component.

Page 13
PricewaterhouseCoopers May 05
TreeNet

The MART/TreeNet model is similar in spirit to a very


long series expansion (such as a Fourier or Taylor's
series) - a sum of factors that becomes progressively
more accurate as the expansion continues.

The first tree in the series contributes a relatively large amount to the model,
while subsequent trees contribute successively smaller corrections.

A model normally consists of 400 to 800 small trees, each typically no larger
than four to eight terminal nodes.
The final model is a collection of weighted and summed trees.

Page 14
PricewaterhouseCoopers May 05
MART or TreeNet
In any Predictive Modeling situation:
Y Target or Response Variable
X Inputs or Predictors
F( X) -values predicted by the model.
Loss Function L( Y, F) measures errors between Y and F( X). Typical choices of L( Y,
F) are
Squared error L( Y, F) =(Y-F(X))2
Absolute error L( Y, F) =|(Y-F(X)|

Page 15
PricewaterhouseCoopers May 05
TreeNet Optimization Strategy:

Make an initial guess {Fo(X i )} for example, assuming that


all Fo(Xi) are the same
Compute the negative gradient as the vector of partial
derivatives of L with respect to F(X i ) for i =1,2,,N
The negative gradient gives us the direction of the steepest
descent
Make a step towards the steepest descent direction

Page 16
PricewaterhouseCoopers May 05
MART or TreeNet

Difficulty with optimisation:


There are only N points; therefore, direct optimization in N
free parameters will result in a dramatic overfitting. A
direct step towards the negative gradient direction would
assume freely changing all N parameters F(Xi )
We will have to somehow limit the total number of free
parameters.

Page 17
PricewaterhouseCoopers May 05
MART or TreeNet

Lets limit the number of free parameters to


change down to a fixed small number K

Page 18
PricewaterhouseCoopers May 05
MART or TreeNet

Naturally, we want the next optimization step to be as close to the


free steepest descent direction as possible. This means that we will
need to find how to partition N possibly distinct components of the
negative gradient into K mutually exclusive groups such that the
within-group variation of the components will be as small as
possible.
But this means building an K- node regression tree with the target
being the components of the negative gradient.

Page 19
PricewaterhouseCoopers May 05
MART Algorithm for the given estimate of loss function, K and M

1. Make an initial guess {F( X i )}={ Fo(X i )}


2. FOR m = 1 TO M
Compute the negative gradient gm by taking the derivative of the expected loss
with respect to F( Xi ) taken at Fm- 1(X i )
Fit an K- node regression tree to the components of the negative gradient, this
will partition observations into K mutually exclusive groups
Find the within-node optimal constant hm(X i ) by performing K univariate
optimisations of the node contributions to the estimated loss (see the exact
formula for hm(X i) in the Hastie, Tibshirani and Friedman)
Do the update: {Fm(X i )} = {Fm- 1(X i )} + hm(X i )

Page 20
PricewaterhouseCoopers May 05
MART Algorithm . Example for Least Squares Loss
function (linear regression):
Initial guess {F 0 (Xi)}={ mean( Yi)}
FOR m = 1 TO M
Negative gradient gm is the vector of residuals, {Yi Fm- 1(Xi)} = {Residuali}
Fit an K-node regression tree to the current residuals.This will partition
observations into K mutually exclusive groups
For each given node: hm(X i ) = within-node mean( Residuali )
Update: {Fm(X i )} = {Fm- 1(X i )} + hm (X i )
END FOR

Page 21
PricewaterhouseCoopers May 05
TreeNet: further guard against overfitting
It turns out that it is beneficial to by slowing down the learning rate
and introducing the shrinkage parameter v, 0< v <1 into the update
step:
{Fm(X i )} = {Fm-1(X i )}+ v hm(X i )
Parameters v and M are connected: for the same level of
accuracy,small v require larger M. The best strategy appears to be to
set v to be less than 0.1 and choose M by early stopping (Friedman,
2001).

Page 22
PricewaterhouseCoopers May 05
MART or TreeNet

Accuracy of MART.
(Hastie, Friedman, Tibshirani, 2001):
Classification problem: spam vs email
CART: 8.7% error rate
MARS: 5.5% error rate
MART: 4% error rate

Page 23
PricewaterhouseCoopers May 05
TreeNet advantages:

Ability to handle data without preprocessing, automatic handling of missing


values, resistance to outliers in predictors or the target variable, speed

Automatic selection from thousands of candidate predictors


focusing on the data that is not easily predictable as model
evolves
as additional trees are grown less and less data needs to be
processed
In many cases, TreeNet is able to train effectively on 20% of
the data
Resistance to OverTraining

Page 24
PricewaterhouseCoopers May 05
TreeNet disadvantages:

Automatic selection from thousands of candidate


predictors
Interpretability of the prediction

Page 25
PricewaterhouseCoopers May 05
Text Mining Process.
Stage 1. Does textual information have predictive value?

Step 5. Derive domain-relevant 1. Prepare 2. Discover


concepts
3. Reduce
concepts. TextData
(text mining)
concepts

The researchers depended Concept on


Prepared
insurance
Concept namedomain
importance expertise for TransData

deriving LEG
additional
100
features at
LACERATED 99.43
point FRACTURE
five. 92.56
6. Predictive
modelling
5. Derive domain 4. Select predictive
relevant concepts concepts
with concepts only
This encompassed
STRESS
the grouping and
92.27
EYE 86.56
combining
HERNIA
of concepts
84.11
so that the
most predictive
TRUCK concepts were
82.62
combined BURN with 73.06those similar in the
meaningLADDER
(eg, stress
58
and anxiety, 7. Evaluate results
8. Predictive
modelling
9. Predictive
modelling
... ...
laceration and abrasion) to with features
only
with concepts
and features
increase frequencies

10. Compare
results and
conclude

Page 26
PricewaterhouseCoopers May 05
Text Mining Process.
Stage 1. Does textual information have predictive value?

Step 6. Discover and 1. Prepare


TextData
2. Discover
concepts
3. Reduce
concepts
interpret any (text mining)

predictive potential Concept


Prepared
of the textual
Concept name
LEG
importance
100
TransData

concepts only.
LACERATED 99.43
6. Predictive
5. Derive domain 4. Select predictive
FRACTURE 92.56 modelling
Build CART predictive
STRESS 92.27
with concepts only
relevant concepts concepts

model for claims cost


EYE
HERNIA
86.56
84.11
for interpretability of
TRUCK 82.62 Use CART to build a tree using the
derived concepts
results BURN
LADDER
73.06
58 8. Predictive 9. Predictive
modelling modelling
using only the predictive
... ... 7. Evaluate results
with features
only
with concepts
and features

concepts identified
by TreeNet and the
derived concepts
10. Compare
results and
conclude

Page 27
PricewaterhouseCoopers May 05
Text Mining Process.
Stage 1. Does textual information have predictive value?

Step 7 Evaluate results


1. Prepare 2. Discover
3. Reduce
TextData concepts
concepts
(text mining )

Prepared

evaluating models based on TransData

the concepts alone by


6. Predictive
5. Derive domain 4. Select predictive
Concept modelling
with concepts only
relevant concepts concepts

referring LEG
to
Concept name importance
100
gains charts 99.43
LACERATED 7. Evaluate results
8. Predictive
modelling
with features
only
9. Predictive
modelling
with concepts
and features
FRACTURE 92.56
modelSTRESS
precision.92.27
EYE 86.56
10. Compare
HERNIA 84.11 results and
conclude

The TreeNet
TRUCK model
82.62 using

concepts
BURN
LADDER
only was
73.06
58
75.7%
precise...on test ...data.

Page 28
PricewaterhouseCoopers May 05
Text Mining Process.

2. Discover
Stage 2. Does textual information 1. Prepare
TextData concepts
3. Reduce
concepts
(text mining)
add value to existing injury
codings?
Prepared
Created models with demographic TransData

and injury codings information


only and compared them to the 6. Predictive
modelling
5. Derive domain
relevant concepts
4. Select predictive
concepts
models with added textual with concepts only

information.

8. Predictive 9. Predictive
7. Evaluate results modelling modelling
with features with concepts
only and features

10. Compare
results and
conclude

Page 29
PricewaterhouseCoopers May 05
Textual information adds predictive power to the model:

Medical Benefits claim cost for sprains for the next 6 months (top 5%)

Model with no textual information included (left) compared to model with


textual information included
Some important textual concepts:
BOX , TRUCK , INJURY, STRAINNECK , SOFTTISSUEINJURY,
STRAINED_RIGHTWRIST, ANKLESPRAIN,GROUND , WRISTSTRAIN,
STRAINEDSHOULDER

Page 30
PricewaterhouseCoopers May 05
i

2004 PricewaterhouseCoopers. All rights reserved. PricewaterhouseCoopers refers to the network of member firms of
PricewaterhouseCoopers International Limited, each of which is a separate and independent legal entity.

Вам также может понравиться