Академический Документы
Профессиональный Документы
Культура Документы
Louise Francis, FCAS, MAAA, Francis Analytics and Actuarial Data Mining
www.data-mines.com
Louise_francis@msn.com
Why Predictive Modeling?
Advanced methods
for dealing with messy
data now available
Classification
Target variable is
categorical
Prediction
Target variable is
numeric
A Casualty Actuarys Perspective
on Data Modeling
The Stone Age: 1914
Often ad-hoc
Simulation models and numerical methods for variability and uncertainty analysis
At end of 20st century, large consulting firms starts to build a data mining practice
Classical
Decision Trees
Neural Networks
Unsupervised learning
Clustering
Newer Methods
Ensemble
SVM
Deep learning
Text Mining
Predictive Modeling
Predictive
Modeling
Data
Mining/Machine
Learning
Classical GLMs
Classical Statistics: Regression
$10,000
$8,000
Severity
$6,000
$4,000
$2,000
$-
1990 1992 1994 1996 1998 2000 2002 2004
Year
Severity Fitted Y
Copyright StatSoft, Inc., 1984-2011. StatSoft, StatSoft logo, and STATISTICA are trademarks of StatSoft, Inc. 10
Linear Modeling Tools Widely Available: Excel
Analysis Toolpak
Yi i
Regression:
i X '
~ N (0, 2 )
GLM: Y h(i )
h() X '
~ exponential family
h is a link function
Estimating Parameters
>devage<-as.factor((AGE)
>claims.glm<-glm(Claims~devage, family=poisson)
>summary(claims.glm)
Call:
glm(formula = Claims ~ devage, family = poisson)
Deviance Residuals:
Min 1Q Median 3Q Max
-10.250 -1.732 -0.500 0.507 10.626
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.73540 0.02825 167.622 < 2e-16 ***
devage2 -0.89595 0.05430 -16.500 < 2e-16 ***
devage3 -4.32994 0.29004 -14.929 < 2e-16 ***
devage4 -6.81484 1.00020 -6.813 9.53e-12 ***
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 2838.65 on 36 degrees of freedom
Residual deviance: 708.72 on 33 degrees of freedom
AIC: 851.38
Data Complexities: Missing Data
Planned fraud
Staged accidents
Abuse
Opportunistic
Exaggerate claim
Both are referred to as questionable claims
1 .0
0 .8
w 1 = -1 0
w 1 = -5
w 1 = -1
w1=1
0 .6 w1=5
w1=10
0 .4
0 .2
0 .0
X
-1 .2 -0 .7 -0 .2 0 .3 0 .8
Assessing Results
Confusion Matrix
ROC Curve
Regression Trees
Minimize Chi-Square
Statistic
C&RT
Binary splits
Gini Index, MSE
Different Kinds of Decision Trees
60000.00
50000.00
Value Treenet Predicted
40000.00
30000.00
20000.00
10000.00
0.00
457
272
2135
0
0
0
0
81
575
860
0
365
0
3489
705
194
1040
1560
1275
Provider 2 Bill
Ensemble Prediction of IME Requested
0.90
0.80
0.70
Value Prob IME
0.60
0.50
0.40
0.30
450
200
275
2540
0
683
0
0
0
0
0
0
0
363
560
821
989
1195
100
1450
1805
11368
Provider 2 Bill
The Fraud Surrogates used as
Dependent Variables
Independent Medical Exam (IME) requested
Special Investigation Unit (SIU) referral
IME successful
SIU successful
DATA: Detailed Auto Injury Claim Database for Massachusetts
Accident Years (1995-1997)
Results for IME Requested
3/20/2015
42 Dependent Variable Problem:
Unsupervised Learning
Insurance companies frequently do not collect
information as to whether a claim is suspected of fraud
or abuse
Even when claims are referred for special investigation
Solution: unsupervised learning
Hierarchical clustering
K-Means clustering
Most frequent is k-means
92
90
88
Loan to Value
86
84
82
80
78
76
74
2001 2002 2003 2004 2005 2006 2007
Year
2500 250
2000 200
1500 150
# Subprime Loans
Avg Size of Loan
1000 100
500 50
0 0
2001 2002 2003 2004 2005 2006 2007
HMDA Data
LISC ZIP Foreclosure Needs Score
Subprime component
Foreclosure component
Disclosure component
http://www.housingpolicy.org/foreclosure-response.html
Normalized
Independent Variable Importance Importance
Denial Percent .027 100.0%
Mean Denial Score .027 99.9%
PctApprove .024 88.5%
ZipCodePopulation .020 72.6%
PctPropNot1-4Fam .019 69.5%
Median Rate Spread .017 61.6%
PInCom .016 60.5%
HouseholdsPerZipcode .015 56.1%
Mean LTV Ratio .014 52.7%
Results of Applying Clustering to HMDA
Data
Table III.5 Means On Variables[1]
Cluster
1 2 3