Академический Документы
Профессиональный Документы
Культура Документы
Regression
What ?
Regression is formulating a functional relationship between a set of independent or Explanatory
variables (Xs) with a Dependent or Response variable (Y).
Y=f(X1, X2, X3,,Xn)
Why ?
Types of Regression
Y
Continuous
Sales Volume, Claim Amount, Number of
Equipments, % Increase in Growth, etc.
Ordinary Least
Squares Regression
Binary (1/0)
Buy/No Buy, Delinquent/Non Delinquent,
Growth/No Growth, etc.
Logistic Regression
Logistic Regression
Logistic Regression
Pi =Prob(Yi=0) =
Model equation
eLi
(1+ eLi)
Assumption
Parameters to be Estimated
Method of Estimation
Maximum Likelihood
Logistic Methodology
Initial Data Preparation. Data Treatment Univariate Analysis, Data Hygiene checks
Data
Logistic
Model
Obs-Perf
Windows
Observation Window
Mar04
Exclusion
Criterion
Xs
Performance Window
Mar05
May05
Oct05
Observations from the population that need to be excluded. Necessary to weed out
data bias and to ensure model utility.
e.g. Inactive accounts at observation point to be excluded, Accounts charged off
within performance window to be excluded.
e.g. Bad definition of 90+ ever + major derog ( that includes Bankrupt, Charge offs,
Repossessions, but excluding Fraud ) within the Performance window. Goods are
defined as anything other than a Bad.
Initial
Data
Preparation.
Data
Treatment
- Initial
Data
Preparation.
Data
Treatment
Univariate
Analysis,
Data
Hygiene
checks
Univariate Analysis, Data Hygiene checks
Server
Data Merging
Client Data
Different Tables
Merged Data
Data Treatment
(Hygiene Checks)
PHASE I
Variable Creation
Phase
PhaseI IDerived
DerivedVariables
VariablesIdentification
Identification
NO
Business and SME
approved ?
YES
Raw variables could of few types Demographic, Product Related, Behavioral, etc.
From the Raw variables (populated in the dataset) New variables are Derived.
Derived Variables
PHASE II
Phase
Derived
Variables
Identification
PhaseI II (c)
Dummy
Creation
& Correlation
Fineclassing
Coarse Classing
NO
YES
Dummy Creation
Dummy
Correlation
Dummy Creation
Fineclassing & Coarseclassing procedure helps in identifying the dummies to be created.
Dummying is the process of assigning a binary outcome to each group of attributes in each predictive
characteristic.
Logistic Procedure
PHASE III
Model Development
Multicollinearity
Check
Significance Of
Variables
Hosmer
Lemeshow Test
Concordance
KS & Rank
Ordering
Divergence
Index Testing
Clustering
Checks
Model Validation
Business
Validation
Score Calibration
What is Multicollinearity ?
Multicollinearity is a phenomenon when there is a linear relationship between a set of variables.
Why is Multicollinearity a problem ?
Multicollinearity affects the parameter estimates making them unreliable.
Parameter
Intercept
d1_cons_cd_grt_1
d3_max_cdlevel
d1_Payment_method
d3_OTB_jun04
d2_crlimit_may04
d2_avg_pay_bal
d2_max_payment
d4_age
DF
Estimate
Standard
Error
Wald
Chi-Square
Pr > ChiSq
1
1
1
1
1
1
1
1
1
0.6010
1.0016
-1.0768
1.6529
0.6993
0.3627
0.4720
0.2424
0.4141
0.1423
0.1326
0.2338
0.1449
0.1176
0.1156
0.1084
0.1110
0.1094
17.8279
57.0378
21.2164
130.1012
35.3416
9.8523
18.9700
4.7691
14.3331
<.0001
<.0001
<.0001
<.0001
<.0001
0.0017
<.0001
0.0290
0.0002
Chi Square value for each explanatory variable the chi-square value indicates the level of
significance, i.e the impact of independent (explanatory) variable on the dependent variable.
The p-value cut-off should be decided in discussion with the business. Ideally the p-value<0.0001.
However in case of smaller population size p-value could be <0.05 or p-value<0.1.
Group
1
2
3
4
5
6
7
8
9
10
Total
Good = 1
Observed
Expected
924
1002
1058
981
884
923
931
786
734
953
756
918
997
947
859
905
921
778
731
950
Good = 0
Observed
Expected
753.27
920.21
1002.64
945.00
860.25
904.36
919.35
779.30
729.17
948.44
168
84
61
34
25
18
10
8
3
3
Chi-Square
DF
Pr > ChiSq
2.6543
0.9541
The discrepancies between the observed and expected number of observations in these groups are
summarized by the Pearson chi-square statistic, which is then compared to chi-square distribution
with t degrees of freedom, where t is the number of groups minus 2.
For a robust model we need to accept the null hypothesis. Hence, Higher the p-value better the model fit.
170.73
81.79
55.36
36.00
23.75
18.64
11.65
6.70
4.83
4.56
79.0
19.1
1.9
3627468
Concordance is used to assess how well scorecards are separating the good and bad accounts in the
development sample.
The higher is the concordance, the larger is the separation of scores between good and bad accounts.
The concordance ratio is a non-negative number, which theoretically may lie between 0 and 1.
Concordance Determination:
Among all pairs formed from 0 & 1 observations from the dependent variable, the % of pairs where the
probability assigned to an observation with value 1 for the dependent variable is greater than that assigned
to an observation with value 0.
Percentage of concordant pairs should be at least greater than 60.
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
0%
10%
20%
30%
40%
Random
50%
60%
70%
Development
Lorenz Curve
Kolmogorov-Smirnoff (KS) statistic is defined as the absolute difference of cumulative % of Goods and
cumulative % of Bads.
KS statistic value should not be less than 20. Higher the KS better is the model.
80%
90%
100%
Bad
Good
915
912
910
13
905
19
898
30
888
30
888
61
856
78
840
10
167
750
Total
414
8762
ranking
SATISFACTORY
sat_rank
all
Models have to rank order completely across development as well as Validation samples.
Good
_FREQ_
ave
variance
0
1
41338
856
40482
752.67
654.55
754.75
4070.44
10578.1225
3725.8816
1.4038
-28.398
p- value
<0.0001
For a robust model we need to reject the null hypothesis. Hence, lower the p-value better the model.
Good
Bad
8.0
7.0
6.0
%
5.0
4.0
3.0
2.0
1.0
0.0
200
300
400
500
600
SCORE
700
800
900
A good model should not have significant clustering of the population at any particular score and the
population must be well scattered across.
Ideally the clustering should be as low as possible. A thumb-rule would be to contain the clustering so
that it is within 5-6%.
1000
Development Sample
Validation Sample
(n1 obs)
(n2 obs)
Validation Re-run
and
and