Вы находитесь на странице: 1из 30

Model Lifecycle

Ajit Ghanekar
Model Life Cycle

Model Model
Monitoring Development

Model Model
Assessment Validation
Model Development
Model Development - Process

Understanding of Identification of
Business Pains Objective
and and
Available Data Expected Outcome

Formulation of Identification of
Modeling Approach Analysis Tool
and and
Data Requirement I/O Requirement
Model Development -Difficulties

• Voluminous Data

• Missing Data Elements

• Lack of Data Insight

• Inter-Correlated Characteristics

• & Many More…


Model Development – SEMMA Methodology

Sample Explore Modify Model Assess


Rationale

• Manageable Data for Model Development

• Suppose to Represent Population.

• Enough to Develop model on Sample

• Model Developed on Sample Valid for Population

Sample
Techniques

Popular Sampling Techniques…

• Simple Radom Sampling


▫ With Replacement (SRSWR)
▫ Without Replacement(SRSWOR)

• Stratified Sampling
Sample
Thumb Rules

• Choose Sample Size ‘N’ Sufficiently large such that


sampling error would be minimized.

• SRSWR is default sampling method.

• If population has known classes (categories) Use


Stratified Sampling

• If a particular class is thinly represented then use over-


sampling technique for that particular class and
adjust the inference accordingly.
Sample
Data Partitioning

• Avoid Over-fitting of model

• Validating a Model

• Comparison of a Model

Sample
Data Partitioning

• Divide Sample randomly into three Parts


• Suggested Division
Data Type Purpose Suggested

Training Build Model 60%


Data
Validation Validate Model 30%
Data
Testing Compare Model 10%
Data
Sample
Rationale

• Provides Preliminary Insights into Data

• Preliminary Insights include…


▫ Causal Relationships
▫ Correlated characteristics
▫ Central Tendency
▫ Dispersions
▫ & Many More Explore
Techniques

• Statistical Charts
▫ Histogram
▫ P-P Plot/ Q-Q Plot
▫ Box Chart

• Preliminary Data Analysis


▫ Mean/Median/Mode
▫ Symmetry/ Kurtosis Explore
▫ Variance
Rationale

• Imputation

• Standardization

• Normalization

• Data Reduction
Modify
Techniques

• Imputation
▫ Missing Data Analysis

• Standardization
▫ Standardize data

• Normalization
▫ Log Transform
▫ Logit Transform
▫ Probit Transform

• Data Reduction

▫ Principal Component Analysis


▫ Canonical Correlation Modify
Rationale

• Establishes causal relationship between independent


characteristics and Target

• Can preserve relationship in precise and concise


mathematical function

• Provides unique measurement scale in-form of weighted


sum of characteristics, where weights are data dependent

• Model may satisfy one of the Objectives


▫ Classification
▫ Prediction
▫ Forecasting Model
Techniques

• For Classification
▫ Classification Trees
▫ Logistic Regression
▫ Neural Network

• For Prediction
▫ Regression Trees
▫ Linear Regression
▫ Neural Network

• Forecasting
▫ ARIMA Models

▫ Smoothing Techniques
 Exponential Smoothing
 Holt Winters Smoothing
Model
 Moving Average Smoothing
Model Validation
Model Validation - Rationale

• Check for model Accuracy

• Check for Over-fitting of Model

• Check for Model Validity across Population

• Check for Predictability of Model


Model Validation - Process
Compute Predicted Compute Predicted
Compare Predicted
Outcome based on Outcome based on
Outcome with
established established
historical Outcome
Decision Rule Decision Rule

Compare Predicted
Measure gain over Measure efficiency
Outcome with
random model of Model
historical Outcome

Check for
Measure efficiency Measure gain over
of Model unconsumed random model
Information

Training Data Validation Data


Model Validation - Techniques

• Checking Accuracy of Model


▫ Confusion Matrix
▫ Mean Squared Error (MEE)
• Checking Efficiency of Model
▫ R2 and Adjusted R2
• Checking for Unconsumed Information
▫ Using Error Plots
• Gain over Random Model
▫ Lift Chart
Model Validation – Error Plots
Model Validation – Lift Chart
Model Validation –Confusion Matrix

True Positive False Negative

False Positive True Negative


Model Assessment & Deployment
Model Assessment & Deployment

• Multiple Competing Models for Same problem


• Needs common Metric for Comparison
• Best Model is considered as Champion Model
• Best Model is used for Scoring on Current Data.
• Model is Deployed as…
▫ Web Service
▫ PMML Code
▫ C /sas code/ R code
▫ ETL Job
Metric for Model Comparison

• Test Data is Used for Model Comparison


• Test Data is Scored using various Models
• Following Metric is compared for all models
▫ Lift Achieved /Net Gain
▫ Accuracy of Models
▫ Adjusted R2
• Best Model is determined based on Above Metric
Model Monitoring
Model Monitoring

• Model Performance is Not Static


• Model Performance is Constantly Changing
• Model Performance always depends
▫ Changing Population
▫ Changing Characteristics
• Population Changes always

Вам также может понравиться