Академический Документы
Профессиональный Документы
Культура Документы
Dimitris Bertsimas
Adam Mersereau
Geetanjali Mittal
June 2003
or 617-253-7054
BY
DIMITRIS BERTSIMAS
ADAM MERSEREAU
GEETANJALI MITTAL
TABLE OF CONTENTS
1 INTRODUCTION
1.1
1.1.1
Panel Data
1.1.2
1.1.3
Unilever Database
1.2
1.3
Data Provided
2.3 Results
2.4 Conclusions
11
12
13
14
15
16
3.2.1
Data Cleaning
17
3.2.2
Data Aggregation
17
18
3.3.1
18
3.3.2
19
3.3.3
20
20
3.4.1
Choice of Models
20
3.4.2
22
25
3.5.1
25
3.5.2
26
ii
Massachusetts Institute of Technology
CLUSTERING ANALYSIS
26
28
28
29
4.3.1
29
4.3.2
Details of Cluster 1
30
4.3.3
Details of Cluster 2
30
4.3.4
Details of Cluster 3
31
iii
Massachusetts Institute of Technology
32
33
TABLE OF FIGURES
Figure 1: Graphic Representation of MVC
Figure 5: Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively
10
Figure 6: Improved Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively
10
11
11
12
Figure 10: Lift Curves for Logistic Regression Models Based on Fall 2000 Survey Data
14
15
16
16
18
19
Figure 16: Dove Body Wash: Logistic Regression, Tree, Neural Network
21
22
23
Figure 19: Graphical Representation of Logistic Regression Model for Gorton Fillets
23
24
Figure 21: Comparative Lift Curves for Models with and without GQM Scores
24
Figure 22: Comparative Lift Curves for GQM Stratified and Overall Models
25
28
28
29
30
31
31
Figure 29: Comparative Lift Curves for Cluster based and Overall Models
33
iv
Massachusetts Institute of Technology
INTRODUCTION
This report describes and concludes the data analysis project undertaken in collaboration with Unilever
through the Sloan School of Management Center for eBusiness. In this document we trace our interactions
with Unilever, describe the data made available to us, describe various analyses and results, and present
overall conclusions learned in the course of the project.
Unilever has been a pioneer in mass-marketing, which focuses on widely broadcast advertising messages.
This marketing approach is at odds with new trends towards targeted marketing and CRM (Consumer
Relationship Management), and Unilever is interested in investigating how the new marketing philosophy
applies to the packaged consumer goods industry in general and to Unilever in particular. As Unilever sells
its products not directly to consumers but through a variety of retail channels, they have indirect contact
with the end consumers. Thus the application of CRM (defined by Unilever as Consumer Relationship
Management rather than the more standard Customer Relationship Management to make this
distinction) is less obvious in their business. In particular, Unilever recognizes three clear obstacles to the
application of CRM ideas in the packaged goods industry:
Data mining expertise and experience are new to packaged goods companies.
The packaged goods industry marketing efforts focus on brands rather than on consumers.
Unilever employs data mining in the area of CRM. This effort is partly undertaken in the Relationship
Marketing Innovation Center (RMIC), which is a group that transcends Unilevers individual brands.
RMIC Unilevers project with MIT has been part of these efforts.
Especially in light of the second bullet point above, the MIT team was asked to help evaluate the potential
of data mining technology for Consumer targeting at Unilevers business. Unilever was to make available
to the MIT team representative samples of the data at Unilevers disposal, and the MIT team was to analyze
this data and research new data mining methods for making use of this data in a targeted marketing
framework.
1.1
This section summarizes our understanding of the data Unilever has available for analysis, as well as
Unilevers previous data mining efforts with this data. A primary data source is a Unilever database of
consumers. The database contains information on individuals and households who have interacted with
Unilever in some fashion in the past. The data includes demographic and geographic information as well
as self-reported usage and survey information.
1
Massachusetts Institute of Technology
1.1.1 Models
Based on information available on a subset of Unilever consumers two models were fit, the so-called
Demographic and Golden Question models, for predicting if a consumer is an MVC (Most Valuable
Consumer) as measured by their dollar spend on Unilevers collection of brands. The concept of MVC
and
the
two
models
are
described
in
more
detail
in
the
next
paragraph.
H
M
L
L
2
Massachusetts Institute of Technology
Survey responses
The database contains no transactional purchase information, although it does provide self-reported brand
usage data, model predictions, and contact history information. The accuracy of the self-reported brand
usage data varies by brand.
1.2
At a project kickoff meeting in Greenwich, CT in August, 2001, we discussed several research directions of
interest to both MIT and Unilever:
Investigate methods for predicting, characterizing, and clustering individual consumer usage of
products.
Develop dynamic logistic regression modelsthat is, prediction methodologies based on logistic
regression that evolve in time.
Develop optimization-based logistic regression subset selection methodologies. Specifically, how can
we use such a methodology to design a questionnaire of maximum value and minimum length?
Although this list includes a number of items that may be of interest to Unilever in the future, the majority
of our efforts were focused on the first of these topics. This decision was largely guided by the data made
available to us, which includes no information on consumer profitability and contains limited time stamp
information. In subsequent sections of the report, we will revisit the data limitations and the role of the data
in guiding our analysis.
1.3
Data Provided
We received several data files from Unilever on a CD dated November 30, 2001. The files and our
understanding of them are as follows:
UNIFORM.TXT: A large data file sampled from the Unilever Database. The extraction was
performed via uniform random sampling from the most complete and reliable records from the
database.
STRATIFIED.TXT: A large data file sampled from the Unilever Database. This file is similar to
UNIFORM.TXT except in the method of sampling. Stratified sampling was used and stratification
was done with respect to the Golden Question model scores.
MIT LAYOUT #1 AXCIOM.XLS: A list of variable names and layout for the data in
UNIFORM.TXT and STRATIFIED.TXT.
3
Massachusetts Institute of Technology
LAYOUT DESCRIPTION #1.XLS: A list of variable names along with a brief description of
many of the variables.
MIT LAYOUT #2 INFOBASE.XLS: An enumeration of InfoBase data entries, which form some
of the variables in UNIFORM.TXT and STRATIFIED.TXT. Many of the variables described in MIT
LAYOUT #2 INFOBASE.XLS do not appear in the data files.
MASTER.XLS: A table matching brand ids to brand names, franchises, and product categories.
Individual Layout: This set of variables includes individual and household ID codes, individuals'
names, and basic demographic variables for age and gender. Other demographic variables like
Marital Status, Employment Status, Occupation Type, and Ethnic Code have significantly
large number of missing values.
Household Layout: This set of variables includes data specific to a household (note that a
household may include multiple individuals), and is a result of the merging of the Unilever
Database with data from third-party sources.
Third-party data offers information on household vehicle ownership, the distribution of genders and ages in
the household, occupations, home ownership, credit card ownership, and membership in a number of
lifestyle clusters (e.g. traditionalist, home and garden). Most of these fields have at most 20% missing
data.
Demographic Model: This set of variables gives results of the Demographic logistic regression
model for prediction of MVC. The most important field here is the Model Score which gives
values between 0 and 1, the models prediction. The field Model Score Group denotes the
decile in which the model scores falls. Deciles are defined according to the Demographic model
results.
4
Massachusetts Institute of Technology
Golden Question Model: This set of variables gives results of the golden question logistic
regression model for prediction of MVC. The most important field here is the Model Score
which gives values between 0 and 1, the models predictions. The field ``Model Score Group''
groups the observations into deciles according to the Golden Question model results.
The
Demographic and Golden Questions models are positively correlated. We have measured a
correlation of 0.6 on the UNIFORM.TXT data.
Block Group: This set of variables describes the block the household belongs to. A block is an
address-based segment of the population. Thus, there are likely multiple households per block.
These variables essentially provide demographic information about the geographical neighborhood of the
household. Variables describe the urban/rural breakdown, the ethnic breakdown, the distribution of home
valuation, employment breakdown, education level, etc.
Brand Usage Layout #01-#20: For each individual, there are 20 sets of brand usage variables.
Thus each individual is associated with at most 20 brands. The brands are a mixture of Unilever
and non-Unilever brands.
A total of 259 brands appear in the UNIFORM.TXT dataset, with 96 chosen by at least 100 individuals, 55
chosen by at least 1000 individuals, and 17 chosen by at least 10000 individuals. The average individual
reports interaction with 9.5 distinct brands. In our analysis we concentrated on the file UNIFORM.TXT
instead of the file STRATIFIED.TXT, because it seemed appropriate to use a representative sample of the
underlying data set.
Our initial efforts were towards developing methods for predicting usage of individual brands using
demographic variables and cross-purchase information as inputs. We were particularly interested in
methodological innovations for using these different sets of variables to make useful predictions.
At this stage of the project we chose to concentrate on the prediction of usage for individual brands due to
the following reasons:
Prediction of brand usage is of obvious use in targeted marketing
A lack of information on consumer profitability and limited time stamp information eliminated
several of the topics mentioned in section 1.2
5
Massachusetts Institute of Technology
The problem is of general interest to Unilever and other packaged goods companies that have
large amounts of data over a wide range of products.
2.1
Data Cleaning
In efforts to clean the data set for analysis, we first performed a brand aggregation using the Franchise and
Category information provided in the MASTER.XLS file. The reason for this was to eliminate the
distinction between very similar products.
For example, we judged that different flavors and versions of the same product should appear the same
from the perspective of a company-level analysis. The new brand labels in most cases can be mapped oneto-one to the original notion of brands. After eliminating those brands with reported usage by fewer than
100 individuals, we were left with 76 unique brands for analysis in the UNIFORM.TXT data set. Figure 2
shows the brands with top ten consumer response and Figure 3 shows the distribution of reported usage
among the 76 brands. Note that the patterns of reported usage in this data is not representative of the actual
sales of these products.
# Consumer
Response
Product
9367
6643
6146
4989
4938
4923
4640
4554
4399
4302
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
1
13 1 7 21 25 29 33 37 41 4 5 49 53 57 6 1 65 69 7 3
6
Massachusetts Institute of Technology
For each of these 76 products, we developed a binary indicator indicating the reported usage for all the
consumers, with 1 representing a positive usage response and 0 representing no indication of usage.
In addition to the 76 brands, we focused on the following four demographic variables in our prediction
efforts:
Household Size
Income Category
These variables were chosen because they included relatively few missing values and because they
generally represent important individual and household indicators which act as proxies for underlying
factors that influence purchase behavior.
The resulting cleaned data thus contained four demographic variables and 76 binary reported usage
BRAND
BRAND
BRAND
Region
Income
Category
Househol
d Size
Child
Individua
l
variables for each of the 46307 consumers. A representation of this data is as follows:
1
2
46307
Demographic variables
Usage variables
2.2
Logistic regression has a long history of use in targeted marketing for linking demographic variables and
purchase behavior, while collaborative filtering is an approach that has developed with the rise of ecommerce for predicting a consumers preferences based on the preferences of similar consumers. Our
7
Massachusetts Institute of Technology
interests in this study were in investigating the relative merits of these two methodologies and of combining
their results in various ways.
Logistic regression establishes a relationship between predictor variables and a response variable via the
logistic function. In particular, we model a consumers probability of response p in terms of a set of
predictor variables x1, , xn as follows:
exp i xi
i
p=
1 + exp i xi
i
Logistic regression has often been used in marketing contexts, and has the advantages that the model is
interpretable and can be fit using efficient methods. It can be susceptible to overfitting, however, when
there are many predictor variables.
Collaborative filtering models an individual consumers response as a weighted average of the responses of
other consumers in the database, where the weighting is typically according to a similarity measure among
consumers. The approach is appropriate in applications where there are sufficiently large numbers of
products to allow computation of a useful similarity measure among consumers. The collaborative filtering
approach has proved useful particularly in internet recommender systems.
recommendation engines employed by Amazon and Netflix. Collaborative filtering is not as easily
interpretable as logistic regression, but has the advantages that it is conceptually simple and is adept at
handling many variables representing choices among a large number of possibilities.
In our implementation, we compute similarities among consumers based on reported usage information
only. Since this is binary data, we require a suitable similarity measure. We have made use of the socalled Jaccard similarity. Given usage vectors of two individual consumers, define a to be the number of
products the two consumers have in common, b to be the number of products unique to consumer 1, and c
to be the number of products unique to consumer 2. Then the Jaccard similarity is given by the ratio
a/(a+b+c). The Jaccard measure thus takes into account products the two consumers have in common, but
ignores items that neither have chosen. As our reported usage data is relatively sparse, there are likely to be
many products chosen by neither. With the Jaccard similarity measure, these products will not inflate the
similarity measure as they would with, say, a correlation measure. We also made use of a weighting
scheme that weighs rare products more heavily than common ones. Such a modification is based on the
observation that selection of a rare product is more informative than selection of a common product. Such
an inverse frequency weighing is common in collaborative filtering systems.
8
Massachusetts Institute of Technology
Initial tests of these methods indicated that we might be able to produce more accurate predictions by using
the predictions of a logistic regression model and a collaborative filtering model as inputs to a third model.
We considered three methods for combining these models: a weighted average of the two results, another
logistic regression taking the two results as inputs, and an optimization model that computes a linear
discriminant in the space of the logistic and collaborative model outputs. Upon further testing, we decided
that the logistic regression method for combining models exhibits the best performance.
In the end, we examined several models for predicting individual product usage from the demographic and
other product usage data:
RAND : Predictions are random numbers between 0 and 1. This is less of a prediction method
than a baseline for comparison.
COLLAB : The collaborative filtering model using reported usage data from other consumers.
COMB_LOGIT : A logistic regression model using the LOGIT and COLLAB model predictions
as predictor variables.
FULL_LOGIT : A logistic regression model using the demographic variables as well as the usage
variables for products other than the one being modeled. We implemented two versions of the
FULL_LOGIT model: one using all variables available, and one which uses subset selection
methods to choose an accurate model with only 5 variables.
2.3
Results
We present our results for the various methods in the form of lift curves. To compute a lift curve for a
given product, we use a method described above to assign each consumer in the validation set an estimated
probability of usage. We then rank the consumers in order of their predicted usage, and ask the question If
we contacted the top M consumers in this list, how many actual users would we find? If we repeat this
question for every choice of M, we obtain the lift curve. Random predictions should roughly give a straight
diagonal line, while effective prediction algorithms will result in lift curves as high as possible above this
diagonal line. Higher lifts indicate better performance of a predictive model.
9
Massachusetts Institute of Technology
Figure 5: Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively
Figure 5 includes lift curves for the models used to predict usage of the products Caress Body Wash and
Classico Pasta Sauce, noting that these are indicative examples of results obtained for several products.
The first set of lift curves illustrates the results for the COLLAB, LOGIT, and COMB_LOGIT methods.
We observe that the while the LOGIT model using the four demographic variables does better than random
prediction, the COLLAB model using usage information of other products does significantly better.
Combining the two models does only slightly better than the collaborative filtering approach.
The following set of lift curves adds the FULL_LOGIT method:
Figure 6: Improved Lift Curves for Caress Body Wash and Classico Pasta Sauce Respectively
The FULL_LOGIT model, based on all the variables, performs even better than the COMB_LOGIT model.
This logistic regression model using only a few selected variables exhibits surprisingly good performance.
This observation motivated a closer look at the specific coefficients of the variables in these parsimonious
10
Massachusetts Institute of Technology
models. The tables below give the coefficients for the two 5-variable models responsible for the lift curves
above:
FULL_LOGITS Coefficients
Intercept
-2.47
1.61
0.59
0.51
0.46
0.45
FULL_LOGITS Coefficients
Intercept
-2.89
1.92
1.31
0.52
-2.11
-2.53
2.4
Conclusions
Our analysis of the overall UNIFORM.TXT data set led us to some intriguing conclusions, and motivated a
closer look at the data set and its sources.
Methodologically, while combination models are interesting, logistic regression, perhaps with subset
selection, is a sufficiently powerful method for analyzing this data.
11
Massachusetts Institute of Technology
Our primary finding was that in this data set, reported brand usage variables are considerably more
powerful than the limited set of demographics we looked at. In particular, usage of a given product can be
predicted surprisingly accurately using usage data from a small number of closely related products.
These pronounced and powerful trends encouraged us to take a closer look at the underlying data. After
discussions with our counterparts at Unilever, it was revealed that much of the data we were working with
was aggregated from two consumer surveys. Indeed, one of the surveys asked questions regarding personal
washes, while the other did not. Also, one of the surveys included questions regarding pasta sauces, while
the other did not. Clearly, the differences between the two surveys largely explained the high correlation
we were observing among usages of similar products.
Thus, our most significant conclusion from this analysis was that our models were achieving impressive
results, but were likely modeling the data collection technique rather than the underlying phenomenon.
Unfortunately, such a model may generalize poorly to panel data or to real world situations. Such survey
data, in an aggregated format, may serve as a poor proxy for purchase behavior.
The results of this analysis motivated more work to understand the source of the data. Future efforts were
focused on identifying portions of the data that were as uniform as possible, and forming prediction and
clustering using relatively simple modeling techniques such as logistic regression.
2.5
The previous analysis motivated a closer look at the data. At this point Unilever provided us copies of two
questionnaires, the responses to which comprised a large portion of the data in the Unilever database
extract. Using timestamps that indicated the dates of collection for the various responses, we were able to
obtain a more detailed understanding of the data. This structure is indicated in the following figure:
Demographic
variables
Consumer 1
..
Consumer
46307
D
M
G
Q
M
Usage
variables
39 brands
11 Unilever brands
70
12
Massachusetts Institute of Technology
In figure 9, rows indicate data available for a single consumer, while columns indicate different variables in
the data extract. Some demographics and model scores are reported for each consumer. In the section
marked as Usage Variables, the shaded blocks indicate the presence of self-reported usage data. After
some investigation, we believe that roughly half the consumers had Spring 2000 survey responses and no
Fall 2000 survey responses, while the other half had Fall 2000 responses and no Spring 2000 responses. A
subset of consumers from both groups also had some additional brand usage responses, which we hereafter
refer to as Non-Survey data. We were advised that this Non-Survey data largely represented responses
to coupon redemption.
In subsequent analyses, we concentrated on sets of brands and consumers whose usage data fell uniformly
in one of these blocks. The individual surveys reported on a relatively small set of brands, while the Noncoupon data included a much larger set of brands. For this reason, we focused our efforts on the NonSurvey data. In what follows, we will briefly discuss our limited modeling efforts on the Fall 2000 survey
data, and we will provide a lengthy discussion of an extensive analysis of the Non-Survey data.
2.6
While our efforts were focused on the Non-Survey data, we also performed an exploratory analysis of the
Fall 2000 survey data and associated demographic variables. We extracted a small sample of Fall 2000
data that included 2500 consumers.
The Fall 2000 survey data includes response data on a limited number of brands. The Unilever brands
represented are Suave, Lipton, Breyers, Wishbone, Dove, Lever2000, Caress, and Snuggle. This limited
amount of response information restrained the scope of analysis we could perform, and hence we tried the
following three indicative tasks:
Predicting simultaneous Caress and Snuggle usage given all other available variables.
The third task was an attempt to use predictive modeling to identify cross-selling opportunities.
We used only the demographic variables described above: described region of residence, income levels,
household sizes, and presence of children. We tried several modeling methodologies including nearest
neighbor methods, logistic regression, discriminant analysis, classification trees, and neural networks.
These methodologies will be described in more detail in a subsequent section. We concluded that the
choice of modeling algorithm seemed to make insignificant difference in the quality of the results obtained.
The best-performing models predicted 100% non-usage, which gave a 32% misclassification rate for
Caress, and a 19% misclassification rate for Caress / Snuggle cross-sells. These results indicated that it is
difficult to make predictions given the limited number of variables.
13
Massachusetts Institute of Technology
# responses
350
300
250
Caress Reference
200
Caress&Snuggle lift
curve
150
100
Caress&Snuggle
Reference
50
0
0
500
1000
# targeted
1500
Figure 10: Lift Curves for Logistic Regression Models based on Fall 2000 Survey Data
On constraining the models to generate a reasonable number of usage predictions, the best models achieved
35% misclassification rate for Caress and 21% misclassification rate for Caress / Snuggle cross-sells.
Figure 10 shows lift curves from the logistic regression models. We observe that the model making use of
demographic variables only gave insignificant lift.
The models based on both demographic and brand usage information seemed to achieve more lift. The
most useful predictor variables seemed to be other brands of soap namely Dove and Lever2000. While
this may reflect real consumer usage pattern, it may also be due to the design of the survey, which included
separate sections for soap and for other products. As with the other analyses in this report, the question
remains as to whether these results are transferable to real-world usage patterns.
Here we report on a subset of the original UNIFORM.TXT which we refer to as the NonSurvey data.
Our decision to analyze this section of the data was guided by the internally homogenous composition of
this data set and the fact that it included information on a wide range of products.
Data Details: A detailed description of the NonSurvey data extraction and composition
highlighting some of the inherent features.
Data cleaning and aggregation: A detailed description of treatment of missing values or outliers
and transformation of some of the variables that we conducted.
Data credibility issues: A few of our observations suggesting the possibility of artificial data
structure and data bias issues.
14
Massachusetts Institute of Technology
Predictive modeling: An in - depth analysis to predict brand usage and explore the Most Valuable
Consumer concept using various modeling techniques based on both demographics and Golden
Question Model and Demographic Model scores.
Cluster analysis: A description of efforts to segregate consumers into distinctive clusters and to
use this potential information to enhance predictive efforts.
3.1
Data Details
The layout of data provided by Unilever was explained above. The NonSurvey data has been extracted
from the file UNIFORM.TXT. This file contains information on 46,307 consumers. Each usage entry in the
file UNIFORM.TXT bears a date stamp indicative of time of data collection. A majority of data entries in
this file bear one of the two time stamps namely 15th May, 2000 and 15th November, 2000. Based on the
quantity of usage data associated with these two dates and the fact that the data with these time stamps
seems to correspond to the surveys provided to us, we assumed that usage entries with these time stamps
correspond to the Spring and Fall survey data. The remaining data is what we analyzed in this section and
has we refer to it as the NonSurvey data. As per information provided by Unilever, we believe the NonSurvey data represents product promotion coupon responses.
The NonSurvey data consists of 14,492 consumers. For each consumer we have demographic information
and reported usage for seventy brands, some of which are NonUnilever brands. In addition, Golden
Question Model and Demographic Model scores have been provided for each consumer. The diagram
below is a graphical representation of the data layout and structure.
Usage
Demographic
variables
variables
G
D
Q
M
User1 1
Consumer
Non-Survey /
Coupon
..
Consumer
46,
Use307
r 46307
70 brands
15
Massachusetts Institute of Technology
that a large majority of the consumers report usage of very few brands, which leads to the sparse nature of
the NonSurvey dataset.
# Consumers
3000
2500
2000
1500
1000
500
0
11
13
15
17
# Responses
# Brands
Body Items
Food Items
Detergents
Shampoo
Body Wash
Bar Soap
0
16
24
32
40
Brand Categories
3.2
In contrast to our initial efforts, during the analysis on NonSurvey data we sought to incorporate a wide
range of demographic and model score variables for a more comprehensive study. This necessitated
numerous decisions regarding data set preparation like choice of demographics and treatment of missing
values and outliers.
16
Massachusetts Institute of Technology
New England
Middle Atlantic
South Atlantic
Mountain
Pacific
This aggregation was necessitated for decreasing computational and time complexity. In addition, we
believed that a region wise approach will be more insightful.
17
Massachusetts Institute of Technology
3.3
Data Issues
Prior to extensive predictive modeling efforts, the data was observed and examined to extract information
that might be useful in subsequent study. We observed several instances that suggested data inconsistencies
and possible biases in the data. Some of these issues relate to the means and methods of data collection and
interpretation, while others relate to possible artificial data structuring induced by aggregation of dissimilar
data from multiple sources. Presented below are a few cases in point.
3.3.1 Conflicts in Computation of Household_Layout variables
Household_Layout variables consisted of certain variables that supply gender-wise and age-wise
information on presence of household members. Examples include IB_males_0_2, IB_females_3_5, etc
(henceforth referred as household-member variables). A positive response indicates presence of a
18
Massachusetts Institute of Technology
household member in that age and sex group. This information was not extracted from the consumer
directly, rather derived from third-party data source. The accuracy if the data varied depending on source
of data. The following situations led us to doubt the accuracy of some of the information contained in these
variables
The data set also contains a variable called IB_house_size. Addition of unit responses in all the
aforementioned variables describing presence of household members should not exceed the value
indicated by variable IB_house_size. Yet it was observed that there was no correlation between the
aggregated value as obtained from the household-members variables and the house size variable. We
tried various combinations, with the inclusion or exclusion of unknown gender type members, yet
we failed to achieve a match in between the two sets of variables.
Similarly
there
was
no
reconciliation
between
the
values
represented
by
variables
Suave BarSoap
Brand Name
Mealmk StirFry
lvr2k Bodywash
Ragu PastaSauce
Caress BarSoap
Dove BodyWash
Dove BarSoap
0
1000
2000
3000
4000
5000
6000
# Response
19
Massachusetts Institute of Technology
3.4
Predictive Modeling
For reasons we have listed before, our prediction efforts have been focused on predicting reported usage of
individual brands. We briefly looked at the modeling of MVC also. The following summarizes our
sequence of predictive analysis:
Fitting various algorithms to data set to identifying a common predictive modeling methodology
leading to best results over a wide range of brands.
Conducting Most Valuable Consumer (MVC) based analysis to capture information contained
in GQM or DM scores for prediction of MVC.
Drawing inferences and conclusions from model results and suggesting means for obtaining
improved results
20
Massachusetts Institute of Technology
power from architecture of interconnected computational units. They also allow predictive modeling of
more than one variable in a single iteration. Although capable of high accuracy, ANNs suffer from the
drawback that their models can be hard to interpret. Thus they are more appropriate when predictive
accuracy is more important than interpretability. They also require a very large training data, are
susceptible to over training and are more computationally intensive than other algorithms.
Classification Trees: A classification tree makes data classifications based on a set of simple rules that can
be organized in the form of a decision tree. While these models are not as intricate as ANNs, they are
considerably more interpretable. The decision trees can be easily translated to business strategies. So they
have become more popular in business context recently. This technique is also applicable to data sets with
missing values, thereby considerably reducing the data cleaning efforts.
kNeareset Neighbors: This methodology is in many ways similar to the collaborative filtering
methodology discussed earlier. Usage predictions of a given variable are computed as weighted average of
the usage of other users. Its main advantage lies in the fact that it is effective when there are a lot of
response variables. However the simplicity may come at the expense of loss of predictive accuracy.
Logistic Regression: There has been a prior discussion of logistic regression. It has been used to model
preferences in the marketing context. Some of the benefits include a record of success over a wide range of
prediction problems, interpretability of the models, and the fast speed of algorithms used for model fitting.
The output of logistic regression model is a set of posterior probabilities for which we can vary the
threshold according to the desired level of predictive aggressiveness.
Figure 16: Dove Body Wash: Logistic Regression, Tree, Neural Network
On fitting all the aforementioned algorithms to numerous brands and comparing the results, logistic
regression was chosen as the common model for all predictive analysis henceforth. We observed that
21
Massachusetts Institute of Technology
predictive results from logistic regression were at least as good as or better than the results obtained using
other models for a wide range of products. Figure 16 illustrates the superior predictive performance of
logistic regression compared to classification trees and k-nearest neighbors (appears as User in the
figure) in the case of Dove body wash.
For a majority of brands noticeable lift was observed. Typical lift curves over a wide range of
products are similar to the Caress body wash lift curve shown above. This indicates that the
demographic and model score variables contain considerable predictive power in this data set and
can be used for making brand usage predictions.
b) The most important predictive variables emerged to be the Golden Question Model (GQM) and
Demographic Model (DM) scores. There are two possible explanations for this. The first is that
MVC may be an important summary statistic that captures brand usage. The second is that the
Golden Question Model takes a number of brand usage variables as inputs. Thus using the GQM
scores to further predict the same usage variables can lead to artificially inflated results that may
be misleading.
22
Massachusetts Institute of Technology
c)
The following demographics are seen to have significant presence in models for a wide range of
brands:
Significant Demographics
Age
Length of residence
Gender
Marital status
Household Members
Region Code
Brands for which response rate was below 10% of the total consumer base led to poor predictive
efforts, as in the case of Lipton Tea This can be attributed to insufficient number of data entries
available for training and validating the predictive models, leading to poor results.
23
Massachusetts Institute of Technology
4.7005
Model Score DM
-2.521
Intercept
-3.0433
0.2977
Gender
-0.358
Age
0.0183
0.400
Home Renter
0.215
Figure 21: Comparative Lift Curves for Models With and Without GQM Scores
24
Massachusetts Institute of Technology
3.5
As documented previously, Unilever spent considerable efforts in identifying the Most Valuable
Consumers (MVC) on a brand and company-wide level. Given the Golden Question Model scores and
Demographic Model scores for each consumer, we were keen to enhance our predictive efforts using this
information. The exhaustive logistic regression analysis had already convinced us that GQM and DM
scores could be highly instrumental in predicting brand usage. A two pronged approach was followed in the
MVC based analysis:
3.5.1 Stratified Prediction Models
Previous modeling attempts were focused on fitting a single model for each brand to the entire data set and
generating posterior probabilities using these models. Since GQM scores seemed to contain information on
MVC, we stratified the data set based on model scores into three categories. Firstly a separate training data
set was created. The partition was based on consumer distribution such that each stratum had roughly one
third consumers. All consumers with a GQM score greater than 0.679 were categorized as high GQM
consumers. All consumers with a GQM score between 0.679 and 0.342 were categorized as medium GQM
consumers and the rest were categorized as low GQM consumers. Separate logistic regression models were
built for each of these three training data sets. Based on the strata cutoffs for the training set, validation data
set was also divided into three corresponding data sets. Models fit on the respective training data sets were
applied to the validation sets to compute posterior probabilities for each consumer in all the categories.
Thus we generated 2 sets of posterior probabilities for each consumer from the overall model and from
the stratified analysis. Lift charts were drawn for both predictive efforts to compare performance.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10
20
30
40
Stratified Analysis
50
60
Baseline
70
80
90
100
Overall Analysis
Figure 22: Comparative Lift Curves for GQM Stratified and Overall Models
Figure 22 shows lift charts for GQM scorebased stratified and overall models for Dove bar soap, which
are representative for other brands as well. It is clear that both the methodologies lead to similar results. We
25
Massachusetts Institute of Technology
noted that there was insignificant difference in the three separate models created for the stratified data sets
and that these models were each close to the overall model as well. This claim was further substantiated by
details of the model coefficients in each case. It was observed that the important variables in overall model
and stratified models were the same for a single brand. There was only a slight variation in the coefficients
for each of these variables. Thus we concluded that the stratification of data set according to GQM does not
lead to improved results.
3.5.2 Predicting GQM Strata for Each Consumer
We have discussed the reason which prevented us from thoroughly investigating the concept of MVC given
the nature and type of data available to us. We found no direct indications of MVC in the Unilever data set,
rather GQM and DM based estimates of MVC. Nevertheless we spent efforts using GQM scores as a target
for our predictive models. Instead of predicting the exact GQM scores, based on demographics we tried to
predict whether the consumer belonged to high, medium or low GQM strata. Our intention was to generate
a representative MVC model using NonSurvey data based on demographics only. Logistic regression was
used to arrive at results. GQM strata definitions were maintained the same as mentioned afore. Each
consumer was assigned a strata number one, two or three depending on whether it belonged to the high,
medium or low GQM strata. Subsequently predictive models were computed to predict the strata class for
each consumer based on demographic variables only.
A very low degree of success was achieved in predicting the GQM score strata that each consumer
belonged to. For this reason and unclear applicability of this model, we did not find it appropriate to pursue
this line of analysis further.
CLUSTERING ANALYSIS
4.1
Clustering Background
Clustering places objects into groups or clusters suggested by the data. The objects in each cluster tend to
be similar to each other in some sense, and objects in different clusters tend to be dissimilar. The
observations are divided into clusters so that every observation belongs to at most one cluster. Clustering
not only reveals inherent data characteristics by identifying points of similarity or dissimilarity in the data
set, it aids in understanding data structure issues. If dissimilar data sets are aggregated to produce a bigger
data set, clustering of the aggregated set might reveal underlying data sets. One of the added advantages of
clustering analysis is that it can be applied to a data set with missing values as well.
Aside from data cleaning and data structure issues, clustering results can also be of interest by identifying
groups of consumers with similar traits, who may be targeted in a similar fashion. Additionally, we
explored the possibility of improved prediction by modeling the individual clusters separately and then
aggregating results.
26
Massachusetts Institute of Technology
Clustering can be performed according to various methods. For our analysis we chose Wards Method
which is somewhat more sophisticated than the popular but simple kmeans method. Wards method is an
iterative method that seeks to minimize the statistical spread of observations within a cluster. In this
method, the distance between two clusters is the ANOVA sum of squares between the two clusters added
up over all the variables. At each iteration, the withincluster sum of squares is minimized over all
partitions obtainable by merging two clusters from previous iteration. Clustering can be performed
according to various measures of spread or distance among the data points. During our study we used the
Least Squares method because it works fastest while dealing with large data sets.
During the clustering analysis SAS Enterprise Miner computes an Importance Value between 0 and 1 for
each variable in the data set. This represents the measure of worth of the given variable in the formation of
clusters. While the data is split into clusters, Importance Value of each variable indicates the extent to
which each variable was influential in the splitting process. An importance of 0 indicates that this variable
was not used as splitting criteria for clustering and an Importance value of 1 indicates that this variable had
the highest worth in splitting criteria.
One of the most important tools which help us interpret individual clusters is the Input Mean Chart for each
cluster. This allows a comparison of the variable mean for selected clusters to the overall variable means.
The input means are normalized using a scale transformation function:
y=
x min( x)
max( x) min( x)
For example assume 5 input variables yi = y1,,.y5 and 3 clusters C1, C2, C3. Let the input mean for
variable Yi in cluster Cj be represented by Mij. Then the normalized mean, or input mean, SMij becomes:
SM ij =
M ij min( M i1 , M i 2 , M i 3 )
max(M i1 , M i 2 , M i 3 ) min( M i1 , M i 2 , M i 3 )
The input means are normalized to fall in a range from 0 to 1. For each cluster input means are ranked
based on magnitude of difference between the input means for the selected cluster(s) and the overall input
means. The variables with the highest spreads typically best characterize the selected clusters(s). Input
means that are very close to the overall means are not very helpful in describing the unique attributes of
consumers within the selected cluster(s).
4.2
27
Massachusetts Institute of Technology
We performed a clustering analysis to identify distinctive groups of consumers based on their brand usage
responses only. All the seventy brands were used as input variables for the clustering analysis. The
clustering analysis led to a few noteworthy insights and enhanced our understanding of the data. Analysis
was performed over the entire NonSurvey data set. We repeated the clustering several times for various
randomly sampled subsets of the NonSurvey data to ensure generality and accuracy of the results. The
repeated runs of the clustering algorithm over these various subsets gave similar results, and seemed to
indicate an obvious clustering into 3 groups. Results shown below are for the clustering of the entire Non
Survey data set.
% Data
Cluster Std
Points
Deviation
38
0.246
8.7
0.269
53.3
0.194
Cluster Description
Frequent Buyers of food
Infrequent buyers of
item
soaps/ washes
Frequent Buyers of
detergents
Frequent Buyers of
Infrequent buyers of
food items
28
Massachusetts Institute of Technology
Percentage Data Points Percentage of data points in each cluster (Total number of data points
8,608).
Cluster Standard Deviation Average standard Deviation of each point in the cluster from cluster
mean, which indicates within-cluster spread of the data.
The cluster descriptions are based on the input means plots and importance values described previously.
Another tool used in the process was a decision tree that is created in the clustering analysis. It is somewhat
similar to the classification trees. A set of simple variablesbased rules is generated which approximates
the cluster boundaries. The following section sheds light on methods of deciphering the crucial features of
distinction amongst all the clusters and discussion on similarity traits within each cluster.
4.3
Brands
0.2
0.4
0.6
0.8
Imporatance Value
29
Massachusetts Institute of Technology
30
Massachusetts Institute of Technology
31
Massachusetts Institute of Technology
Based on the consumer attributes revealed by the clustering analysis, we conclude that the clustering found
may be approximating the underlying data sets that have been aggregated. We found it intriguing to
observe that the cluster of individuals with higher propensity for purchasing body hygiene products should
have a significantly lower tendency for purchasing food products and that the cluster of consumers with a
high propensity for purchasing food items should have a lower propensity for purchasing body hygiene
products. Also, the few outlying individuals who are neither frequent purchaser of personal wash or food
items have been clustered separately. Possibly information from body hygiene brands like Dove and food
items like Ragu Pasta Sauce were collected separately and aggregated together in one data set, which would
explain the clusters that we observe. As mentioned several times before, body soap and wash products
overwhelm the usage data. Therefore the clustering analysis also segregates frequent and infrequent buyers
of these items into separate clusters. Based on the likelihood that the data available to us has been
aggregated from various sources and the differences in the source may be guiding the clustering analysis,
extrapolation of the results to true usage behavior may be inappropriate. This means that the clusters we
have generated may accurately reflect groups in the data but not be indicative of the true consumer usage
pattern.
4.5
Regardless of the reason for emergence of data clusters, they may potentially lead to improved prediction
results. Here we investigate whether fitting a separate model for each cluster can lead to better results. If
accuracy is improved, it suggests that separate clusters of consumers should be preferably modeled
individually.
Once it was identified that there exists a possibility of prior data aggregation leading to artificial structure,
we explored a clusterwise predictive methodology to obtain better results. The objective was to conclude
whether agglomeration of data sets was desirable or data for separate brands and product categories should
be treated separately.
For this purpose, each data cluster was treated as separate data set and logistic regression models were built
for each. Thus predictive probabilities were generated for each data point according to the cluster it
belonged to. Probabilities for each of these data entries were also generated through an overall model fitted
to the entire agglomerated data set. Lift curves for generated for both the analysis and compared.
Figure 29 shows comparative lift charts for clusterwise and overall analysis for Dove Bar Soap, which is
typical of lift curves observed for other brands too. This analysis proves that a cluster-wise predictive
model is capable of giving significantly than a predictive model based on the entire agglomerated data set.
This observation is in contrast to the weak results we obtained by separately modeling consumers in
different GQM score strata.
32
Massachusetts Institute of Technology
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
10
30
Cluster Wise Analysis
50
70
Baseline
90
Overall Analysis
Figure 29: Comparative Lift Curves for Cluster Based and Overall Models
The conclusion to be drawn from this analysis is that modeling small homogenous groups in the data
independently is a more useful exercise than fitting an overall model to an agglomerated data set. If our
clusters are reflecting underlying data sets from different product categories that have been agglomerated to
form the Unilever Database, then it is more advantageous for Unilever to analyze its various data sets
independently.
We conclude the report with a brief summary of what has been accomplished. Below we have enumerated
some overall conclusions and recommendations based on all our analysis.
A single extract of the Unilever database was made available to us for the purpose of developing modeling
methodologies useful for Unilevers business, to identify actionable insights from the data, and to evaluate
the data as an asset for targeted marketing. In addition to sizeable efforts trying to understand, clean, and
prepare the data, we focused on generating predictive models of individual brand usage because such
models are arguably the most valuable tool for targeted marketing. Our efforts and interactions with
Unilever representatives gave us a better understanding of the data which in turn led to more refined
analyses on subsets of the data. Finally, we generated clustering models to identify groupings in the data,
whether due to natural brand usage patterns or induced artificially through combination of various data sets.
We list a number of overall trends and conclusions that arise out of our analyses:
There is a need to understand the content of the Unilever database more fully. This includes
gaining greater understanding of the effect of missing data and assessment of the quality of some
of the third party information. Also, the Unilever database includes some information on dates
33
Massachusetts Institute of Technology
and usage quantities. The availability of such data could potentially lead to more interesting
analyses.
Logistic regression appears to be a suitable method for most of the prediction tasks undertaken. It
has the benefits of being well-studied for these types of data and of being interpretable. Its
performance was comparable to more complicated methods for most of the tasks we tried.
Overall, we observed that data quality issues seemed more important than the choice of modeling
methodology.
The Unilever data seems to be aggregated from several different data sources, many of which
seem to be incompletely understood and/or poorly documented. We note that the quality and
applicability of any data analysis depends critically on the quality of the underlying data and the
extent to which it is understood.
Our prediction efforts seem to indicate that in general, predictions of self-reported brand usage
based on MVC estimations and on usage of other products were somewhat effective. We believe
that in this data set there is a potential problem with using MVC estimates to predict usage of
certain individual brands, as these MVC estimates are sometimes based on the same data we are
trying to predict. Demographic variables seemed to have less predictive power.
Many of our prediction and clustering analyses seemed to be heavily influenced by data
aggregation effects. In many cases, these effects may have dominated any underlying consumer
usage effects. Due to this, our results may be representative of the data we are working with but
extrapolate poorly to actual usage situations.
A strategy of clustering consumers first, then fitting prediction models gave improved prediction
results. This suggests that data may best be analyzed in a cluster-wise manner. If clusters are
representative of underlying data sources, then these data sources may be better analyzed
individually rather in an aggregated fashion.
While the Unilever database provides information on a large number of consumers, it is our
opinion that modeling and insight generation about consumer behavior is best performed on a
cleaner data set that is better understood and more uniformly gathered. Examples include the
panel data to which Unilever already has access and data gathered by retailers of Unilever brands.
34
Massachusetts Institute of Technology