Вы находитесь на странице: 1из 17

Introduction to Predictive Analytics &

Datamining
The objective of this page is to give you some basic notion of predictive analytics so that you can understand the standard
terminology/vocabulary: dataset, variable, model, target, lift curve, lift quality, scoring, ranking.

If these notions are familiar to you, this page is not useful.

Define the data to analyse


The first example used here is a "fraud detection" mechanism. Thereafter we will extend the results obtained here on specific business cases:
CRM, customer acquisition, customer retention,...

Let's assume that we want to analyse a database that contains some information about the financial incomes of the inhabitants of the United
State of America. This database is publicily availabe at the UCI repository (a local copy is available here). Here is an extraction of this
database:

Is taxable
weeks
Primary Income country
age education marital stat race sex worked
Key amount of birth
in year
above$50K?
1 0 73 High school graduate Widowed White F USA 0
2 0 58 Some college but no degree Divorced White M USA 52
3 0 18 10th grade Never married Asian F Vietnam 0
4 0 9 Children Never married White F USA 0
5 0 10 Children Never married White F USA 0
6 0 48 Some college but no degree Married-civilian Indian F USA 52
7 0 42 Bachelors degree Married-civilian White M USA 52
8 1 28 High school graduate Never married White F USA 30
9 0 47 Some college but no degree Married-civilian White F USA 52
10 0 34 Some college but no degree Married-civilian White M USA 52
11 0 8 Children Never married White F USA 0
12 0 51 Bachelors degree Married-civilian White M USA 52
13 0 46 Some college but no degree Married-civilian White F Columbia 52
14 1 26 High school graduate Divorced White F USA 52
15 0 13 Bachelors degree Never married White F USA 52
16 0 47 Children Never married Black F USA 0
17 0 39 Bachelors degree Never married White F USA 52
18 0 16 10th grade Married-civilian White F Mexico 0
19 0 35 10th grade Never married White F USA 0
20 0 55 High school graduate Married-civilian White M USA 49

In the datamining field, a table to analyse is named a "dataset". The columns of the dataset are named "variables".

We will explore the link between the column “Is taxable income amount above $50K ?” and all the other columns of the dataset (age,
education level, race,…) . The column named “Is taxable income amount above $50K ?” is the “column to explain” or “consequence column”
inside our dataset. The “column to explain” is named, in technical term, the “Target Column”. In our example, the “Target Column” contains
"1" if the income amount of an individual is above $50K, otherwise it contains "0". In other words, we want to construct a system that predict
the value of the "Target Column" based on the value of the other columns. Such a system is called, in the datamining field, a "model".

In the example given here the "Target Column" contains either "0" or "1": we have a binary classification/prediction problem.

With TIM, you can solve:

 Binary Classification Problems: The "Target Column" contains either "0" or "1".
 N-ary Classification Problems: The "Target Column" contains several different values
 Continuous Classification Problems: The "Target Column" contains some continuous number (for example: we have to
predict an amount of euros or dollars)
The example described here is a very simple "fraud detection" example: we want to detect the individuals that are cheating when they say "I
do NOT have a big income thus I won't pay any taxes".

The small part of the population with a taxable income amount above $50K is also named the “target”. In other words, the individuals that
have a „1‟ inside the “Target Column” are the “targets”.

Our dataset contains another special column: the “primary key column” or, in other words, the “primary key”. The “primary key” contains a
different value on each line of the dataset. Its utility is to be able to define in a unilateral way each line of our dataset. The concept of
“primary key” is well known in the database world. If you want to know more about this subject, I suggest that you ask your database
administrator. The “primary key column” in our dataset is named “key”.

The “Census-Income” dataset has been prepared in a way that TIM can use it. See this document to have more information about the data
preparation step and how to construct a good Dataset. Basically, the dataset now contains a target column.

The origin of this dataset is the American census bureau database. Each line represents a person. The prediction task to perform with TIM is
to guess if the income level for a person is above $50K. Thus we want to build a model that predicts if the "Target Column" contains "1".

To summarize the vocabulary:

 dataset (learning dataset): it's the table to analyse


 variables: the columns of the dataset
 target column: the column to predict
 targets: the individuals with a "Target Column" that contains "1" (in our case here: the individuals with an income level above
$50K)
 model (predictive model): a system that takes as input the column of a dataset and that guess the value of the "Target
column".

Create a model with TIM and use it


The user-friendly interface of TIM allows you to easily specify the "Target column" and the "Primary key" column.

Once the model is built, TIM automatically applies it on all the individuals of the dataset. More precisely, you obtain a file that contains, for all
the individuals inside your dataset, the probability that he has an income level above $50K. The result file is very often named the
"customerList": in TIM it's a simple CSV file. Here is the customerList file opened with MS Excel:
What do we see here? Each row of the above Excel sheet represents an different individual. You can see (on the first row) that the predictive
model says that the individual with primary key "79661" has 99.1% chance to have an income level above $50K. Actually the guess of the
model is correct because the individual "79661" is indeed inside the target (see column "D" named "Target"). You notice that the rows of the
table are sorted from the highest probability to the lowest probability.

Here is some vocabulary:

 scoring: the action of computing the "probability to be a target" for all the rows of a dataset
 ranking: the action of sorting the rows from the highest score (or probability) to the lowest score and thereafter assigning an
increasing number on each row.
 customer list: the CSV file that contains all the score/probability for all the individuals
 probability: the probability that the "Target Column" contains "1".

Let's scroll down a little:


Let's have a look at the individual with primary key "18945" on line 256 of the table. This individual says "I have an income level below $50K"
(see "Target" column) but the predictive model says "There is 92.8% chance that individual "18945" has an income level above $50K". This
person is very suspicious. We just created a simple fraud detection mechanism that detects suspicious individuals.

Predictive Modelling for different Business applications


In the previous section we illustrated a simple fraud detection mechanism. The same system can be applied for other types of target.

1. Propensity Model - B2C - banking,telco,retail - cross-selling/up-selling

Let's assume that you are a statistician inside a bank. You have a dataset that describes all the bank customers. Each row of the
dataset represent one customer of the bank. The columns (the variables) of the dataset are, for example:
o Age of a customer
o Does this customer has a loan?
o Does this customer possess a credit card?
o Does this customer has an life insurance?
o How many accounts has the customer.
o ...

You can use TIM on this dataset to create a model that predicts if a customer is willing to have a credit card. The "Target Column"
to predict is "Does this customer possess a credit card?". TIM will generates very good leads: these are the indivuals that have a
high probability to possess a credit card and that do not yet possess one (see the analogy with the fraud detection machanism
presented in the previous section).

On the same dataset, you can also use TIM to predict if a customer is willing to buy a life insurance. TIM will give you the
probability that a customer will buy a life insurance.

When you have several products to sell (in this example: a credit card and a life insurance), you can compute the probability of
purchase for each product and only present to the customer the product with the highest probability.

2. Propensity Model - B2B

Let's assume that you are once again a statistician inside a bank. You have a dataset that describes all the companies in Europe or
America. Such datasets are direclty available here, for example. Each row of the dataset represent one company. The columns (the
variables) of the dataset are, for example:
o Total asset
o Financial Debts
o Stocks
o ...

You can use TIM on this dataset to create a model that predict if a company is willing to accept a leasing proposition from your
bank. The "Target Column" to predict is "Does this company has a leasing contract with us?". TIM will generates very good leads:
these are the company that have a high probability to have a leasing contract and that do not yet possess one.

3. Probability of Default - B2B - banking,risk assesment

You can use TIM on the same dataset as the previous point to create a model that predicts if a company will go bankrupt within 6
monthes.

All you have to do is to create a "Target Column" that contains "1" for all the companies that recently went bankrupt. TIM will
generates a model that tells you if a company has a high probability to go banckrupt. These kind of models are a little bit more
tricky to construct because you must pay a close attention to the time periods. See this document to have more information about
"time periods".

4. Propensity Model - B2C - retail

Let's assume that you possess a dataset that describes all the customers in Belgium. Such a dataset is available here. Each row of
the dataset represent one individual. The columns (the variables) of the dataset are, for example:
o Name
o Address
o Age of a customer
o Income
o Number of Children
o Married?
o Has a GS system?
o ...

Let's now assume that a company X that sells cars is coming to see you. They want to know the best leads to sell their cars. They
have the name and address of the people that already have their car. Based on the names and addresses that you received from
company X, you can add inside your dataset a new column that is "1" if the individual has a car from company X and "0"
otherwise.

You can use TIM on this extended dataset to create a model that predicts if an individual is willing to buy a car from company X.
The "Target Column" to predict is "Does this customer have a car from company X?". TIM will generates very good leads: these
are the indivuals that have a high probability to buy a car from company X and that do not yet possess one.

Example:

You can use TIM directly on your dataset to create a model that predicts if an individual is willing to buy a GPS
system. The "Target Column" to predict is "Has a GPS system". TIM will generates very good leads: these are the
indivuals that have a high probability to buy a GPS system and that do not yet possess one.

Note:

Is it better to simply sell addresses that have responded YES to the question "Do you want/intend to buy a GPS?" or
is it better to do a preditive model on people that have have responded YES to the question "Do you have a GPS?"
The answer is: do a predictive model.

Why?

Even if I would like to have a ferrari I will never be rich enough to actually buy one. Even if I am rich enough to buy
one, I might buy a ferrari next month or in two years. So the right question would be "Do you have enough money
to buy a GPS next month and the envy to buy it? ...and in one month will this envy be as strong as today?". It's
better to do a model on people that have responded YES to the question "Do you have a GPS?" because these
persons actually did have enough money to buy a GPS and they really already bought it. What we are searching for
(our target) are the people that "looks like" people that are currently buying GPS systems and NOT people that have
the indefine desire to buy a GPS in a distant future. All the theoretical textbooks in the CRM field agree together:
"you should never trust envy or desire but only current facts".

To summarize: If you are using directly the question "Do you intend to...", you will, most certainly, have a high
response rate but you will get a low convertion rate (they won't bought your product). If you are using a predictive
technique, you will have a high convertion rate.

5. Churn Prediction - Telco,Banking

Let's assume that you are a statistician inside a telco operator in the mobile phone department. Let's assume that you possess a
dataset that describes all your customers. Each row of the dataset represents one individual. The columns (the variables) of the
dataset are:
o Number of phone call outside the internal network
o Total Number of phone call
o Number of SMS sent
o Number of SMS received
o ...

You can use TIM on this dataset to create a model that predicts if a customer will churn (.i.e. change to another mobile phone
operator).

All you have to do is to create a "Target Column" that contains "1" for all the individuals that recently churned. TIM will generates
a model that tells you if an individual will churn. These kind of models are a little bit more tricky to contruct because you must pay
a close attention to the time periods. See this document to have more information about "time periods".

6. Potential analysis - B2C - retail (advanced usage)

Let's assume that we have to decide in which geographical region to install a new agency for our company. We have to estimate
the potential of all the regions and choose the best one. One way to esimate this potential is to simply count the number of targets
inside each region. Is it a good method?

The answer is:


YES, but there is a better solution.

What's the best solution?

Let's assume that you are selling GPS systems. To have the "potential" of sales for each geographical region, you count the
number of people having a GPS system in each region. If this count is high it means that the potential of selling a GPS in this
region is high. No luck: it was christmas and many people received from their relatives many gifts (including GPS systems). So
your counting (=count of people having a GPS) is "polluted". It means that there are some person that have a GPS but they never
wanted one.

The best method to do a potential analysis is: "The potential of a region is the sum the probability column of the people inside this
region."

Why is this method better? Let's go back to the "simple method" and compare it with the advanced predictive technique method.
The simple method is equivalent to: "The potential of a region is the sum the target column of the people inside this region.

The difference between the simple and the "advanced" technique is the column that is summed. Why replace the
column "Target" by the column "Probability"? Most of the time these two columns contains more a less the same thing: When the
probability is high, the target column contains "1". When the probability is low, the target column contains "0".

However, there are still some differences between these two columns. The differences appears for people that have a GPS but that
never wanted one (it was a gift: see example above). For these people, the target column contains "1" but the probability is low.
For these people, the target column contains the wrong value (this value will increase the potential of the region but it should not
be increased). In opposition, the probability column always contains the right value.
If the Target size is very small (your dataset contains less than 1% of rows with "1" inside the "Target Column"), the counting per
region will very often be either "0", "1" or maybe "2". It's not a good idea to say "this geographical region is good because it has
more target in it." because this number is actually very small and has no statistical meaning. The targets are too few to make a
stable, statistically meaningful decision. On the other hand, when you sum probabilities, you don't have these kind of problems.

In this case the predictive model is used to "filter out", the wrong targets (some target="1" should be in reality target="0"). You
should also always use a model when the number of Target is very low to obtain stable, statistically meaningful decisions.

... and much more! The domains of application of the TIM Binary Ranking System and the TIM Continuous Prediction System are without
limits.

A graphical explanation of the prediction process


Basically TIM creates a system (a model) that is able to guess if an individual is a target or not. This system (the predictive model) can be
represented by a blue dashed line on the following graphic (situation 1):

The graphic here above represents a dataset that contains 35 rows and 3 columns. Each row is an individual. Each individual is represented

by either a circle ( ) or a triangle ( ). The first two columns of the dataset are "age" and "education". The last column is the

"Target column": it's the column "Is taxable Income amount above $50K?". If an individual is a target, it's represented by a triangle ( ),

otherwise he is represented by a circle ( ). You can see that it's easy to isolate the targets ( ) from the non-targets ( ): the
blue line (the model) is able to separate quite well the two classes (target vs non-target).

The next situation (situation 2) is really bad: it's very difficult to separate the targets ( ) from the non-targets ( ):
Let's now assume that we are building a "propensity to buy" model. Where are situated the good leads inside these graphics? Let's have a
look at the leads in the first, good situation (situation 1):

The leads generated in the first situation (situation 1) are very good because there is an easy way to separate the individuals that are
targets from the non-targets. The leads generated in the second situation (situation 2) are very poor.

When you build your predictive models, you must always control the quality of your models otherwise there is a risk to produce leads that are
very bad. You obtain bad leads when the separation between the two classes (Target class vs Non-Target class) is difficult: there is no strong
"pattern" that allows you to recognize easily a Target. When you can easily separate the two classes (Target class vs Non-Target class), you
have a high model quality: the lift is high and close to the irma lift. See next section below to have an explanation of the lift.

In the example above, the pattern that characterize your leads is the following: "The leads are people with a high education and with an age
between 40 and 60)." TIM give you directly inside the automatically generated reports a complete analysis of the pattern of your leads. TIM is
the only datamining tool that offers you a characterization of your leads that is both complete and very easy to understand from a business
perspective.

The default parameters of TIM always produces nearly the best model you can get based on your dataset. However, you can change these
parameters to still increase your model quality.
How to estimate the quality of a model?
The most common way to estimate the quality of a binary predictive model is the Lift Curve. Let's go back to the ranked list. This list is
reproduced here:

The individual with the primary key "18945" is a very good lead. Here is a graphical representation of the column "probability" inside the
CustomerList file:
The column "probability" inside the CustomerList file is sorted from highest probability to lowest probability so we see a pink curve that is
decreasing.

On the next graphic the yellow line represents the performances of the random selection. This is the worst ranking you can do: the leads are
not computed at all: they are simply taken randomly.
On the next graphic, we are using the CustomerList generated with TIM to select the leads (i.e. to select which individuals will be contacted
during the commercial campaign). The blue curve thus represents the quality of the ranking created with the model generated with TIM. The
blue curve is named the "Lift-Curve" of the model (or, in short, "Lift").
If we had a magic way to predict if an individual is a target, we would obtain the ranking repesented by the green curve (assuming that the
individuals inside the Target represents 5% of the total number of individual inside your dataset):
The green curve is also named "Irma Curve". No predictive model can do better than the Irma-Curve. The Irma-Curve represents the best
ranking that you can have.

To summarize, on the lift graphics, we have:


If the Blue Curve (the lift-curve of the model generated with TIM) is far above the Random-Choice-Curve (in yellow), it means that your
predictive model is very good. The Lift-Curve is always below the Irma-Curve (and above the Random-Choice-Curve). From the graphic above
we can see that, if we follow the TIM ranking, we only need to contact 10% of the population to "touch" 35% of the targets. In technical
terms, we have a lift of 3.5 (=35/10)(This means that we are doing 3.5 times better than the Random-Choice). Typically, a ranking based on
rules of common sense gets a maximum lift of 2 or 3. If the contact cost is high, a "high" Lift-Curve means substancial savings.

Here are some examples to give you some insight of what a good lift is:
Here is the lift curved obtained on the Census income dataset:
What to do when you have a poor model/ranking?

There are several solutions. If you are still not using TIM, you can try TIM on your problem . Another solution is, for example, to add new
columns inside your dataset. Theses new columns can be:

 The ratio of two columns of your original dataset.

o Ratio variables/columns are very often very informative. If you are using TIM, you can work with thousands of ratio
variables without any problems.

 Completely new columns that you bought from an external data provider.

o There are several data providers available, each of them is specialized in B2C or B2B business.

 Completely new columns extracted from an analysis of the underlying network.


o For example, in a telco application, for a churn prediction analysis, each row of the dataset represents one individual.
The individuals are the nodes of a giant network that is based on phone calls. Typically, an arc between the node A and
the node B inside this network represents the relation "The individual A phoned to Individual B". You can analyse the
network and create additional columns based on your analysis. Always for the same example (telco application,- churn
prediction), it's interesting to add a column that counts how many churners there are around each individual.

Columns discarded because of some old heuristic or business rules that is still active inside the
operational system.

o This may sound silly but most dataminers were forced to remove potentially valuable information out of their datasets
because their old, previous datamining tool was not able to handle a large number of columns. In a real business case
(in a banking application), this simple "trick" improved the lift from 2.5 to 4.

One last solution is to refine/change slightly the definition of your target group.

Вам также может понравиться