Final Cheat Sheet Nadya

Excel Dashboards
Bullet Graphs in excel

http://peltiertech.com/WordPress/vertical-bullet-graphs-in-excel/
Checkboxes and combo boxes:
http://peltiertech.com/Excel/Charts/ChartByControl.html
Data Cleaning and Descriptive Statistics:

The measures from a population are called parameters.
The measures from a sample are called statistics.
Type of sampling
Probabilistic sampling (assume there is no order, no bias against any

characteristic, and they are all heterogeneously mixed)
simple random sampling with our without replacement (random number

gerenrator in Excel, use function randtween (1, 1638) - the numbers
being the size of the population
proportional random sampling

o
Let's say you want to do proportional random sampling on
books on south America
o
by viewing the number of books in each country and their
percentage of total, you can figure out, for instance, that for
Argentina the proportion is 578/3192, so your sample needs
to have the same proportion
o
for example, if you have 200 samples, then you need to have
200 *(578/3192) from Argentina
How to pick the samples, e.g. with flight delay data
Enter observation IDs for each record
sort your file on delay
note the number of delays vs. non delays. Let's stay the first 400,000
records are non-delays
randbetween (1, 400,000) to sample delay = 0. Keep the ratio of delays

vs. non delays in your sample same as original data
randbetween (400,001, 500,000) to sample delay = 1
Copy all of the resulting numbers and paste special value
then do vlookup, to get the data from the other sheet

Outliers
Anything more than 3 times the standard deviation
Your range is mean + 3 sigma, mean -3 sigma
exclude any values outside of this range (e.g. conditional formatting to

highlight offending values, filter, etc)
keep deleting until there are no values highlighted (as it deletes, it will
recalculate the mean + 3 sigma and it will highlight new values, that's
ok, just keep deleting)
Frequency Charts (e.g. frequency of house size from real estate)
Insert Pivot Table
House size is row
Count (Price) as a column (can be anything else)
Right click house size group. Enter Bin size in by
Right click resulting table Insert Chart Column Chart
Eyeball data to determine outliers

Histograms: Data Data Analysis Histograms
Descriptive Stats: Data Data Analysis Descriptive Statistics
Box Plots: Add Ins Data Analysis XLMiner Charts Box Plot
Prediction and Classification Methods:
You will have three data sets - training, validation and test sets
Training set is what you build the model on
Validation is used for validating the quality of the model
Test is testing the accuracy of the model
XML Miner - Classification - Naive Bayes

Select input variables and output variables (flight status)
Next
Next
Check Summary Reports Detailed report, score validation data summary
report, Lift Charts
Go to Prior Class Probability
The 0.8 and 0.2 on top are the prior class probabilities
Copy data from other spreadsheet paste special - transpose
Use VLOOKUP On the conditional probabilities
Then PRODUCT(I43:I49, I41) (multiply each one of the probabilities by
the overall probability
Then Nave Bayes formula = Prob of on time / (prob of on time + prob
of delay)
Probability of each condition give on time * Overall on time probability

divided by
Probability of each condition given on time * overall on time probability +
Probability of each condition given dealyed * overall delayed probability
Multiple Regression
You are trying to find how a dependent variable is related to independent variable.
You want to check:
whether the dependent variable has a linear relationship with the

independent variable
whether the independent variable is indeed independent
to make sure that it is a continuous relationship rather than a discrete one

(e.g. one-bedroom, 2 bedroom, 3 bedroom is discrete)
Regression equation: Y = Alpha + Beta*X + error
In other words: Dependent variable = constant + the contribution of an
independent variable + something random
For example: House price = 3000 + 600* sq ft + E
You can say Y hat = a +bx <-- this is an estimate (you drop the random part)
<-- a is an estimate of Alpha and b is an estimate of Beta
<-- The error in your estimate is Y - Y hat
<-- of you square that, you will get Error squared
If you add all the errors squared (for each error) (call it i) that is the total error
There are models that minimize this error --> you use derivatives
If you want to minimize the error, you need to find a and b that minimize y hat.
Data Partitioning:
Before you run the regression, you want to make sure that there are no correlated
Open Data Set in Excel

variables (they are truly independent).
Add Ins --> XLMiner --> Partition Data --> Standard Partition
To find out, you go to XLMiner - Charts - Matrix Plot - pick all the variables you
Select all the variables and put them on the right

are interested in (Beds, baths, sq ft, price)
Partitioning by default is set to Automatic - 60% training, 40%

Price and Sq Ft has almost linear relationship (look in lower right corner)
validation
Lower left - discontinuous and you also see that price and beds and sq ft and bed are
But you can also say use partition variable

also positively corelated...
There, you say what kind of set you want by putting in a variable So here you will pick sq ft because
t(test), s (training), v (validation)
it means I will get more bedrooms and bathrooms
This generated the partitions and you can use the hyperlinks at the top to
It's the continuous variable where beds and baths are discrete
stitch between training and validation data
You also know that you have to separate them because beds and baths are
Nave Bayes
discontinuous (lower right corner)
Partition Data
So let's say you can't decide which variable to use. Run regression for all three
variables independently (In Excel --> Data --> Data Analysis --> Regression.
Click on Training Set
Check residual plots).

Since bed vs. price residual plot is discontinuous, you can tell that beds is not a
good variable to use.
Alpha is what you set (it's your tolerance for error, typically it's 0.05 or less),
p is what you get.
Lower P --> Better result.
P =< Alpha independent variable is significant
P > Alpha independent variable not significant and can be removed from
regression
T > 2 significant
Also, you look at the adjusted R Square to see the explanatory power of the
model. Lower R means worse.
Check the standard error make sure its low.
Look at Correlation (Excel Data Analysis Correlation)
Highlight results, do a conditional formatting - color bar (home conditional
formatting - color scales). Do absolute value first
This is another way to determine multi-colinearity (in addition to doing Matrix Plot
in XLMiner)
What if I run multiple regression on all variables (using XLMiner).
Fitted values will give you the predicted value
Check Unstandardized, summary report
You know this is a problem because the coefficient for the bedroom is negative So
adding a bedroom reduces your house value???
The good news is, XlMiner will determine the best variable for you.
XLMiner --> Mutliple Linear regression
At Step 2, click Best Subset. Backwards elimination (it takes all the variables, and
eliminates the least significant first)
Look at the adjusted r Square - where is it tapering off? No more improvement
between 5 and 6.
Also, look at CP - highest R Square and CP = total number of predictors. Pick # 12
because R square is higher and CP is close to the total number of predictors.
Principle of parsimony - if you can do the job with two variables, don't use 3.
In fact, if you include too many variables, you overfit the data- you match the
model perfectly to the data and there is no predictive power.
From the output from XLMiner, click Subset Selection, then choose subset
It will automatically select the subset for you, but then you have to rerun the
regression on just this data.
The regression equation is under "Reg Model"
Prediction = constant + coeff.*input variable + coeff. *input variable....
K Nearest Neighbor
when you get a new record, you compare it to existing records
you find the "distance" between this new record and the existing records
XLMiner - Classification - Classification Tree

Select input and output variables
Run
When you look at the output, less is on left, more is on right
The number in between is the number of records in the existing set that fall into that
category
e.g. someone with less than 100.5K and less than 2.95 CC average --> not worth
personal loan
Classification Errors and Costs
Misclassification - how many are placed in the incorrect category on the
test/validation data
Two kinds of Errors:
Individual Misclassification Error - this is for each category itself (you think that
a mailing will generate business but it does not); ssually associated with false
positive or false negative
Overall Misclassification Error: useful for evaluating the overall model
Behavior of the errors with cut-off probability values
If you provide a cut-off probability, then the classification algorithm will

reclassify according to the cut-off. Typical default cut-off is 50%
Cut-off probability is dependent on misclassification cost and business

context
Data Table (What-If Analysis) can be used to plot the behavior

For calculations and decision making for the future records, typically
validation results are used.
Lift Chart (or gains chart) is a graphical way to see the effectiveness of the
classification model. If you do not use any classification and just send an offer to
everyone, then your response rate will be whatever is the underlying probability.
However, when you use a classification scheme, and then sort the target records
accordingly and send the offer, then your response rate should be much higher. The
ratio of gain is the lift.
The Decile chart shows the same information, only in blocks of 10% of the records.
Allows you to know when to stop targeting.
Let Us Recreate the Lift and Decile Chart for the Universal Bank Example
Sort the records in the validation score in descending order of the classification
probabilities
Create a new column on the left to number the cases serially from 1 to 1000
Create a column to count the cumulative number of 1s (successes) in the actual
column
Complete the entries for all columns using appropriate formulas
Find out the actual number of 1s and 0s in the validation data set (hint: can be easily
done from classification confusion matrix) and create the overall prob of 1 and 0
In a new sheet create a table that would show the number of success from every 50
records as per the probability and from your actual cumulative column in the
validation score worksheet.
=SQRT(SUMXMY2($H$3, A2) + ($I$3-B2)^2)

You decide to use the k number of records with the smallest distance
(has to be odd but you set it)
see Excel example in MBAD 698 folder

To do this in XLMiner
Partition data first
On the training set, click inside the set
Add Ins --> XLMiner --> Classification - K Nearest Neighbor
On Step 2, select score on best K between 1 and specified value (will

let you do 19 max)
Click on Prior class Probabilities. Best K will be highlighted there.
Classification Confusion M atrix

Predicted Clas s
Actual Class
1
1
60
0
Classification Tree
Partition Data
10
0
46
884
No of
cases
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
Success Rate
when chosen
at Random
0
5.3
10.6
15.9
21.2
26.5
31.8
37.1
42.4
47.7
53
58.3
63.6
68.9
74.2
79.5
84.8
90.1
95.4
100.7
106
Number of
Success When
Logit is used Decile Lift Prior Prob
0
Success
46
Failure
75
7.075472
89
95
4.481132
97
99
3.113208
101
101
2.382075
103
103
1.943396
104
104
1.63522
104
105
1.415094
105
105
1.238208
106
106
1.111111
106
106
1
0.106
0.894

Final Cheat Sheet Nadya

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Final Cheat Sheet Nadya

Загружено:

Авторское право:

Доступные форматы

Excel Dashboards

Bullet Graphs in excel

Data Cleaning and Descriptive Statistics:

Probabilistic sampling (assume there is no order, no bias against any

simple random sampling with our without replacement (random number

proportional random sampling

Enter observation IDs for each record

sort your file on delay

randbetween (1, 400,000) to sample delay = 0. Keep the ratio of delays

randbetween (400,001, 500,000) to sample delay = 1

Copy all of the resulting numbers and paste special value

then do vlookup, to get the data from the other sheet

Anything more than 3 times the standard deviation

Your range is mean + 3 sigma, mean -3 sigma

exclude any values outside of this range (e.g. conditional formatting to

Insert Pivot Table

House size is row

Count (Price) as a column (can be anything else)

Right click house size group. Enter Bin size in by

Right click resulting table Insert Chart Column Chart

Eyeball data to determine outliers

Training set is what you build the model on

Validation is used for validating the quality of the model

Test is testing the accuracy of the model

XML Miner - Classification - Naive Bayes

Probability of each condition give on time * Overall on time probability

whether the dependent variable has a linear relationship with the

whether the independent variable is indeed independent

to make sure that it is a continuous relationship rather than a discrete one

Open Data Set in Excel

Select all the variables and put them on the right

Partitioning by default is set to Automatic - 60% training, 40%

But you can also say use partition variable

it means I will get more bedrooms and bathrooms

Click on Training Set

Check residual plots).

when you get a new record, you compare it to existing records

XLMiner - Classification - Classification Tree

Behavior of the errors with cut-off probability values

If you provide a cut-off probability, then the classification algorithm will

Cut-off probability is dependent on misclassification cost and business

Data Table (What-If Analysis) can be used to plot the behavior

=SQRT(SUMXMY2($H$3, A2) + ($I$3-B2)^2)

see Excel example in MBAD 698 folder

Partition data first

On the training set, click inside the set

Add Ins --> XLMiner --> Classification - K Nearest Neighbor

On Step 2, select score on best K between 1 and specified value (will

Click on Prior class Probabilities. Best K will be highlighted there.

Classification Confusion M atrix

Вам также может понравиться