Вы находитесь на странице: 1из 7

Statistics

28 September 2016 03:41

Dealing with Skewed Data


When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself -- this prevents
the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models.
However, while applying the predictive model derived by using log of the dependent variable, do not forget to apply exp() on the result to get the actual predicted values.

Cross Sectional Data


Data collected at the same or approximately at the same point in time.
Histogram
Time Series Data Frequency distribution
Data collected over several time periods.
Box Plot
Measures of Variability The box represents the number of observations between the
Range= Highest - Lowest 3rd Quartile and the 1st Quartile
Inter Quartile Range = Q3 - Q1 Inter-Quartile Range (IQR) = Q3 - Q1
Outliers:
Variance a. Observation Value > Q3 + IQR
Measure of variability of data, based on the difference between each observation and the mean.
b. Observation Value < Q1 - IQR
= (Standard Deviation)^2
For a Sample:
s2 = Σ(Xi - X )2 / (n -1)
s-> variance; Xi -> Sample Observa ons; n -> Sample Size; X -> Sample Mean
For a Population:
σ 2 = Σ(Xi - μ)2/N
σ-> variance; Xi -> Population Observations; N -> Population Size; μ -> Population Mean
Coefficient of Variance
(100 * Standard Deviation / Mean)%
Skewness

When handling a skewed dependent variable, it is often useful to predict the logarithm of the dependent variable instead of the dependent variable itself -- this prevents
the small number of unusually large or small observations from having an undue influence on the sum of squared errors of predictive models.
However, while applying the predictive model derived by using log of the dependent variable, do not forget to apply exp() on the result to get the actual predicted values.

The Z - Score
• It is a measure of relative location of an observation in the dataset and helps us determine how far a particular value is from the mean
• There is a Z-score associated to each value (observation) of the population/sample
• It is often called the Standardized Value
• It is interpreted as the number of standard devia ons Xi is from the mean X.
• Any value with Z>3 or Z<-3 is an outlier
Zi = (Xi - X)/s
Zi is the Z score for Xi ; X is the sample mean; s is the sample standard devia on

Chebyshev's Theorem:
At least (1 - 1/Z2) of the data values must be within Z standard deviations of the mean where Z > 1
The theorem applies to all data sets irrespective of the shape of distribution of the data

Normal Distribution
1. Mean = Median = Mode
2. Symmetric about mean
3. Standard Deviation determines how flat or wide the curve is
4. Probability for normal random variable is given by area under the curve
5. 68.3% values of a Normal Random Variable are within +/- 1 standard deviation of its mean
6. 95.4% values of a Normal Random Variable are within +/- 2 standard deviation of its mean
7. 99.7% values of a Normal Random Variable are within +/- 3 standard deviation of its mean
8. f(x) = e-(X - μ)^2/2σ^2/σ√(2π)
Standard Normal Density Function
Normal distribution with a mean = 0 and standard deviation = 1.

IIM Trichy Page 1


Normal distribution with a mean = 0 and standard deviation = 1.
Here effectively the Z score becomes the normal random variable.
F(z) = e-z^2/2/√(2π)
To calculate probability in the Standard Normal Density function:
a. Calculate Z score of the desired value (converting the observation value to standard normal random variable)
b. Use the Z-table for cumulative probabilities for Standard Normal Dist to get the probability

Point Estimation
Population Statistic
Mean = μ
Standard Deviation = σ
Sample Statistic
Mean = x̅
Standard Deviation = s
Propor on Est. = p̅ = (no of selected observa ons)/(Total no of observa ons)
Sample Mean x̅ becomes the Point Es mator for Popula on Mean μ
Sample Standard Deviation s becomes the Point Estimator for Population Standard Deviation σ
Sampling Distribution
Consider selecting a simple random sample as an experiment, repeated several times
Each sample gives us a sample mean x̅.
As a result we will have a random variable x̅, that will have a mean or expected value, a standard devia on and a probability distribu on.
This is known as the sampling distribu on of x̅
Expected Value of x̅, E(x̅) = μ the Popula on Mean σ = SD of Population
Standard Devia on of x̅: n = Sample Size
Finite Population SD N = Population Size
= √{(N-n)/(N-1)} * σ/√n
Infinite Population SD
= σ/√n
Central Limit Theorem
In selec ng simple random samples of size n from a popula on, the sampling distribu on of the sample mean x̅ can be approximated by a normal distribu on as the sample
size becomes large.
Sampling Types
1. Simple Random Sampling
2. Stratified Random Sampling:
Elements in Population first divided into groups called strata, based on certain attributes such as department, location, age, type etc.
After this, a Simple Random sample is taken from each stratum.
3. Cluster Sampling
The population is divided into smaller groups called clusters.
Each cluster should ideally be representative of the population
Samples are taken from each cluster
4. Systematic Sampling
Example, selecting one sample from the first 100, another from the next 100 ….
5. Convenience Sampling
Non-Probability Sampling technique.
Elementa are included in the sample without prespecified or known probabilities of being selected
6. Judgement Sampling

Interval Estimation
The purpose of an interval estimate is to provide information about how close the point estimate, provided by the sample, is to the value of the Population Parameter.
It helps us guess the value of the Popula on Mean μ, using the value of the sample mean x̅, and sample size n
Interval estimate of Population Mean:
x̅ ± Margin of Error
Population SD, σ is known: Confidence Coefficient = (1 - α) = Confidence Level/100
So with Confidence level of 95%,
x̅ ± Zα/2(σ/√n) Confidence Coefficient is 0.95 and α = 0.05
Zα/2 is the Z value providing an area of α/2 in the
Population SD, σ is not known: upper/lower tail of the standard normal probability Dist
x̅ ± tα/2(s/√n)
S is the Sample SD
t- is the random variable for t-Distribution
tα/2 is the t value providing an area of α/2 in the upper tail
of the t-Distribution with n-1 degrees of freedom
Margin of Error, E = Zα/2(σ/√n)
Desired Sample Size, n = (Zα/2)2 * σ2/E2

Hypothesis Testing
Type-1 & Type-2 Errors
Table H0 True Ha True μ0 is the Hypothesized Value
Accept H0 Correct Type-II Error
μ is the Population Mean

IIM Trichy Page 2


Reject H0 Type-I Error Correct
Level of Significance - α
It is the probability of making a Type-I error when the null hypothesis is true as an equality
It is represented by α
Smaller the value, the better.
P-Value
A probability that provides a measure of evidence against the null hypothesis provided by the sample.
Smaller p-values indicate more evidence against H0
It is the probability of obtaining a value equal to or smaller than that provided by the sample statistic Z
One Tailed Test
H0: μ >= μ0 H0: μ =< μ0
Ha: μ < μ0 H a: μ > μ 0
Lower Tail Test Upper Tail Test
For Critical Value Approach
Take the required significance level α
Lower Tail Test: See in Z-chart, against what value of Z, do we
Population SD, σ is known get a value equal to α.
1. Calculate Z sta s c: Z = (X - μ0)/s = (X - μ0)/(σ/√n) Mark this value as Z0.
2. Calculate the p-value by looking up the Z table Reject H0 if Z <= Z0
3. Reject H0 if p-value =< α
Population SD, σ is unknown
1. Calculate the t-sta s c; t = (X - μ0)/(s/√n) s is the Sample SD & n is the Sample Size
2. Derive the degrees of freedom = n - 1
3. Find the p-value corresponding to t from the t-table
4. Reject H0 if p-value =< α
Upper Tail Test:
Population SD, σ is known
1. Calculate Z sta s c: Z = (X - μ0)/s = (X - μ0)/(σ/√n)
2. Calculate the p-value by looking up the Z table
3. Reject H0 if (1 - p-value) =< α
4. Reject H0 if Z >= Z0 (using critical Value approach)
Population SD, σ is unknown
1. Calculate the t-sta s c; t = (X - μ0)/(s/√n) s is the Sample SD & n is the Sample Size
2. Derive the degrees of freedom = n - 1
3. Find the p-value corresponding to t from the t-table
4. Reject H0 if (1 - p-value) =< α
Two Tailed Test
H0: μ = μ0
Ha: μ != μ0
Population SD, σ is known
1. Compute the value of test sta s c Z = (X - μ0)/(σ/√n)
2. If Z > 0, find area to the right of Z in the Upper Tail,
3. If Z < 0, find area to the left of Z in the Lower Tail
4. Double the tail area obtained to get p-value
5. Reject if p =< α
6. Reject H0 if Z =< -Zα/2 or if Z >= Zα/2 (using critical Value approach)

Regression Using ANOVA table


Null Hypothesis,
H0: β1 = β2 = β3 = β4 = 0 (i.e. None of the in-dependent variables are related to the dependent variable)
Ha: βi != 0
The F Statistic
Large values of the test statistic provide evidence against the null hypothesis
The F test does not indicate which of the parameters βi != 0 is not equal to zero, only that at least one of them is linearly related to the response variable.
Calculated as (Mean Squared Error of Reg Model)/(Mean Squared Error of Baseline Model)

Linear Regression
Y = a + bX + e ---- (a -> Intercept; b-> slope of line; e -> error in prediction)

Baseline Model
predicts the average value of the dependent variable regardless of the value of the independent variable. Always a flat line. Gives maximum SSE.
Y = a ---- where a = avg(Yi)

SSE = Sum of Squared Errors = Σei


SST = SSE of a base line model (where the mean of the observation is always taken as the future predictor)

R Squared Model
R2 = 1 - (SSE/SST) ……..where ( 0 <= SSE <= SST & SST > 0)

IIM Trichy Page 3


R2 = 1 - (SSE/SST) ……..where ( 0 <= SSE <= SST & SST > 0)
0 < R2 < 1
R2 = 0 ----> SSE = SST, means that regression does not help. There is no improvement over baseline and Y & X are not very related
R2 = 1 ----> SSE = 0, means that the regression is a perfect fit and there are no errors
R squared is nice because it captures the value added from using a linear regression model over just predicting the average outcome for every data point.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Multiple Regression
Y = a + b1X1 + b2X2 + ……. + bnXn + e

multiple linear regression allows you to use multiple variables at once to improve the model. The multiple linear regression model is similar to the one variable regression
model that has a coefficient beta for each independent variable.
We predict the dependent variable y using the independent variables x1, x2, through xk, where k here denotes the number of n dependent variables in our model. Beta 0 is,
again, the coefficient for our intercept term, and beta 1, beta 2, through beta k are the coefficients for the independent variables. We use i to denote the data for a particular
data point or observation.
The best model is selected in the same way as before. To minimize the sum of squared errors, using the error terms, epsilon.

We can start by building a linear regression model that just uses the variable with the best R squared. Then we can add variables one at a time and look at the improvement
in R squared. Note that the improvement is not equal to the one variable R squared for each n dependent variable we add, since they're interactions between the
independent variables.

Adding independent variables improves the R squared to almost double what it was with a single independent variable. But there are diminishing returns. The marginal
improvement from adding an additional variable decreases as we add more and more variables.

Overfitting
Often not all variables should be used. This is because each additional variable used requires more data, and using more variables creates a more complicated model. Overly
complicated models often cause what's known as overfitting. This is when you have a higher R squared
on data used to create the model, but bad performance on unseen data.

Adjusted R-squared
This number adjusts the R-squared value to account for the number of independent variables used relative to the number of data points. Multiple R-squared will always
increase if you add more independent variables. But Adjusted R-squared will decrease if you add an independent variable that doesn't help the model. This is a good way to
determine if an additional variable should even be included in the model

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.45039886 10.18888394 -0.044 0.965202
AGST 0.60122388 0.10302027 5.836 0.0000127 ***
HarvestRain -0.00395812 0.00087511 -4.523 0.000233 ***
WinterRain 0.00104251 0.00053099 1.963 0.064416 .
Age 0.00058475 0.07900313 0.007 0.994172
FrancePop -0.00004953 0.00016668 -0.297 0.769578
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple R-squared: 0.8294, Adjusted R-squared: 0.7845
F-statistic: 18.47 on 5 and 19 DF, p-value: 1.044e-06

A coefficient of 0 means that the value of the independent variable does not change our prediction for the dependent variable. If a coefficient is not significantly different
from 0, then we should probably remove the variable from our model since it's not helping to predict the dependent variable.
Regression coefficients represent the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model
constant.
The standard error column gives a measure of how much the coefficient is likely to vary from the estimate value
The t value is the estimate divided by the standard error. It will be negative if the estimate is negative and positive if the estimate is positive. The larger the absolute value of
the t value, the more likely the coefficient is to be significant. So we want independent variables with a large absolute t-value.
The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. In
other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the
response variable This number will be large if the absolute value of the t value is small, and it will be small if the absolute value of the t-value is large. We want independent
variables with small values in this column.

Correlation & Multicollinearity


#Removing France Population from model (since it is not significant)#
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.4299802 1.7658975 -1.942 0.066311 .
AGST 0.6072093 0.0987022 6.152 5.2e-06 ***
HarvestRain -0.0039715 0.0008538 -4.652 0.000154 ***
WinterRain 0.0010755 0.0005073 2.120 0.046694 *
Age 0.0239308 0.0080969 2.956 0.007819 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple R-squared: 0.8286, Adjusted R-squared: 0.7943
F-statistic: 24.17 on 4 and 20 DF, p-value: 2.036e-07

After removing France Population, the variable Age becomes significant in the model. This is because population and Age were highly correlated among themselves. Also
note that this is a better model than the previous one, as the Adjusted R2 has increased
Correlation
Correlation measures the linear relationship between two variables and is a number between -1 and +1.

IIM Trichy Page 4


Correlation measures the linear relationship between two variables and is a number between -1 and +1.
A correlation of +1 means a perfect positive linear relationship.
A correlation of -1 means a perfect negative linear relationship.
In the middle of these two extremes is a correlation of 0, which means that there is no linear relationship between the two variables

Multicollinearity
refers to the situation when two independent variables are highly correlated.
A high correlation between an independent variable and the dependent variable is a good thing since we're trying to predict the dependent variable using the independent
variables.
Now due to the possibility of multicollinearity, you always want to remove the insignificant variables one at a time
typically, a correlation greater than 0.7 or less than -0.7
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Important R - Codes
# Read in data
wine = read.csv("C:\\Users/Raktim/Documents/IIM-Trichy/ClassRoom/Term-5/MASDM/Analytics Edge/Wine Test/wine.csv")
str(wine)
summary(wine)
# Multi variable Regression
model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
summary(model4)
# Correlations
cor(wine$WinterRain, wine$Price)
cor(wine$Age, wine$FrancePop)
cor(wine)
# Make test set PREDICTIONs
predictTest = predict(model4, newdata=wineTest)
predictTest
# Compute R-squared
SSE = sum((wineTest$Price - predictTest)^2)
SST = sum((wineTest$Price - mean(wine$Price))^2)
Rsquared = 1 - SSE/SST
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Logistic Regression
Logistic regression predicts the probability of the outcome variable being true. The probability that the outcome variable is 0 is just 1 minus the probability that the outcome
variable is 1.
Logistic Response Function
P(y=1) = 1/(1 + e-(B0 + B1X1 + B2X2 + ……. + BnXn))
Nonlinear transformation of Linear Regression equation to produce number between 0 and 1.

Positive Coefficient Values are predictive of Class 1


Negative Coefficient Values are predictive of Class 0

Odds
The Odds are the probability of 1 divided by the probability of 0
Odds = P(y=1)/P(y=0) => P(y=1)/(1-P(y=1))
The Odds are greater than 1 if 1 is more likely, and less than 1 if 0 is more likely.
The Odds are equal to 1 if the outcomes are equally likely.
Logit
If you substitute the Logistic Response Function for the probabilities in the Odds equation on the previous slide, you can see that the Odds are equal to "e" raised to the
power of the linear regression equation.
Substituting the Logistic Response Function for the probabilities in the Odds equation:
Odds = e B0 + B1X1 + B2X2 + ……. + BnXn
Log(Odds) = B0 + B1X1 + B2X2 + …. + BnXn
This is the Logit, looks exactly like the linear regression equation.
A positive beta value increases the Logit, which in turn increases the Odds of 1.
A negative beta value decreases the Logit, which in turn, decreases the Odds of 1.

In R
QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial) # the family = binomial tells to build a logistic regression model. Glm stands for
generalized linear regression

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6818 -0.6250 -0.4767 -0.1496 2.1060
Coefficients:
Estimate Std. Error z value Pr(>|z|) The coefficient of OfficeVisits variable means that, two people (A&B) who are otherwise
(Intercept) -2.80449 0.59745 -4.694 2.68e-06 *** identical, 1 unit of more office visit increases the Predicted Log Odds of A by 0.08 more
OfficeVisits 0.07995 0.03488 2.292 0.02191 * than the Predicted Log Odds of B.
Narcotics 0.12743 0.04650 2.740 0.00614 ** Ln(OddsA) = Ln(OddsB) + 0.08
--- => exp(Ln(OddsA)) = exp(Ln(OddsB) + 0.08)
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 => OddsA = exp(0.08) * OddsB

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 111.888 on 98 degrees of freedom

IIM Trichy Page 5


Null deviance: 111.888 on 98 degrees of freedom
Residual deviance: 84.855 on 96 degrees of freedom
AIC: 90.855
Number of Fisher Scoring iterations: 5

AIC
Measure of Quality of the model. Similar to Adjusted R-square. Accounts for number of variables used compared to the number of observations. Can only be used to
compare between models on the same data set. The preferred model is the model with minimum AIC.

Thresholding :: Confusion Matrix


Often, we want to make an actual prediction. We can convert the probabilities to predictions using what's called a threshold value, t.
The threshold value, t, is often selected based on which errors are better. There are two types of errors that a model can make --ones where you predict 1, or poor care, but
the actual outcome is 0, and ones where you predict 0, or good care, but the actual outcome is 1.
If we pick a large threshold value t, then we will predict poor care rarely, since the probability of poor care has to be really large to be greater than the threshold. This means
that we will make more errors where we say good care, but it's actually poor care. This approach would detect the patients receiving the worst care and prioritize them for
intervention.
On the other hand, if the threshold value, t, is small, we predict poor care frequently, and we predict good care rarely. This means that we will make more errors where we
say poor care, but it's actually good care. This approach would detect all patients who might be receiving poor care.
Some decision-makers often have a preference for one type of error over the other, which should influence the threshold value they pick.
If there's no preference between the errors, the right threshold to select is t = 0.5, since it just predicts the most likely outcome.

Confusion matrix or Classification Matrix


Predicted = 0 Predicted = 1
Actual = 0 True Negative (TN) False Positive (FP)
Actual = 1 False Negative (FN) True Positive (TP)
This compares the actual outcomes to the predicted outcomes. The rows are labelled with the actual outcome, and the columns are labelled with the predicted outcome.
Each entry of the table gives the number of data observations that fall into that category

Sensitivity
is equal to the true positives divided by the true positives plus the false negatives, and measures the percentage of actual poor care cases that we classify correctly. This is
often called the true positive rate.
= TP/(TP + FN)
Specificity
is equal to the true negatives divided by the true negatives plus the false positives, and measures the percentage of actual good care cases that we classify correctly. This is
often called the true negative rate.
= TN/(TN + FP)
Threshold, Specificity & Sensitivity
• A model with a higher threshold will have a lower sensitivity and a higher specificity.
• A model with a lower threshold will have a higher sensitivity and a lower specificity.

Selecting a Threshold :: ROC Curve


A Receiver Operator Characteristic curve, or ROC curve, can help you decide which value of the threshold is best.
The ROC curve for our problem is shown on the right of this slide. The sensitivity, or true positive rate of the model, is shown on the y-axis. And the false positive rate, or 1
minus the specificity, is given on the x-axis. The line shows how these two outcome measures vary with different threshold values.

The ROC curve always starts at the point (0, 0). This corresponds to a threshold value of 1. If you have a threshold of 1, you will not catch any poor care cases, or have a
sensitivity of 0. But you will correctly label of all the good care cases, meaning you have a false positive rate of 0.

The ROC curve always ends at the point (1,1), which corresponds to a threshold value of 0. If you have a threshold of 0, you'll catch all of the poor care cases, or have a
sensitivity of 1, but you'll label all of the good care cases as poor care cases too, meaning you have a false positive rate of 1. The threshold decreases as you move from (0,0)
to (1,1).

This helps you select a threshold value by visualizing the error that would be made in the process.

IIM Trichy Page 6


IIM Trichy Page 7

Вам также может понравиться