Вы находитесь на странице: 1из 4

OLS Assumptions

Parameter Estimates

Centering

No measurement error (biased estimates: inflated)


No specification error
Non-linear relationship modeled (biased estimates)
Relevant Xs excluded (biased estimates)
Irrelevant Xs included (inflated standard errors)
Error term assumptions

= true intercept
a = population intercept
= true slope
b = population slope
= true error
e = population error
true regression equation: for the total population
estimated regression equation: for sample

How: subtract each X from the mean of X (same for Y)


Run OLS with the new values
Why: might make results easier to interpret
Consequences: slope (b) no change; intercept (a) changes

Variance (standard error) of Slope Estimate


Homoskedasticity (variance of error term is constant)
(if not, inflated/deflated standard errors)

No autocorrelation (residuals are not correlated)


(if not, inflated/deflated standard errors)
Residuals average out to 0 (this is built into OLS)

Tells you how stable the line is


Criteria used by OLS for fitting a line
Basic idea is to choose estimates b1 to bk to minimize the sum of
the squared residuals (errors)

High variance (points compacted around the line): Bad


line is less stable
Low variance (points more spread around the line):
Good line is more stable

Problems with the Slope (parameter) Estimate


Covariance between residuals and ind var = 0
(if leave relevant var out, biased estimates)
Error terms are normally distributed

R2
% of the variance in Y that is explained by X
Measures goodness of fit
R2 = Regression Sum of Squares (RSS) (y hat - mean y)2
Total Sum of Squares (TSS) (yi - mean of y)2
Problems with R2:
1) not a measure of magnitude of rel btwn X & Y
2) dependent/vulnerable on the std dev of X & Y
cant compare across samples
biased in small samples

3) addition of any variable will increase R2


include variables only because of theory

1.Biased slope estimate: over an infinite number of samples, the As sample size increases, variance usually decreases
estimate will not equal the true population value
2.Want efficient estimators: unbiased estimator with the least
variance
Forcing the Intercept Through the Origin
If theory predicts that if X = 0, Y should = 0
Not a good idea:
Standardized Estimates (Beta Weights)
1) changes the slope (strength of the relationship)
Change in Y in standard deviation unites brought about by one
2) cant test H0: = 0
standard deviation unit change in X
3) wont work if you really have curvilinear rel
Units are lost in the transformation (now in std dev units)
4) maybe it doesnt make sense to talk about the line at
all if X = 0
Hard to convey meaning to the reader ( b/c std dev units)
5) may have a bad sample; makes it appear sig.
6) if you force the line you deny the chance to see if
there is something wrong with the model and if the
Functional Transformations of Independent Variable
model actually predicts an intercept of 0
Used if there is a non-linear relationship btwn X & Y
Log (X)
X
7) costs of leaving a in are minor compared to taking it
out
8) R2, slope, intercept all change; difficult to interpret

Confidence Intervals for


Over all samples, 95% of the computed confidence intervals
will cover the true

Parameter Estimates & Degrees of Freedom


X2

Degrees of Freedom: n-k-1


Parameters: k+1 (for the intercept)

i hat (t/2)(std error i hat)


Confidence Intervals for E(Y0)
Over all samples, 95% of the confidence intervals will cover
the true Y value
Each Y value has its own confidence interval
To interpret the results, the new values would be plugged into
Extrapolation is when you predict a value for Y with an X
the equation to get predicted y values
that is not actually in your sample
Standard Error of Regression (Std Dev of Residuals)
Adjusted R2
(Root Mean Sqd Error; Std error of Estimate)
Adj R2 = (R2 k/n-1) (n-1/n-(k+1))
Since adding variables inflates the R2, adjusted R2 takes the
degrees of freedom into account to fix this
Dummy Variable
Changes the intercept
The two groups start at different points, but have the
same slope
Interactions
Changes the slopes
The difference in slopes between the groups
Should calculate predicted Y values to see impact
Use the means to calculate the predicted values for the
variables used in the interaction (4 equations) but
include all the variables
Possible interpretations (ex: interaction of gender/feeling)
Among Republicans, women like Clinton even less
Among women, Republicans like Clinton even less
There is an additional effect of gender on feelings
towards Clinton
Outliers
Explained?

No

Stop

Yes

Explain as
anomalies?

Yes

Delete, explain in
footnote

No
Are they
severe?

Report from whole


Yes e > 3 std dev
dataset and add
from
footnote (idd some
Analyze
Sig
difference
with
Look harderoutliers
for
Explained
but they
did not
and
without
away
in
results?
Yes
No
explanationmake a difference)
Report
both
moderate

2
i

e .
n-(k+1)

t-statistics and One Tailed Tests


b
.
Std err of b
One-tailed tests should be used if the researchers theory
suggests that the relationship between the two variables
goes in a specific direction
P values
A p value is the lowest significance level at which a null
hypothesis can be rejected; they are connected with the
t statistic

Mean std. error

Measures the goodness of fit


Alternative to using the R2
Not dependent on the standard deviations of X or Y
Just one number for each equation (in Y units)
Can compare across samples

To determine if 0, we would find the actual probability of


obtaining a value of the test statistic (p value for the t
statistic) as much as or greater than that obtained in the
example; we can accept or reject the hypothesis on the
basis of that number (.05 needed)
To determine if > 0, a one-tailed test is needed; to do this,
only one half of the area under the graph is analyzed
(1/2 of the p value)

Multicollinearity (MC)
X1 can be predicted if the values of X2 and X3 are known
Cant tell which variable is actually having an impact
It exists in degrees and the magnitude of it determines whether Standard Error of Parameter Estimate
or not it is a problem
ei2/n-(k+1)
Inflates the standard errors: all look more significant
(xi x hat)2
Diagnose it using VIF (variance inflation factor) Scores : scores
of 4 or 5 usually cut off point for problems; higher scores are Variance Inflation Factor (VIF)
problematic
1
.
VIF=
High Aux R2 (> .75) indicates high MC: you can explain a lot of
1-auxR2
X1 with the other variables
What to do about it:
Finding the Standard Error
Get more data; pool data (but that is problematic)
Standard error of regression
Combine variables (ex: socioeconomic status combines
(xi )2
education, income, and occupational prestige which alone
tend to be highly correlated)
Miscellaneous Info
Drop one X
Adjusted R2 Standard Error of Reg
Dont do this: only added it because it was theoretically Sample Size is total df from ANOVA table + 1
important
If std error of estimate is inflated, t will drop, p goes up,
If you drop a relevant right hand side (RHS) variable you keep Ho when it should be rejected
get biased parameter estimates
If std error of estimate is deflated, t will go up, p goes down,
It is acceptable to run 2 models:
reject Ho when it should be kept
1) with all variables
b=
(xi mean x) (yi mean y)
2) with some dropped variablesshowing that it
(xi mean x)2
could be misestimating
You are giving full information (important)

How to deal with outliers Diagnose them


1. DFBETA measures if there is a change in a particular variable SPSS can give you all of these numbers
when a particular case is removed; if the value is bug, the
case has a large impact
2. Cooks Distance (Cooks D) measures the influence of an
Cooks D:
Di > 4
.
observation on the model as a whole
n (k+1)
3. look at abs value of standardized residuals; helps flag cases;
e > 3 indicates it is pretty far off the line

Missing Values

F-test (limitations listed below)

Heteroskedasticity

1. listwise deletion: if a value for any X is missing, entire


Testing to see if the variables are significant
When variance of error term is not constant; violation of
case deleted (lose lots of data)
Null hypothesis: none of the variables have a significant impact
OLS assumptions: Var() = constant
2. pairwise deletion (not a great idea, but not evil): uses info
(b1=b2=b3=bk)
When to suspect it:
in pairs of variables to estimate slope coefficients
Alt. Hypothesis: at least one variable has a significant impact
Pooled data
When H0 is false Mean Reg Sum of Sqrs > Mean Err SS
Learning curves with coding
Problem: get different sample size for each variable,
some variables have more/less info
Fatigue effects
Whenever think some portion of data will be better
reported/predicted

Durbin Watson Scores: H0 No autocorrelation (AC)


3. substitute mean (mean substitution) NOT GOOD IDEA:
will bias over time; substitute the mean value for
missing ones
No positive value in doing this

AC
Reject

No guarantee that the mean of missing values will


be the same as for the ones you have values for
could be putting the wrong value in
4. predict the value that is missing **best option**: run
regression with missing var. as dep variableuse
actual info to make an educated guess

Autocorrelation
Inflated standard error (conclude there is less sig)
Residuals are correlated; usually happens with time series data;
If small var(e) are located away from mean X
can inflate/deflate standard errors
std error is too large, underconfident that
Diagnosing:
b0
1. scatterplot: look for pattern in residuals
Deflated standard error (conclude there is more sig)
2. regress residuals on previous residuals
If large var(e) are located away from mean X,
ei = ei-1 + i looking for sig
reported std error will be to low
How much bias is indicated?
How to diagnose it:

% bias induced
Scatter plot
.0
0%
.2
3%

Autocorrelation

variance

Heteroskedasticity

variance, stable, std error

No AC
Fail to Reject

AC
Reject

Predictive power of model not consistent across


cases or variables
Consequences

.5

8%

.8

19%

.9
29%
3. Durbin Watson Scores

^d
^d = 2 2
0
2
1
0

Goldfield/Quandt
Order observations by suspect x
Throw out middle observations
Run 2 models using all original xs
Mean residual SS1
Mean residual SS2
F

-1
4
Durbin Watson: save in SPSS, look up critical value

Look up value on F chart to see if > than critical


value
Limitation: only diagnoses at ends, not middle

F-test limitations
Does not tell you which is significant, just that something is
If have insignificant F, all b insignificant

Glejser
Save residuals
Logit
Run new regression with abs residual as DV and
OLS not for dichotomous DV
suspect x as only IV
Values > 1 & < 1 which are not options; actual values are only 0
If new parameter est sig, have heterocan predict with
&1
residual
Induces heteroskedasticityresiduals clustered in the middle,
Limitations: cant check middle, only works for linear
but no actual values there
Whites
Choice functions tend to be
Save residuals
Regress all IV, squares (but not dummies) & all
potential interactions on residuals
Cant model probability as a straight line, if do, misspecify model
and bias parameter estimate
n*R2 2
df = # of regressors
Limitations: could be overkill; doesnt tell you where
the problem is; since have so many variables,
some could be randomly significant and you
could diagnose when not really problem
WLS
Assumes that best info in data is in the observations
with least variance in error terms
Weighs some observations more than others
Divide all by hetero variable
Pure hetero should not bias parameter estimates,
but could be indication of
measurement/specification problems and
correcting for it could bias parameter estimates