Вы находитесь на странице: 1из 4

OLS Assumptions Parameter Estimates Centering

No measurement error (biased estimates: inflated) α = “true” intercept a = population intercept How: subtract each X from the mean of X (same for Y)
Run OLS with the new values
No specification error β = “true” slope b = population slope
ι ι
Non-linear relationship modeled (biased estimates) ε = “true” error e = population error Why: might make results easier to interpret
Relevant X’s excluded (biased estimates) “true” regression equation: for the total population Consequences: slope (b) no change; intercept (a) changes
Irrelevant X’s included (inflated standard errors) estimated regression equation: for sample
Error term assumptions Variance (standard error) of Slope Estimate
Homoskedasticity (variance of error term is constant) Criteria used by OLS for fitting a line Tells you how stable the line is
(if not, inflated/deflated standard errors)
Basic idea is to choose estimates b1 to bk to minimize the High variance (points compacted around the line):
sum of the squared residuals (errors) Bad à line is less stable
No autocorrelation (residuals are not correlated)
(if not, inflated/deflated standard errors)
Low variance (points more spread around the line):
Good à line is more stable
Residuals average out to 0 (this is built into OLS) Problems with the Slope (parameter) Estimate
Covariance between residuals and ind var = 0 1.Biased slope estimate: over an infinite number of samples, As sample size increases, variance usually decreases
(if leave relevant var out, biased estimates) the estimate will not equal the true population value

Error terms are normally distributed 2.Want efficient estimators: unbiased estimator with the least Forcing the Intercept Through the Origin
variance
If theory predicts that if X = 0, Y should = 0
R2 Not a good idea:
% of the variance in Y that is “explained” by X Standardized Estimates (Beta Weights) 1) changes the slope (strength of the relationship)
Measures goodness of fit Change in Y in standard deviation unites brought about by 2) can’t test H0: α = 0
R2 = Regression Sum of Squares (RSS) Σ (y hat - mean y)2 one standard deviation unit change in X
Total Sum of Squares (TSS) Σ (yi - mean of y)2
3) won’t work if you really have curvilinear rel
Units are lost in the transformation (now in std dev units) 4) maybe it doesn’t make sense to talk about the line
at all if X = 0
Problems with R2: Hard to convey meaning to the reader ( b/c std dev units)
1) not a measure of magnitude of rel btwn X & Y 5) may have a bad sample; makes it appear sig.
2) dependent/vulnerable on the std dev of X & Y Functional Transformations of Independent Variable 6) if you force the line you deny the chance to see if
can’t compare across samples there is something wrong with the model and if
biased in small samples the model actually predicts an intercept of 0
Used if there is a non-linear relationship btwn X & Y
Log (X)
√X
3) addition of any variable will increase R2 7) costs of leaving a in are minor compared to taking
include variables only because of theory it out

8) R2, slope, intercept all change; difficult to interpret


Confidence Intervals for β
Over all samples, 95% of the computed Parameter Estimates & Degrees of Freedom
confidence intervals will cover the true β
X2 Degrees of Freedom: n-k-1

βi hat ± (t )(std error βi hat) Parameters: k+1 (for the intercept)


α /2

Confidence Intervals for E(Y0) t-statistics and One Tailed Tests


Over all samples, 95% of the confidence intervals will b .
cover the true Y value Std err of b
Each Y value has its own confidence interval To interpret the results, the new values would be plugged into One-tailed tests should be used if the researcher’s
the equation to get predicted y values theory suggests that the relationship between the two
variables goes in a specific direction
Extrapolation is when you predict a value for Y with an X
that is not actually in your sample

Standard Error of Regression (Std Dev of Residuals)


(Root Mean Sqd Error; Std error of Estimate)
Adjusted R2 P values
A p value is the lowest significance level at which a null
Adj R2 = (R2 – k/n-1) (n-1/n-(k+1)) hypothesis can be rejected; they are connected with
the t statistic
Since adding variables inflates the R2, adjusted R2 takes Σei2 .
the degrees of freedom into account to fix this √ n-(k+1) √ Mean std. error

Dummy Variable Measures the goodness of fit To determine if β ≠ 0, we would find the actual
probability of obtaining a value of the test
statistic (p value for the t statistic) as much as
or greater than that obtained in the example; we
can accept or reject the hypothesis on the basis of
that number (.05 needed)
Changes the intercept Alternative to using the R2
The two groups start at different points, but have the Not dependent on the standard deviations of X or Y
same slope
Just one number for each equation (in Y units)
Can compare across samples
Interactions To determine if β > 0, a one-tailed test is needed;
to do this, only one half of the area under the
graph is analyzed (1/2 of the p value)
Changes the slopes Multicollinearity (MC)
The difference in slopes between the groups X1 can be predicted if the values of X2 and X3 are known
Should calculate predicted Y values to see impact Can’t tell which variable is actually having an impact
No
Explained? Stop
Yes
Use the means to calculate the predicted values for It exists in degrees and the magnitude of it determines Standard Error of Parameter Estimate
the variables used in the interaction (4 equations) whether or not it is a problem
but include
Yes all the variables
plain as anomalies? Delete, explain in footnote Σei2/n-(k+1)

No Inflates the standard errors: all look more significant Σ(xi – x hat)2
Possible interpretations (ex: interaction of gender/feeling) Diagnose it using VIF (variance inflation factor) Scores :
scores of 4 or 5 usually cut off point for problems; higher
re they severe? scores are problematic
Among Republicans, women like Clinton even less Variance Inflation Factor (VIF)
moderate
YesAmong women,
?e? > 3 std Republicans
dev from ? like Clinton even less 1 .
VIF=
1-auxR2
ze with andThere
Look is anfor
without
harder additional effect
explanation of gender
Explained away on feelings High Aux R2 (> .75) indicates high MC: you can explain a lot
towards Clinton of X1 with the other variables

No What to do about it: Finding the Standard Error


Outliers
ifference in results? Get more data; pool data (but that is problematic) Standard error of regression
Combine variables (ex: socioeconomic status combines Σ(xi – 0)2
whole dataset and add footnote (id’d some outliers but they did not make a difference)
Yes education, income, and occupational prestige which
alone tend to be highly correlated)
Report both
Miscellaneous Info
Drop one X Adjusted R2 ↑ Standard Error of Reg ↓
Don’t’ do this: only added it because it was Sample Size is total df from ANOVA table + 1
theoretically important
If std error of estimate is inflated, t will drop, p goes up,
keep Ho when it should be rejected
If you drop a relevant right hand side (RHS) variable
you get biased parameter estimates
If std error of estimate is deflated, t will go up, p goes
down, reject Ho when it should be kept
It is acceptable to run 2 models:
1) with all variables b= Σ (xi – mean x) (yi – mean y)
2) with some dropped variables—showing that it Σ (xi – mean x)2
could be misestimating

You are giving full information (important)

How to deal with outliers à Diagnose them


1. DFBETA measures if there is a change in a particular SPSS can give you all of these numbers
variable when a particular case is removed; if the value
is bug, the case has a large impact

2. Cook’s Distance (Cook’s D) measures the influence of an Cook’s D: Di > 4 .


observation on the model as a whole
n – (k+1)
3. look at abs value of standardized residuals; helps
flag cases; ІeІ > 3 indicates it is pretty far off the
line
Missing Values F-test (limitations listed below) Heteroskedasticity
1. listwise deletion: if a value for any X is missing, entire Testing to see if the variables are significant When variance of error term is not constant; violation of
case deleted (lose lots of data) OLS assumptions: Var(ε ) = constant
Null hypothesis: none of the variables have a significant
impact (b1=b2=b3=bk)
2. pairwise deletion (not a great idea, but not evil): uses When to suspect it:
info in pairs of variables to estimate slope
coefficients
Alt. Hypothesis: at least one variable has a significant impact Pooled data
When H0 is false Mean Reg Sum of Sqrs > Mean Err SS Learning curves with coding
Problem: get different sample size for each Fatigue effects
variable, some variables have more/less info
Durbin Watson Scores: H0 à No autocorrelation (AC) Whenever think some portion of data will be better
reported/predicted
3. substitute mean (mean substitution) NOT GOOD AC No AC AC
IDEA: will bias over time; substitute the mean value
for missing ones
Reject ? Fail to Reject ? Reject Predictive power of model not consistent
across cases or variables
No positive value in doing this 0 2 4 Consequences
No guarantee that the mean of missing values will Autocorrelation Inflated standard error (conclude there is less sig)
be the same as for the ones you have values
for—could be putting the wrong value in
Residuals are correlated; usually happens with time series If small var(e) are located away from mean X
data; can inflate/deflate standard errors std error is too large, underconfident that
b≠0
4. predict the value that is missing **best option**: run Diagnosing:
regression with missing var. as dep variable—use
actual info to make an educated guess
1. scatterplot: look for pattern in residuals Deflated standard error (conclude there is more sig)
2. regress residuals on previous residuals If large var(e) are located away from mean X,
reported std error will be to low
ei = ρei-1 + μi à looking for sig ρ
Autocorrelation How much bias is indicated? How to diagnose it:
ІρІ % bias induced Scatter plot
.0 0%
.2 3%
.5 8%
.8 19%
.9 29% Goldfield/Quandt
3. Durbin Watson Scores Order observations by suspect x
ρ ^d ^d = 2 – 2 ρ Throw out middle observations
↑ variance ↓variance, ↑stable, ↓std error 0 2 Run 2 models using all original x’s
1 0 Mean residual SS1
àF
Mean residual SS2
Heteroskedasticity -1 4
Durbin Watson: save in SPSS, look up critical value Look up value on F chart to see if > than critical
value

F-test limitations Limitation: only diagnoses at ends, not middle


Does not tell you which is significant, just that something is
If have insignificant F, all b insignificant Glejser
Save residuals
Logit Run new regression with abs residual as DV and
suspect x as only IV
OLS not for dichotomous DV
Values > 1 & < 1 which are not options; actual values are If new parameter est sig, have heteroàcan
only 0 & 1 predict with residual

Induces heteroskedasticity—residuals clustered in the Limitations: can’t check middle, only works for
middle, but no actual values there linear

Choice functions tend to be White’s


Save residuals
Regress all IV, squares (but not dummies) & all
potential interactions on residuals
Can’t model probability as a straight line, if do, misspecify
model and bias parameter estimate
n*R2 àχ2
df = # of regressors
Limitations: could be overkill; doesn’t tell you
where the problem is; since have so many
variables, some could be randomly significant
and you could diagnose when not really
problem

WLS
Assumes that best info in data is in the
observations with least variance in error terms

Weighs some observations more than others


Divide all by √ hetero variable
Pure hetero should not bias parameter estimates,
but could be indication of
measurement/specification problems and
correcting for it could bias parameter
estimates

Вам также может понравиться