Statistical Visualization and Analysis.

OLS assumptions: 1) linearity (correct specication). 2) ei N (0, 2 ). 3) Xs are not random (xed, i.e.
e. not estimated like unemployment) and no measurement error. 4). Xs are linearly independent (correlation between Xs is not 1 or -1). Precision of regression increases with less covariance between regressors (if you regress x on other xs, low R2 ). OLS: OLS gives equation for line that minimizes
cov(x,y) var(x)
y = r x
)(yi (yi y )2 . Result: 1 = (xi(xx)2 y) = i x where r is the correlation between x and y, and 0 = y 1 x. Correlation between
x and y is r = covar(x,y) Mle: assume E(yi |xi ) (0 + 1 xi , 2 ). Then the probability of seeing sdx sdy observed data is the product of the probabilities for each point. You then nd parameter values that maximize this function (calculus). Methods will give the same results if the error term in the regression are actually normally distributed. Categorical predictors: example: gender (1 if male, 0 if female). Model: IQ = 0 + 1 gender +beta2 SAT. If male, then IQ = (0 + 1 ) + 2 SAT, if female, IQ = 0 + 2 SAT. Interaction: Are SAT scores a better predictor of IQ for males than for females? Sampling distribution of everything OLS: ei (0, 2 ). Condence intervals for coecients:0 td.f ( )0 and :1 td.f ( )1 where the s are reported 2 2 in regression output as standard errors. d.f., or degrees of freedom, is the number of observations x (n) minus the number of estimated parameters. t-test for a mean: t = /n . t-test for 2 means: t=
x 1 x 2 sx1 x2
where sx1 x2 =
s2 2 n1
s2 2 n2
ANOVA and sum of squares: SStotal = (yi y )2 . Think of as total dispersion or spread of y about its mean. SSmodel = SSregression = SSexplained = (yi y )2 . Think of as the total variation in y that is mopped up by the model. If large, then errors will tend to be small. SSresiduals = SSerrors = (yi yi )2 . Think of as magnitude of the errors, or the discrepensy between the data and what the model predicts. SStotal = SSerr + SSreg and d.f.total = d.f.err + d.f.reg . total degrees of freedom is always n-1. d.f. model is the number of estimated parameters (not including the intercept). The Mean squared error, which is the standard deviation of the error terms (aka ) and the conditional sd of y, is MSE SS = d.f.error . Marginal sd of y is variation of all y values, where conditional sd of y is variation of y error about the regression line at a given point. R2 = SSreg = 1 SSerr . Think of R2 as the proportion of variation explained by the model to total SStot tot variation. between 0 and 1. Note that R2 always goes up when you add new variables. adjusted R2 accounts for the number of variables you include. F and t tests: t-tests test whether a sample mean or coecient is dierent from some value, or whether two sample means are dierent. F tests test whether a set of coecients or means are equal to 0. If any of the coecients arent 0, the test rejects the null. One way F test: are coecients within a single sample equal to 0 (e.g. treatments for cancer, testing whether at least one treatment M is superior or inferior). In this case, F(d.f.model ,d.f.residual ) = M SSmodel . If F is large, then we reject residual the null that all coecients are equal to 0. Note that when there is only one parameter, F = t2 . Or if we want to test whether a subset of regressors is equal to 0, F = RSSresiduals U SSresiduals /q U SSresidual /(nk)
SS
where q is the number of variables in the subset, USS is sum of squares for the model including the subset, RSS is sum of squares for model not including the subset, and k is number of variables in the unrestricted model. Regression Problems and Diagnostics: Collinearity: the dierent variables in a regression are perfectly correlated. This violated our assumption that the variables are independent of each other. Consequence: no unique regression solution. Similarly, multicollinearity is strong (but not perfect) correlation among variables. Reduces precision of estimation (large standard errors, small t values, large condence intervals). Test: regress one x variable on the other x variables, if R2 is high, there is collinearity. Added variable plot for x1 : y axis: residuals of y on x2 , x3 , .... x axis: residuals of x1 on x2 , x3 , .... Slope of this plot should equal coecient on x1 in original model. Use these plots to nd outliers or high leverage points, and to determine the correction functional form for a particular variable (e.g. include a quadratic term). Individual points: Outlier: far from data, usually large residual. Leverage measures distance of explanatory variable x x 1 from average. Leverage: hi = n + (xijx)2 Large if greater than .5. High levered points usually have small residuals and are far from average x value. Inuential points are points that accect the (j j(i) )2 y y regression line a lot: measure: Cooks distance: Di = , p is number of predictors, j(i) is pM SE when that observation is removed. Big if greater than 1. second part is leverage.
e2 i (M SE)(P )
hi (1hi )2 ,
rst part is residual,
Q-Q or q-normal plot pattern/diagnosis: all but a few points fall on a line/outliers in the data;;; left end of pattern is below the line; right end of pattern is above the line/ long tails at both ends of the data distribution;;; left end of pattern is above the line; right end of pattern is below the line/ short tails at both ends of the data distribution;;; curved pattern with slope increasing from left to right/ data distribution is skewed to the right;;; curved pattern with slope decreasing from left to right/ data distribution is skewed to the left;;; staircase pattern (plateaus and gaps)/ data have been rounded or are discrete;;;

Statistical Visualization and Analysis.

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Statistical Visualization and Analysis.

Загружено:

Авторское право:

Доступные форматы

OLS assumptions: 1) linearity (correct specication). 2) ei N (0, 2 ). 3) Xs are not random (xed, i.e.

rst part is residual,

Вам также может понравиться