Вы находитесь на странице: 1из 3

Simple Linear Regression (SLR) Used for two quantitative variables to examine the relationship between the two

1) Establish independent & dependent variable

2) Sample question: The investigator wanted to know if the improvement in exercise duration is related to the
patients disease history. 3) If given a scatterplot of the 2 variables the independent is on the horizontal axis & the dependent is on the vertical axis 4) & could be used as parameters if you had the entire population to study, but instead we just have an estimate that we can make from the sample

5) Things to look for: a. Correlation coefficient: denoted R or Beta. relates how well the line fits through the data,
and Beta gives you if its a + or - correlation. b. Use when asked to give the COEFFICIENT OF DETERMINATION give R-square in decimals, when asked for the interpretation explain it in percentage (i.e. interpretation: 58.7% of the variability in the (dep.variable) is accounted for in this model

c. i. ii. d.

Equation of Least Square Lines: y = + X


is the y-intercept & is the slope Y is the dependent variable & X is the independent variable

Give a PREDICTION ex. We predict this one persons (dep.variable) to be ____after (indep.variable) - Figure out prediction value by plugging in a value for X & the data ( & ) from the output into the Equation of Least Squares 6) Give an ESTIMATION for people who did/had whatever (indep variable), we estimate theit MEAN (dep.variable) to be __ a. Estimation value is found the same way as the prediction value 7) INTERPRET Y-INTERCEPT: Y-intercept is when X (indep.variable) is ZERO! Explain if it is appropriate or not to interpret the Y-intercept. It IS appropriate to interpret WHEN X CAN be ZERO! When X is zero, the y-intercept is . Example of interpreting Y-intercept:

a. If a person did not have whatever (indep.variable = 0), we predict this persons (dep.variable) to be = . b. For ppl who did not have whatever (indep.variable = 0), we estimate their MEAN (dep.variable) to be = . 8) INTERPRET SLOPE: use examples: as the (indep.variable) increases by 1, the predicted (dep.variable) incr/decr by . AND as the (indep.var)
increases by 1, the estimated avg (dep.var) incr/decr by 9) RESIDUAL ANALYSIS: checking assumptions to perform SLR inferences: 1. Random errors are indep. 2. Random errors are normally distrib. (skew/kurt test) 3. Random errors have constant variance a. If close to 1 = normal popn b. Use for kurtosis &skewness testing c. The avg.error = 0 10) QUESTION OF INTEREST will ask is there a significant linear relationship between y and x? 11) STATE THE HYPOTHESES: Ho: = 0 (model not useful) there is no linear relationship btwn y & x, Ha: 0 (model IS useful) a relationtip exists between y & x 12) TEST STATISTIC F* or t* (t* x t* = F*) & P value & Conclusion same as always (sig) 13) CONFIDENCE INTERVAL FOR SLOPE: Were 95% conf that as X incr by 1, the Y incr/decr by (values on the following table where t* is also found) WATCH out for NEGs, if SO put smaller value FIRST in interval

CONFIDENCE INTERVALS AND PREDICTION INTERVALS FOR THE OUTCOME VARIABLE (FOR A SPECIFIC VALUE OF THE INDEPENDENT VARIABLE):

VALUE X 1 1 3 . . .

VALUE Y 40 90 30 . . .

pre_1 59.130 59.130 32.319 . . .

lmci_1 45.491 45.491 23.001 . . .

umci_1 72.770 72.770 41.636 . . .

lici_1 15.569 15.569 -10.088 . . .

uici_1 102.692 102.692 74.726 . . .

What the headings mean:


pre_1 = prediction (for value of Y) lmci_1 & umci_1 = low & upper endpoint for CI for mean lici_1 &uici_1 = low& upper endpoint for CI (PI) for individual

**Suppose the researchers would like to predict the Y for an individual who has had ? for X years. Single best prediction: __Prediction interval Interpretation: For an individual who has had this ? for X years, we predict with 95% confidence that this persons Y will be between _and_. **Suppose the researchers would like to estimate what the average percent improvement is for all people having ? for X years. Single best (mean) estimate: __Confidence interval Interpretation: We are 95% confident that the mean percent improvement for all people having this disease for 2 years is between _ and _.

MULTIPLE LINEAR REGRESSION (MLR) Used to look for LINEAR relationships between 1 Dependent variable (outcome), and
MULTIPLE independent variables (explanatory) 1) CORRELATION COEFFICIENTS: To determine correlation coeff btwn OUTCOME var, and EACH DEP.var. In the output find the OUTCOME variable and see all its Pearson Correlation (denoted r) values (highest =strongest correlation, lowest =weakest correlation) make a statement about them...

2) MULTIPLE REGRESSION ANALYSIS: Will present:


MODEL SUMMARY (we dont need it for anything) ANOVA output from which we get the TEST STATISTIC F* and the P-value to use for the TEST OF SIGNIFICANCE OF THE MODEL. COEFFICIENTS table Most sig explanatory variable has the LOWEST p-value, least sig has the highest. Explanatory variables that need to be thrown out of the model will have P-values higher than 0.05 3) TEST FOR SIGNIFICANCE OF THE MODEL: Ho: none of the indep vars is a significant predictor of Y, Ha: at least 1 of the indep vars is a sig... (etc) Use F* and sig (p-value) from ANOVA output 4) EQUATION OF THE MODEL WITH ALL THE VARIABLES: Y = + 1 X1 + 2 X2, ++ 5 X5 (assign a value for each X for each explanatory variable 5) IF THE Ha IS TRUE & THERES AT LEAST 1 SIGNIFICANT INDEP VAR. DO FURTHER ANALYSIS!!! 6) TEST FOR INDIVIDUAL PREDICTOR: if wanting to predict one by one, can use coefficients table included in the MULTIPLE REGRESSION ANALYSIS and compare whichever variable is least significant (has largest p-value) this takes long can use backward selection model (faster) 7) BACKWARD SELECTION PROCEDURE: will present with 3 outputs: (Model Summary, ANOVA, and Coefficients) each output will have at least 2 models. a. Choose which model is best the model that throws out the least significant independent variable is the best (from previous steps you know which one this is) b. Whichever model you choose BE CONSISTANT with where you get the values from. c. Use significance of 0.05 to decide which variables to throw out (unless asked different) Red circle is , green circle are all the 1, 2,3..etc respective to each variable (FOR EQUATION OF THE MODEL) the teal circle is to denote that we chose that model because it had the MOST significant amount of variables (see xs and check marks) 8) COEFFICIENT OF DETERMINATION: (Assessing the fit of the model) _ % of the variability in the Y or dependent var. is accounted for in this multiple linear regression model. (% can be found in model summary (red circle) .467 is 46.7% 9) You will be presented with a data set that looks kinda like: math96 read96 scienc96 pre_1 lmci_1 umci_2 lici_1 uici_1 65.00 27.00 34.00 39.51267 25.18071 53.84464 11.90267 67.12268 70.00 49.00 45.00 48.72511 41.53448 55.91574 24.05504 73.39518 48.00 50.00 40.00 33.19912 24.74859 41.64965 8.13283 58.26541 10) PREDICTIONS: EXAMPLE For a school that has 70% of its students scoring proficient in math in 1996, 45% of its students scoring proficient in science in 1996, and 49% of its students scoring proficient in reading in 1996, we predict it will have ____% of its students scoring proficient in writing in 1997 ** You can also plug in the % s in this sentence into the EQUATION of the model in place of the Xs (watch for pairing with right var) 11) ESTIMATES: EXAMPLE For schools that have 70% of their students scoring proficient in math in 1996, 45% of their students scoring proficient in science in 1996, and 49% of their students scoring proficient in reading in 1996, we estimate that these schools will have an average of ___% of their students scoring proficient in writing in 1997. 12)INTERPRETATION OF PARTIAL SLOPES: SAME AS IN SIMPLE LINEAR REGRESION, JUST MAKE SURE TO USE THE RIGHT SLOPE VALUE WITH THE CORRECT VARIABLE AND WATCH FOR SIGNS!!!! 13) RESIDUAL ANALYSIS: can be used to check the following assumptions are met: 1. Errors are indep. 2. Errors are normally distrib. 3. Errors have constant variance 14) Predictions intervals for ONE variable use columns from data set above highlighted in PURPLE 15) Confidence intervals estimating the mean outcome for MANY variables use columns from dataset above highlighted in BLUE a. b. c.

LOGISTIC REGRESSION used when there is one variable that can lead to a dichotomous answer (live or die, pass or fail, etc etc)
Age Group 50-54 55-59 60-69 n 8 17 10 CHD Absent CHD Present 3 5 4 13 2 8 Proportion of 1s Odds 0.63 1.703 How to interpret this? See the ODDS column! 0.76 3.304 Someone who is 60-69yo is 4 times more likely to get CHD. 0.80 4.000

1) HYPOTHESIS TESTING: Ho Indep variable is NOT a predictor of Dep variable in this model, Ha Indep Variable IS a predictor of Dep variable in this model 2) TEST STATISTIC: Walds for variable, PVALUE same as always, CONCLUSION same as always 3) MODEL FOR LOGIT: Z (predicted natural log of the odds) = o + 1 X 4) o = B constant, 1 = B of Variable & Exp(B) is the ODDS RATIO 5)MODEL FOR PROBABILITY OF DEP.VARIABLE: e to the Z / 1 + e to the Z (e is the natural log or e to the x in the calculator) 6)PREDICTION: FIRST using the MODEL FOR LOGIT Equation, plug in the INDEP variable for X and solve for Z, SECOND using the MODEL FOR PROB. Equation, plug in value of Z you just figured out . Use this PREDICTION to establish your CUTOFF to classify your individuals into categories. 7)CLASSIFICATION: According to this model, if a probability of .50 or lower classifies someone as CHD=0, otherwise as CHD=1, how would a person 48 years old be classified? CHD 1 (PARTIAL) SPSS SPREADSHEET WITH PREDICTED VALUES: pre_1 = the predicted probability that the subject will have CHD pgr_1 = the predicted group for the subject (1=CHD, 0=no CHD) 8)ODDS RATIO (see green highlight above in SPPS output): you can say: A 1 yr increase in age increases the odds of CHD 1.118 times. 9) FOR ALL MODELS! The (predicted) 95% confidence interval for the odds ratio is found on the RIGHT LOWER &UPPER of SPSS output

This means that we are 95% confident that the odds ratio is between LOWER and UPPER. NOTE If a 95% confidence interval for the odds ratio contains 1this means that there is not a statistically significant difference in the odds for the two groups, at the .05 level. WATCH FOR THE RATIO, IF 1.00 IS IN THE RATIO, THEN IT IS NOT STATISTICALLY SIGNIFICANT!

10) SPECIFICITY(top) (prob of a neg when actually pos) & SENSITITVITY(bottom) (prob of a pos when a pos is present)are found in the CLASSIFICATION Table under PERCENT CORRECT

TWO WAY ANOVA (NO SIGNIFICANT INTERACTION) (used when evaluating 2 factors affecting 1 thing 2 indep, 1 dep variable)
**If the lines of the graphs titled EST.MARGINAL MEANS OF CHANGE are PARALLEL to each other, it means NO INTERACTION 1) CHECK ASSUMPTIONS FOR 2way ANOVA 1. All groups from normal popns, 2. Groups are indep., 3. No extreme outliers, 4. Homogeneity of variances (Levines test MORE than 0.05) 2) FIRST TEST OVERALL MODEL: Ho: there is no main effect due to factor 1, there is no main effect due to factor 2, and there is no significant interaction between factor 1 and factor 2 on the mean Dependent Variable). Ha: at least one of these 3 factors affects the mean Dependent Variable. 3) TEST STATISTIC: F* of CORRECTED MODEL line, P-VALUE and CONCLUSION 4) SECOND - IF Ha is TRUETEST THE INTERACTION TERM Ho: Factor 1 & Factor 2 do not interact to affect mean Dep.Variable, Ha: F1 & F2 do interact to affect mean dep var. TEST STATISTIC is F* (of F1*F2), P-value & conclusion. (if lines in graphs were parallel, you could have anticipated the INTERACTION was not going to be significant) 5) THIRD since the interaction term is NOT significant, TEST FOR MAIN EFFECTS a. 1st Main Effect (drug): Ho: there is NO main effect due to factor 1 on the mean Dep.Var. Ha: there is a main effect due to Factor 1 on the mean Dep.Var. see TEST STATISTIC (F*), P-VALUE, & CONCLUSION b. IF 1st Main Effect is significant (Ha ia true) then How Much Difference? We are 95% confident that mean change in hemoglobin levels for active agent drug (mentioned FIRST) users is somewhere between Lower Bound and Upper Bound more/less (POS OR NEG?) than the mean change in hemoglobin levels for non-active agent drug (mentioned SECOND) users. c. 2nd Main Effect (Type of CA) establish Ho and Ha just like you did for 1st main effect but with type of CA instead of drug. TEST STAT is F*, PVALUE & CONCLUSION d. IF 2nd Main Effect is significant (Ha true) then How Much Difference? Answer each question Yes or No; give the p-value used to make the decision. If the means are different, state which group has the higher mean. Is the mean change in hemoglobin levels significantly different for (WATCH WHICH IS BEING MENTIONED FIRST AND THEN LOOK IN THE CHART IN THAT ORDER!) A. the cervical cancer and the prostate cancer patients? B. the cervical cancer and the colorectal cancer patients? C. the prostate cancer and the colorectal cancer patients? We are 95% confident that (fill in blanks with positive numbers watch to put the SMALLEST one FIRST when using Negatives) the mean change in hemoglobin levels for cervical cancer patients (1ST) is somewhere between LOWER BOUND and UPPER BOUND more/less (+ or -) than the mean change in hemoglobin levels for prostate cancer (2nd) patients (repeat with all other comparisons) WATCH OUT FOR WHICH FACTOR IS BEING SAID FIRST TO THEN CHOOSE FROM THE OUTPUT CORRECTLY!! SO In this case CHANGE = DRUG + TYPE + DRUG*TYPE

TWO-WAY ANOVA WITH SIGNIFICANT INTERACTION


**If the lines of the graphs titled EST.MARGINAL MEANS OF CHANGE are DIFFERENT or CROSS OVERwith each other, it means THERE IS AN INTERACTION. 1) SET EVERYTHING UP JUST LIKE ABOVE STEPS 1-4 2) IF the interaction term is significant you keep all the factors that make the interaction TIME = DRUG +CENTER + DRUG*CENTER 3) Since the interaction term is significant, we do not test for main effects (hierarchical principal). We discuss differences within different groups. The significant interaction tells us that the difference in the mean times for the drug and placebo is different for the different centers. We can look at these differences within each center (fill in blanks with positive numbers). EXAMPLE: For center 1, we are 95% confident that the true mean time until first bowel movement after colorectal surgery for active agent users is between ________ and _______ hours more/less than the true mean time until first bowel movement after colorectal surgery for placebo or non-active agent users. (REPEAT FOR EVERY OTHER CENTER) **You will have SEVERAL of the PAIRWISE COMPARISON outputs, one for each type of each variable

Significant difference? **WATCH OUT FOR WHICH FACTOR IS BEING ASKED FIRST TO SELECT PROPERLY FROM THE SPSS OUTPUT If watch how you write _____ yes, the negative numbers!!!! which has the 3 higher mean? __________