Вы находитесь на странице: 1из 21

Running head: MULTIPLE REGRESSION ASSUMPTIONS

The Importance of Assumptions in Multiple Regression and How to Test Them Meagan Keashly University of Calgary

MULTIPLE REGRESSION ASSUMPTIONS The Importance of Assumptions in Multiple Regression and How to Test Them Multiple regression analyses can be useful in exploring the relationship between one dependent variable and several independent or predictor variables (Tabachnick & Fidell, 2007). It is a valuable tool for addressing a range of research questions as it assesses how well a set of variables can predict a certain outcome (Pallant, 2005).

However, in order for the tool to perform as intended a number of assumptions should be met. This paper begins with a brief overview of multiple regression. Next, the importance of assumptions will be discussed. The remainder of this paper focuses on how to test for assumptions of multiple regression using SPSS, including testing for outliers, multicollinearity, independence of errors, normality, linearity and homoscedasticity. Overview There are three types of multiple regression: standard (i.e., all independent variables are entered into the regression equation at one time and each one is evaluated as if it had been the last independent variable entered in), sequential (i.e., the independent variables are entered into the equation in the order than the researcher chooses) and statistical (i.e., the order of variable entry is based on statistical criteria) (Tabachnick & Fidell, 2007). Multiple regression is a useful tool for providing information about the model as a whole, the contribution of each of the independent variables that comprise the model, and/or whether adding another independent variable adds to the predictive power of the model above the variables already included (Pallant, 2005). It should be noted that while multiple regression analyses are useful for finding relationships among variables, they do not imply causation (Tabachnick & Fidell, 2007).

MULTIPLE REGRESSION ASSUMPTIONS Multiple regression requires one continuous dependent variable and multiple independent or predictor variables, which are also usually continuous (Pallant, 2005). Selection of independent variables is important. According to Tabachnick and Fidell (2007) an optimal set of IVs is the smallest, reliable, uncorrelated set that covers the waterfront with respect to the DV (p. 122). It is vital to choose the most reliable independent variables possible and try to ensure unmeasured independent variables are not correlated with any of the independent variables being measured (Tabachnick &

Fidell, 2007). Further, Pallant (2005) cautions not to just throw variables into a multiple regression and hope that, magically, answers will appear. You should have a sound theoretical or conceptual reason for the analysis and, in particular, the order of variables entering the equation. Dont use multiple regression as a fishing expedition (p. 140). Importance of Assumptions A statistical assumption is a prerequisite condition that must exist for a statistical tool to do its job (Yang & Huck, 2010, p. 45). Assumptions are important because if they are violated, the tool has less power for the data set and may not work as intended (Yang & Huck, 2010). That is, if assumptions are not met we may not be able to trust the results of the analyses (Osborne & Waters, 2002). Pallant (2005) notes that multiple regression can be a fussy statistical technique because it requires a number of data assumptions and is not very forgiving if they are violated. Violation of assumptions can result in misleading results including increased Type I or Type II error, or under or overestimates of effective size or significance (Chen, Ender, Mitchell, & Wells, 2003; Osborne & Waters, 2002). It is important that researchers test their data set for assumptions so that data can be transformed if appropriate, or that other more appropriate

MULTIPLE REGRESSION ASSUMPTIONS

techniques for data analysis can be considered (e.g., non-parametric techniques) (Osborne & Waters, 2002). Knowledge and understanding of the situations when violations of assumptions lead to serious biases, and when they are of little consequence, are essential to meaningful data analysis (Pedhazur, 1997, p. 33). This paper includes two appendices, which can be followed to assist with testing assumptions of multiple regression using SPSS. Appendix A provides step-by-step instructions on how to run the Explore function and Appendix B provides step-by-step instructions on how to run a Multiple Regression. The following sections will explain how to check for assumptions of multiple regression by examining the output produced in SPSS. Sample Size In multiple regression it is important to consider the number of independent variables in determining the number of cases necessary to run the analyses. The sample size required is dependent on a number of factors including alpha level, number of predictors, desired power and expected effect sizes (Tabachnick & Fidell, 2007). There are no firm rules for determining sample size. Hindes (2012b) suggests that if assumptions are met, it is necessary to have 5 to 10 subjects per independent variable. However, when assumption are violated this increases to 20 to 30 subjects per independent variable. She suggests that for stepwise multiple regression, 40 to 50 subjects per independent variable are needed. Green (1991) suggests two formulas for determining sample size assuming a medium sized relationship exists between the independent variables and dependent variable =. = .20): N >50 + 8m (where

m is the number of IVs) for testing the multiple correlation; and N >104 + m for testing

MULTIPLE REGRESSION ASSUMPTIONS

individual predictors (as cited in Tabachnick & Fidell, 2007). Green (1991) notes that the number of cases required increases if there is a skewed dependent variable, small effect size or large amount of expected measurement error. However, while ensuring that the sample size is not too small it is also important that there are not too many cases. If the number of cases becomes very large, almost any multiple correlation will depart significantly from zero, even one that predicts negligible variance in the DV (Tabachnick & Fidell, 2007, p. 123). Missing Data, Data Entry Errors, Outliers Prior to checking the data for outliers, it is beneficial to check the data set for missing data or data entry errors. Double-checking data entry is important so that unnecessary errors can be avoided (Horn, 2008). Examining the Case Processing Summary and Extreme Values tables produced using the Explore function (explained in Appendix A) can also be helpful in identifying these issues.

Table 1: Case Processing Summary (SPSS output) Explanation: Checking the N values in the Missing column on the Case Processing Summary will identify the number of missing cases. In the example above, there are no missing cases.

Table 2: Extreme Values (SPSS output)

MULTIPLE REGRESSION ASSUMPTIONS

Explanation: The Extreme Values table identifies the five highest and lowest values, along with their associated case numbers. In the above table, case number 240 has a value of 99.00 and case 234 has a value of .00. In both instances this is reflective of data entry error.

Once the data has been checked for missing data or errors, it should be screened for outliers. Extreme cases can strongly impact the outcome of a multiple regression so it is important that outliers are rescored, removed or that the variable is transformed (Tabachnick & Fidell, 2007). If an outlier is included in a multiple regression analysis, the regression line is pulled toward the outlier. This results in a regression solution that is more accurate for the extreme case, but less accurate for the rest of the data set cases (Tabachnick & Fidell, 2007). One method of checking for univariate outliers is to examine the box plot for the variable, which is produced when using the Explore function in SPSS (see Appendix A). Circles on the boxplot indicate outliers while asterisks indicate extreme outliers (Hindes, 2012a).

Figure 1: Boxplot to identify outliers Explanation: On the above boxplot, scores from participants 135 and 186 were outliers for the math enjoyment variable being tested. The specific value for each of these outliers can be found by examining the Extreme Values table mentioned above. Notably, scores from participants 234 and 240 are classified as extreme (as can be seen by the presence of the asterisks). However, in this case these were due to data entry error.

MULTIPLE REGRESSION ASSUMPTIONS Multivariate outliers can be checked by examining the Mahalanobis distance value located in the Residuals Statistic table produced through the Multiple Regression

procedure in SPSS (see Appendix B). To determine if there are outliers, it is necessary to determine the critical chi-square value based on the number of independent variables (Pallant, 2005). According to Tabachnick and Fidell (2007), an alpha level of .001 should be used when examining Mahalanobis distance.

Table 3: Mahalanobis distance for data set with two independent variables (SPSS output)

Table 4: Critical values for Mahalanobis distance. Reprinted from SPSS Survival Manual: A Step By Step Guide to Data Analysis Using SPSS for Windows (Version 12) (p. 151), by J. Pallant, 2005, Crows Nest, Australia: Allen & Urwin. Explanation: Table 3 shows the output for a regression model with two independent variables. The Mahalanobis distance values range from .004 to 10.696. According to Table 4, the critical value for the Mahalanobis distance for two independent variables is 13.82. The maximum value in Table 3 is lower than this critical value, indicating that there are no multivariate outliers in this data set.

Multicollinearity

MULTIPLE REGRESSION ASSUMPTIONS Multicollinearity occurs when the independent variables are highly correlated

with each other (Pallant, 2005). This decreases the multiple regression analysis ability to determine which variables in the model are important (Chen et al., 2003). To check for multicollinearity, look at the Correlations table. The independent variables should have at least a moderate correlation with the dependent variable (r > .3) but should not be correlated too highly with each other (r < .7) (Tabachnick & Fidell, 2007).

Table 5: Correlations table to test for multicollinearity. Reprinted from SPSS Survival Manual: A Step By Step Guide to Data Analysis Using SPSS for Windows (Version 12) (p. 147), by J. Pallant, 2005, Crows Nest, Australia: Allen & Urwin. Explanation: In the above table each independent variable is moderately correlated with the dependent variable (Total Mastery and Total Perceived Stress -.612; Total PCOISS and Total Perceived Stress -.581) but the independent variables are not too highly correlated with each other (.521).

Multicollinearity issues are not always evident when examining the Correlations table, so it may also be beneficial to look at the Tolerance and VIF (Variance Inflation Factor) values on the Coefficients table. Tolerance indicates how much variability of the given independent variable is not explained by the other independent variables in the regression model (Pallant, 2005). If the Tolerance value is small (<.10), this is in an indication that there is a high multiple correlation with other variables, and thus multicollinearity is a possibility (Pallant, 2005). The VIF is the inverse of the Tolerance

MULTIPLE REGRESSION ASSUMPTIONS value, so VIF values larger than 10 may indicate multicollinearity (Pallant, 2005). If these values are exceeded, the researcher should consider taking one of the highly correlated independent variables out of the model (Pallant, 2005).

Table 6: Coefficients table to test for multicollinearity. Reprinted from SPSS Survival Manual: A Step By Step Guide to Data Analysis Using SPSS for Windows (Version 12) (p. 148), by J. Pallant, 2005, Crows Nest, Australia: Allen & Urwin. Explanation: In the table above, the Tolerance value is greater than .10 (.729) and the VIF is smaller than 10 (1.372). Therefore, there is no violation of multicollinearity.

Independence of Errors Another assumption of multiple regression is that prediction errors are independent from each other. This can be tested by looking at the Durbin-Watson statistic, which measures autocorrelation of errors over sequence of cases. The DurbinWatson statistic can be found in the Model Summary table, which can be produced when running a Multiple Regression (see Appendix B). The values of the statistic range from zero to four. A value of two indicates no autocorrelation. A value substantially greater than two implies negative autocorrelation, while a value substantially less than two indicates positive autocorrelation (Hindes, 2012b). Negative autocorrelation results in estimates of error variance being too large, and thus a loss of power. Positive autocorrelation results in estimates that are too small and consequently increases the Type I error rate (Tabachnick & Fidell, 2007).

MULTIPLE REGRESSION ASSUMPTIONS

10

Table 7: Durbin-Watson statistic (SPSS output) Explanation: In the above table, the Durbin-Watson statistic is 2.028. This value would be considered acceptable because it is close to two.

Normality, Linearity and Homoscedasticity The assumptions of normality, linearity and homoscedasticity can be tested by examining residual scatterplots produced using the Multiple Regression procedure in SPSS (see Appendix B for instructions). The assumptions are that residuals are normally distributed around the predicted dependent variable scores, that there is a straight-line relationship between the residuals and dependent variable scores, and that the residual variance around the predicted dependent variable scores is the same for all of the predicted scores (Tabachnick & Fidell, 2007). Each of these assumptions will be discussed more in depth below. Normality The assumption of normality refers to the continuous variables in the multiple regression being normally distributed (Yeatts, 2008). As mentioned previously, examining the residuals scatterplot can test for the assumption of normality. When the residuals are normally distributed around each predicted dependent variable score the assumption of normality has been met (Tabachnick & Fidell, 2007).

MULTIPLE REGRESSION ASSUMPTIONS

11

Figure 2: Scatterplots showing (a) assumption of normality met, (b) failure of normality. Reprinted from Using Multivariate Statistics (p. 126), by B. G. Tabachnick and L. S. Fidell, 2007, Boston, MA: Allyn and Bacon.
Explanation: As is visible in (a), to meet the assumption of normality the residuals should cluster in the center of the scatterplot at each predicted score and then spread symmetrically from the center (Tabachnick & Fidell, 2007).

Another method of testing normality is examining the skewness and kurtosis values of the variables in the Descriptives table produced using the Explore function. The closer to zero the values fall, the more normal the variable (Yeatts, 2008). In order to be considered close to normal, skewness and kurtosis values should fall between -2 and 2 (Hindes, 2012b).

Table 8: Descriptives table (SPSS output) Explanation: On the statistics table above, the skewness and kurtosis values for math enjoyment are outside of what would be considered close to normal. Therefore, it would be necessary to transform this variable. If

MULTIPLE REGRESSION ASSUMPTIONS

12

normally distributed after transformation, it could then be substituted in the regression. Inverse transformation, square root transformation and logarithmic transformation are examples of transformations that can be completed in SPSS.

Linearity Linearity refers to a linear relationship between the variables. In assessing for linearity, the residuals scatterplot can examined to see if there is straight-line relationship between the residuals and the predicted dependent variable scores (Pallant, 2005). A linear relationship is evident when the scatterplot has an overall rectangular shape rather than one that is curved (Osborne & Waters, 2002). While linearity is important, it should be noted that nonlinearity does not invalidate a multiple regression analyses. However, it does reduce its power (Tabachnick & Fidell, 2007).

Figure 3: Example of curvilinear and linear relationships with standardized residuals by standardized predicted values. Reprinted from Four Assumptions of Multiple Regression that Researchers Should Always Test, by J. W. Osborne and E. Waters, 2002, Practical Assessment, Research and Evaluation 8(2), 2. Explanation: The plot on the right shows a linear relationship as evidenced by the overall rectangular shape of the scatterplot. If nonlinearity of residuals is present, transforming the independent variables or dependent variable so that there is a linear relationship between each independent variable and the dependent variable can usually make the residuals linear (Tabachnick & Fidell, 2007).

Homoscedasticity Homoscedasticity refers to the dependent variable having an equal amount of variability for each value of the independent variables (Yeatts, 2008). When looking for

MULTIPLE REGRESSION ASSUMPTIONS homoscedasticity on a residual scatterplot, check to see whether the band that encompasses the residuals is approximately equal in width for all of the predicted

13

dependent variable values (i.e., error of prediction standard deviations are about equal for all of the predicted scores for the dependent variable) (Tabachnick & Fidell, 2007). When the assumption of homoscedasticity is violated, the data is heteroscedastic. Heteroscedasticity can cause the standard errors for the estimates of regression coefficients to be incorrect and therefore result in inaccurate tests of significance for the independent variables (Tabachnick & Fidell, 2007). Heteroscedasticity can occur due to measurement error or because the independent variable interacts with another variable that is outside of the regression equation (Tabachnick & Fidell, 2007). Similarly to linearity, the violation of the assumption of homoscedasticity weakens the multiple regression analyses rather than invalidating it (Tabachnick & Fidell, 2007).

Figure 4: Examples of homoscedasticity and heteroscedasticity. Reprinted from Four Assumptions of Multiple Regression that Researchers Should Always Test, by J. W. Osborne and E. Waters, 2002, Practical Assessment, Research and Evaluation 8(2), 4. Explanation: The above are scatterplots of the dependent variables predicated values against the residuals. The graph on the left meets the assumption of homoscedasticity as evidenced by the band encompassing the residuals being approximately equal in width across the predicted values of the dependent variable. If heteroscedasticity is evident, the data can be transformed using methods such as square root transformation or the weighted least squares (Yeatts, 2009).

Multiple regression is a useful tool for examining the relationship between multiple independent variables and one dependent variable. However, in order to help

MULTIPLE REGRESSION ASSUMPTIONS ensure the tool works as intended and to maximize its power, it is important that researchers test for assumptions and address violations appropriately. There are many resources available to support day-to-day researchers in using SPSS for multiple

14

regression analyses. One print resource researchers may find useful is the SPSS survival manual: A step by step guide to data analysis using SPSS for Windows (Version 12) written by Julie Pallant. Another helpful resource to consider is an online book entitled Regression with SPSS (written by Xiao Chen, Phil Ender, Michael Mitchell and Christine Wells), which can be retrieved here: http://www.ats.ucla.edu/stat/spss/webbooks/reg/default.htm.

MULTIPLE REGRESSION ASSUMPTIONS References

15

Chen, X., Ender, P., Mitchell, M., & Wells, C. (2003). Regression with SPSS. Retrieved from http://www.ats.ucla.edu/stat/spss/webbooks/reg/default.htm Green, S. B. (1991). How many subjects does it take to do a regression analysis? Multivariate Behavioral Research, 29, 449-510. Hindes, Y. L. (2012a). Explore [PowerPoint slides]. Retrieved from https://blackboard.ucalgary.ca/ Hindes, Y. L. (2012b). Multiple regression [PowerPoint slides]. Retrieved from https://blackboard.ucalgary.ca/ Horn, R. A. (2008). Multiple regression. Retrieved from http://oak.ucc.nau.edu/rh232 /courses/EPS625/Handouts/Regression/Multiple%20Regression%20 -%20Handout.pdf Osborne, J. W., & Waters, E. (2002). Four assumptions of multiple regression that researchers should always test. Practical Assessment, Research and Evaluation 8(2), 1-5. Retrieved from www-psychology.concordia.ca/fac/kline/495/osborne.pdf Pallant, J. (2005). SPSS survival manual: A step by step guide to data analysis using SPSS for Windows (Version 12). Crows Nest, Australia: Allen & Urwin. Pedhazur, E. J. (1997). Multiple regression in behavioral research (3rd ed.). New York, NY: Harper Collins College Publishers. Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics. Boston, MA: Allyn and Bacon. Yang, H., & Huck, S. W. (2010). The importance of attending to underlying statistical

MULTIPLE REGRESSION ASSUMPTIONS assumptions. Newborn and Infant Nursing Reviews, 10(1), 44-49. doi: 10.1053/j.nainr.2009.12.005

16

Yeatts, D. E. (2009). Multiple regression: Diagnostics and solutions [PowerPoint slides]. Retrieved from http://courses.unt.edu/yeatts/6200-Multivariate%20Stats /Lectures-Tests/Test%202/Week-11-diagnostics-solutions.pdf

MULTIPLE REGRESSION ASSUMPTIONS Appendix A Running the Explore Function in SPSS

17

On the top menu bar select Analyze Descriptive StatisticsExplore

Select the dependent variable and press the arrow to the left of the Dependent List. If wanted, select a factor and press the arrow to the left of the Factor List.

Press Statistics. Click the box beside Descriptives and Outliers. Click Continue.

Press Plots. Under Boxplots select Factor levels together. Under Descriptive select Stem and Leaf. Click Continue.

MULTIPLE REGRESSION ASSUMPTIONS

18

Press Options. Select Exclude cases listwise. Click Continue.

Press OK. Examine output.

MULTIPLE REGRESSION ASSUMPTIONS Appendix B Running a Multiple Regression in SPSS

19

On the top menu bar select Analyze Regression Linear

Select the dependent variable and click the arrow to move it to the Dependent box Select the independent variables and click the arrow to move them to the Independent box

Click Statistics. Select the boxes marked Estimate, Confidence Intervals, Model Fit, Descriptives, Part and Partial correlations, Collinearity diagnostics and Durbin-Watson

MULTIPLE REGRESSION ASSUMPTIONS

20

Select Plots. Select *ZRESID and move it to the Y box then select *ZPRED and move it to the X box. Click the box beside Normal probability plot. Press Continue.

Press Save. Select Mahalanobis and Cooks D. Press Continue.

Press Options. Click Exclude cases pairwise. Press Continue

MULTIPLE REGRESSION ASSUMPTIONS

21

Press OK. Examine output.

Вам также может понравиться