Assumptions in Multiple Regression

Running head: Assumptions in Multiple Regression
Importance of Assumptions in Multiple Regression Shawna Sjoquist EDPS 607 University of Calgary
Running head: Assumptions in Multiple Regression Introduction The use of assumptions in multiple regression allows researchers the ability to analyze data sets and draw reliable conclusions from the analysis. Assumptions have the power to substantially influence the results that are reported and as such assumptions affect the conclusions that are drawn from multiple regression analysis. The validity of the assumptions themselves directly influences the validity of the conclusions that are drawn from the data analysis. Review of the literature indicates that the number of assumptions that are considered varies and relates somewhat to a matter of statistical opinion (Berry, 1993). While some statisticians identify that a particular condition for multiple regression is required as part of
analysis others may report that the same condition is an assumption of the analysis (Berry, 1993). For example, in order to draw conclusions from a multiple regression data set the researcher may make assumptions regarding linearity, independence, homoscedasticity and normality. In the discussion that follows, the reader will gain a statistical awareness of what is meant by the terms multiple regression and assumptions, be presented with four assumptions germane to multiple regression and become familiar with methods of satisfying the identified assumptions. Multiple Regression In order to adequately discuss the importance of assumptions in multiple regression and the manner in which a researcher would test them, several definitions need to be established. First, the researcher requires a clear understanding of the definition of multiple regression. Multiple regression describes the process of predicting a dependent variable from a set of predictor variables (Stevens, 2009). In other words, the researcher is attempting to predict an effect in a dependent variable from a set of predictors that may in some way influence the result
of the dependent variable. For example, a researcher may be interested in predicting college SAT scores by considering standardized math test scores, standardized English test scores, general attitude toward education and overall study habits. This example highlights an important factor inherent to multiple regression. Multiple regression allows for a more realistic analysis that reflects the natural human condition (Multiple Regression, 2006). More specifically, multiple regression recognizes that the dependent variable is rarely influenced by a single isolated factor. In fact, the dependent variable, college SAT scores in the previous example, are often influenced by several factors, as seen in the example as general attitude toward education, overall study habits and standardized test scores. We know that the score one obtains on their college SAT is likely the result of a variety of contributing factors rather than any one sole factor alone. Taken in mathematical terms, multiple regression attempts to describe the relationship between a Y variable and a few predictor X variables (Stevens, 2009). Essentially, we are attempting to use the independent variables we identified, the X s to predict what the dependent variable, the Y will be (Multiple Regression, 2006). In order to analyze this relationship several important assumptions are made. Assumptions Second, the researcher requires a clear understanding of the definition of statistical assumptions. As we have discussed, through multiple regression we are attempting to use the independent variables, the X variables we introduced, to predict what the dependent variable, or Y variable, will be and use these variables to calculate a regression equation (Multiple Regression, 2006). In order to perform the regression calculation it is assumed that several certain conditions are present. These assumptions constitute our multiple regression assumptions. Technically speaking, multiple regression assumptions are the assumptions made that infer how
Running head: Assumptions in Multiple Regression predicted values of the dependent Y variable are produced from the values of the independent X
variables (Aiken & West, 1991). When the researcher is able to verify that the assumptions have been satisfied they are then able to analyze the data and produce estimates that are presumably reliable, or unbiased, and efficient (Aiken & West, 1991). Being able to produce reliable estimates means that the estimates the researcher puts forward are consistent with the actual results as opposed to being misleadingly higher or lower than the actual results. Essentially, the researcher wants their estimates to be as close to the true value as possible and therefore have a standard error that is as small as possible (Osborne & Waters, 2002). In other words, the researcher wants the estimated value to fit within a small window of values that encase the true value and thereby deemed acceptable. Though we want to ensure that the results we obtain are unbiased and efficient it also must be acknowledged that regression analysis, on its own, is relatively robust (Osborne & Waters, 2002). This means that it is possible for the analysis to produce results that are reasonably unbiased and efficient even if one or more of the assumptions have minor violations (Berry, 1993). The regression analysis is not so robust however to protect against a large violation of one or more of the assumptions (Osborne & Waters, 2002). A large violation is likely to produce faulty results (Berry, 1993). In other words, when the assumptions have not been met the researcher cannot completely trust the results they obtain from the regression analysis (Osborne & Waters, 2002). Moreover, the results the researcher obtains may be susceptible to Type I error, Type II error, and or an over or under estimation of significance (Osborne & Waters, 2002). Overall, knowledge and understanding of the situations when violations of assumptions lead to serious biases, and when they are of little consequences, are essential to meaningful data analysis (Pedhazur, 1997, p. 33). Though, regression analysis is a relatively robust analysis however there are several assumptions of multiple regression that are
Running head: Assumptions in Multiple Regression not robust to violation (Osborne & Waters, 2002). These assumptions will be the focus of the following discussion. Assumption of Linearity In its simplest form, multiple regression requires that there is a linear relationship between the independent and dependent variables (Osborne & Waters, 2002). To understand the assumption of linearity it is important to understand what the result may be should a non-linear relationship be found. Osborne & Waters (2002) offer that if a non-linear relationship is found between the independent and dependent variables the results offered to the researcher through the multiple regression analysis are likely to under estimate the true relationship occurring between the variables. Should the researcher under estimate their results, the concurrent observations have an increased risk of having a Type I error (Osborne & Waters, 2002). If we consider our previous example where the researcher is seeking to predict college SAT scores through analysis of several independent predictor variables, standardized math test scores,
standardized English scores, general attitude toward education and overall study habits, a Type I error would mean that the researcher may report that overall study habits, for example, significantly predict college SAT scores when in fact, true values may suggest otherwise. There are several ways a researcher can check for violations of this assumption. To gain information regarding linearity the researcher may examine the standardized residual plot of studentized residuals vs. predicted values provided by a statistical software program (Osborne & Waters, 2002; Stevens, 2009). If the assumption of linearity is not violated the researcher can expect to see a random scattering of standardized residuals surrounding the horizontal line as seen in the figure below left (Stevens, 2009). Generally speaking, observing any systematic
Running head: Assumptions in Multiple Regression patterns or clustering of the residuals suggests that there has been a violation of the assumption of linearity (Stevens, 2009). For example, the figure below right depicts a pattern and clusters that would suggest a violation of the assumption of linearity has occurred.
Additionally, statistical output can also assist with checking the assumption of linearity. For example, the researcher can use the lack of fit test, as seen in the figure below, to either accept or reject the null hypothesis that a linear regression model is appropriate. For example, using the information provided in the figure, we see that the F statistic (F = 1.49) had a p = .039 which is non-significant and therefore the null hypothesis is not rejected and the assumption of linearity is supported.
Assumption of Independence of Errors
As multiple regression is concerned with predicting a dependent variable from a set of predictors (Stevens, 2009) the assumption of independence of errors is also important to the successful prediction of results. The assumption of independence of errors has to do with measurement and is of particular note when the researcher is attempting to imitate the
Running head: Assumptions in Multiple Regression relationships present in the population the research sample is drawn from (Osborne & Waters,
2002). More specifically, the effect sizes of the variables used in the multiple regression analysis can be over-estimated if the covariate is not reliably measured as the full effect of the covariate would remain intact (Osborne & Waters, 2002). Any causes that are measured unintentionally constitute error and influence the results of multiple regression if not accounted for in an appropriate manner (Christensen & Bedrick, 1997). Multiple regression assumes that the errors or error term is not correlated with any of the independent variables (Christensen & Bedrick, 1997). It assumes that the errors or residuals are independent of one another (Stevens, 2009). Violations to this assumption can reflect errors made in omitting variables, reverse causation and or general measurement error (Christensen & Bedrick, 1997). More specifically, omission of a relevant variable that is correlated with other variables included in the regression will result in a violation of this assumption. Likewise, if the dependent variable has a causal effect on any of the independent variables and or if there is a measurement error concerning the independent variables then a violation in the assumption will likely result (Christensen & Bedrick, 1997). There are several ways a researcher can check for violations to this assumption. Generally speaking, the researcher wants to ensure the size of one residual does not influence the size of another residual (Jeng & Martin, 2006). Statistical output can assist the researcher in identifying violations to the assumption of independence of errors. For example, the researcher can use the Durbin-Watson statistic, presented in the figure below, to evaluate whether or not it is likely that one residual has not affected the next (Jeng & Martin, 2006). The Durbin-Watson statistic ranges from a statistic of 0 to a statistic of 4. In order for the assumption to be satisfied the Durbin-Watson statistic needs to be approximately 2 though the literature suggests that a statistic falling within the range of 1.50 to 2.50 is acceptable (Jeng & Martin, 2006). For
example, if we are to consider the data presented in the figure we can see that the Durbin-Watson statistic is considered acceptable and thus the assumption of independence of errors would be satisfied in this case.
Assumption of Homoscedasticity In order to understand the assumption of homoscedasticity the researcher must comprehend its definition. Homoscedasticity means that he variance of errors is the same across all levels of the independent variable (Osborne & Waters, 2002). In other words, homoscedasticity would indicate that the dependent variable would have the same amount of variability for each of the values of the independent variable (Aiken & West, 1991). Essentially, the researcher would want to see roughly the same amount of variance for all values examined in the analysis. If the researcher finds that all of the random variables in the same vector, or sequence, are equal the researcher would indicate that there is homogeneity of variance and the sequence, or vector, is homoscedastic (Stevens, 2009). Violation of the assumption of homoscedasticity can lead to distorted findings, weakened analysis and an increased possibility for Type I error (Osborne & Waters, 2002). A researcher can test for violations to the assumption of homoscedasticity in several ways. A visual review of the plot of standardized residuals by predicted values that was also used in detection of the previously discussed assumption of linearity can aid in the detection of the assumption of homoscedasticity. In order for the assumption of homoscedasticity to be met the plot, as depicted in the figure below, would demonstrate residuals that are evenly distributed about the horizontal line (Osborne & Waters, 2002).
Should the researcher find that the residuals are not evenly distributed the researcher could conclude that data is more likely heteroscedastic in nature than homoscedastic. There are also some statistical methods useful for checking the assumption of homoscedasticity or homogeneity of error variance. For instance, the researcher may use the Breusch-Pagan test, provided in may statistical output software to check for violations to this assumption (Osborne & Waters, 2002). The figure below displays the Breusch-Pagan test as provided in the statistical output.
Using this output the researcher would attempt to accept or reject the null hypothesis that the variance of the residuals is the same for all values of the independent variable. For example, in the figure provided, the researcher would note that the probability of the Breusch-Pagan statistic was p = .614 which is considered non-significant and would conclude that the null hypothesis is not rejected and the assumption of homogeneity of error variance would be satisfied. The results of this statistical output could be used to strengthen and support the conclusions regarding the assumption of homoscedasticity made from the scatter plot of residuals that was previously discussed.
Running head: Assumptions in Multiple Regression Assumption of Normality The fourth assumption that will be discussed is the assumption of normality. Multiple regression assumes that the variable used in analysis are normally distributed (Hindes, 2012). Variables that are normally distributed will form a relatively consistent bell shaped curve and
10
will be discussed further in the figures to follow (Stevens, 2009). The assumption of normality is influenced by the sized of the sample used (Stevens, 2009). More specifically, if the researcher uses a sufficiently large sample the assumption of normality is more likely to be met (Pfaffenberger & Dielman, 1991). Central Limit Theorem states that the sum of independent observations having any distribution whatsoever approaches a normal distribution as the number of observations increases (Stevens, 2009, p. 221). Further, Bock (1975) offers that even those distributions that may stray from normality, due to outliers or Kurtosis for example, will approximate toward normality where there is a sum of 50 or more observations. Variables that do not follow a normal pattern of distribution are likely being influenced by highly skewed or kurtotic variables and or variables with considerable outliers (Osborne & Waters, 2002). Given a general understanding of the assumption of normality the researcher also needs to be equipped with the knowledge of how to check for violations to the assumption. The assumption of normality can be assessed though a visual inspection of data plots, skew, Kurtosis and or P-P plots (Osborne & Waters, 2002; Stevens, 2009). The researcher may also gain information regarding normality by considering the presence of outliers. The researcher can identify outliers by inspecting a histogram or stem and leaf plot derived from the frequency distribution (Osborne & Waters, 2002). The figure below represents a histogram that follows a relatively normal pattern of distribution with the exception of the extreme outlier we see in red.
11
If we are to consider the value circled in red, this distribution would be considered to be skewed. Visual inspection of the extreme values table provided in statistical output can help the researcher gain information regarding normality. The extreme values table presented in the figure below highlights the highest and lowest scores presented in the data set and can indicate the existence of outliers that might influence the presence or absence of skew and kurtosis. The presence of extreme values, such as those represented by the highlighted portions of the figure below left have the potential to distort the data and create a distribution that is skewed. The information provided by the extreme values table is also supported by the box plot. By reviewing the box plot, as depicted by the figure below right, the researcher is able to see the extreme values or outliers also described by the extreme values table.
12
The researcher can also gain information about skew and kurtosis by referencing information provided in the descriptives table of the statistical output. In a perfectly normal distribution both skew and kurtosis will be 0 (Stevens, 2009). The farther the reported values move away from 0 represents movement toward a non-normal distribution (Stevens, 2009).
Further, the researcher may wish to run the Kolmogorov-Smirnov test and Shapiro-Wilk test which are also designed to provide information regarding normality (Pfaffenberger & Dielman, 1991). By reviewing the test of normality table provide by the resulting statistical analysis the researcher will be able to obtain information indicating the presence or absence of normality. For instance, statistically significant results, results .05 and greater, for example, indicate that the data is normally distributed while results less than .05, for example, indicate that the data is not normally distributed.
13
Additionally, the researcher may review the P-P plot also provided by the statistical output. The P-P plot allows the researcher to see how closely the two data sets agree with one another (Stevens, 2009). Distributions are considered equal if the data plot follows the comparison line. The figure below depicts a normal P-P plot where the standardized residuals follow the general path of the comparison line and represents the type of P-P plot a researcher would hope to see in their regression analysis.
Conclusion The use of assumptions in multiple regression allows researchers the ability to analyze data sets and draw reliable conclusions from the analysis. The discussion has provided evidence that indicates the ability of the researcher to check relevant assumptions has significant benefits for the research results. Though review of the literature has revealed that the number of assumptions that are considered varies and relates somewhat to a matter of statistical opinion (Berry, 1993) much of the literature agrees that the assumptions identified in this discussion are important to multiple regression. For example, it is generally agreed that multiple regression requires that there is a linear relationship between the independent and dependent variables and that a normal distribution exists. If the variance of errors is not the same across all levels of the
Running head: Assumptions in Multiple Regression independent variable and or errors are correlated with any of the independent variables the results and pursuant conclusions of multiple regression analysis will be significantly impacted.
14
When the researcher is able to verify that relevant assumptions have been satisfied the researcher is provided the freedom to produce estimates that can be considered reliable, free from bias and efficient.
Running head: Assumptions in Multiple Regression References Aiken, L. S.. & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park, CA:Sage. Berry, W. D. (1993). Understanding Regression Assumptions. Newbury Park: Sage Publications Christensen, R. & Bedrick, J. (1997). Testing the Independence Assumption in Linear Models. Journal of the American Statistical Association 92(439) 1006-1016. http://www.jstor.org/stable/2965565 Hindes, Y, (2012, May). Multiple Regression Continued. Presented in EDPS 607 Research in Applied Psychology: Multivariate Analysis Lecture, University of Calgary, Calgary, Alberta. Jeng, J. Y & Martin, A. (2006). Residuals in multiple regression analysis. Journal of Pharmaceutical Sciences 74 (10)1053-1057. DOI: 10.1002/jps.2600741006 Multiple Regression. (2006). In Encyclopedia of Special Education: A Reference for the Education of the Handicapped and Other Exceptional Children and Adults. Retrieved from http://www.credoreference.com.ezproxy.lib.ucalgary.ca/entry/wileyse/ multiple_regression Osborne, Jason & Elaine Waters (2002). Four assumptions of multiple regression that researchers should always test. Practical Assessment, Research & Evaluation, 8(2). Retrieved June 16, 2012 from http://PAREonline.net/getvn.asp?v=8&n=2 .
15
Running head: Assumptions in Multiple Regression Pedhazur, E. J., (1997). Multiple Regression in Behavioral Research (3rd ed.). FL:Harcourt Brace. Pfaffenberger, R. C & Dielman, T, E. (1991). Testing normality of regression disturbances: A Monte Carlo study of the Filleben test. Computational Statistics Data Analysis, 11(3) from http://www.sciencedirect.com/science/article/pii/016794739190085G Stevens, J. P. (2009). Applied multivariate statistics for the social science (5th ed.). New York: Routledge. Orlando,
16

Assumptions in Multiple Regression

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Assumptions in Multiple Regression

Загружено:

Авторское право:

Доступные форматы

Running head: Assumptions in Multiple Regression

Running head: Assumptions in Multiple Regression

Assumption of Independence of Errors

Running head: Assumptions in Multiple Regression

Running head: Assumptions in Multiple Regression

Running head: Assumptions in Multiple Regression

Running head: Assumptions in Multiple Regression

Running head: Assumptions in Multiple Regression

Вам также может понравиться