© All Rights Reserved

Просмотров: 137

assumptions_in_multiple_regression.pdf

© All Rights Reserved

- Multiple Regression
- Multiple Regression Review
- 2007-Hawkins-Ecology-richness-metabolic
- Quality Kitchen Meatloaf Mix
- Lecture 7 Regression and Correlation
- Detection Fraud With Fraud Triangle
- Prediction of Construction Project Performance using Regression Analysis and Artificial Neural Network
- Writing Tips for Economics Research Papers_P Nikolov 2010
- Bel et al_2018
- OLSinR Application
- Final Brochure Sujata Kar
- Linear Regression
- 1171924
- 31. Regression Analysis 1-1-11
- Exercise 1 Multiple Regression Model
- O2
- RMppt
- Multiple Regression
- 3.IJAFMRAPR20173
- 7+OLS+Assumptions

Вы находитесь на странице: 1из 14

Dianne L. Ballance

ID#00939966

University of Calgary

APSY 607

Statistical tests rely upon certain assumptions about the variables used in an analysis

(Osborne & Waters, 2002). Since Cohens 1968 seminal article, multiple regression has become

increasingly popular in both basic and applied research journals (Hoyt, Leierer, & Millington,

2006). It has been noted in the research that multiple regression (MR) is currently a major form

of data analysis in child and adolescent psychology (Jaccard, Guilamo-Ramos, Johansson, &

Bouris, 2006). Multiple regression examines the relationship between a single outcome measure

and several predictor or independent variables (Jaccard et al., 2006). The correct use of the

multiple regression model requires that several critical assumptions be satisfied in order to apply

the model and establish validity (Poole & OFarrell, 1971). Inferences and generalizations

about the theory are only valid if the assumptions in an analysis have been tested and fulfilled.

The use of MR has become common across a wide variety of social science disciplines

including applied psychology and education specifically in the search of interaction effects and

evaluating moderating effects of variables in theory development (Aguinis, Petersen, & Pierce,

1999; Mason & Perreault Jr., 1991; Shieh, 2010). Applications of MR in psychology often are

used to test a theory about causal influences on the outcome measure (Jaccard et al., 2006).

Multiple regression is attractive to researchers given its flexibility (Hoyt et al., 2006). MR can

be used to test hypothesis of linear associations among variables, to examine associations among

pairs of variables while controlling for potential confounds, and to test complex associations

among multiple variables (Hoyt et al., 2006).

The purpose of this tutorial is to clarify the primary assumptions and related statistical

issues, and provide a basic guide for conducting and interpreting tests of assumptions in the

multiple regression model of data analysis. It is assumed that the reader is familiar with the

basics of statistics and multiple regression which provide the framework for developing a deeper

understanding for analysing assumptions in MR. Treatment of assumption violations will not be

addressed within the scope of this paper.

Statistical Issues and Importance

Regression analyses are usually driven by a theoretical or conceptual model that can be

drawn in the form of a path diagram (Jaccard et al., 2006). The path diagram provides the model

for setting the regression and what statistics to examine (Jaccard et al., 2006). If one assumes

linear relations between variables, it provides a road map to a set of theoretically guided linear

equations that can be analyzed by multiple regression methods (Jaccard et al., 2006).

Multiple regression is widely used to estimate the size and significance of the effects of a

number of independent variables on a dependent variable (Neale, Eaves, Kendler, Heath, &

Kessler, 1994). Before a complete regression analysis can be performed, the assumptions

concerning the original data must be made (Sevier, 1957). Ignoring the regression assumptions

contribute to wrong validity estimates (Antonakis, & Deitz, 2011). When the assumptions are

not met, the results may result in Type I or Type II errors, or over- or under-estimation of

significance of effect size (Osborne & Waters, 2002). Meaningful data analysis relies on the

researchers understanding and testing of the assumptions and the consequences of violations.

The extant research suggests that few articles are reporting having tested the assumptions of the

statistical tests they rely on for drawing their conclusions (Antonakis & Dietz, 2011; Osborne &

Waters, 2002; Poole & OFarrell, 1971). Sevier (1957) raised this concern many years ago, and

it appears to continue to be an issue today. The result is that the rich literature in social sciences

and education may have questionable results, conclusions, and assertions (Osborne & Waters,

2002). The validation and reliability of theory and future research relies on diligence in meeting

assumptions of MR.

Assumption Analysis

The assumptions of MR that are identified as primary concern in the research include

linearity, independence of errors, homoscedasticity, normality, and collinearity. This section will

specifically define each assumption, review consequences of assumption failure, and address

how to test for each assumption, and the interpretation of results.

Linearity

Some researchers argue that this assumption is the most important, as it directly relates to

the bias of the results of the whole analysis (Keith, 2006). Linearity defines the dependent

variable as a linear function of the predictor (independent) variables (Darlington, 1968).

Multiple regression can accurately estimate the relationship between dependent and independent

variables when the relationship is linear in nature (Osborne & Waters, 2002). The chance of

non-linear relationships is high in the social sciences, therefore it is essential to examine analyses

for linearity (Osborne & Waters, 2002).

If linearity is violated all the estimates of the regression including regression coefficients,

standard errors, and tests of statistical significance may be biased (Keith, 2006). If the

relationship between the dependent and independent variables is not linear, the results of the

regression analysis will under- or over- estimate the true relationship and increase the risk of

Type I and Type II errors (Osborne & Waters, 2002). When bias occurs it is likely that it does

not reproduce the true population values (Keith, 2006). Violation of this assumption threatens

the meaning of the parameters estimated in the analysis (Keith, 2006).

the current analysis to assist in choosing the appropriate variables (Osborne & Waters, 2002).

Keith (2006) suggests that if you have reason to suspect a curvilinear relationship that you add a

curve component (variable2) to the regression equation to see if it increases the explained

variance. However, this approach may not be sufficient alone to detect non-linearity. More indepth examination of the residual plots and scatter plots available in most statistical software

packages will also indicate linear vs. curvilinear relationships (Keith, 2006; Osborne & Waters,

2002). Residual plots showing the standardized residuals vs. the predicted values and are very

useful in detecting violations in linearity (Stevens, 2009). The residuals magnify the departures

from linearity (Keith, 2006). If there is no departure from linearity you would expect to see a

random scatter about the horizontal line. Any systematic pattern or clustering of the residuals

suggests violation (Stevens, 2009). Figure 1 visually demonstrates both linear and curvilinear

relationships. In addition, detecting curvilinearity can be seen after utilizing the nonlinear

regression option in a statistical package (Osborne & Waters, 2002). The amount of shared

variance (R2) can be seen in an F-test. The F-test is designed to test if two population variances

are equal (by comparing the ratio of two variances). If the variances are equal, the ratio of the

variances will be one. A significant F value indicates a departure from linearity (Sevier, 1957).

Figure1. Scatterplots showing linear and curvilinear relationships with standardized residuals by predicted

values.

Independence of Errors

Independence of errors refers to the assumption that errors are independent of one

another, implying that subjects are responding independently (Stevens, 2009). The goal of

research is often to accurately model the real relationships in the population (Osborne &

Waters, 2002). In educational and social science research it is often difficult to measure

variables, which makes measurement error an area of particular concern (Osborne & Waters,

2002).

When independence of errors is violated standard scores and significance tests will not be

accurate and there is increased risk of Type I error (Keith, 2006; Stevens, 2009). When data are

not drawn independently from the population, the result is a risk of violating the assumption that

errors are independent (Keith, 2002). This means that violations of this assumption can

underestimate standard errors, and label variables as statistically significant when they are not

(Keith, 2006). In the case of MR, effect sizes of other variables can be over-estimated if the

covariate is not reliably measured (Osborne & Waters, 2002). Essentially what occurs is that the

full effect of the covariate is not removed (Osborne & Waters, 2002). Violation of this

assumption therefore threatens the interpretations of the analysis (Keith, 2006).

One way to diagnose violations of this assumption is through the graphing technique

called boxplots in most statistical software programs (Keith, 2006). The boxplots of residuals

show the median, high and low values, and possible outliers (Keith, 2006). Examining the

variability of the boxplots allows the researcher to explore violations to independence of errors

(Keith, 2006). Figure 2 shows a sample boxplot from the IBM SPSS Statistics software program

(SPSS) with variables at similar levels that meet the independence of errors assumption.

Homoscedasticity

The assumption of homoscedasticity refers to equal variance of errors across all levels of

the independent variables (Osborne & Waters, 2002). This means that researchers assume that

errors are spread out consistently between the variables (Keith, 2006). This is evident when the

variance around the regression line is the same for all values of the predictor variable.

When heteroscedasticity is marked it can lead to distortion of the findings and weaken the

overall analysis and statistical power of the analysis, which result in an increased possibility of

Type I error, erratic and untrustworthy F-test results, and erroneous conclusions (Aguinis et al.,

1999; Osborne & Waters, 2002). Therefore the incorrect estimates of the variance lead to the

statistical and inferential problems that may hinder theory development (Antonakis & Dietz,

2011). However, it is good to note that the regression is fairly robust to violation of this

assumption (Keith, 2006).

Homoscedasticity can be checked by visual examination of a plot of the standardized

residuals by the regression standardized predicted value (Osborne & Waters, 2002).

Specifically, statistical software scatterplots of residuals with independent variables are the

method for examining this assumption (Keith, 2006). Ideally, residuals are randomly scattered

around zero (the horizontal line) providing even distribution (Osborne & Waters, 2002).

Heteroscedasticity is indicated when the scatter is not even; fan and butterfly shapes are common

seen in scatterplots. When the deviation is substantial more formal tests for heteroscedasticity

should be performed, such as collapsing the predictive variables into equal categories and

comparing the variance of the residuals (Keith, 2006; Osborne & Waters, 2002). The rule of

thumb for this method is that the ratio of high to low variance less than ten is not problematic

(Keith, 2006). Bartletts and Hartleys tests have been identified in the research as flexible and

powerful tests to assess homoscedasicity (Aguinis et al., 1999; Sevier, 1957).

Figure 3. Homoscedasticity and heteroscedasticity examples

Collinearity

Collinearity (also called multicollinearity) refers to the assumption that the independent

variables are uncorrelated (Darlington, 1968; Keith, 2006). The researcher is able to interpret

regression coefficients as the effects of the independent variables on the dependent variables

when collinearity is low (Keith, 2006; Poole & OFarrell, 1971). This means that we can make

inferences about the causes and effects of variables reliably. Multicollinearity occurs when

several independent variables correlate at high levels with one another, or when one independent

variable is a near linear combination of other independent variables (Keith, 2006). The more

variables overlap (correlate) the less able researchers can separate the effects of variables. In

MR the independent variables are allowed to be correlated to some degree (Cohen, 1968;

Darlington, 1968; Hoyt et al., 2006; Neale et al., 1994). The regression is designed to allow for

this, and provides the proportions of the overlapping variance (Cohen, 2968). Ideally,

independent variables are more highly correlated with the dependent variables than with other

independent variables.

If this assumption is not satisfied, autocorrelation is present (Poole & OFarrell, 1971).

Multicollinearity can result in misleading and unusual results, inflated standard errors, reduced

power of the regression coefficients that create a need for larger sample sizes (Jaccard et al.,

2006; Keith, 2006). Interpretations and conclusions based on the size of the regression

coefficients, their standard errors, or associated t-tests may be misleading because of the

confounding effects of collinearity (Mason & Perreault Jr., 1991). The result is that the

researcher can underestimate the relevance of a predictor, the hypothesis testing of interaction

effects is hampered, and the power for detecting the moderation relationship is reduced because

of the intercorrelation of the predictor variables (Jaccard et al., 2006; Shieh, 2010).

One way to prevent multicollinearity is to combine overlapping variables in the analysis,

and avoid including multiple measures of the same construct in a regression (Keith, 2006).

Statistical software packages include collinearity diagnostics that measure the degree to which

each variable is independent of other independent variables. The effect of a given level of

collinearity can be evaluated in conjunction with the other factors of sample size, R2, and

magnitude of the coefficients (Mason & Perreault Jr., 1991). Widely used procedures examine

the correlation matrix of the predictor variables, computing the coefficients of determination, R2,

and measures of the eigenvalues of the data matrix including variance inflation factors (VIF)

(Mason & Perreault Jr., 1991). Tolerance measures the influence of one independent variable on

all other independent variables. Tolerance levels for correlations range from zero (no

independence) to one (completely independent) (Keith, 2006). The VIF is an index of the

10

amount that the variance of each regression coefficient is increased over that with uncorrelated

independent variables (Keith, 2006). When a predictor variable has a strong linear association

with other predictor variables, the associated VIF is large and is evidence of multicollinearity

(Shieh, 2010). The rule of thumb for a large VIF value is ten (Keith, 2006; Shieh, 2010). Small

values for tolerance and large VIF values show the presence of multicollinearity (Keith, 2006).

Table 1 is an example of low collinearity demonstrated by high tolerance and low VIF values

from the SPSS software.

Table 1. Collinearity statistics

Unstandardized

Standardized

Coefficients

Coefficients

Correlations

Collinearity Statistics

ZeroModel

1 (Constant)

hours of math

Std. Error

7.640

6.705

1.532

.211

.208

.754

Beta

Sig.

order

Partial

Part

Tolerance

VIF

1.139

.256

.418

7.271

.000

.429

.421

.414

.985

1.016

.123

.101

1.696

.091

.053

.107

.097

.916

1.092

.382

.118

1.971

.050

.141

.125

.112

.903

1.108

homework per

month

teacher math

support

peer math support

Normality

Multiple regression assumes that variables have normal distributions (Darlington, 1968;

Osborne & Waters, 2002). This means that errors are normally distributed, and that a plot of the

values of the residuals will approximate a normal curve (Keith, 2006). The assumption is based

on the shape of normal distribution and gives the researcher knowledge about what values to

expect (Keith, 2006). Once the sampling distribution of the mean is known, it is possible to

make predictions for a new sample (Keith, 2006).

11

When scores on variables are skewed, correlations with other measures will be

attenuated, and when the range of scores in the sample is restricted relative to the population

correlations with scores on other variables will be attenuated (Hoyt et al., 2006). Non-normally

distributed variables can distort relationships and significance tests (Osborne & Waters, 2002).

Outliers can influence both Type I and Type II errors and the overall accuracy of results

(Osborne & Waters, 2002).

The researcher can test this assumption through several pieces of information: visual

inspection of data plots, skew, kurtosis, and P-Plots (Osborne & Waters, 2002). Data cleaning

can also be important in checking this assumption through the identification of outliers.

Statistical software has tools designed for testing this assumption. Skewness and kurtosis can be

checked in the statistic tables, and values that are close to zero indicate normal distribution.

Normality can further be checked through histograms of the standardized residuals (Stevens,

2009). Histograms are bar graphs of the residuals with a superimposed normal curve that show

distribution. The normal curve is fitted to the data using the observed mean and standard

deviation as estimates, and computing the corresponding chi square (Sevier, 1957). Figure 4 is

an example of is histogram with normal distribution from the SPSS software. Q-plots, and Pplots are a more exacting methods to spot deviations from normality, and are relatively easy to

interpret as departures from a straight line (Keith, 2006). Figure 5 shows a P-Plot with normal

distribution from the SPSS software.

Figure 4. Histogram with normal distribution

12

Figure 5. Normal P-Plot

Conclusion

Multiple regression techniques give researchers flexibility to address a wide variety of

research questions (Hoyt et al., 2006). Since the analyses are based upon certain definite

conditions or assumptions, it is imperative that the assumptions be analyzed (Sevier, 1957). The

goal of this tutorial was to raise awareness and understanding of the importance of testing

assumptions in MR for understanding and conducting research. The primary assumptions

reviewed included linearity, independence of errors, homoscedasticity, collinearity, and

normality. When assumptions are violated accuracy and inferences from the analysis are

affected (Antonakis & Dietz, 2011). Statistical software packages allow researchers to test for

each assumption. Checking the assumptions carry significant benefits for the researcher, reduce

error, and increase reliability and validity of inferences. Consideration of the issues surrounding

the assumptions in multiple regression should improve the insights for researchers as they build

theories (Jaccard et al., 2006).

13

References

Aguinis, H., Petersen, S., & Pierce, C. (1999). Appraisal of the homogeneity of error variance

assumption and alternatives to multiple regression for estimating moderating effects of

categorical variables. Organizational Research Methods, 2, 315-339. doi:

10.1177/109442819924001

Antonakis, J., & Dietz, J. (2011). Looking for validity or testing it? The perils of stepwise

regression, extreme-score analysis, heteroscedasticity, and measurement error.

Personality and Individual Differences, 50, 409-415. doi:10.1016/j.paid.2010.09.014

Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin,

70(6), 426-443.

Darlington, R. (1968). Multiple regression in psychological research and practice. Psychological

Bulletin, 69(3), 161-182.

Hoyt, W., Leierer, S., & Millington, M. (2006). Analysis and interpretation of findings using

multiple regression techniques. Rehabilitation Counseling Bulletin, 49(4), 223-233.

Jaccard, J., Guilamo-Ramos, V., Johansson, M., & Bouris, A. (2006). Multiple regression

analyses in clinical child and adolescent psychology. Journal of Clinical Child and

Adolescent Psychology, 35(3), 456-479.

Keith, T. (2006). Multiple regression and beyond. PEARSON Allyn & Bacon.

Mason, C., & Perreault Jr., W. (1991). Collinearity, power, and interpretation of multiple

regression analysis. Journal of Marketing Research, 28(3), 268-280. Retrieved from:

http://www.jstor.org/stable/3172863

14

Neale, M., Eaves, L., Kendler, K., Heath, A., & Kessler, R (1994). Multiple regression with data

collected from relatives: Testing assumptions of the model. Multivariate Behavioral

Research, 29(1), 33-61.

Osborne, J., & Waters, E. (2002). Four assumptions of multiple regression that researchers

should always test. Practical Assessment, Research & Evaluation, 8(2). Retrieved from:

http://PAREonline.net/getvn.asp?v=8&n=2

Poole, M., & OFarrell, P. (1971). The assumptions of the linear regression model. Transactions

of the Institute of British Geographers, 52, 145-158. Retrieved from:

http://www.jstor.org/stable/621706

Sevier, F. (1957). Testing assumptions underlying multiple regression. The Journal of

Experimental Education, 25(4), 323-330. Retrieved from:

http://www.jstor.org/stable/20154054

Shieh, G. (2010). On the misconception of multicollinearity in detection of moderating effects:

Multicollinearity is not always detrimental. Multivariate Behavioral Research, 45, 483507. doi: 10.1080/00273171.2010.483393

Stevens, J. P. (2009). Applied multivariate statistics for the social sciences (5th ed.). New York,

NY: Routledge.

- Multiple RegressionЗагружено:Khawaja Naveed Haider
- Multiple Regression ReviewЗагружено:akirank1
- 2007-Hawkins-Ecology-richness-metabolicЗагружено:juli
- Quality Kitchen Meatloaf MixЗагружено:tiger_71
- Lecture 7 Regression and CorrelationЗагружено:fa2heem
- Detection Fraud With Fraud TriangleЗагружено:Mariana Febriansyah
- Prediction of Construction Project Performance using Regression Analysis and Artificial Neural NetworkЗагружено:IJRASETPublications
- Writing Tips for Economics Research Papers_P Nikolov 2010Загружено:Arbee Lu
- Bel et al_2018Загружено:Alfredo Steven
- OLSinR ApplicationЗагружено:Nataniel Lopes Barros
- Final Brochure Sujata KarЗагружено:MarkWeber
- Linear RegressionЗагружено:vada_so
- 1171924Загружено:fabriciolafebre
- 31. Regression Analysis 1-1-11Загружено:Ani Krishna
- Exercise 1 Multiple Regression ModelЗагружено:Carlos Villa Alonso
- O2Загружено:Yash Maurya
- RMpptЗагружено:Rohit Padalkar
- Multiple RegressionЗагружено:ColleenMaimaiChimee
- 3.IJAFMRAPR20173Загружено:TJPRC Publications
- 7+OLS+AssumptionsЗагружено:Saitej
- L2-2014-FS-02-QMЗагружено:Heidi HC
- Linear ModelsЗагружено:jbsimha3629
- Fundamentals of Statistics - UploadЗагружено:ipconfigearth
- ChemistryЗагружено:SelvaKumar Raju
- Multivariate - Dependence MethodsЗагружено:Arvind Yadav
- 10.Availability and Analysis of ICT Indices in Asia&PacificЗагружено:ridwansurono
- The Effect of Industry and Firm Characteristics on New Firms Survival: (Case Study: The Electrical and Electronic Industries in The Mazandaran Province, Iran)Загружено:TI Journals Publishing
- Investigation of Multi Linear Regression Methods on Estimation of Free Vibration Analysis of Laminated Composite Shallow ShellsЗагружено:IJAERS JOURNAL
- MultiColl TheoryЗагружено:Munnah Bhai
- Epping Wilder (2011)Загружено:Rizaldi Maulana

- Mangerial AccountingЗагружено:Hira Mustafa Shah
- VocabЗагружено:Hira Mustafa Shah
- Statement of PurposeЗагружено:Hira Mustafa Shah
- Exercise 1.docxЗагружено:Hira Mustafa Shah
- Leadership AssignmentЗагружено:Hira Mustafa Shah
- Final Brm ProЗагружено:Hira Mustafa Shah
- English LearningЗагружено:Hira Mustafa Shah
- HiraЗагружено:Hira Mustafa Shah
- BrmЗагружено:Hira Mustafa Shah
- BRM iv dvЗагружено:Hira Mustafa Shah
- Latest AssignmentЗагружено:Hira Mustafa Shah
- DocumentЗагружено:Hira Mustafa Shah
- Consumer PsychologyЗагружено:Hira Mustafa Shah
- MFЗагружено:Hira Mustafa Shah
- Cover PageЗагружено:Hira Mustafa Shah
- Strategic Management PlanЗагружено:Hira Mustafa Shah
- Assignment 2Загружено:Hira Mustafa Shah
- Assignment 4Загружено:Hira Mustafa Shah
- jjjjЗагружено:Hira Mustafa Shah
- DocumentЗагружено:Hira Mustafa Shah
- ist assign.docxЗагружено:Hira Mustafa Shah
- Cost FinalЗагружено:Hira Mustafa Shah
- MarketingЗагружено:Hira Mustafa Shah
- 2nd Assign BRM.docxЗагружено:Hira Mustafa Shah
- 12-1Загружено:Hira Mustafa Shah
- Business FinanceЗагружено:Hira Mustafa Shah

- ETC1010_2010_S1 Exam Solutions All TopicsЗагружено:Mohammad Rashman
- S3AЗагружено:Andreas Menon
- How to forecast using ARIMAЗагружено:Naba Kr Medhi
- ivregEXЗагружено:lZbrunoZl
- Muthen LCAЗагружено:YonathanRacing
- Chapt 8 -9 ExercisesЗагружено:husaini98
- STAT1201 MSE 2013c SolutionsЗагружено:Devon Hamley
- Spring 12 QMB3200 FSU Exam 3 Practice Problems SolutionsЗагружено:AlekHartzog
- Psy 103Загружено:HaidarAli
- malhotra19.pptЗагружено:Sunia Tariq
- TЗагружено:taruna
- OM Test 1-Sample (Undergraduate)Загружено:Robin T
- Dynamic Panel Data ModelsЗагружено:zaheer khan
- Fix ExpertЗагружено:Ulfia Dwi Novita
- Q'vive PMP Formulas PMBOK5 v2.pdfЗагружено:jsrt234956
- Addressing Moderated Mediation Preacher Rucker Hayes 2007Загружено:Payal Anand
- Unit-9Загружено:swingbike
- Materi 4-Teknik SamplingЗагружено:Budi Santoso
- MKKG WORDЗагружено:Muvida Savitri
- Index ModelsЗагружено:Nurdin Hadihawrun
- Data Processing & Analysis in b.r.Загружено:popat vishal
- MultiColl TheoryЗагружено:Munnah Bhai
- Chapter6_Test of Hypotheses_ANSWER KEYЗагружено:emmanuel auguis
- How to Choose the Right TestЗагружено:ldlewis
- Frequency Analysis in hydrologyЗагружено:suman59
- pROC.pdfЗагружено:Luis Luengo Machuca
- A Critical Look at the Use of PLS-SEM in MIS QuarterlyЗагружено:Liliya Iskhakova
- Logistic RegressionЗагружено:Fithria Siti Hanifah Fazeh
- Assignment 21 SolutionsЗагружено:Ivan Paredes
- chap04.pptЗагружено:Habib