Вы находитесь на странице: 1из 17

Measures of Relationship

Chapter 5 of the textbook introduced you to the two most widely used measures of
relationship: the Pearson product-moment correlation and the Spearman rank-order
correlation. We will be covering these statistics in this section, as well as other
measures of relationship among variables.

What is a Relationship?
Correlation coefficients are measures of the degree of relationship between two or
more variables. When we talk about a relationship, we are talking about the manner in
which the variables tend to vary together. For example, if one variable tends to
increase at the same time that another variable increases, we would say there is a
positive relationship between the two variables. If one variable tends to decrease as
another variable increases, we would say that there is a negative relationship between
the two variables. It is also possible that the variables might be unrelated to one
another, so that there is no predictable change in one variable based on knowing about
changes in the other variable.
As a child grows from an infant into a toddler into a young child, both the child's
height and weight tend to change. Those changes are not always tightly locked to one
another, but they do tend to occur together. So if we took a sample of children from a
few weeks old to 3 years old and measured the height and weight of each child, we
would likely see a positive relationship between the two.
A relationship between two variables does not necessarily mean that one variable
causes the other. When we see a relationship, there are three possible causal
interpretations. If we label the variables A and B, A could cause B, B could cause A,
or some third variable (we will call it C) could cause both A and B. With the
relationship between height and weight in children, it is likely that the general growth
of children, which increases both height and weight, accounts for the observed
correlation. It is very foolish to assume that the presence of a correlation implies a
causal relationship between the two variables. There is an extended discussion of this
issue in Chapter 7 of the text.

Scatter Plots and Linear Relationships


A helpful way to visualize a relationship between two variables is to construct a
scatter plot, which you were briefly introduced to in our discussion of graphical
techniques. A scatter plot represents each set of paired scores on a two dimensional

graph, in which the dimensions are defined by the variables. For example, if we
wanted to create a scatter plot of our sample of 100 children for the variables of height
and weight, we would start by drawing the X and Y axes, labeling one height and the
other weight, and marking off the scales so that the range on these axes is sufficient to
handle the range of scores in our sample. Let's suppose that our first child is 27 inches
tall and 21 pounds. We would find the point on the weight axis that represents 21
pounds and the point on the height axis that represents 27 inches. Where these two
points cross, we would put a dot that represents the combination of height and weight
for that child, as shown in the figure below.

We then continue the process for all of the other children in our sample, which might
produce the scatter plot illustrated below.

It is always a good idea to produce scatter plots for the correlations that you compute
as part of your research. Most will look like the scatter plot above, suggesting a linear
relationship. Others will show a distribution that is less organized and more scattered,
suggesting a weak relationship between the variables. But on rare occasions, a scatter
plot will indicate a relationship that is not a simple linear relationship, but rather
shows a complex relationship that changes at different points in the scatter plot. The
scatter plot below illustrates a nonlinear relationship, in which Y increases
as X increases, but only up to a point; after that point, the relationship reverses
direction. Using a simple correlation coefficient for such a situation would be a
mistake, because the correlation cannot capture accurately the nature of a nonlinear
relationship.

Pearson Product-Moment Correlation

The Pearson product-moment correlation was devised by Karl Pearson in 1895,


and it is still the most widely used correlation coefficient. This history behind
the mathematical development of this index is fascinating. Those interested in that
history can click on the link. But you need not know that history to understand how
the Pearson correlation works.
The Pearson product-moment correlation is an index of the degree of linear
relationship between two variables that are both measured on at least an ordinal scale
of measurement. The index is structured so the a correlation of 0.00 means that there
is no linear relationship, a correlation of +1.00 means that there is a perfect positive
relationship, and a correlation of -1.00 means that there is a perfect negative
relationship. As you move from zero to either end of this scale, the strength of the
relationship increases. You can think of the strength of a linear relationship as how
tightly the data points in a scatter plot cluster around a straight line. In a perfect
relationship, either negative or positive, the points all fall on a single straight line. We
will see examples of that later. The symbol for the Pearson correlation is a
lowercase r, which is often subscripted with the two variables. For example, r would
stand for the correlation between the variables X and Y.
xy

The Pearson product-moment correlation was originally defined in terms of Z-scores.


In fact, you can compute the product-moment correlation as the average crossproduct Z, as show in the first equation below. But that is an equation that is difficult
to use to do computations. The more commonly used equation now is the second
equation below. Although this equation looks much more complicated and looks like
it would be much more difficult to compute, in fact, this second equation is by far the
easier of the two to use if you are doing the computations with nothing but a
calculator.

You can learn how to compute the Pearson product-moment correlation either by hand
or using SPSS for Windows by clicking on one of the buttons below. Use the browser's
return arrow key to return to this page.
Spearman Rank-Order Correlation

The Spearman rank-order correlation provides an index of the degree of linear


relationship between two variables that are both measured on at least an ordinal scale
of measurement. If one of the variables is on an ordinal scale and the other is on an
interval or ratio scale, it is always possible to convert the interval or ratio scale to an
ordinal scale. That process is discussed in the section showing you how to compute
this correlation by hand.

The Spearman correlation has the same range as the Pearson correlation, and the
numbers mean the same thing. A zero correlation means that there is no relationship,
whereas correlations of +1.00 and -1.00 mean that there are perfect positive and
negative relationships, respectively. The formula for computing this correlation is
shown below. Traditionally, the lowercase r with a subscript s is used to designate the
Spearman correlation (i.e., r ). The one term in the formula that is not familiar to you
is d, which is equal to the difference in the ranks for the two variables. This is
explained in more detail in the section that covers the manual computation of the
Spearman rank-order correlation.
s

The Phi Coefficient

The Phi coefficient is an index of the degree of relationship between two variables
that are measured on a nominal scale. Because variables measured on a nominal scale
are simply classified by type, rather than measured in the more general sense, there is
no such thing as a linear relationship. Nevertheless, it is possible to see if there is a
relationship. For example, suppose you want to study the relationship between
religious background and occupations. You have a classification systems for religion
that includes Catholic, Protestant, Muslim, Other, and Agnostic/Atheist. You have also
developed a classification for occupations that include Unskilled Laborer, Skilled
Laborer, Clerical, Middle Manager, Small Business Owner, and Professional/Upper
Management. You want to see if the distribution of religious preferences differ by
occupation, which is just another way of saying that there is a relationship between
these two variables.
The Phi Coefficient is not used nearly as often as the Pearson and Spearman
correlations. Therefore, we will not be devoting space here to the computational
procedures. However, interested students can consult advances statistics textbooks for
the details. you can compute Phi easily as one of the options in
the crosstabs procedure in SPSS for Windows. Click on the button below to see how.
Advanced Correlational Techniques

Correlational techniques are immensely flexible and can be extended dramatically to


solve various kinds of statistical problems. Covering the details of these advanced
correlational techniques is beyond the score of this text and website. However, we
have included brief discussions of several advanced correlational techniques on

the Student Resource Website, including multidimensional scaling, path


analysis,taxonomic search techniques, and statistical analysis of neuroimages.
Nonlinear Correlational Procedures

The vast majority of correlational techniques used in psychology are linear


correlations. However, there are times when one can expect to find nonlinear
relationships and would like to apply statistical procedures to capture such complex
relationships. This topic is far too complex to cover here. The interested student will
want to consult advanced statistical textbooks that specialize in regression analyses.
There are two words of caution that we want to state about using such nonlinear
correlational procedures. Although it is relatively easy to do the computations using
modern statistical software, you should not use these procedures unless you actually
understand them and their pitfalls. It is easy to misuse the techniques and to be fooled
into believing things that are not true from a naive analysis of the output of computer
programs.
The second word of caution is that there should be a strong theoretical reason to
expect a nonlinear relationship if you are going to use nonlinear correlational
procedures. Many psychophysiological processes are by their nature nonlinear, so
using nonlinear correlations in studying those processes makes complete sense. But
for most psychological processes, there is no good theoretical reasons to expect a
nonlinear relationship.

LookingforRelationshipsintheData
Whentherearetwoseriesofdata,thereareanumberofstatisticalmeasures
thatcanbeusedtocapturehowtheseriesmovetogetherovertime.
CorrelationsandCovariances

Thetwomostwidelyusedmeasuresofhowtwovariablesmovetogether(ordo
not)arethecorrelationandthecovariance.Fortwodataseries,X(X 1,X2,)andY(Y,
Y...),thecovarianceprovidesameasureofthedegreetowhichtheymovetogether
and is estimated by taking the product of the deviations from the mean for each
variableineachperiod.

Thesignonthecovarianceindicatesthetypeofrelationshipthetwovariableshave.A
positivesignindicatesthattheymovetogetherandanegativesignthattheymovein
opposite directions. Although the covariance increases with the strength of the
relationship,itisstillrelativelydifficulttodrawjudgmentsonthestrengthofthe
relationshipbetweentwovariablesbylookingatthecovariance,becauseitisnot
standardized.
Thecorrelationisthestandardizedmeasureoftherelationshipbetweentwo
variables.Itcanbecomputedfromthecovariance:

Thecorrelationcanneverbegreaterthanoneorlessthannegativeone.Acorrelation
closetozeroindicatesthatthetwovariablesareunrelated.Apositivecorrelation
indicatesthatthetwovariablesmovetogether,andtherelationshipisstrongerasthe
correlationgetsclosertoone.Anegativecorrelationindicatesthetwovariablesmove
inoppositedirections,andthatrelationshipgetsstrongertheasthecorrelationgets
closertonegativeone.Twovariablesthatareperfectlypositivelycorrelated( XY=1)
essentiallymoveinperfectproportioninthesamedirection,whereastwovariables
that are perfectly negatively correlated move in perfect proportion in opposite
directions.
Regressions

Asimpleregressionisanextensionofthecorrelation/covarianceconcept.It
attemptstoexplainonevariable,thedependentvariable,usingtheothervariable,the
independentvariable.

ScatterPlotsandRegressionLines

Keepingwithstatisticaltradition,letYbethedependentvariableandXbethe
independentvariable.Ifthetwovariablesareplottedagainsteachotherwitheachpair
ofobservationsrepresentingapointonthegraph,youhaveascatterplot,withYonthe
verticalaxisandXonthehorizontalaxis.FigureA1.3illustratesascatterplot.
FigureA1.3:ScatterPlotofYversusX

Inaregression,weattempttofitastraightlinethroughthepointsthatbestfits
thedata.Initssimplestform,thisisaccomplishedbyfindingalinethatminimizesthe
sumofthesquareddeviationsofthepointsfromtheline.Consequently,itiscalled
anordinaryleastsquares(OLS)regression.Whensuchalineisfit,twoparameters
emergeoneisthepointatwhichthelinecutsthroughtheYaxis,calledtheintercept
oftheregression,andtheotheristheslopeoftheregressionline:
Y=a+bX
Theslope(b)oftheregressionmeasuresboththedirectionandthemagnitudeofthe
relationship between the dependent variable (Y) and the independent variable (X).
When the two variables are positively correlated, the slope will also be positive,
whereaswhenthetwovariablesarenegativelycorrelated,theslopewillbenegative.
Themagnitudeoftheslopeoftheregressioncanbereadasfollows:Foreveryunit
increase in the dependent variable (X), the independent variable will change
byb(slope).
EstimatingRegressionParameters

Althoughtherearestatisticalpackagesthatallowustoinputdataandgetthe
regressionparametersasoutput,itisworthlookingathowtheyareestimatedinthe

firstplace.Theslopeoftheregressionlineisalogicalextensionofthecovariance
concept introduced in the last section. In fact, the slope is estimated using the
covariance:

Theintercept(a)oftheregressioncanbereadinanumberofways.Oneinterpretation
isthatitisthevaluethatYwillhavewhenXiszero.Anotherismorestraightforward
andisbasedonhowitiscalculated.Itisthedifferencebetweentheaveragevalue
ofY,andtheslopeadjustedvalueofX.

Regressionparametersarealwaysestimatedwithsomeerrororstatisticalnoise,partly
becausetherelationshipbetweenthevariablesisnotperfectandpartlybecausewe
estimatethemfromsamplesofdata.Thisnoiseiscapturedinacoupleofstatistics.
OneistheR2oftheregression,whichmeasurestheproportionofthevariabilityinthe
dependentvariable(Y)thatisexplainedbytheindependentvariable(X).Itisalsoa
directfunctionofthecorrelationbetweenthevariables:

AnR2valueclosetooneindicatesastrongrelationshipbetweenthetwovariables,
thoughtherelationshipmaybeeitherpositiveornegative.Anothermeasureofnoise
inaregressionisthestandarderror,whichmeasuresthespreadaroundeachof
thetwoparametersestimatedtheinterceptandtheslope.Eachparameterhasan
associatedstandarderror,whichiscalculatedfromthedata:

StandardErrorofIntercept=SEa=

If we make the additional assumption that the intercept and slope estimates are
normallydistributed,theparameterestimateandthestandarderrorcanbecombined
togetatstatisticthatmeasureswhethertherelationshipisstatisticallysignificant.
tStatisticforIntercept=a/SEa
tStatisticfromSlope=b/SEb
Forsampleswithmorethan120observations,atstatisticgreaterthan1.95indicates
thatthevariableissignificantlydifferentfromzerowith95%certainty,whereasa
statisticgreaterthan2.33indicatesthesamewith99%certainty.Forsmallersamples,
thetstatistichastobelargertohavestatisticalsignificance. [1]
UsingRegressions

Although regressions mirror correlation coefficients and covariances in


showing the strength of the relationship between two variables, they also serve
anotherusefulpurpose.Theregressionequationdescribedinthelastsectioncanbe
usedtoestimatepredictedvaluesforthedependentvariable,basedonassumedor
actualvaluesfortheindependentvariable.Inotherwords,foranygivenY,wecan
estimatewhatXshouldbe:
X=a+b(Y)
Howgoodarethesepredictions?Thatwilldependentirelyonthestrengthofthe
relationshipmeasuredintheregression.Whentheindependentvariableexplainsa
highproportionofthevariationinthedependentvariable(R 2ishigh),thepredictions
willbeprecise.WhentheR2islow,thepredictionswillhaveamuchwiderrange.
FromSimpletoMultipleRegressions

Theregressionthatmeasurestherelationshipbetweentwovariablesbecomes
a multiple regression when it is extended to includemore than one independent
variables(X1,X2,X3,X4...)intryingtoexplainthedependentvariableY.Although

the graphical presentation becomes more difficult, the multipleregressionyields


outputthatisanextensionofthesimpleregression.
Y=a+bX1+cX2+dX3+eX4
TheR2still measures the strength of the relationship, but an additionalR2statistic
calledtheadjustedR2iscomputedtocounterthebiasthatwillinducetheR2tokeep
increasingasmoreindependentvariablesareaddedtotheregression.Iftherearek
independentvariablesintheregression,theadjustedR2iscomputedasfollows:

Multipleregressionsarepowerfultoolsthatallowustoexaminethedeterminantsof
anyvariable.
RegressionAssumptionsandConstraints

Boththesimpleandmultipleregressionsdescribedinthissectionalsoassume
linear relationships between the dependent and independent variables. If the
relationshipisnotlinear,wehavetwochoices.Oneistotransformthevariablesby
takingthesquare,squareroot,ornaturallog(forexample)ofthevaluesandhopethat
therelationshipbetweenthetransformedvariablesismorelinear.Theotheristorun
nonlinear regressions that attempt to fit a curve (rather than a straight
line)throughthedata.
Thereareimplicitstatisticalassumptionsbehindeverymultipleregressionthat
we ignore at our own peril. For the coefficients on the individual independent
variablestomakesense,theindependentvariableneedstobeuncorrelatedwitheach
other, a condition that is often difficult to meet. When independent variables are
correlated with each other, the statistical hazard that is created is
calledmulticollinearity.Initspresence,thecoefficientsonindependentvariablescan
takeonunexpectedsigns(positiveinsteadofnegative,forinstance)andunpredictable
values.Therearesimplediagnosticstatisticsthatallowustomeasurehowfarthedata
maybedeviatingfromourideal.

Correlation Types
Correlation is a measure of association between two variables. The variables are not designated as
dependent or independent. The two most popular correlation coefficients are: Spearman's correlation
coefficient rho and Pearson's product-moment correlation coefficient.
When calculating a correlation coefficient for ordinal data, select Spearman's technique. For interval or
ratio-type data, use Pearson's technique.
The value of a correlation coefficient can vary from minus one to plus one. A minus one indicates a
perfect negative correlation, while a plus one indicates a perfect positive correlation. A correlation of zero
means there is no relationship between the two variables. When there is a negative correlation between
two variables, as the value of one variable increases, the value of the other variable decreases, and vise
versa. In other words, for a negative correlation, the variables work opposite each other. When there is a
positive correlation between two variables, as the value of one variable increases, the value of the other
variable also increases. The variables move together.
The standard error of a correlation coefficient is used to determine the confidence intervals around a true
correlation of zero. If your correlation coefficient falls outside of this range, then it is significantly different
than zero. The standard error can be calculated for interval or ratio-type data (i.e., only for Pearson's
product-moment correlation).
The significance (probability) of the correlation coefficient is determined from the t-statistic. The probability
of the t-statistic indicates whether the observed correlation coefficient occurred by chance if the true
correlation is zero. In other words, it asks if the correlation is significantly different than zero. When the tstatistic is calculated for Spearman's rank-difference correlation coefficient, there must be at least 30
cases before the t-distribution can be used to determine the probability. If there are fewer than 30 cases,
you must refer to a special table to find the probability of the correlation coefficient.
Example
A company wanted to know if there is a significant relationship between the total number of salespeople
and the total number of sales. They collect data for five months.
Variable 1

Variable 2

207

6907

180

5991

220

6810

205

6553

190

6190

-------------------------------Correlation coefficient = .921


Standard error of the coefficient = ..068
t-test for the significance of the coefficient = 4.100
Degrees of freedom = 3
Two-tailed probability = .0263
Another Example
Respondents to a survey were asked to judge the quality of a product on a four-point Likert scale
(excellent, good, fair, poor). They were also asked to judge the reputation of the company that made the
product on a three-point scale (good, fair, poor). Is there a significant relationship between respondents
perceptions of the company and their perceptions of quality of the product?
Since both variables are ordinal, Spearman's method is chosen. The first variable is the rating for the
quality the product. Responses are coded as 4=excellent, 3=good, 2=fair, and 1=poor. The second
variable is the perceived reputation of the company and is coded 3=good, 2=fair, and 1=poor.
Variable 1

Variable 2

------------------------------------------Correlation coefficient rho = .830


t-test for the significance of the coefficient = 3.332
Number of data pairs = 7
Probability must be determined from a table because of the small sample size.

Regression
Simple regression is used to examine the relationship between one dependent and one independent
variable. After performing an analysis, the regression statistics can be used to predict the dependent

variable when the independent variable is known. Regression goes beyond correlation by adding
prediction capabilities.
People use regression on an intuitive level every day. In business, a well-dressed man is thought to be
financially successful. A mother knows that more sugar in her children's diet results in higher energy
levels. The ease of waking up in the morning often depends on how late you went to bed the night before.
Quantitative regression adds precision by developing a mathematical formula that can be used for
predictive purposes.
For example, a medical researcher might want to use body weight (independent variable) to predict the
most appropriate dose for a new drug (dependent variable). The purpose of running the regression is to
find a formula that fits the relationship between the two variables. Then you can use that formula to
predict values for the dependent variable when only the independent variable is known. A doctor could
prescribe the proper dose based on a person's body weight.
The regression line (known as the least squares line) is a plot of the expected value of the dependent
variable for all values of the independent variable. Technically, it is the line that "minimizes the squared
residuals". The regression line is the one that best fits the data on a scatterplot.
Using the regression equation, the dependent variable may be predicted from the independent variable.
The slope of the regression line (b) is defined as the rise divided by the run. The y intercept (a) is the
point on the y axis where the regression line would intercept the y axis. The slope and y intercept are
incorporated into the regression equation. The intercept is usually called the constant, and the slope is
referred to as the coefficient. Since the regression model is usually not a perfect predictor, there is also an
error term in the equation.
In the regression equation, y is always the dependent variable and x is always the independent variable.
Here are three equivalent ways to mathematically describe a linear regression model.
y = intercept + (slope x) + error
y = constant + (coefficient x) + error
y = a + bx + e
The significance of the slope of the regression line is determined from the t-statistic. It is the probability
that the observed correlation coefficient occurred by chance if the true correlation is zero. Some
researchers prefer to report the F-ratio instead of the t-statistic. The F-ratio is equal to the t-statistic
squared.
The t-statistic for the significance of the slope is essentially a test to determine if the regression model
(equation) is usable. If the slope is significantly different than zero, then we can use the regression model
to predict the dependent variable for any value of the independent variable.

On the other hand, take an example where the slope is zero. It has no prediction ability because for every
value of the independent variable, the prediction for the dependent variable would be the same. Knowing
the value of the independent variable would not improve our ability to predict the dependent variable.
Thus, if the slope is not significantly different than zero, don't use the model to make predictions.
The coefficient of determination (r-squared) is the square of the correlation coefficient. Its value may vary
from zero to one. It has the advantage over the correlation coefficient in that it may be interpreted directly
as the proportion of variance in the dependent variable that can be accounted for by the regression
equation. For example, an r-squared value of .49 means that 49% of the variance in the dependent
variable can be explained by the regression equation. The other 51% is unexplained.
The standard error of the estimate for regression measures the amount of variability in the points around
the regression line. It is the standard deviation of the data points as they are distributed around the
regression line. The standard error of the estimate can be used to develop confidence intervals around a
prediction.
Example
A company wants to know if there is a significant relationship between its advertising expenditures and its
sales volume. The independent variable is advertising budget and the dependent variable is sales
volume. A lag time of one month will be used because sales are expected to lag behind actual advertising
expenditures. Data was collected for a six month period. All figures are in thousands of dollars. Is there a
significant relationship between advertising budget and sales volume?
Indep. Var.

Depen. Var

4.2

27.1

6.1

30.4

3.9

25.0

5.7

29.7

7.3

40.1

5.9

28.8

-------------------------------------------------Model: y = 9.873 + (3.682 x) + error


Standard error of the estimate = 2.637
t-test for the significance of the slope = 3.961
Degrees of freedom = 4
Two-tailed probability = .0149
r-squared = .807

You might make a statement in a report like this: A simple linear regression was performed on six months
of data to determine if there was a significant relationship between advertising expenditures and sales
volume. The t-statistic for the slope was significant at the .05 critical alpha level, t(4)=3.96, p=.015. Thus,
we reject the null hypothesis and conclude that there was a positive significant relationship between
advertising expenditures and sales volume. Furthermore, 80.7% of the variability in sales volume could be
explained by advertising expenditures.

Вам также может понравиться