Вы находитесь на странице: 1из 54

AdvancedStatistics

AdvancedStatistics
PaoloColettiA.Y.2010/11FreeUniversityofBolzanoBozen

TableofContents
1.

Statisticalinference................................................................................................................2

1.1

Populationandsampling..............................................................................................................................................2

2.

Dataorganization...................................................................................................................4

2.1
2.2
2.3

Variablesmeasure.......................................................................................................................................................4
SPSS..............................................................................................................................................................................4
Datadescription...........................................................................................................................................................5

3.

Statisticaltests........................................................................................................................7

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8

Example........................................................................................................................................................................7
Nullandalternativehypothesis..................................................................................................................................11
TypeIandtypeIIerror...............................................................................................................................................11
Significance.................................................................................................................................................................12
Acceptandreject........................................................................................................................................................12
Tailsandcriticalregions.............................................................................................................................................13
Parametricandnonparametrictest..........................................................................................................................15
Prerequisites...............................................................................................................................................................15

4.

Tests.....................................................................................................................................16

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13
4.14

Studentsttestforonevariable.................................................................................................................................16
Studentsttestfortwopopulations...........................................................................................................................16
Studentsttestforpaireddata..................................................................................................................................18
Ftest...........................................................................................................................................................................19
Onewayanalysisofvariance(ANOVA)......................................................................................................................20
JarqueBeratest..........................................................................................................................................................22
KolmogorovSmirnovtest...........................................................................................................................................22
Signtest......................................................................................................................................................................23
MannWhitney(Wilcoxonranksum)test..................................................................................................................26
Wilcoxonsignedranktest..........................................................................................................................................28
KruskalWallistest......................................................................................................................................................30
Pearsonscorrelationcoefficient................................................................................................................................32
Spearman'srankcorrelationcoefficient.....................................................................................................................34
Multinomialexperiment.............................................................................................................................................36

5.
6.

Whichtesttouse?................................................................................................................41
Regressionmodel.................................................................................................................43

6.1
6.2
6.3
6.4
6.5
6.6
6.7

Theleastsquaresapproach........................................................................................................................................43
Statisticalinference....................................................................................................................................................46
Multivariateandnonlinearregressionmodel...........................................................................................................47
Multivariatestatisticalinference................................................................................................................................48
Qualitativeindependentvariables.............................................................................................................................49
Qualitativedependentvariable..................................................................................................................................50
Problemsofregressionmodels..................................................................................................................................51

Page1

AdvancedStatistics

1. Statisticalinference
Statistic is the science of data. This involves collecting, classifying, summarizing, organizing,
analyzing, and interpreting numerical information. A population is a set of units (usually people,
objects, transactions, etc.) that we are interested in studying. A sample is a subset of units of a
population, whose elements are called cases or, when dealing with people, subjects. A statistical
inference is an estimate, prediction, or some other generalization about a population based on
informationcontainedinasample.
For example, we may introduce a variable which models the temperature at midday in
January.Clearlythisisarandomvariable,sincethetemperaturefluctuatesrandomlydaybydayand,
moreover, temperatures of the future days cannot be even determined now. However, from this
random variables we have data, measurements done in the past. In statistics people deal with
observationsor,inotherwords,realizations, , , ..., ofarandomvariable .Thatis,eachof
isarandomvariablethathasthesameprobabilitydistributionasitsoriginatingrandomvariable
. It characterizes the th performance of the stochastic experiment determined by the random
variable . Given this information, we want to characterize the distribution of or some of its
characteristics, like the expected value. In the simplest cases, we can even establish via theoretical
considerationstheshapeofthedistributionandthentrytoestimatefromthedataitsparameters.
In other words, statistical inference concerns the problem of inferring properties of an
unknowndistributionfromdatageneratedbythatdistribution.Themostcommontypeofinference
involvesapproximatingtheunknowndistributionbychoosingadistributionfromarestrictedfamily
of distributions. Generally the restricted family of distributions is specified parametrically. For the
temperatureexamplewecanassumethat benormallydistributedwithaknownvariance and
anexpectedvaluetobedetermined.Amongallnormaldistributionwiththisvariancewewanttofind
theonewhichisthemostlikelycandidateforhavingproducedthefinitesequence , ,..., of
temperatureobservedinthepastdays.
Makinginferenceaboutparametersofadistribution,peopledealwithstatisticorestimates.
Any function
, ,,
of the observations is called a statistic. For example, the sample
is a common statistic, typically used to estimate the expected value. The
mean

is another useful estimate. Being a function of random
sample variance
variables, a statistic is a random variable itself. Consequently we may, and will, talk about its
distribution.

1.1 Populationandsampling
A statistical research can analyze data from the entire population or only on a sample. The
populationisthesetofallobjectsforwhichwewanttoinferinformationorrelations.Inthiscase,
data set is complete and statistical research simply describes the situation without going on to any
other objective and without using any statistical test. When data are instead available only on a
sample, a subset of the population, statistical research analyses whether information and relations
foundonthesamplecanbeextendedontheentirepopulationfromwhichthesamplecomesfromor
theyarevalidonlyforthatparticularsamplechoice.
Therefore,samplechoiceisaveryimportantanddelicateissueinstatisticalresearches.Many

Page2

AdvancedStatistics

statistical methods let us extend results (estimates or tests results) found on the sample to the
population, provided that sample is a random sample, a sample whose elements are randomly
extracted from the population without any influence from the researcher, from previously taken
sampleselementsorfromotherfactors.Buildingsuchasample,however,isadifficulttasksincea
perfectly random selection is almost a utopia. For example, any random sampling on people will
necessarilyincludepeoplewhoareunwillingtogiveinformation,whohavedisappeared,andwholie;
thesepeoplecannotbeexcludednorreplacedwithothers,becauseotherwisethesamplewouldnot
berandomanymore.Apreviouslyrandomsamplewithexcludedelementscanscrewtheestimates:in
our example, problematic people are typically old and with low education, thus unbalancing our
sampleinfavorofyoungandeducatedsubjects.
A common strategy to build a sample which behaves like a random sample is the stratified
sampling.Withthismethod,thesampleischosenrespectingtheproportionsofthevariableswhich
are believed to be important for the analysis and which are believed to be able to influence the
analysis results. For example, if we analyze people we should take care to build a sample which
reflectsthesexproportions,age,educationandincomedistribution,theresidence(towns,suburbs,
countryside)proportions, etc. In this way,thesamplewillreflectexactlythepopulation atleastfor
whattheconsideredvariablesareconcerned.Wheneverapersonisnotavailableforanswering,we
replace him with another one with the same variables values. Obviously these variables must be
chosenwithcareandwithalookatpreviousstudiesonthesametopic,balancingtheirnumbersince
too fewvariables willcreatea badlystratified sample,while atoo many willmakesamplecreation
andpeoplessubstitutionverydifficult.
Anotheraspectisthesamplesize.Obviously,thelargerthesamplethebetter.However,this
relationisnotdirect,i.e.doublingthesamplesizedoesnotyielddoublybetterresults.Therelationin
many statistical tests goes approximately like , which means that we need to quadruple the
samplesizetogetdoublybetterresults.Inanycase,itismuchmoreimportanttohavearandomor
wellstratifiedsampleratherthananumeroussample.Qualityismuchbetterthanquantity.
A commonmistake relatedtosamplesizeissupposing that itshould be proportional tothe
population. This is, at least for all the test analyzed in this book, false: for large populations, tests
resultsdependonlyontheabsolutesizeandnotontheproportion.Thus,apopulationof1000witha
sampleof20doesnotyieldbetterresultscomparedtoapopulationof5000withasampleof20.

Page3

AdvancedStatistics

2. Dataorganization
2.1 Variablesmeasure
Inastatisticalresearchwefacebasicallythreetypesofvariables:

scalevariablesarefullynumericalvariableswithanintrinsicmathematicalmeaning.Forexample,
a temperature or a length are scale variables since they are numeric and any mathematical
operationonthesevariablesmakessense.Alsoacountisanumericalvariable,eventhoughithas
restrictions(cannotbenegativeandisinteger),becauseitmakessensetoperformmathematical
operationson it. However, numerical codes such as phone numbers or identification codes are
not scale variables even though they seem numeric, since no mathematical operation makes
senseonthemandthenumberisusedonlyasacode;
nominalvariablesrepresentcategoriessuchassex,nationality,degreecourse,plantstype.These
variablesdividethepopulationintogroups.Variablessuchasidentificationnumberarenominal
sincetheydividethesampleintocategories,eventhougheachcaseisasinglecategory;
ordinalvariablesareamidwaybetweennominalandscalevariables.Theyrepresentcategories
whichdonothaveamathematicalmeaning(eventhoughmanytimescategoriesareidentifiedby
numbers,suchasinaquestionnairesanswers)butthesecategorieshaveanordinalmeaning,i.e.
can beputinorder.Typicalexamplesarequestionnairesanswers suchthatverybad,bad,
good,verygood,orsometimeissuessuchasfirstyear,secondyear,thirdyear.

Ordinalandnominalvariables,oftenreferredtoascategorical,areusedinSPSSintwoways:
asvariablesbythemselves,suchasinmultinomialexperiments(seesection4.14)and,moreoften,as
awaytosplitthesampleintogroupstoperformtestsontwoormorepopulations,suchasStudentT
testfortwopopulations(seesection4.2),ANOVA(seesection4.5),MannWhitney(seesection4.9)
andKruskalWallis(seesection4.11).
2.1.1 Grouping
Itisalsoacommonproceduretodegradescalevariablethemtoordinalvariables,arbitrarily
fixingintervalsorbinsandgroupingthecasesintotheirappropriatebin.Forexample,anagevariable
expressedinyearscanbedegradedtoanordinalvariabledividingthesubjectsintoyoung,upto25,
adult,from26to50,old,from51to70,veryold,71andover.Thenewvariablesthatweobtain
are suited for different statistical tests which open up more possibilities. However, any grouping
procedure reduces the information that we have introducing arbitrary decisions in the data and
possiblebiases.Forexample,ifoursamplehasaverylargecountforpeopleofage26,theprevious
arbitrarychoiceof25asalimitforyounggrouphasputmanypeople,whoaremoresimilarto25
yearsoldpeopleratherthanto50yearsoldpeople,intotheadultgroup.
SPSS:TransformRecodeintoDifferentVariables

2.2 SPSS
SPSS means Statistical Package for Social Sciences and it is a program to organize statistical
data and perform statistical research. SPSS organizes data in a sheet called Data View which is a
database table, more or less like Excels tables. Each case is represented by an horizontal lines and
identified very often by the first variable which is an ID number. Variables instead use vertical

Page4

AdvancedStatistics

columns.UnlikeExcelandlikedatabasetables,SPSSdatatableisextremelywellstructuredandeach
variablehasalotoffeatures.ThesefeaturesarefoundinVariableViewsheet:

Name:feelfreetouseanymeaningfulname,butwithoutspecialcharactersandwithoutspaces.
Whendatahavemanyvariablesitisagoodideatoindicatenamesasv_followedbyanumber(it
willbepossibletoindicateahumanreadablenamelater).
Type: numeric is the most common type. String should be used only for completely free text,
whilecategoricalvariablesshouldbenumericwithanumbercorrespondingtoeachcategory(it
will be possible to indicate a human readable name later); a common mistake is using a string
variableforacategoricalvariable,whichhastheimpactthatSPSSwillrefusetoperformcertain
operationswiththatvariable.
Widthanddecimals
Label:thisisthevariableslabelwhichwillappearinchartsandtablesinsteadofthevariables
name.
Values: this feature represents the association between values and categories. It is used for
categorical variables, which, as said before, should use numbers for each category. In this field
values labels can be assigned and in charts and tables these labels will appear instead of
numbers.Obviouslyscalevariablesshouldnotreceivevalueslabels.
Missing:wheneveravariablesvalueisunknownforacertaincaseaspecialnumericcodeshould
beused,traditionallyanegativenumber(ifthevariablehasonlypositivenumbers)orthelargest
possiblenumbersuchas9999.Ifthisnumberisinsertedhereamongthemissingvalues,SPSSwill
simplyignorethatcasewheneverthatvariableisinvolvedinanyoperation.Itisalsopossible,in
Data View, to clear the cell completely and SPSS will indicate it with a dot which is a
systemmissingnumber(sameeffectasmissingvalue).
Measure:variablesmeasuremustbecarefullyindicated,sinceitwillhaveimplicationsonwhich
operationsmaybedoneonthevariable.
SPSShasfourbasicmenus:

Transform: this menu lets us build new variables or modify existing ones, usually working on a
casebycasebase,thusperformingonlyhorizontaloperations.Veryusefularecommands:
o compute,whichbuildanewvariable,typicallyscale,usingmathematicaloperations;
o recode,whichbuildanewvariable,typicallycategorical,usingrecoding;
Data: this menu lets you rearrange your data in a more global way. Very useful are the
commands:
o split,splitsthefileusinganominalorordinalvariableinsuchawaytobeabletoanalyzeit
automaticallyingroups;
o select,letsusfilteroutsometemporarilyundesiredcases;
o weight,letsusweightthecasesusingavariablewhenevereachcaserepresentsseveralcases
withthesamedata(allthestatisticswilluseanewsamplesizebasedontheweights);
Analyze:thismenuisthecoreofSPSSwithallthestatisticaltestsandmodels;
Graphs:thisisthemenutocreatecharts.

2.3 Datadescription
SPSSoffersavarietyofnumericalandgraphicaltoolstoquicklydescribedata.Thechoiceof
thetooldependsonvariablesmeasure:

Page5

AdvancedStatistics

SPSS:AnalyzeDescriptiveStatisticsFrequencies
Frequenciesisindicatedasadescriptionforasinglecategoricalvariable,whileforascalevariable
frequencytablebecomestoolongandfullofsinglecases.However,itisalwaysagoodideato
startanystatisticalresearchwithfrequenciesforeveryvariable,includingscaleones,tospotout
dataentrymistakeswhichareverycommoninstatisticaldata.

SPSS:GraphsChartBuilderPie/Polar
Piechartisindicatedasagraphforasinglenominalandordinalvariable.

SPSS:GraphsChartBuilderBar
Pie charts are indicated as a graph for a single categorical variable. Using colors and
threedimensionalitytheyworkalsofortwooreventhreenominalandordinalvariables.

SPSS:AnalyzeDescriptiveStatisticsDescriptives
Descriptivestatistics(mean,median,standarddeviation,minimum,maximum,range,skewness,
kurtosis)isindicatedasadescriptionforasinglescalevariableandusuallyitdoesnotmakesense
forcategoricalvariables.

SPSS:GraphsChartBuilderHistogram
Histogramisindicatedasagraphforasinglescalevariable.Variablevaluesaregroupedintobins
forthevariablerepresentation.Thechoiceofbinninginfluencesthehistogram.

SPSS:GraphsChartBuilderBoxplot
Boxplotisindicatedasagraphforasinglescalevariable.Thecentrallinerepresentsthemedian
andtheboxrepresentsthecentral50%ofthevariablesdistributiononthesample.Boxplotsmay
beusedalsotocomparethevaluesofascalevariablebygroupsofacategoricalvariable.

SPSS:AnalyzeDescriptiveStatisticsCrosstabs
Contingencytable(seesection4.14.2)isindicatedasadescriptionfortwocategoricalvariables.

SPSS:AnalyzeCompareMeansMeans
Meanscomparisonisawaytocomparethemeansofascalevariableforgroupsofacategorical
variable,usuallyfollowedbyStudentsTtestorANOVA(seesections4.2and4.5).

SPSS:AnalyzeCorrelateBivariate
Bivariatecorrelation(seesections4.12and4.13)isadescriptionforthelinearrelationbetween
twoscalevariables.

SPSS:GraphsChartBuilderScatter/Dot
Scatterplotisindicatedasagraphfortwoscalevariables.

Page6

AdvancedStatistics

3. Statisticaltests
Statisticaltestsareinferencetoolswhichareabletotellustheprobabilitywithwhichresults
obtainedonthesamplecanbeextendedtothepopulation.
Everystatisticaltesthasthesefeatures:

the null hypothesis H0 and its contradictory hypothesis H1. It is very important that these
hypothesesarebuiltwithoutlookingatthesample;
a sample of observations , , . . . , and a population, to which we want to extend
informationandrelationsfoundonthesample;
prerequisites, special assumptions which are necessary to perform the test. Among these
assumptionsthereisalways,eventhoughwewillnotrepeatiteverytime,thatdatamustcome
fromarandomsample;
the statistic
, ,...,
, a function calculated on the data, whose value determines the
resultofthetest;
astatisticsdistributionfromwhichwecanobtainthetestssignificance.Whenusingstatistical
computerprograms,significanceisautomaticallyprovidedbytheprogramnexttothestatistics
value;
significance, also called pvalue, from which we can deduct whether accepting or rejecting null
hypothesis.

3.1 Example
Inordertoshowalltheelementsofastatisticaltest,werunthroughaverysimpleexample
andwewill,later,analyzethetheoreticalaspectsofallthetestssteps.
WewanttostudytheageofInternetusers.Ageisarandomvariableforwhichwedonothave
any idea of the distribution nor its parameters. However, we make the hypothesis that age is a
continuousrandomvariablewithanexpectedvalue.Wewanttocheckwhethertheexpectedvalueis
35yearsornot.Weformulatethetestshypotheses:

H0: E age
H1: E age

35
35

Ofthisrandomvariabletheonlythingweknowaretheobservationsonarandomsampleof
100users,whichare:25;26;27;28;29;30;31;30;33;34;35;36;37;38;30;30;41;42;43;44;45;
46;47;48;49;50;51;52;20;54;55;56;57;20;20;20;30;31;32;33;34;35;36;37;38;39;40;41;
42;43;44;45;46;47;48;49;50;20;21;22;23;24;25;26;27;28;29;30;31;32;33;34;35;36;37;
38;39;40;35;36;37;35;36;37;35;36;37;35;36;37;35;36;37;35;36;37;35;36;37;35.
36.2,whichisanestimationfor
Nowwecalculatetheageaverageonthesample, age
the expected value. We compare this result with the 35 of the H0 hypothesis and we find a
differenceof 1.2.Atthispoint,weaskourselveswhetherthisdifferenceislargeenough,implying
thattheexpectedvalueisnot 35 andthusH0mustberejected,orissmallandcanbecausedbyan
unluckychoiceofthesampleandthereforeH0mustbeaccepted.
Thisconclusioninastatisticalresearchcannot bedrawnfromasubjectivedecisionwhether
thedifferenceislargeorsmall.Itistakenusingformalargumentsandthereforewemustrelyonthis

Page7

AdvancedStatistics

statisticfunction:
age

hypothesizedexpectedvalue

samplevariance

It is noteworthy to look at this statistic numerator. When the average of age is very close to the
hypothesized expected value, the statistic will be close to 0. On the other hand, when the two
quantitiesareverydifferent,comparedtothesamplestandarddeviation,thestatisticis verylarge.
Statisticsvalueisalsoinfluencedbythesamplesnumberofelements :thelargeristhesample,the
largerthestatistic.
Summing up, considering that in our case the sample standard deviation is 8.57, statistic is
1.40.Thesituationistherefore
H0probablytrue

H0probablyfalse

H0probablyfalse

+1.40

AtthispointweaskourselveswhereisexactlythepointwhichseparatedtheH0truezonefromthe
H0falsezone.Tofinditout,wecalculatetheprobabilitytoobtainanevenworseresultthanthe
onewehavegotnow.ThemeaningofworseinthissituationisworseforH0,thereforeanyresult
largerthan 1.40 orsmallerthan 1.40.Weusecentrallimittheoremwhichguaranteesusthat,if
is large enough and if the hypothesized expected value is the real expected value of the
distribution(i.e.H0istrue),ourstatistichasastandardnormaldistribution.Infact,theonlyreason
whywehavebuiltthisstatisticinsteadofusingdirectlythedifferenceatthenumeratorisbecausewe
knowthestatisticsdistribution.Thereforeweknowthattheprobabilityofgettingavaluelargerthan
1.40 orsmallerthan 1.40 is1 16%.Thisvalueiscalledsignificanceorpvalue.

1.4 1

1 +1.4

If significance is large it means that, supposing H0 to be true and taking another random
sample, the probability of obtaining a worse result is large and therefore the result that we have

This value can be calculated through normal distribution tables or using English Microsoft Excel function
NORMDIST(1.4;0;1;TRUE)whichgivestheareaunderthenormaldistributionontheleftof1.4,equalto8%.Areaonthe
rightof+1.4isobviouslythesame.

Page8

AdvancedStatistics

obtained can be considered to be really close to 0, something which pushes us to accept the idea
that H0 is true. When, instead, significance is small, it means that if we suppose that H0 is true we
haveasmallprobabilityofgettingsuchabadresult,somethingwhichpushesustobelievethatH0be
false.Intheexamplessituationwehaveasignificanceof 16%,whichusuallyisconsideredlarge(the
chosencutpointistypically 5%)andthereforeweacceptH0.
Aslightlydifferentmethod,whichyieldstothesameresult,isfixingthecutpointapriori,lets
say 5%,andfindingthecorrespondingcriticalvalueafterwhichthestatisticisintherejectionregion.
Inourcase,consideringtwoareasof 2.5% ontheleftandontherightside,thecriticalvaluefora
standardnormaldistributionis2 1.96.

1.96

+1.96

Atthispointthesituationis
H0 probably true

H0 probably false

1.96

H0 probably false

+1.40 +1.96

The first method gives us an immediate and straightforward answer and in fact is the one
typicallyusedbycomputerprograms.Thesecondmethodinsteadismoresuitedforonetailedtests
andiseasiertoapplyifacomputerisnotavailable.
Anexampleofaonetailedtestisthesituationwhenwewanttocheckwhethertheexpected
valueoftheageissmallerorlargerthan35.Wewritethehypothesesinthisway:

H0: E age
H1: E age

35
35

Inthiscase,thedifferenceof 1.2 betweensampleaverageand 35,sinceitispositive,leadsusto


stronglybelievethatH0betrue.Infact,nowthesituationofthestatisticisdifferentfrombefore,i.e.

This value can be calculated through normal distribution tables or using English Microsoft Excel NORMINV(2.5%;0;1)
which gives the critical value 1.96 for which the area under the normal distribution on the left of it is 2.5%. Due to
symmetricityofthedistribution,criticalvalueontherightisobviously+1.96.

Page9

AdvancedStatistics

H0probablytrue

H0probablyfalse

H0probablytrue

+1.40

0
InfactherewedonothaveanydoubtsincethestatisticvaluefallsrightinthemiddleoftheH0true
area.
Writinghoweverthehypothesesinthisway:

H0: E age
H1: E age

35
35

Inthiscase,thesituationofthestatisticis
H0probablytrue

H0probablytrue

H0probablyfalse

+1.40

0
andherewehavethesameproblemofdeterminingwhether 1.40 iscloseto 0 orfarawayfrom
it.Asusual,todetermineitwehavetwomethods.Thefirstonecalculatestheprobabilityofgettinga
worseresult,whereworsemeansworseforH0.Inthissituation,however,aworseresultislarger
than 1.40, while results smaller than 1.40 are strongly in favor of H0. The statistic is always
distributedlikeastandardnormal,underthehypothesisthatH0betrue,

1 +1.4

and the area, thus the significance, is 8%. Using the second method the critical value is not 1.96
anymore,but3 1.64.Thecriticalregionislargerthanbefore,sincenowthe 5% isallconcentrated
ontheleftpart.
H0 probably true

H0 probably true

H0 probably false

+1.40 +1.64

ThisvaluecanbecalculatedthroughnormaldistributiontablesorusingEnglishMicrosoftExcelNORMINV(5%;0;1)which
givesthecriticalvalue1.64forwhichtheareaunderthenormaldistributionontheleftofitis5%.Duetosymmetricityof
thedistribution,criticalvalueontherightisobviously+1.64.

Page10

AdvancedStatistics

3.2 Nullandalternativehypothesis
ThehearthofastatisticaltestisnullhypothesisH0,whichrepresentstheinformationthatwe
are officially trying to extend from the sample to the population4. It is important that the null
hypothesis gives us additional information, since we need to suppose it to be true and use its
information toknowthestatistics distribution.If,inthe previous example,thenullhypothesishad
notgivenustheadditionalinformationthattherealexpectedvaluebe 35,wecouldnotusethefact
that that statistic function be normally distributed. Therefore, the null hypothesis must always
containanequality,whilestrictinequalitiesarereservedforH1.Whenthetestisonetailed,wewrite
the null hypothesis in the form of a nonstrict inequality such as E age 35 for practical
purposes, but theoretically we should write the equality E age 35 and simply not take into
accountthe E age 35 possibility.
Forexample,usablehypothesesareE
35ordistributionof isexponentialoreven
and areindependent.Ontheotherhand,hypothesessuchasE
35ordistributionof
isnotexponentialarenotacceptable.Also and aredependentisnotacceptable,sinceit
doesnotprovideuswithanyinformationonhowtheyaredependent.
TogetherwithnullhypothesiswealwayswritealternativehypothesisH1,whichisthelogical
contradictionofnullhypothesis.

3.3 TypeIandtypeIIerror
Once the statistic is calculated we must take a decision: accept H0 or reject H0. When H0 is
rejected,wefaceoneofthetwofollowingsituations:

nullhypothesisisreallyfalseandwerejectedit:verygood;
nullhypothesisisreallytrueandwerejectedit:wecommittedatypeIerror.

IfweacceptH0,wefaceoneofthetwofollowingsituations:

nullhypothesisisreallyfalseandweacceptedit:wecommittedatypeIIerror;
nullhypothesisisreallytrueandwerejectedit:verygood.

There are two different types of errors that we may commit when taking a decision after a
statistical test and it would be wonderful if we could reduce at the same time the probability of
committing both errors. Unfortunately, the only method to reduce the probability to commit both
errorsistakingalargesample,hopefullytakingtheentirepopulation.Thisthingisclearlynotfeasible
inmanysituationswheregatheringdataisveryexpensive.
There is a method to reduce probability of committing a type I error: rejecting only in the
situationswhereH0isevidentlyfalse.InthiswayatypeIerrorwillbeveryraresincewearerejecting
inveryfewsituations.Unfortunately,ifwerejectwithparsimony,wewillacceptveryoftenandthis
meanscommittingalotoftypeIIerrors.Samethingif,viceversa,werejecttoomuch:wewillcommit
veryfewtypeIIerrorsbutmanytypeIerrors.
Thus,wemustdecidewhicherroristhemoresevereoneandtrytoconcentrateonreducing
the probability of committing it. Every statistical research concentrates on type I errors, trying to

As we will see later, it is instead H1 the information that we will be able to extend to the population, while,
unfortunately,itisneverpossibletoextendH0.

Page11

AdvancedStatistics

reduce the probability of committing them under a significance level usually 5% or 1%. Using an
exampledrawnfromajuridicalsituation:

H0:suspectdeserves0yearsofprison(suspectisinnocent)
H1:suspectdeserves>0yearsofprison(suspectisguilty)

Inthiscase,atypeIerrormeanscondemninganinnocent,whileatypeIIerrormeansaninnocent
verdictforaguilty.ItiscommonbeliefthatinthiscaseatypeIerrorshouldbeavoidedatallcost,
whileatypeIIerrorbeacceptable.
ThereasonwhystatisticaltestsconcentratetheirattentiononavoidingtypeIerrorsderived
fromthehistoricaldevelopmentofsciencewhichtakesascorrectthecurrenttheories(H0)andtries
tominimizetheerrortodestroy,bymistake,awellestablishedtheoryinfavorofnewtheories(H1).It
isthereforeaconservativeapproach.Forexample:

H0:hearthpumpsblood
H1:hearthdoesnotpumpblood

AtypeIerrorinthis casewould beadisastersinceitwould meanrejectingthe correcthypothesis


thatbloodispumpedbyhearth,givingusnoothercluesinceH1carriesonlyanegativeinformation.

3.4 Significance
Significance or pvalue is the probability of committing a type I error. This probability is
calculatedassumingthatH0betrueandcomparingthevalueofthestatisticthatwecalculateonour
samples data with the statistics distribution. A small significance means that if we reject we have
only a small probability of committing a mistake, and therefore we will reject. A large significance
meansthatifwerejectwearefacingalargeprobabilityofcommittingamistake,andthereforewe
willacceptH0.
Another equivalent definition for the significance is the probability of obtaining, taking
another randomsample,an equalorworse statisticsvalueunderthehypothesis thatH0betrue. A
smallsignificancemeansthatthestatisticsvalueisreallybadandthereforewewillrejectH0.Alarge
significancemeansthatthestatisticsvalueismuchbetterthanwhatweexpectedandthereforewe
willacceptH0.
SincewetrytominimizetypeIerrors,wewillfixaverysmallsignificancelevelunderwhich
null hypothesis is rejected, usually 5% or 1%. In this way, probability of a type I error is low and
whenwerejectwearealmostsurethatH0isreallyfalse.
Confidenceisequalto 100% minusthesignificance.

3.5 Acceptandreject
Attheendofthestatisticaltestwemustdecidewhetheracceptingorrejecting:

ifsignificanceisabovethesignificancelevel(usually 5% or 1%),weacceptH0;
ifsignificanceisbelowthesignificancelevel,werejectH0.

ItisveryimportanttounderlinethefactthatwhenwerejectwearealmostsurethatH0isfalse,since
wearekeepingtypeIerrorsunderasmallsignificancelevel.However,whenweacceptwemaynot
saythatH0betrue,sincewedonothaveanyestimationontypeIIerrors.Therefore,rejectingisa
surething,whileacceptingisanoanswerandfromitwearenotallowedtodrawanyconclusion.

Page12

AdvancedStatistics

Thisapproachiscalledfalsification,sinceweareonlyabletofalsifyH0andnevertoproveit.If
weneedtoprovethatH0betrue,wemustrewritethehypothesesandputtheinformationwewant
toextendtothepopulationintheH1hypothesisinstead,performthetestagainandhopetoreject.
Another important effect that we must underline is the sample size. When sample size is
extremely small, data are almost random and probability of committing type I error is very large.
Thereforesignificanceisverylargeand,usingthetraditionalsmallsignificancelevels,wewillaccept.
Therefore a statistical test with few data automatically accepts everything, since it does not have
enoughdatatoprovethatH0befalse.Again,acceptingmustneverimplythatH0betrue.
3.5.1 Paradox
Usingthefalsificationapproachwecan,throughasmartchoiceofnullhypotheses,accepttwo
contradictorynullhypotheses.Usingassampletheoneofthepreviousexampleandformulatingthe
hypotheses

H0: E age
H1: E age

weaccept E age

35
35

35 withasignificancelevelof 5%.Usinginsteadthesehypotheses

H0: E age
H1: E age

36
36

we accept E age
36 with a significance level of 5%. We have thus accepted two hypotheses
whichsaydifferentandcontradictorythings.Thisisonlyanapparentparadox,sinceacceptingdoes
not mean that they are true but only that they might be true. Therefore, for the population from
whichoursampleisextracted,theexpectedvaluemightbe 35 or 36 (ormanyotherclosevalues,
suchas 35.3, 36.5, 37,etc.).Thisisduetoarelativelysmallsizeofthesample;ifweincreasethe
samplesize,theintervalofvaluesforwhichweacceptwoulddecrease.

3.6 Tailsandcriticalregions
Statistical tests where the null hypothesis contains an equality and alternative hypothesis a
not equality are twotailed tests. Statistical tests where the null hypothesis contains a nonstrict
inequalityandalternativehypothesisastrictinequalityareonetailedtests,suchas

H0: E age
H1: E age

35
35

The name of these tests comes from the number of critical regions. A critical region is an area for
whichnullhypothesisisrejectedwhenthestatisticsvaluefallsinthatarea,accordingtothesecond
methodthatwehaveseenintheexample3.1.Thenumberofcriticalregions,whichusuallyarefar
awayfromthecenterofthedistributionandthereforearecalledtails,determinesthenameofthe
testtwotailedoronetailed.
critical region

critical region
+C

twotailedtest

Page13

AdvancedStatistics

critical region
+C

critical region

onetailedtestwithcriticalregionontheright

onetailedtestwithcriticalregionontheleft

Thepointwherethecriticalregionstartsiscalledcriticalvalueandisusuallycalculatedfrom
tablesofthestatisticsdistribution.Inthetwotailedtestthetworegionsarealwayssymmetric,while
foronetailedtestwefacetheproblemofdeterminingonwhichsideistherejectionregion.
Inordertofindwherethecriticalregionisinonetailedtests,wetrytoseewhathappensif
wehaveanextremelylargepositivevalueforthestatistic.Ifsuchanextremelylargepositivevalue
(which,beingverylarge,isforsureintherighttail)isnotinfavorofnullhypothesis,itmeansthatthe
right tailisnotinfavorof nullhypothesis andthereforeitistherejectionregion. Otherwise,if this
extremely large value of the statistic is in favor of the null hypothesis, the right region is not a
rejectionregionandthecriticalregionisontheleft.Forexample,weconsiderexample3.1

H0: E age
H1: E age

35
35

andweusethesamestatistic

age

.Whenthisstatisticsvalueispositiveand

samplevariance

extremelylarge,itmeansthattheaverageofageismuchmorethanthehypothesizedexpectedvalue
and this is a clear indication that the real expected value is much larger than 35. This is in
contradiction with null hypothesis which says that expected value must be smaller or equal to 35.
Therefore a positive value of the statistic, on the right tail, is contradicting null hypothesis and this
meansthatrighttailisacriticalregion.
critical region

+1.41

Consideringinsteadhypotheses

H0: E age
H1: E age

35
35,

whenthe statisticsvalueispositive andextremelylarge, itmeansthattheaverageofageismuch


morethanthehypothesizedexpectedvalueandthisisaclearindicationthattherealexpectedvalue
ismuchlargerthan 35.Thisisexactlywhatthenullhypothesissays.Thereforeapositivevalueofthe
statistic,ontherighttail,isinfavorofthenullhypothesisandthismeansthatrighttailisnotcritical
region.Thereforethecriticalregionisontheleft.
critical region

1.41

Someimportantfeaturestonoteoncriticalvalues:

decreasing significance level implies that critical value goes away from 0. This is evident if we
considerthefactthatdecreasingthesignificancelevelweareevenmoreafraidoftypeIerrors

Page14

AdvancedStatistics

andthereforewerejectwithmuchmorecare,thusreducingtherejectionzone;
the critical value of a onetailed test is always closer to 0 than the critical value of twotailed
tests.Thisisbecausethecriticaltailofaonetailedtestmustcontaintheprobabilitythatfora
twotailedtestinsplitintworegionsandthereforethezonemustbelarger;
for each twotailed test there are two corresponding onetailed tests. One of them has the
statistics value completely on the other side of the rejection region, therefore for this one we
always accept. This is the reason why using the significance method to determine whether
acceptingorrejectingcanbemisleadingforonetailedtests,sinceitisnotevidentwhetherthe
testhasanobviousacceptverdictornot.

3.7 Parametricandnonparametrictest
Thereare parametricandnonparametricstatisticaltests. A parametrictestimpliesthatthe
distributioninquestionisknownuptoaparameterorseveralparameters.Forexample,itisbelieved
thatmanynaturalphenomenaarenormallydistributed.Estimating and ofthephenomenonis
aparametricstatisticalproblem,becausetheshapeofthedistribution,anormalone,isknownupto
these two parameters. On the other hand, nonparametric test do not rely on any underlying
assumptionsabouttheprobabilitydistributionofthesampledpopulation.Forexample,wemaydeal
withcontinuousdistributionwithoutspecifyingitsshape.
Nonparametrictestsarealsoappropriatewhenthedataarenonnumericalinnaturebutcan
beranked,thusbecomingranktests.Forexample,tastetestingfoodswecansaywelikeproductA
better thanproduct B, and B better than C, but we cannot obtain exact quantitative values for the
respectivemeasurements.Otherexamplesaretestswherethestatisticisnotcalculatedonsamples
valuesbutontherelativepositionsofthevaluesintheirset.

3.8 Prerequisites
Each test, especially parametric ones, may have prerequisites which are necessary for the
statistictobedistributedinaknownway(andthusforustocalculateitssignificance).
A typical prerequisite for many parametric tests is that the sample comes from a certain
distribution.Toverifyit:

if data are not individual measures but are averages of many data, the central limit theorem
guaranteesusthattheyareapproximatelynormallydistributed;
ifdataaremeasuresofanaturalphenomena,theyareoftenaffectedbyrandomerrorswhichare
normallydistributed;
wecanhypothesizethatdatacomesfromacertaindistributionifwehavetheoreticalreasonsto
doit;
wecanplotthehistogramofthedatatohaveahintontheoriginalpopulationsdistribution,if
thesamplesizeislargeenough;
we can perform specific statistical tests to check the populations distribution, such as
KolmogorovSmirnovorJarqueBeratestsfornormality.

Everytesthasasaprerequisitethatthesamplebearandomsample,eventhoughwewillnot
indicateit.

Page15

AdvancedStatistics

4. Tests
4.1 Studentsttestforonevariable
Prerequisites:variablenormallydistributed(ifsamplevarianceisused).
H0:expectedvalue=
Statistic:

sample average m

population or sample variance / n

1 degreesoffreedom;when

Statisticsdistribution:Studentstwith
standardnormal.

SPSS:AnalyzeCompareMeansOneSampleTTest

30

WilliamStudent
Gosset
(18861937)

Studentsttestistheonewehavealreadyseenintheexampleinitslargesampleversion.Itis
atestwhichinvolvesasinglerandomvariableandcheckswhetheritsexpectedvalueis ornot.
Forexample,taking

32 andasampleof 10 elements:25;26;27;28;29;30;30;31;33;

34
H0: E 32
H1: E 32
Sampleaverageis 29.3 andsamplestandarddeviationis 2.91.Statisticistherefore 2.94 andits
significanceis5 1.7%.H0isrejectedsince 1.7% isbelowsignificancelevel;thismeansthatextracting
anothersampleof 10 elementsfromadistributionwithanexpectedvalueequalto 32,wehavea
verysmallprobabilityofgettingsuchbadresults.Wecanthussaythatexpectedvalueisnot 32.
Aswecaneasilysee,Studentsttestforonevariableisexactlythetestversionoftheaverage
confidenceinterval.

4.2 Studentsttestfortwopopulations

Prerequisites: two populations A and B and the variable must be distributed normally on the two
populations
H0:expectedvalueonpopulationA=expectedvalueonpopulationB
Statistic:

sample A average sample B average

n A 1 sample or population A variance

n B 1 sample or population B variance 1

n2
n A nB

Statisticsdistribution:Studentstwith

2 degreesoffreedom;when

31 standardnormal.

SPSS:AnalyzeCompareMeansMeans
SPSS:AnalyzeCompareMeansIndependentSamplesTTest

Significance can be calculated in two ways. (1) Using Students t distribution table. (2) Using English Microsoft Excel
functionTDIST(2.94;9;2)whichgivesusthesumofthetwotailsareas,thoseontheleftof2.94andontherightof+2.94.

Page16

AdvancedStatistics

This test is used whenever we have two populations and one variable calculated on this
population and we want to check whether the expected value of the variable changes on the
populations.
Forexample,wewanttotest

H0: E height formale


H1: E height formale

E height forfemale
E height forfemale

Wetakeasampleof 10 males(180;175;160;180;175;165;185;180;185;190)e 8 female(170;


175; 160; 160; 175; 165; 165; 180). We suppose that males and females heights are normally
distributed with the same variance. Males sample average is 177.5 while for female it is 168.75.
Statistics value is 2.18. Since it is onetailed test we draw the graph to have a clear idea where
doesthestatisticfall.
H0probablytrue

+2.18

If the statistic were extremely large, this would be strongly in contradiction with H0 and therefore
rejectionregioninontheright.
H0 probably true

H0 probably true

H0 probably false

+1.76

+2.18

Critical value for onetailed test is6 1.76 and therefore we reject. Using instead the significance
method,afterhavingcheckedthatstatisticdoesnotfallontheH0truearea,weget7 asignificance
of 2.2% andthereforewereject,meaningthatmalepopulationhasanexpectedheightsignificantly
largerthanfemalepopulation.

Criticalvaluecanbecalculatedintwoways.(1)UsingEnglishMicrosoftExcelfunctionTINV(5%;16),whichgivesusthe
criticalvalueforthetwotailedtest,thereforeprobabilitysplitinto2.5%and2.5%.Foronetailedtestprobabilitymustbe
doubled,TINV(10%;16),sinceinthiswayitwouldbesplitinto5%and5%.(2)UsingStudentstdistributiontable.
7

Significancecanbecalculatedinfourways.(1)Using oneofthestatisticalttests(Zweistichproben ttest)intheData


Analysis tookpak in Microsoft Excel, choosing among known variances (in this case populations variances have to be
indicated explicitly), equal and unknown, different and unknown (in these latter two cases populations variances are
estimatedfromsampledataautomaticallybyExcel),whichgivesusstatisticsvalueanditssignificance.(2)UsingEnglish
MicrosoftExcelfunctionTTESTwhichgivesusthesignificancedirectlyfromthedata,choosingtype=2ifwesupposeequal
variancesortype=3ifwesupposedifferentvariances.(3)UsingEnglishMicrosoftExcelfunctionTDIST(2.18;16;1) which
givesustheareaofoneofthetwotails.(4)UsingStudentstdistributiontable.

Page17

AdvancedStatistics

4.3 Studentsttestforpaireddata
Prerequisites: two variables
distributed
H0: E

and

,whichmeans E

on the same population and must be normally


E

Thetestcanalsobeperformedwithnullhypothesis:H0: E

Statistic:weuse asvariableandweperformStudentsttestforonevariable
Statisticsdistribution:sameasStudentsttestforonevariable
SPSS:AnalyzeCompareMeansPairedSamplesTTest
Thistestis used whenever wehave asinglepopulationandtwo variablescalculatedon this
populationandwewanttocheckwhethertheexpectedvalueofthesetwovariablesisdifferent.
Forexample,wewanttotestwhetherpopulationsincomeinacountryhaschanged.Wetake
asampleof 10 peoplesincomeandthenwetakethesame 10 subjectsincomethenextyear
Income2010 Income2011 Difference
(thousands) (thousands) 20102011
20
21
1
23
23
0
34
36
2
53
50
+3
43
40
+3
45
44
+1
36
12
+24
76
80
4
44
45
1
12
15
3

Twothingsareveryimportanthere.Thesubjectsmustbeexactlythesame,noreplacementisclearly
possible. When calculating the difference the sign is important, so it is a good idea to clearly write
whatissubtractedfromwhat,especiallyforonetailedtests.
Hypothesesare:

H0: E income for2010 E income for2011


H1: E income for2010 E income for2011

0
0

Sampleaverageforthedifferenceis 2.0 andsamplestandarddeviationis 8.07.Statisticis 0.78


with8 asignificanceof 45.3%.H0isthusaccepted.Thisdoesnotmeanthatincomehasremainedthe
same,butsimplythatourdataarenotabletoprovethatithaschanged.

Significancecanbecalculatedinfourways.(1)WiththeStudentsttestforonevariableformulausingm=0.(2)Using
EnglishMicrosoftExcelfunctionTTESTwhichgivesusthesignificancedirectlyfromthedata,choosingtype=1.(3)Using
the statistical t test (Zweistichproben t test bei abhngig Stichproben) in the Data Analysis tookpak in Microsoft Excel,
whichgivesusstatisticsvalueanditssignificance.(4)UsingStudentstdistributiontable.

Page18

AdvancedStatistics

4.4 Ftest
Prerequisites:twopopulationsAandBandthevariablemustbe
distributednormallyonthetwopopulations
H0:VaronpopulationA=VaronpopulationB

Statistic: sampleAvariance/sampleBvariance
Statisticdistribution:FishersFdistributionwith
degreesoffreedom

1 and

GeorgeWaddel
Snedecor
(18811974)

Ronald
Fisher
(18901962)

ThenameofthistestwascoinedbySnedecorinhonorofFisher.Itchecksthevariancesoftwo
populations.Itisinterestingtonotethat,unlikealltheothertests,statisticsbestvalueforH0is 1
and not 0. Since F distribution is only positive and not symmetric, special care must be taken into
account on the statistics position when calculating the significance since it can be misleading. In
particular,theopposingstatisticsvalueisnottheoppositebutthereciprocal.
Forexample,supposingthatheightformaleandfemaleisnormallydistributed,wetest

H0: Var height formale


H1: Var height formale

Var height forfemale


Var height forfemale.

Weusethe previous sampleand wegetasamplevarianceof 84.7 formaleand 55.4 forfemale.


Statistic is thus 1.53. Degrees of freedom are 9 and 7. The two critical values are9 4.82 and
0.21 andthereforeweacceptH0.Usingthesignificancemethod,afterhavingcheckedthatthe
.
statisticisontherightof 1,wegetanareaof 29% fortherightpartandthereforesignificanceis
58%.
critical region

critical region

0.21 1 1.53 4.82

Calculation of critical values or significance can be done in different ways. (1) Using the statistical F test
(ZweiStichproben FTest) in the Data Analysis tookpak in Microsoft Excel, which gives us statistics value and its
significance.(2)UsingEnglishMicrosoftExcelfunctionFTESTwhichgivesusthesignificancedirectlyfromthedata.This
method can be misleading when statistic is on the left of 1. (3) Using English Microsoft Excel function FDIST(1.53;9;7)
whichgives us the areaof therighttail. (4)Using English MicrosoftExcelfunctionFINV(2.5%;9;7) and 1/FINV(2.5%;9;7)
to get the two critical values. Pay attention to the inverted degrees of freedom for the second calculation. (5) Using F
distributiontable,whichhoweverusuallyprovidesonlythecriticalvalues.

Page19

AdvancedStatistics

4.5 Onewayanalysisofvariance(ANOVA)
Prerequisites: populations, variable is normally distributed on every population with the same
variance
H0:expectedvalueofthevariableisthesameonallpopulations
VarianceBetween

Statistic:

VarianceWithin

Statisticdistribution:FishersFdistributionwithdegreesoffreedomequalto 1 and
SPSS:AnalyzeCompareMeansMeans
SPSS:AnalyzeCompareMeansOneWayANOVA
This test is the equivalent of Students t test for two unpaired populations when the
populationsaremorethantwo.Wenotethatifonlyonepopulationhasanexpectedvaluedifferent
fromtheother,thetestrejects.Therefore,arejectionguaranteesusthatpopulationsdonothavethe
sameexpectedvaluebutdoesnottelluswhichpopulationsaredifferentandhow.Optimalstatistic
valueforH0is 0 and,sinceFdistributionhasonlypositivevalues,thistesthasonlytherighttail.
Forexample,wehaveheightsforyoung(180;170;150;160;170),adults(170;160;165)and
old(155;160;160;165;175;165)andwewanttocheck

H0: E height foryoung E height foradults E height forold


H1:atleastoneofthe E height isdifferentfromtheothers

We suppose heights are normally distributed with the same variance. From data we get a sample
averageof 166 foryoung, 165 foradultsand 163.3 forold.Nowweaskourselveswhetherthese
differencesarelargeenoughtosaythattherearedifferencesamongpopulationsexpectedvaluesor
not.
Theoriginsoftheanalysisofvariancelieinthesplittingofsamplesvarianceinthisway10:

10

Variance

Page20

AdvancedStatistics

Variance

We now define the samples variance between groups as a measure of the averages variations
betweenvaluesofdifferentgroups
variancebetween

and the samples variance within group as a measure of the variations among values of the same
group
1

variancewithin

Theideabehindthetestistocomparethesetwomeasures:ifthevariancebetweenismuch
largerthanthevariancewithin,itmeansthatatleastonepopulationissignificantlydifferentfromthe
others, while if the variance between is not large compared to the variance within it means that
variationsduetoachangeinthepopulationhavethesamesizeasvariationsduetoothereffectsand
canthusbeconsiderednegligible.Simplifyingthe 1/ thestatisticis
variancebetween
variancewithin

1
1

whichisdistributedasaFishersFdistributionwith
1 and
degreesoffreedom.Rejection
regionisclearlyontheright,sincethatareaistheonewhereVarianceBetweenismuchlargerthan
VarianceWithin.
Goingbacktoourexample,statisticsvalueis
11

0.136 withdegreesoffreedom

2 and 11 andasignificance of 87.4% andthereforeweaccept.

11

Significancecanbecalculatedindifferentways.(1)UsingtheonewayANOVA(ANOVA:EinfaktorielleVarianzanalyse)
in the Data Analysis tookpak in Microsoft Excel, which gives us statistics value and its significance. (2) Using English
Microsoft Excel function FDIST(0.136;2;12) whichgives us theareaoftherighttail. (3)UsingF distribution table, which
usuallyprovidestherightsidecriticalvalues.

Page21

AdvancedStatistics

4.6 JarqueBeratest
Prerequisites:none.
H0:variablefollowsanormaldistribution
n
sample Kurtosis 2
sample skewness 2
6
4

Statistic:

2000, chi

Statistic distribution: JarqueBera distribution. When


squaredistributionwith2degreesoffreedom.

CarlosJarque

AnilBera

This test checks whether a variable is distributed, on the population, according to a normal
distribution.ItusesthefactthatanormaldistributionhasalwaysaskewnessandaKurtosisof 0.Its
statisticisclearlyequalto 0 ifthesamplesdatahaveaskewnessandKurtosisof 0 andincreasesif
thesemeasuresaredifferentfrom 0.Thestatisticismultipliedby ,meaningthatifwehavemany
datatheymusthavedisplayverysmallskewnessandKurtosistogetalowstatisticsvalue.
SamplesskewnessandsamplesKurtosisarecalculatedas
1

3.

4.7 KolmogorovSmirnovtest
Prerequisites:none.
H0:variablefollowsaknowndistribution
numberofsampledata

, where
Statistic: sup
isthecumulativedistributionoftheknownr.v.
Statisticdistribution:Kolmogorovdistribution

AndreyKolmogorov

SPSS:AnalyzeNonparametricTestsOneSample

(19031987)

VladimirIvanovich
Smirnov
(18871974)

Thisisaranktestwhichcheckswhetheravariableisdistributed,onthepopulation,according
toaknowndistributionspecifiedbytheresearcher.Thetestforeach calculatesthethedifference
betweenthepercentageofsamplesdatasmallerthanthis andtheprobabilityofgettingavalue
smallerthan fromtheknowndistribution.Clearly,ifsamplesdataaredistributedaccordingtothe
known distribution, these differences are very small for every since the percentage of smaller
values reflects exactly the probability of finding smaller values. The statistic is defined as the
maximum,forallthe ,ofthesedifferences.
For example, we want to check whether data 3; 4; 5; 8; 9; 10; 11; 11; 13; 14 come from a
numberofsampledata
|0 0.05| 0.05 ; for
N 9; 25 distribution. For
2 ,
N 9;25 2
3 ,

N 9;25

numberofsampledata

|0.1

0.12|

|0

0.02; for

5,

N 9;25

0.08|

0.08 ; for

numberofsampledata

Page22

4 ,
N 9;25

numberofsampledata

|0.2

0.16|

AdvancedStatistics

0.04 andsoon.Obviously,thiscalculationisnotdoneonlyforintegervaluesbutforallvaluesand
doingitmanuallyis,inmanycases,averyhardtask.Inthiscase,themaximumis 0.21 obtainedfora
valueof immediatelyafter 11.Itssignificanceismuchlargerthan 5% andthereforeweaccept.

4.8 Signtest
Prerequisites:continuousdistribution.
H0:medianis

Statistic:outcomesontheleftorontherightof
Statisticdistribution: B

; 50% ;for

10

~N 0; 1

SPSS:AnalyzeNonparametricTestsOneSample
Signtestisaranktestwhichteststhecentraltendencyofaprobabilitydistribution.Itisused
todecideonwhetherthepopulationmedianequalsornotthehypothesizedvalue.
Consider the example when 8 independent observations of a random variable having a
continuousdistributionare0.78,0.51,3.79,0.23,0.77,0.98,0.96,0.89.Wehavetodecidewhether
thedistributionmedian isequalto 1.00.Weformulatethetwohypotheses:
H0:
1.00
1.00
H1:
Ifthenullhypothesisistrue,weexpectapproximatelyhalfofthemeasurementstofalloneachside
ofthehypothesizedmedian.Ifthealternativeistrue,therewillbesignificantlymorethanhalfonone
ofthesides.Thus,ourteststatisticwillbeeither or .Thesetwoquantitiesdenotethenumber
of observations falling below and above 1.00. Since was assumed to have a continuous
distribution, P
1.00
0. In other words, every observation falls either below of above 1.00,
neverhittingthisvalueitself.Consequently,
8.Inpracticeitcanbethatanobservationis
exactly 1.00. In this situation, since this observation is strongly in favor of H0 hypothesis, we will
considerittobelongto when islargerandto when islarger.
Notethatthischoiceofteststatisticdoesnotrequirehavingexactvaluesoftheobservations.
Infact,itisenoughtoknowwhethereachobservationislargerorsmallerthan 1.00.Tothecontrary,
the corresponding small sample parametric test (which is the Students t test for one variable)
requiresexactvaluesinordertocalculatethesamplesaverageandvariance.
Now we take and consider the significance of this test. This is the probability (assuming
that H0 is true) of observing a value of the test statistic that is at least as contradictory to the null
hypothesis,andthussupportivetothealternativehypothesis,astheactualonecomputedfromthe
sample data. In our case
7. There are two more contradictory outcomes of the experiment:
when
8, the case when all observations have fallen on the same side of the hypothesized
median, and when
0. And there is a result which is as contradictory as the one we have,
1.Thussignificanceequals P
7
P
8
P
1
P
0 .
Notethatthedistributionof hasabinomialdistribution B 8; 0.5 .Indeed,ifwesuppose
that H0 is correct, having an outcome on the left of 1.00 is an event with probability 50%. And
having outcomes on the left of 1.00 on a total of 8 independent observations is a binomial
with
50% and
8.Therefore,rememberingthat

Page23

AdvancedStatistics

P B

!
!

7
P
8
0.035. Remembering that the binomial distribution in
we can calculate12 P
theparticularcaseof
50% issymmetricandtherefore P
1
P
0
P
7
P
8 , we get that significance is 7%. Setting a significance level of 5%, we accept null
hypothesismeaningthatourdataarenotabletosupportthehypothesisthatmedianisnot 1.00.
Thecorrespondingonetailedtestisusedtodecideonwhetherthedistributionmedianequals
tothehypothesizedvalueorfallsbelow/exceedsit.Referringtothesetofdataconsideredabove,the
correspondingtwomutuallyexclusivehypothesesread,forexample:

H0:
H1:

1.00
1.00

Asteststatisticwechoose .Inordertofindoutwhereistherejectionregion,wenotethatwhen
ourstatisticishugetheobservationsfallingbelow 1 willbemorenumerousthantheonesexceeding
1 and this is in favor with the alternative hypothesis. Thus the zone on the right is the rejection
region, while the zone on the left, where is small, is not a rejection region. Because
7,
there is only one more contradictory to H0 outcome is
8. Thus the significance equals
P
7
P
8 . The random variable
has always a binomial distribution whose
probability of a success is 1/2 and we conclude that the significance is P B 8; 50%
7
P B 8; 50%
8
3.5%.
H0 probably true

H0 probably true

H0 probably false

Thus,whenH0istrue,theprobabilitytofaceanoutcomeascontradictoryastheactuallyobserved
oneoranoutcomemorecontradictorytoH0,equals 3.5%.Consequently,thesampledatasuggest
thatifwerejectH0wemaybewronginonly 3.5% ofthecases.
Note that, as compared with the twotailed test, now the probability of type I error is two
timessmalleralthoughthesampleinformationremainsthesame.Thisisnotsurprisingbecausethe
onetailed test starts from a more precise guess, it starts with the implicit hypothesis that can
neverbelargerthan 0.
Ifwemaketheotheronetailedtestinstead:

H0:
H1:

1.00
1.00,

ifwetake asstatistic,inordertofindoutwhereistherejectionregion,wenotethatwhenour
statisticishugetheobservationsfallingbelow 1 willbemorenumerousthantheonesexceeding 1
andthisisinfavorwiththenullhypothesis.Thereforelargervaluesofthestatisticareallinfavorof

12

Thesequantitiescanbemucheasilycalculatedintwodifferentways:(1)usingbinomialdistributioncumulativetables,
which give directly P B ;
and in our case P(B(8;50%)=7) + P(B(8;50%)=8) = 100% P(B(8;50%)6); (2) using
EnglishMicrosoftExcelfunction100%BINOMDIST(6;8;50%;TRUE)whichgivesus100%P(B(8;50%)6).

Page24

AdvancedStatistics

H0.Thereforetherejectionregionisnowforsmallvaluesofthestatistic
H0 probably true

H0 probably false

H0 probably true

Withoutevencalculatingthesignificance,itisevidentthatwemustacceptH0.Inanycase,theworse
casesare
6,
5,
4,
3,
2,
1 and
0.Therefore, P
7
P B 8; 50%
7
0.996.
Recall that the normal distribution provides a good approximation for the binomial
distributionwhenthesamplesizeislarge(usually
10).Thus,usingthecentrallimittheorem,we
mayuse N 0.5 ; 0.25 toapproximatethedistributionofourstatistic.Usingstandardization
0.5
~N 0; 1 ,
0.25
where isourstatistic or .Duetotechnicalreasons13 acorrectionof 0.5 isappliedtothe
formula
0.5

0.5

~N 0; 1 ,
0.25
Forexample,wehaveasampleof 30 elementswith 18 elementsontheleftof 2.00 and
12 elementsontherightof 2.00 andwewanttotest

H0: median
H1: median

2.00
2.00.

13

Atechnicalproblemwhichariseswheneverwetrytoapproximateadiscretedistribution(B ; 50% inourcase)with


a continuous one (N 0.5 ; 0.25 in our case). Discrete probability distribution does not have any probability for non
integervalues,whilecontinuousonedoes.

Thereforewehavetodecidewhattodowiththevaluesbetween 12 and 13,wherethebinomialdistributiondoesnot


exists, however the normal distribution has a consistent probability. We take a compromise, taking for the normal
approximations all the values up to 12.5. Therefore we add a 0.5 to the previous formula. It is always an addition
wheneverweareonthelefttail,whileitisclearlyasubtractionwheneverweareontherighttailandhavethusa sign:
.
.
~N 0; 1
.

Page25

AdvancedStatistics

We take as statistic . Since it is a onetailed test we have to see where the rejection region is.
30, this means that probably the median is
Supposing a very large value for the statistic, i.e.
muchlargerthanthehypothesizedvalueandthisisinfavorofH0.Therefore,rejectionregionisnot
forlargestatisticsvalueanditisontheotherside,theleftone.Valuesmoreorequalcontradictory
toH0arethus
12.Usingtheexactcalculationyieldsto P B 30; 50%
12
18.07%,while
usingapproximatedcalculation14 wehave
P N 0; 1

12

0.5

0.5 30

P N 0; 1
0.9129
18.06%.
0.25 30
Inbothcasesweaccept,meaningthatoursampledataarenotabletoprovethatH0bewrong.

4.9 MannWhitney(Wilcoxonranksum)test
Prerequisites:thetwoprobabilitydistributionsarecontinuous
H0:positionofdistributionforpopulationA=positionofdistributionforpopulationB
Statistic:sumofranksofthesmallergroup

Statisticdistribution:Wilcoxonranksumtableor

N 0; 1 whensampleislarge

andtablesarenotavailable
Alternativestatistic:
sumofranksofthesmallergroupminus
sizeofthesmallergroup

1 /2,where

Alternativestatisticdistribution:MannWhitneytableor

isthe

N 0; 1 whensample

islargeandtablesarenotavailable
SPSS:AnalyzeNonparametricTestsIndependentSamples
Supposetwoindependentrandomsamplesaretobeusedtocompare twopopulationsand
we are unwilling to make assumptions about the form of the underlying population probability
distributions(andthereforewe cannotperformStudentst testfortwopopulations) orwemaybe
unable to obtain exact values of the sample measurements. If the data can be ranked in order of
magnitude, the MannWhitney test (also called Wilcoxon rank sum test) can be used to test the
hypothesisthattheprobabilitiesdistributionsassociatedwiththetwopopulationsareidentical.
Forexample,supposesixeconomistswhoworkforthegovernmentandsevenwhoworkfor
universities are randomly selected, and each one is asked to predict next year's inflation. The
objective of the study is to compare the government economists' predictions to those of the
universityeconomists.Assumethegovernmenteconomistshavegiven:3.1,4.8,2.3,5.6,0.0,2.9.The
universityeconomistshavesuggestedinsteadthefollowingvalues:4.4,5.8,3.9,8.7,6.3,10.5,10.8.
That is, there is a random variable equal to the next year's inflation given by a governmental
economist. Asking governmental economists about their prediction, we observe independent
outcomes, , of . As well, there is anotherrandom variable equal tothe next year's inflation
given by a university economist. Approaching a university economist concerning his forecast of the

14

Theprobabilityofanormaldistributioncanbecalculatedintwoways:(1)lookingintoastandardnormaldistribution
table;(2)usingEnglishMicrosoftExcelfunctionNORMDIST(2.5/SQRT(0.25*30);0;1;TRUE).

Page26

AdvancedStatistics

inflationrate,weobserveanindependentoutcomes, ,ofthisrandomvariable.Wehavetodecide
whether and have the same distributions or not, basing our decision only on the sample
observations,whichistheonlyinformationwehave.

H0: the probability distribution corresponding to the government economists


predictionsofinflationrateisinthesamepositionastheuniversityseconomistsone
H1: the probability distribution corresponding to the government economists
predictionsofinflationrateisinadifferentpositionastheuniversityseconomistsone

Tosolvethisproblem,wefirstrankallavailablesampleobservations,fromthesmallest(arank
of 1) to the largest (a rank of 13): 1 0.0 , 2 2.3 , 3 2.9 , 4 3.1 , 5 3.9 , 6 4.4 , 7
4.8 , 8 5.6 , 9 5.8 , 10 6.3 , 11 8.7 , 12 10.5 , 13 10.8 . The test statistic for the
MannWhitneytestisbasedonthetotalsoftheranksforeachofthetwosamplesthatis,onrank
sums. If the two rank sums are nearly equal, the implication is that there is no evidence that the
probabilitydistributionsfromwhichthesamplesweredrawnaredifferent.Ontheotherhand,when
thetworanksumsdiffersubstantially,itsuggeststhatthetwosamplesmayhavecomefromdifferent
distributions. We denote the rank sum for governmental economists by and that for university
4 7 2 8 1 3 25 and
6 9 5 11 10 12
economists by . Then
1 /2,thatisthesumofallintegersfrom 1
13 66.Thesumof and willalwaysequal
6,
7,
13, and
13 13
through . In the particular case in hands,
1 /2 91.Since
isfixed,asmallvaluefor impliesalargevaluefor (andviceversa)
andalargedifferencebetween and .Therefore,thesmallerthevalueofoneoftheranksums,
the greater the evidence to indicate that the samples were selected from different distributions.
However,whencomparingthesetwovalues,wemustalsotakeintoaccountthefactthata maybe
smallduetothefactthatthecorresponding issmall;inourcase, g maybesmallerbecausethe
governmentalsamplehaslesssubjects.Thetestsstatisticisanyofthetworanksums.Criticalvalues
for this statistic are given in appropriate Wilcoxon rank sum tables. We take g and looking at the
tablefor
6 and
7 weget,forasignificancelevelof5%,criticalvaluesof 28 and 56.
H0 probably false

25

H0 probably false

56

28

Since our statistic is in the critical region, we reject, meaning that our data confirm that the two
distributionsaredifferent.
NotethattheassumptionsnecessaryforthevalidityoftheMannWhitneytestdonotspecify
theshapeofprobabilitydistribution.However,thedistributionsareassumedtobecontinuoussothat
the probability of tied measurements is zero, and, consequently, to each measurement can be
assignedauniquerank.Inpractice,however,roundingofcontinuousmeasurementsmaysometimes
produceties.Aslongasthenumberoftiesissmallrelativetothesamplesizes,theMannWhitney
test procedure is applicable. On the other hand, the test is not recommended to compare discrete
distributionsforwhichmanytiesareexpected.Tiesmaybetreatedinthefollowingway:assigntied
measurementstheaverageoftherankstheywouldreceiveiftheywereunequal.Forexample,ifthe
3.5.If
thirdrankedandfourthrankedmeasurementsaretied,weassigntoeachonearankof
thethirdranked,fourthrankedandfifthrankedmeasurementsaretied,weassigntoeachonearank

Page27

AdvancedStatistics

4.

of

Returning to our example, we may formulate the question more exactly: is it true that the
university economists' predictions tend to be higher than the predictions of the governmental
economists? In other words, is the density shifted to the right with respect to density ?
Conceptuallythisshiftequalsthesystematiccomponentinthedifferencebetweenthepredictionsof
agenericuniversityeconomistandagenericgovernmenteconomist.Thatis:

H0: the probability distribution corresponding to the government economists


predictionsofinflationrateisinthesamepositionorshiftedtotherightwithrespect
totheuniversityseconomistsone
H1: the probability distribution corresponding to the government economists
predictions of inflation rate is shifted to the left with respect to the universitys
economistsone

Wehavetofindouttherejectionregion.Wetake g asstatisticandsupposethatitsvalueisvery
large. This means that governmental economists make predictions with larger ranks and thus with
higher values than universitys economists. This is strongly in favor of H0 and therefore rejection
regionisontheotherside,theleftone.Criticalvaluesaredifferentandtheyare,forasignificance
level of 5%, 30 and 54. Statistic falls in the rejection region and thus our data confirms that
governmentalpredictionsareshiftedtotheleft.
H0 probably true

H0 probably false

25

H0 probably true

54

30

Whensamplesize, or ,islargerthan 10,tablesdonotprovideuswithcriticalvalues


anymore.Inthesecasesstatisticdistributioncanbeapproximatedwithanormaldistribution
1 2

N 0; 1 .

1 12

4.10 Wilcoxonsignedranktest
Prerequisites:thedifferenceisarandomvariablehavingacontinuousprobability
distribution.
H0:positionofdistributionforvariableA=positionofdistributionforvariable
B
Statistic:sumofranksofdifferences

Statisticdistribution:Wilcoxonsignedranktableor

N 0; 1

whensampleislargeandtablesarenotavailable
SPSS:AnalyzeNonparametricTestsRelatedSamples

FrankWilcoxon
(18921965)

Rank tests can also be employed to compare two probability distributions when a paired
differencedesignisused.Forexample,consumerpreferencesfortwocompetingproductsareoften
comparedbyanalyzingtheresponsesinarandomsampleofconsumerswhoareaskedtorateboth

Page28

AdvancedStatistics

products. Thus, the ratings have been paired on each consumer. Consider for example a situation
when 10 students have been asked to compare the teaching ability of two professors, say

and
. Each of the students grades the teaching ability on a scale from 1 to 10, with higher
gradesimplyingbetterteaching.Theresultsoftheexperimentareasfollows:
student
1
2
3
4
5
6
7
8
9
10

6
8
4
9
4
7
6
5
6
8

4
5
5
8
1
9
2
3
7
2

2
3
1
1
3
2
4
2
1
6

rankof

signof

2
3
1
1
3
2
4
2
1
6

+
+

+
+

+
+

5
7.5
2
2
7.5
5
9
5
2
10

Here
and
arethegradesassignedbyeachStudentstoprofessor
and
.Sincethis
isapaireddifferenceexperiment,weanalyzethedifferencesbetweenthemeasurements.Examining
thedifferencesallowsremovingapossiblecommoncausalitybehindtheseratings.Infact,thefourth
andthesixthstudentsseemtohavegivenhigherthanotherstudentsratingstobothprofessors.
This rank test requires that we calculate the ranks of the absolute values of the differences
between the measurements. Since there are ties, the tied absolute differences are assigned the
average of the ranks they would receive if they were unequal but successive measurements. For
example,theabsolutevalue 3 appearstwotimes.Ifthesewereunequalmeasurements,theirranks
wouldhavebeen 8 and 7.Thustherankfor 3 equals
7.5.Inthesameway,therankfor 2
equals
5,therankfor 1 is
2.Aftertheabsolutedifferencesareranked,thesumof
theranksofthepositivedifferencesoftheoriginalmeasurements, ,andthesumoftheranksof
thenegativemeasurements, ,arecomputed.Inourcase:
5 7.5 2 7.5 9 5 10
2 5 2 9.Nowwearereadytotestthenonparametrichypotheses:
46 and

H0:theprobabilitydistributionsoftheratingsforprofessor
isinthesameposition
, 1
2
astheoneforprofessor
H1: the probability distributions of the ratings for professor
is in a different
positionastheoneforprofessor
, 1
2

As the test statistic we use any . The more the difference between and , the greater the
evidencetoindicatethatthetwoprobabilitydistributionsdifferinlocation.Notethatalsoforthistest
the sum of
is fixed and equal to
1 /2. Left critical value is tabulated, while right
criticalvaluecanbefoundforsymmetricity.Inourcase,wetakeforexample whichis8.Theleft
criticalvalue,forasignificancelevelof 5%,is 8.Theothercriticalvalueis
1 /2 8 55
8 47.
H0 probably true

H0 probably false

H0 probably false

46 47

27.5
Page29

AdvancedStatistics

Asitcanbeseenintheschema,thistestisperfectlysymmetricandwhenone fallsintothecentral
region,theotherautomaticallydoesthesame.Viceversa,whenone fallsintoarejectionregion,
theotherfallsintotheotherrejectionregion.Inourexampleweacceptandthereforeourdataare
notabletoprovethatthetwodistributionsaredifferent.
Obviously,alsoforthistestwehaveonetailedversions.Thisisperformedintheusualway,
takingcaretochooseonestatisticanddecidewhichtherejectionregionforthatstatisticis.
Sincewehaveassumedthatthedistributionofadifferenceiscontinuous,theremaynotbe
differenceswhichareexactly 0.However,inpractice,theymayoccurduetorounding:insuchcases,
we must decide whether assigning their rank to or to . For the twotailed test there is no
solution. Since a difference of 0 is in favor of H0 hypothesis, assigning it to either statistic can
unbalancethesituationandpushinfavorofH1.Moreover,adifferenceof0isstronglyinfavorofH0,
butitwouldhavethesmallerrank.So,thetwotailedtestcannotbeperformedatallifwehaveany
0 difference.However,theonetailedtestcanbeperformed.Forexample:
isinthesamepositionor
H0:theprobabilitydistributionsoftheratingsforprofessor
shiftedtotheleftwithrespecttotheoneforprofessor
, 1
2 , 1
2
0
H1: the probability distributions of the ratings for professor
is shifted to the right
withrespecttotheoneforprofessor
, 1
2 , 1
2
0
A difference of 0 is in favor of H0 hypothesis which includes also all the negative differences.
Therefore,any 0 differencesrankisassigned,withthesehypotheses,to .
When
25 statistics tables are not available anymore. Statistics distribution can be
approximatedwith:
1 4
1 2

N 0; 1 ,

1 24

whereitisbettertotakeasstatisticthesmallerbetween
distributiontablesprovidetheareaontheleft.

and

,sinceusuallystandardnormal

4.11 KruskalWallistest
Prerequisites:thereare 5 ormoremeasurementsineachsample;
the probabilitydistributionsfromwhichthesamplesaredrawn
arecontinuous
H0:positionofdistributionofpopulationsisthesame
Statistic:

Statisticdistribution:chisquaredistributionwith
freedom

1 degreesof

WilliamHenry
Kruskal
(19192005)

WilsonAllen
Wallis
(19121998)

SPSS:AnalyzeNonparametricTestsIndependentSamples
The KruskalWallis test is the MannWhitney test when more than two populations are
involved.ItscorrespondingparametrictestistheAnalysisofVariance.
For example, a health administrator wants to compare the unoccupied bed space for three

Page30

AdvancedStatistics

hospitals. She randomly selects 10 different days from the records of each hospital and lists the
number of unoccupied beds for each day. Just as with two independent samples, we base our
comparisonontheranksumsforthesethreesetsofdata.TiesaretreatedasintheMannWhitney
testbyassigningtheaveragevalueoftherankstoeachofthetiedobservations:
Hospital1
Beds Rank
6
5
38
27
3
2
17
13
11
8
30
21
15
11
16
12
25
17
5
4
120

Hospital2
Beds Rank
34
25
28
19
42
30
13
9.5
40
29
31
22
9
7
32
23
39
28
27
18
210.5

Hospital3
Beds Rank
13
9.5
35
26
19
15
4
3
29
20
0
1
7
6
33
24
18
14
24
16
134.5

Wetest

H0: the probability distributions of the number of unoccupied beds have the same
positionforallthreehospitals
H1:atleastoneofthehospitalshasprobabilitypositiondifferentwithrespecttothe
others.

Theteststatistic,called

,is

,where denotesthenumberofdistributions

involved,
is the number of measurements available for the th distribution, is the
correspondingranksum,
/ isthe meanrankforpopulation and
...
/

(remembering that the sum of ranks is fixed, as for MannWhitney and


Wilcoxontests)isthemeanrankforthewholepopulation.Asitcanbeseenfromtheformula,this
statisticmeasurestheextenttowhichthe ranksdifferwithrespecttotheaveragerank.Notethat
statisticisalwaysnonnegative.Ittakesonthevaluezeroifandonlyifallsampleshavethesame
meanrank,thatis
forall .Thisstatisticbecomesincreasinglylargeasthedistancebetween
asamplemeanrank andthemeanrankforthewholepopulationgrows.
However,theformulathatisusedforpracticalcalculationsisaneasierone15:
12

10 and

30.

is

1
In our case

3,

1
.

3 31

15

1 24 12

1
1

Page31

AdvancedStatistics

6.097.
The statistics distribution is, under the hypothesis that the null hypothesis is true,
approximately a chi square distribution with
1 degrees of freedom. This approximation is
adequateaslongaseachofthe samplesizesisatleast 5.Chisquaredistributionhasonlyonetail
ontherightandthustherejectionregionforthetestislocatedintherighttail.Inourcase
3,so
we are dealing with a chi square distribution with 2 degrees of freedom. Using the significance
method, we find16 a significance of 4.74% which means that we reject. Using the critical region
methodwith 5% significancelevel,wegetacriticalvalueof 5.99.
H0 probably true

H0 probably false

5.99 6.097

4.12 Pearsonscorrelationcoefficient
Prerequisites:coupleddata
H0: Corr

Statistic:

2 degreesoffreedom

Statisticdistribution:Studentstwith

KarlPearson(18571936)

SPSS:AnalyzeCorrelateBivariate

Consider two random variables, and , of which we have only couples of outcomes,
; . It is important that the outcomes that we have are in couples, since we are interesting in
estimatingthecorrelationbetweenthetwovariables.WeuseasestimatorthePearsonscorrelation
coefficientwhichisdefined,throughtheintroductionofthe (sumofsquares)quantity,as

As it can be seen from the formulas, quantities have two equivalent definitions, of which the
latter is easier to use in practical calculations while the former is more useful for theoretical
considerations.Inparticular,wecanimmediatelyobservefromtheseconddefinitionthat
and
arestrictlypositiveandthereforethesquarerootandthedenominatorarewelldefined.Inthe
particular case when all the or all the have the same value, the corresponding quantity
becomes 0 andthePearsoncorrelationcoefficientisnomoredefined.Thisisaveryrarecaseand
corresponds to the situation when there are only constant outcomes for random variable or ;
clearly, from constant outcomes we can not estimate anything concerning the behavior of random
variables.
is the estimation of the variance of random variable

, while

is the

16

Significancecanbecalculatedintwodifferentways.(1)UsingEnglishMicrosoftExcelfunctionCHIDIST(6.097;2)which
givesustheareaofthelefttail.(2)Usingchisquaredistributiontable,whichusuallyprovidestheleftsidecriticalvalues.

Page32

AdvancedStatistics

estimationofthevarianceofrandomvariable and
the correlation is exactly Corr

Cov

Var

istheestimationfor Cov

.Since

, Pearsons correlation coefficient is the

Var

estimationforthecorrelation.
. It can moreover be easily
The sign of is determined only by the sign of
17

and therefore the value of must lie between 1


demonstrated that
and 1,independentlyfromhowlargeorsmallarethenumbers and .Inotherwords, isa
scalelessvariable.Avalueof nearorequaltozeroisinterpretedaslittleornocorrelationbetween
and . In contrast, the closer comes to 1 or 1, the stronger is the correlation of these
variables. Positive values of imply a positive correlation between and . That is, if one
increases,theotheroneincreasesaswell.Negativevaluesof implyanegativecorrelation.Infact,
and moveintheoppositedirections:when increases, decreasesandviceversa.Insum,
thiscoefficientofcorrelationrevealswhetherthereisacommontendencyinmovesof and .
We have a test to check whether Corr X, Y is different from 0, meaning that there is a
linearrelationbetweenrandomvariables and .Thistestusesthefactthatstatistic

isdistributedlikeaStudentstdistributionwith
2 degreesoffreedom.Weremindthefactthat
independenceimplieszerocorrelationbutnotviceversa:therefore,whenthecorrelationisdifferent
from 0,wearesurethatthetworandomvariablesaredependent.
Forexample,supposewehavethese 11 couplesofdata
2
5

we get
10 15 28

4
15

138.73.Therefore
is

3
5

4
7

3
5

5
7

9 16 9 25 36 49
35 42 98 15 3 3
.

6
7

7
14

3
5

1
3

3
1

4
12

9 1 9 16 11 3.7
48 11 3.7 6.5 47.36

30.18 ,
and

0.732 with 11 couplesofdataandthevalueofourstatistic

3.223 and a significance, for the twotailed test, of 1.04%. Therefore, taking a

significance level of 5%, 11 couples of data with Pearsons correlation coefficient of 0.732 are
enough to prove that the correlation is different from 0 and therefore the two variables are not
independent.

17

This fact obtains by applying the CauchySchwarz inequality, |


and
.

Page33

|| ||

|| ||, to the vectors and with

AdvancedStatistics

4.13 Spearman'srankcorrelationcoefficient
Prerequisites:coupledrankeddataorcoupleddatafromcontinuousdistributions
H0:ranksareuncorrelated
Statistic:Spearmansrankcorrelationcoefficient
Statisticdistribution:Spearmantable
SPSS:AnalyzeCorrelateBivariate

CharlesSpearman
(18631945)

The Spearman's rank correlation coefficient is the non parametric version of the Pearsons
correlationcoefficient.
Takingthesamedataofthepreviousexample,
2 3 4 3 5 6 7 3 1 3 4
5 5 7 5 7 7 14 5 3 1 12
this time instead of taking the values, we assign ranks. It is important that ranks be assigned
independentlyfor and ,yetmaintainingthecoupledpositionofthedata:
2 4.5 7.5 4.5 9 10 11 4.5 1 4.5 7.5
4.5 4.5 8 4.5 8 8 11 4.5 2 1 10
TheSpearmansrankcorrelationcoefficient, ,iscalculatedexactlyasPearsonscorrelation
coefficient:

Where,exactlyasforPearsonscorrelationcoefficient,

The value of always falls between 1 and 1, with 1 indicating perfect positive correlation
and 1 for perfect negative correlation. The closer falls to 1 or 1, the greater the
correlationbetweentheranks.Conversely,thenearer isto 0,thelessthecorrelation.
For Spearmans rank correlation we have, however, additional information since the values
used in the calculation must be integer numbers between 1 and . Therefore, through
mathematicalcalculations,wecanderive18 analternativeformulavalidonlywhentherearenottied

18

Starting from the consideration that

simplification

Moreover,

since

Page34

we can obtain a

and

AdvancedStatistics

ranks:
6

,thedifferencebetweentherankofthe thmeasurementinthefirstsetand
where
therankofthe thmeasurementinthesecondset.Wecanseethatifallranksareidentical,thatis,
forevery ,then
1.Wemusttakecaretorememberthatthisformulaisvalidonly
whentherearenotiedranks.
Returningtoourexample,weseethat
2.5

0.5

3.5

2.5

31,

consequently,
1 6 31 11 11
1
0.859. The fact that is close to 1 indicates
thattherankingsgivenbythetwomagazinestendtoagree,buttheagreementisnotperfect.
If the sets of ranks are formed by values taken by independent realizations of random
variables
and
, the Spearman's rank correlation coefficient may be used for testing
whetherthevalueof Corr
,
isdifferentfrom 0.Thestatisticisthecoefficientitself.Inthe
previousexample,with
11 andasignificancelevelof 5% wehaveacriticalvalueof 0.623,
H0 probably true

H0 probably false

0.623

H0 probably false

0.859

0.623

andthereforewereject,meaningthattheranksarecorrelatedandthereisarelationbetweenthe
orderofthetwovariables.
Spearmans rank correlation coefficient can be used, as every other rank test, in all the
situationswhereeffectivemeasuresarenotavailableandonlyranksareprovided.Supposetennew
carmodelsareevaluatedbytwoconsumermagazinesandeachmagazineranksthebrakingsystemof
the cars from 1 (best) to 10 (worst). We want to determine whether the magazines' ranks are
related. If they are, we may conclude that these rankings contain useful information about the
breaking system. Otherwise, if the rankingsgiven by the two magazines are not related, we should
notregardtheserankingascontainingusefulinformationsincetheyarecontradictoryandwedonot
knowwhichonetouse.Lettheranksgivenbythetwomagazinesbeasfollows:
Carmodel

10

we see that

. Finally, taking into account that


,

we

.Consequently,

obtain

Page35

AdvancedStatistics

Rankgivenbymagazine1

10

Rankgivenbymagazine2

10

Inthiscasedataarealreadyrankedandthecoefficientcanbecalculateddirectly.

4.14 Multinomialexperiment
Manybusinessanalysesconsistofenumeratingthenumberofoccurrencesofsomeevent.For
example,wemaycountthenumberofconsumerswhochooseeachofthethreebrandsofcoffee,or
thenumberofsalesmadebyeachoffiveautomobilesalespeopleduringamonth.Whenthereisa
single scale to classify data, as in all examples above, we have a one dimensional classification. In
some cases we may collect the count data characterizing several factors. For example, we may be
interested in investigating whether the color of automobile purchased is related to the sex of the
buyer. In this case we are dealing with a two dimensional classification. The corresponding data
constituteacontingencytable.Countdataaretraditionallyanalyzedusingtables.
4.14.1 Onedimensionalclassification
Prerequisites:
H0:

5 forall

forall or,equivalently,

Statistic:tableschisquare

Statisticdistribution:chisquarewith

forall

1 degreesoffreedom

SPSS:AnalyzeNonparametrictestsOneSample
Thepropertiesoftheonedimensionalmultinomialexperimentareasfollows:

theexperimentconsistsof identicaltrials;
thetrialsareindependent;
thereare possibleoutcomestoeachtrial;
theprobabilitiesofthe outcomes,denotedby , ,..., ,remainthesamefromtrialto
trial, where
1 (therefore there is no other possible outcome outside the
onesweareconsidering);
therandomvariablesofinterestarethecounts , ,..., ineachofthe cells.

Forexample,supposealargesupermarketchainconducts aconsumerpreferencesurveyby
recording the brand of bread purchased by customers in its stores. Assume the chain carries three
brands of bread, A, B and C. The brand preferences of a random sample of 150 consumers are
observed, and the resulting count data are as follows: A: 61, B: 53, C: 36. Do these data indicate
thatapreferenceexistsforanyofthesebrands?
Our consumer preference survey satisfies the properties of a multinomial experiment. The
experiment consists in randomly sampling
150 buyers from a large population of consumers
containing an unknown proportion who prefer brand A, a proportion who prefer brand B,
andaproportion whopreferthestorebrand,C.Approachingabuyerconcerninghispreference,
weperformasingletrialthatcanresultinoneofthreeoutcomes:theconsumerprefersbrandA,Bor
C. Probabilities of these outcomes are , , and , respectively. The buyer's preference of any
singleconsumerinthesampledoesnotaffectthepreferenceofanother.Consequently,thetrialsare

Page36

AdvancedStatistics

independent. The recorded data are the numbers of buyers in each of the consumer preference
categories. Thus, the consumer preference survey satisfies the five properties of a multinomial
experiment.
Note that we may talk about the proportions , , as probabilities because in a
population consisting totally of agents,
prefer brand A,
opt for brand B, and
for
brand C. Consequently, the probability to choose randomly a customer who buys A, B, C will be
correspondingly
/
,
/
and
/
.Thatiswhywemaytalkabout asa
proportion as well as about a probability. The three probabilities , , are unknown and we
wanttousethesurveydatatomakeinferencesabouttheirsize.
Thegeneralformforatestofahypothesisconcerningmultinomialprobabilitiesisasfollows:

H0:
,
,...,
,where , ,..., representthehypothesizedvaluesof
the multinomial probabilities (
1/3 in the above example with three types of
bread)
H1:atleastoneofthemultinomialprobabilitiesdoesnotequalitshypothesizedvalue,inother
words,thereisan suchthatthecorrespondingactualprobability doesnotcoincidewithits
hypothesizedvalue ,
.
Webuildatableofobservedcountsandatableofpredictedcounts
A
61

B
53

C
36

observedcounts

A
B
C
50
50
50
predictedcounts
underH0hypothesis

Theteststatisticisthetableschisquare,ameasurecalculatedas

...
isthetotalsamplesize.Thisstatisticis
where arecalledobservedcounts,
distributed as a chi square distribution with
1 degrees of freedom. Observing the chi square
statistic,itisevidentthatwhentheobservednumbersareverydifferentfromthepredictedcounts,
, the value of the statistic is very large, while when the observed numbers coincides with the
predicted ones the statistic is zero. Therefore, rejection region is only on the right. This test works
onlyifthepredictedcountsareall
5,whileitisnotimportantthattheobservedonesbeat
least5.
Inourparticularexample,

/
/

6.52. Sincehere

3, we are dealing with a chi square distribution with 2 degrees of freedom. Statistics
significance is19 3.84% and therefore we reject, meaning that consumers preferences are not

19

Significancecanbecalculatedindifferentways:(1)usingEnglishMicrosoftExcelfunctionCHIDIST(6.52;2)whichgives
ustheprobabilityoftherighttailofchisquaredistribution;(2)lookingintochisquaretableswhichusuallyprovidecritical
valuesfordifferentsignificancelevels;(3)usingEnglishMicrosoftExcelfunctionCHIINV(5%;2)whichgivesusthecritical
valuecorrespondingto5%significancelevel;(4)testcanbeperformedalsousingEnglishMicrosoftExcelfunctionCHITEST
which,giventheobservedtableandthepredictedtable,givesusthevalueofchisquarestatisticandthenusingCHIDIST
significancecanbefound.

Page37

AdvancedStatistics

uniformandthatthereisatleastonetypeofbreadthathasaprobabilitydifferentfrom 1/3.Ifwe
wanttousethecriticalregionsmethod,criticalvaluefor 5% is 5.99 andtherefore 6.52 isinthe
rejectionregionwhichforthistestisalwaysontheright.
Asanotherexample,usingthesamedatawewanttocheckwhetherthebreadsprobabilities
followa40%,40%,20%distribution:

H0:
H1:

40% and
40% or

40% and
40% or

20%
20%.

The observed and predicted tables are


A
61

B
53

C
36

observedcounts

A
B
C
60
60
30
predictedcounts
underH0hypothesis

and
1.43 with a significance of 48.8%. Therefore we accept,
meaningthatoursampleisnotabletoprovethatconsumerspreferenceisnot40%,40%,20%.
4.14.2 Twodimensionalcontingencytable
Prerequisites:

5 forall ,

H0:classificationsareindependent
Statistic:tableschisquare

Statisticdistribution:chisquarewith

1 degreesoffreedom

SPSS:AnalyzeDescriptiveStatisticsCrosstabsStatisticsChisquare
Suppose, for example, that an automobile magazine is interested in determining the
relationship between the size and manufacturer of newly purchased automobiles. One thousand
recentbuyersofcarsmadeinGermanyarerandomlysampled,andeachpurchaseisclassifiedwith
respecttothesize(small,intermediate,andlarge)andmanufactureroftheautomobile(Volkswagen,
BMW,Opel,Mercedes).Thedataaresummarizedinthetwowaytable:
Size\Manufacturer
Small
Intermediate
Large
Totals

VW
157
126
58
341

BMW
65
82
45
192

Opel
181
142
60
383

Mercedes Totals
10
413
46
396
28
191
84
1000

Thistableiscalledacontingencytable.
SPSS:AnalyzeDescriptiveStatisticsCrosstabs
It presents multinomial count data classified in two dimensions, namely automobile size and
manufacturer. Each count is indicated with , , where the first index is referred to the row, the size, and
the second index to the column, the manufacturer. We also indicate with the rows totals and with
the columns totals, and these quantities are called marginal counts. The sample size is and coincides
with the grand total, in our case 1000.

Page38

AdvancedStatistics

Size\Manufacturer
Small

VW
1,1

BMW
1,2

Opel
1,3

Mercedes Totals
1,4
1

Intermediate

2,1

2,2

2,3

2,4

Large

3,1

3,2

3,3

3,4

Totals

This is a multinomial experiment with a total of


1000 trials,
3 4 12 cells or possible
forthecells.Ifthe 1000 recentbuyersarerandomlychosen,
outcomes,andprobabilities
the trials are considered independent and the probabilities are viewed as remaining constant from
and
.
trialtotrial.Wealsodefinethemarginalprobabilitiesforrowsandcolumnsas
Inatwodimensionalclassificationexperimentusuallyweareinterestedincheckingwhether
,
one variable can influence the other. It may be helpful calculating the row percentages and
columnpercentages

asfollow:

Size\Manufacturer
Small
Intermediate
Large

VW
38.0%
31.8%
30.4%

BMW
15.7%
20.7%
23.6%

Opel
43.8%
35.9%
31.4%

Mercedes
2.4%
11.6%
14.7%

Totals
100.0%
100.0%
100.0%

Size\Manufacturer
Small
Intermediate
Large
Totals

VW
46.0%
37.0%
17.0%
100.0%

BMW
33.9%
42.7%
23.4%
100.0%

Opel
47.3%
37.1%
15.7%
100.0%

Mercedes
11.9%
54.8%
33.3%
100.0%

Usingrowpercentageswecanshow,forexample,thatamongallsmallcarsonly2.4%areproduced
byMercedescomparedtoOpelwhichhas43.8%ofthemarket.Usingcolumnpercentagesinsteadwe
seethatamongallMercedescars11.9%aresmallcomparedto33.3%oflargeones.
SPSS:AnalyzeDescriptiveStatisticsCrosstabsCells
Therefore,inatwodimensionalclassificationexperimentwearenotinterestedinwhetherthe
observedcountsfollowapredetermineddistribution,sincetheyarealsoinfluencedbythemarginal
counts (which depends on our samples choice). We instead test whether the two classifications,
manufacturerandsizeinourexample,areindependent.

H0:rowvariableandcolumnvariableareindependent,i.e.

H1: row variable and column variable are dependent, i.e. there is a couple , for which

Thatis,ifweknowwhichsizecarabuyerwillchoose,doesthisinformationgiveusaclueaboutthe
manufacturer of the car that is going to be bought? In a probabilistic sense we know that
independence of events and implies P
P
P
. Similarly, in the contingency
tableanalysis,ifthetwoclassificationsareindependent,theprobabilitythatanitemisclassifiedin
anyparticularcellofthetableisaproductofthecorrespondingmarginalprobabilities.Thus,under
thehypothesisofindependence,wemusthave:

andsoforth.Totest

Page39

AdvancedStatistics

thehypothesisofindependence,weusethesamereasoningasintheonedimensionaltests.Firstwe
calculatethepredictedcountineachcellassumingthatthenullhypothesisofindependenceistrue,
. In


multiplying by the cell predicted probability
ourexample
Size\Manufacturer
VW
BMW
Opel
Mercedes Totals
Small
413341/1000 413192/1000 413383/1000 41384/1000 413
Intermediate
396341/1000 396192/1000 396383/1000 39684/1000 396
Large
191341/1000 191192/1000 191383/1000 19184/1000 191
Totals
341
192
383
84
1000

Size\Manufacturer
Small
Intermediate
Large
Totals

VW
140.8
135.0
65.1
341

BMW
79.3
76.0
36.7
192

Opel
158.2
151.7
73.2
383

Mercedes Totals
34.6
413
33.5
396
16.0
191
84
1000

Asitcanbeseen,marginalcountshaveremainedthesame.
Weusethechisquarestatistictocomparetheobservedandpredictedcountsineachcellof
the

contingency
.

table
.

.
.

.
.

45.81. Degrees
of freedom are 3 1 4 1
6 , significance is 0.000003% : therefore we reject the
hypothesisofindependenceandweconcludethatthesizeandmanufacturerofacarselectedbya
purchaseraredependentevents.Usinginsteadthecriticalregionsmethod,forasignificancelevelof
5% wegetacriticalvalueof 12.59 andthereforethestatisticvaluefallsintotherejectionregion.
.

20

20

Significance can be found and test can be performed in different ways: (1) using English Microsoft Excel function
CHIDIST(45.81;6)whichgivesustheprobabilityoftherighttailofchisquaredistribution;(2)lookingintochisquaretables
which usually provide critical values for different significance levels; (3) using English Microsoft Excel function
CHIINV(5%;6)whichgivesusthecriticalvaluecorrespondingto5%significancelevel;(4)testcanbeperformedalsousing
EnglishMicrosoftExcelfunctionCHITESTwhich,giventheobservedtableandthepredictedtable(whichmustbemanually
built),givesusthevalueofchisquarestatisticandthenusingCHIDISTsignificancecanbefound.

Page40

AdvancedStatistics

5. Whichtesttouse?
While for some data situation it is evident which statistical test to use, such as for example
multinomial experiments or when facing with a single distribution, when having to compare two
distributionsthereareseveralteststhatmaybeused,accordingtowhatwewanttocheck.
MannWhitneytestcheckwhethertwodistributionshavethesamepositionornot.Weuseit
whenwehavetwosets ofdata andwearesimply interestedintotestingwhetherthey comefrom
distributionsinthesamepositionornot.Studentsttestfortwopopulationsisatestforthesame
situationbutwhichtestswhethertheexpectedvaluesofthedistributionsarethesameanddoesnot
testtheirposition.
Wilcoxon signed rank test checks whethertwo distributions of paired data have the same
position. It can be used only when data are paired and it analyses the difference case with case
(intracase,insidethesamecase).SamethingforStudentsttestforpaireddata,whichanalysesthe
differencebetweenexpectedvaluesofthetwosamplesonacasebycasebasis.
Spearman rank correlation coefficientcheckswhethertwodistributionsofpaireddatahave
thesameorderoraperfectlyreverseorder.Itdoesnotmatterwhetherthedatacomefromsimilar
distributionornot,theimportantthingisthattheyareinorder.Itcanbeusedonlywhendataare
pairedandwhenwearesimplyinterestedintheorder.Pearsoncorrelationcoefficientappliestothe
samesituation,butcheckingtheeffectivedatasvaluesandnottheorder.
There are however many cases where all the tests can be performed. Some theoretical
examples:

ifdataareinperfectreverseorder(1,2,3,4,5and5,4,3,2,1),Spearmanisequalto 1 (H0
rejected,thereforeordersarerelated)indicatingthattheorderisreversedwhileMannWhitney
testandWilcoxonsignedranktestacceptH0indicatingthatdatamaybeinthesameposition;
if data are perfectly shifted (1, 2, 3, 4, 5 and 3, 4, 5, 6, 7), Wilcoxon signed rank test and
MannWhitneytestrejectH0indicatingthatdatadonothavethesamepositionwhileSpearman
isequalto 1 (H0rejected,thereforeordersarerelated)indicatingthattheorderisthesame.

Somemorepracticalexamplesfornonparametrictests,whicharehowevervalidalsoforthe
correspondingparametrictestsprovidedtheirprerequisitesaresatisfied.

Giventwostudents andtheirexams'gradesonthesame 6 economicssubjects,whichonewill


youhireforapositioninabank?
Dataarepairedsowecanuseallthreetests.However,wearenotinterestedwhetherthetwo
studentshavethesameorderornot,butsimplywhethertheirtwodistributionshavethesame
positionornot(and,ifwewanttodoalsoonetailedtests,whetheroneofthetwostudentshas
exams'gradesshiftedtotheright).Forexample,supposethatgradesare302928272625and
252627282930:inthiscaseforusthetwostudentsareequivalent,butSpearmanis 1 (H0
rejected,thereforeordersarerelated)indicatingthattheorderisreversed,whileMannWhitney
andWilcoxontestsbothacceptindicatingthatdatamaycomefromthesameposition.Suppose
insteadthatgradesare302928272625and252423222120:inthiscasethefirststudentis
evidently the best. Spearman is 1 (H0 rejected, therefore orders are related) indicating that
theorderisthesame,whileMannWhitneyandWilcoxontestsbothrejectindicatingthatdata

Page41

AdvancedStatistics

come from different position. So the best choice here is MannWhitney test, followed by
Wilcoxonsignedranktestsincethedifferencesexambyexamarenotimportant.Spearmanisnot
goodheresincewearenotinterestedintheorder.
Given two subjects and the grades given to the same6 students, how can you test whether
subjectBhasmarkswhichhavebeeninflated?
Dataarepairedsowecanuseallthreetests.However,wearenotinterestedwhetherthetwo
examshavethesameorderornot(becauseitcanhappenthatastudentisgoodinasubjectbut
bad in another, due to personal preferences), but we are interested whether their distribution
havethesameposition.Forexample,supposethatgradesare302928272625and25262728
2930:inthiscase,eventhoughthesamestudentgotdifferentgrades,foreachstudentwhohas
gothighgradeinAandlowinBthereisanotherwhocompensatewithalowgradeinAandhigh
inB(andthisisnotanindicationthatgradeshavebeeninflated,butsimplythatstudentsgoodin
AarenotgoodinB,duetopersonalpreferences).SosubjectBhasnotbeeninflated.Spearmanis
1 (H0rejected,thereforeordersarerelated)indicatingthatmarksareinthereverseorder,an
information which is totally useless here, while the two Wilcoxon tests do not reject indicating
thatmarksmaycomefromthesamedistribution.Supposeinsteadthatgradesare2524232221
20and302928272625:inthiscaseexamsgradesareevidentlyinflated.Spearmanis 1 (H0
rejected, therefore orders are related) indicating that marks are in the same order, an useless
information, while MannWhitney and Wilcoxon tests both reject, indicating that marks
distribution do not have the same position. So the best choice here is MannWhitney test,
followedbyWilcoxonsignedranktestsincethedifferencessubjectbysubjectarenotimportant.
Spearmanisnotgoodhere.If,ontheotherhand,wewanttoconcentratetheattentiononthe
grades inflation subject by subject, Wilcoxon signed rank test is the best choice, followed by
MannWhitneytest.
Given two subjects and the grades given to the same6 students, how can you test whether
gradesareconsistent?
Dataarepairedsowecanuseallthreetests.Consistentheremeansthatgoodstudentsinone
subjectarealsogoodintheother.Sointhiscaseweareinterestedindiscoveringwhethergrades
havethesameorderornot.Forexample,supposethatgradesare302928272625and2526
27 28 29 30: in this case it is clear that good students in first subject are bad in the second.
Spearmanis 1 (H0rejected,thereforeordersarerelated)indicatingthattheorderisdifferent
while the two Wilcoxon tests do not reject indicating that marks may come from the same
distribution,auselessinformationinthiscase.Supposeinsteadthatgradesare252423222120
and302928272625:inthiscase,eventhoughsecondexamsgradesareevidentlyinflated,at
least good students in first exam are still the best ones in the second. Spearman is 1 (H0
rejected,thereforeordersarerelated)indicatingthattheorderisthesame,whileMannWhitney
andWilcoxontestsbothrejectindicatingthatdatahavedifferentpositions.ThereforeSpearman
isthebestchoice,andMannWhitneyandWilcoxontestsarenotappropriate.
Giventwosubjectsandthegradesgivento 4 studentsforsubjectAandto 8 students(including
the previous 4) for subject B, how can you test whether subject B has marks which have been
inflated?
Dataarepairedonlyforthefirst 4.SousingWilcoxonsignedranktestimpliestakingveryfew
subjects and looking at the table with 4 cases we see that we must always accept.
ThereforeMannWhitneytestwith 4/8 subjectsistheonlytestpossiblehere.

Page42

AdvancedStatistics

6. Regressionmodel
An important consideration in merchandising a product is the amount of money spent on
advertising. Suppose you want to model the monthly sales revenue of a store as a function of the
monthly advertising expenditure. First, you have to decide whether an exact relationship exists
betweenthesetwovariables.Thatis,whetheritispossibletostatetheexactmonthlyrevenueifthe
amount spentonadvertisingisknown.Wearegoingtostudyasituationwhen thisis notpossible.
Thereareseveralreasons.First,salesdependonmanyvariablesotherthanadvertisingexpenditure:
timeofyear,thestateofthegeneraleconomy,inventory,andpricestructure.Thesevariablescanbe
included,alongwiththemonthlyadvertisingexpenditure,inamodel,butthenitisstillunlikelythat
wewouldbeabletopredictthemonthlysalesexactly.Thishappensduetorandomphenomenathat
cannot be predicted with certainty. For example, people may stop buying microwave appliances
becauseofnewfindingsconcerningtheharmfuleffectsofelectromagneticradiation.
Ifweweretoconstructamodelthathypothesizedanexactrelationshipbetweenvariables,it
wouldbecalledadeterministicmodel.Forexample,ifwebelievethat ,themonthlysalesrevenue,
willbeexactly 5 times ,themonthlyadvertisingexpenditure,wewrite
5 .Thisdeterministic
relationshipimpliesthat canalwaysbedeterminedwhen isknown.Thereisnoallowancefor
errorinthisprediction.If,ontheotherhand,webelievethattherewillbeunexplainedvariationin
monthlysalesperhapscausedbyimportantbutnotincludedvariablesorbyrandomphenomena
we discard the deterministic model and use a model that accounts for this random error. This
probabilistic model includes both a deterministic component and a random error component. For
example,ifwehypothesizedthatthesales isrelatedtoadvertisingexpenditure by
5
random error,wearehypothesizingaprobabilisticrelationshipbetween and .
Ingeneralthedeterministiccomponentmaybeanyfunctionofseveralvariables.Thesimplest
probabilistic model
employs a linear function
of one independent
variable as its deterministic component. Here is called the dependent variable, is the
independent or predictor variable and is the random error term. The latter is supposed to have
zero mean and finite variance, i.e. to randomly fluctuate around a null value. Then E
E
implying that expected value of follows a straight line
. The
Greek symbols and are the models parameters. They are not known and we have to
estimatethemfromtheavailabledata.
Themostcommonusesofaprobabilisticmodelformakinginferencescanbedividedintotwo
categories. The first is the use of a probabilistic model for estimating the value of for a specific
valueof whichisinthesetofourdata.Theseconduseofthemodelusuallyentailspredictinga
valuefor correspondingtoanew(thatis,whichisnotinthesetofdatawearedealingwith)value
of .

6.1 Theleastsquaresapproach
Given coupled observations
,
, we now want to find estimates for the parameters
.
whichfitstothissetofdatathebest.Westartwithchoosingamathematicalmodel
PlottingtheabovecouplesinaCartesian plane,weobtainthescatterplotcorrespondingtothis
dataset.Itisveryunusualthatallpointsbelongtothesamestraightline.Ifitdoesnotseemtobe
possible to have a single straight line passing through all of the couples, we may try to look for a

Page43

AdvancedStatistics

straightlinewhichdeviatestheleastfromthem.Asameasureforthedeviationwemayconsiderthe
sum of squared distances between the observed couples and the couples predicted by the line
.Thisistheessenceoftheleastsquaresapproach,whichtriesto

findestimatesforparameters forwhichthisquantityistheminimumpossiblevalue21.

Estimates are
/
and
(remembering from section 4.12 on page 32
is called the least
the definitions of ). The straight line given by the linear function
squares line or regression line. The value
can be considered as a prediction or
estimatefor .
SPSS:AnalyzeRegressionLinear
Sincethenumeratorsintheexpressionsfor
section4.12onpage32)areidentical,weseethat
samesignas .

andPearsonscorrelationcoefficient (see
0 andthat hasthe
0 ifandonlyif

Note that, dividing by


, we are assuming that
0. In fact,
is equal to zero
onlywhenallthexvaluesareidentical,acasewhereitisclearlyimpossibletoestimate basingon

21

In formal terms, given


2 couples
, , we consider the following function of two arguments
,

. This is, in fact, a sum of squared deviations of actually observed values from the quantities
assignedtopoints bythelinearfunction
.Wewanttofindacoupleofvalues
and
such
that this quantity has the minimum possible value (note that is always positive and it is 0 only when all the

, lieonaline).Inordertofindthisminimumwederivethefunction withrespectto and :

1 1

2n

and

2.Equatingtheabovepartialderivativestozero,weobtain
0

Thefirstequationimpliesthat
.Substitutingthisexpressioninthesecondequation,weget

0, and, remembering from section 4.12 on page 29 the definitions of


/
and
.

Page44

AdvancedStatistics

.
Thesimplestwaytomeasurethequalityofthelinearmodelistoevaluatethecontributionof
inpredicting .Wedefinethesumofsquaresoferrors
,

whichisameasureofhowcloseto 0 ourerrorsare.However,thisquantitydependsstronglyonthe
scalewe are using: ifwedivideallour numbersby 10,this quantitywouldbereducedby 100!In
ordertohaveascaleinvariantmeasureweintroduce
1
whichbelongsto 0; 1 andiscalledcoefficientofdetermination.Itisinterpretedastheproportion
of the total sample variability around that has been explained by the linear relationship
between and .Thedifference
showshowmuchthetotalvariability
(when isnotinvolvedatall)hasbeenreducedbyusingthebestpossible(intheleastsquaresense)
,wegettheproportionofthisreductionasmeasuredagainst
linearapproximation.Dividingby
.Obviously,thelargeris ,thebetteristhelinearapproximation.Indeed,alarger impliesa
smaller
. In other words, a smaller deviation of predictions from actually observed . For
0.6 means that the sum of squares of deviations of predicted values from actually
example,
observedoneshasbeenreducedby 60% byusingtheleastsquareslinearpredictions insteadof
.
It can be easily demonstrated22 that coefficient of determination is the square of Pearsons
correlationcoefficient.
1
areeitherall 0 oralldifferentfrom 0.

and

Thisimpliesalsothat ,

22

The expression for

implies that

. Indeed, we have demonstrated above that


/

analogously. Inserting here


1

, we get

Page45

, but the remaining terms may be treated

. Hence,

AdvancedStatistics

18
16
14
12
10
8
6
4

R=0.9799

2
0
0

10

12

14

16

18

20

R=0.0044

7
6
5
4
3
2
1
0
0

10

11

12

6.2 Statisticalinference
So far all calculations have been done without any hypothesis concerning the nature of the
dependence between the variables in question. Now we want to turn to the distributions used to
evaluatethequalityoftheleastsquaresestimatesobtainedabove.Thiscallsformoreassumptions
aboutthestructureofthedata:

.
whenever we consider a value we have the following relation
The values and are deterministic and unknown to us. are independent
(meaning that is independent from for
) observations of a random variable
, with zero expected value and a fixed variance (the
normally distributed as N 0;
sameforall ).

Thevalues
, arecalledtheresiduals.Theyaretheestimatesfortheoutcomesofrandom
variable .Itisinterestingtonotethatresidualshavethefollowingfeature

whichautomaticallyimpliesthatany
notindependent.
Prerequisites:

canbewrittenasafunctionoftheothers.Therefore

areindependentobservationsofarandomvariable ~N 0;

Page46

are

AdvancedStatistics

H0:

Statistic:

2
2

Statistic distribution: Students t with


distribution.

2 degrees of freedom. For

31 standard normal

SPSS:AnalyzeRegressionLinear

Arguingabouttheusefulnessofthesimplelinearregressionmodel,wewanttotestwhether
, of whom we only have the estimation , is equal to zero or not. Indeed, when
0, the
deterministic part of the model does not change as varies and therefore the model is totally
uselesssince doesnotdependon .Therefore,wetest:

H0:
H1:

0
0.

Thestatisticweuseis
2

which has the Students t distribution with


2 degrees of freedom. When we reject the null
hypothesis, we are sure that the slope of the regression line is not zero and therefore there is a
deterministic influence of independent variable over dependent variable. On the other hand, when
weacceptthenullhypothesiswearenotabletoprovethat isdifferentfromzeroandwecannot
arguewhetherthereisaninfluenceornot.

6.3 Multivariateandnonlinearregressionmodel
We may also assume that the dependent variable is a function of independent
arguments. For example, we may try to model the dependence of the annual revenue of a firm
runningsupermarketsnotonlyfromtheadvertisingexpenditure
,butalsofromthemoney

invested in the infrastructure of its shops. Trying to recover a relationship between and
,
1, 2, , , we can assume that it takes a linear form
. Searching for the best values for such coefficients, we can attempt to use again the least
squaresapproach23 toestimatethe coefficients.
SPSS:AnalyzeRegressionLinear

Introducing
, ,,
and looking for a minimum point,
, ,,
ofthisfunctionof
1 parameters , , , .Asinthecaseofthesimplelinearmodel,the
1
estimates , , , areobtainedasauniquesolutiontoasystemof
1 linearequations

23

,,

,,

Page47

AdvancedStatistics

The quality of this approximation may be assessed by looking at the multiple coefficient of
determinationdefinedalwaysas
1

Asbefore,itisinterpretedastheratiooftheexplainedvariabilitytothetotalvariability.Hencethe
largeristhisvalue,thebetterfitsthislinearfunctiontothesetofdatainquestion.Formultivariate
regression models there is no alternative formula and no easy relation with Pearsons correlation
coefficients.However,itisstilloftencalled .
The coefficient has the same meaning as before: it is the predicted value of when all the

are equal to 0. Instead, each coefficient is the increment of when the corresponding
increments by 1 unit and at the same time all the other
,
,,
,
, ,

maintain the same value. Thus each for


1 can be seen as the effect of its independent
variableonthedependentvariable.
Inthepreviousmodelswehavealwayslookedfora linearrelationbetweendependentand
independentvariables.Therearecases,however,wheretheoreticalreasonsorthescatterplotitself
suggestanonlinearrelation.Forexample,therelationmaybepolynomial
or logarithmic part
ln
or exponential
e
or any
complexcombinationoffunctions.Clearly,themorecomplexthefunctionthemorecoefficientsare
necessarytobeestimated.Estimates , ,..., maybeobtainedbytheleastsquaresapproach
aswell.
7
6
5
4
3

y=0.9857ln(x)+5.9685
d=0.8566

2
1
0
1 0

Quadraticmodel:

0,1

0,2

0,3

0,4

0,5

Logarithmicmodel

SPSS:AnalyzeRegressionCurveEstimation/Nonlinear

6.4 Multivariatestatisticalinference
areindependentobservationsofarandomvariable ~N 0;

Prerequisites:
H0:

forall

Statistic:

1
1

Statisticdistribution:FishersFwith and

1 degreesoffreedom.

SPSS:AnalyzeRegressionCurveEstimation/Linear/Nonlinear

Page48

0,6

ln

0,7

0,8

0,9

AdvancedStatistics

Since the dependent variable now is written as a linear function of independent


variables
,
,...,
,themodelisalsotermedasagenerallinearmodel.Nowitisexplicitly
postulated that the unknown deterministic part is a linear function with unknown coefficients. The
valueof determinesthecontributionoftheindependentvariable
and istheintercept.
and we make the same
We define exactly as in the previous case the residuals
assumptionsasbefore.Nowwecantesttheusefulnessofthemodel

H0:
H1:thereisatleasta

0
0 forwhich

0.

Thestatisticweuseis

whichisdistributedasaFishersFwith and
1 degreesoffreedomwithonlyarejection
regionontheright.Notethatwhenwerejectweknowthatatleastonecoefficientisnotzerobutwe
donotknowwhichone.

6.5 Qualitativeindependentvariables
Peopledistinguishbetweentwotypesofdata:qualitativeandquantitativeones.Quantitative
data are recorded in a meaningful numerical scale, whereas qualitative data are measured on a
nonnumericalorcategoricalscale.Thus,theGrossDomesticProduct,numberofsolditems,kilowatt
perhoursofelectricityusedperdayareallexamplesofquantitativevariables.Ontheotherhand,the
gender, race, job title, and style of packing are all examples of qualitative variables. The possible
values of a qualitative independent variable are referred to as category. For example, the style of
packingmighthavethreepossiblelevels:A,B,andC.Evenifwedesignatethesevaluesarbitrarilyas
1, 2, and 3, thenthenumbersstillrepresentcategoriesandthevariableisstillqualitative.
Letuslookatregressionmodelswithqualitativeindependentvariables.Supposewewantto
estimatethemeanoperatingcostperkilometerofcarsasafunctionofthecar'smanufacturer.Let
there be three manufacturers of interest, which we identify as A, B and C. Then the automobile
manufacturerisasinglequalitativeindependentvariablewiththreecategories,A,BandC.Notethat,
as always with a quantitative independent variable, we cannot attach a quantitative measure to a
givencategory.Evenifweweretocallthemanufacturers 1, 2, and 3,thenumberswouldsimplybe
identifiers of the manufacturers and would have no meaningful quantitative interpretation. Our
objectiveistowriteasingleequationtopredictthecostperkilometerbasedoncarsbrand.Thiscan
bedoneasfollows:
E,where
1ifthecarismanufacturedbyB,
0ifthecarisnotmanufacturedbyB;
1ifthecarismanufacturedbyC,
0ifthecarisnotmanufacturedbyC;
Thevariables
and
arenotmeaningfulindependentvariablesastheyareinthecaseofthe
model with quantitative independent variables. Instead, they are dummy variables that make the
modelwork.

Page49

AdvancedStatistics

Tounderstandthemeaningsofthe coefficients,let
0.Thisconditionmeans
thatthecarismanufacturedbyA(neitherBnorCismanufacturing;henceitmustbeA).Thenthe
modelbecomes
0

.Takingtheexpectedvalue, E
forcarsmanufacturedbyA.Therefore
Thus,
istheexpectedvalueforthecostofcarsmanufacturedbyA.Nowsupposewewanttorepresentthe
meancostperkilometerformanufacturerB.Thenweshouldlet
1 and
0:
1

We have that
E producedbyB
E producedbyA . Therefore this coefficient is the
E producedbyC
expected difference of cost when switching from A to B. Similarly,
E producedbyA .
Note that we are able to describe three categories of the qualitative variable with only two
dummy variables. This is because the base level (manufacturer A, in this case) is accounted for the
intercept . In general therefore for each qualitative variable we require an amount of dummy
variablesequaltothecategoriesminus1.
Since a model with dummy variables is a multivariate regression model, everything we said
concerning usefulness tests also applies here. Moreover, it can be freely mixed with quantitative
linearandnonlinearmodelscomponents.
SPSS:dummyvariablesarehandledautomaticallyifvariablesarenominalorordinal

6.6 Qualitativedependentvariable
Itisalsopossibletobuildaregressionmodelwithaqualitativedependentvariable,provided
thatithasonlytwocategoriesarbitrarilyindicatedwith 0 and 1.Thedifficultyofthismodelisthe
fact that the right side of the models term
,
,,
provides continuous values while
the left side has only two possible values. For example a linear regression model would yield silly
results,largerthan 1 orsmallerthan 0,suchas
1

y=0,0494x+0,1986
R=0,3682
0
0

Thereforeinsteadofusing

10

,,

15

20

asestimationfunction,weuseitslogit
1

.
,
,,
1
Thefunctionontherightsidenowgoesfrom0to1.Itsshapeismuchbettersuitedforinterpolating
valueswhicharealways 0 or 1:

Page50

AdvancedStatistics

0
0

10

15

20

Values between 0 and 1 can be interpreted as the probability for the dependent variable to take
value 1.
SPSS:AnalyzeRegressionBinaryLogistic

6.7 Problemsofregressionmodels
6.7.1 Numberofobservations
Inordertoworkcorrectly,theleastsquaresapproachneedsanumberofobservationsthatis
atleastequaltothenumberofparametersitneedstoestimate.Ifthisconditionisnotsatisfied,the
linearequationssystemdoesnothaveauniquesolutionandthusparametersestimatescannotbe
determined.Inpracticehowever,thenumberofobservationsmustbemuchlargerthanthenumber
of parameters: as an empirical rule, observations should be at least 10 times the number of
parametersusedbythemodel.Thisisbecausewheneverweaddaparametertothemodelweget
alwaysalarger ;thisseemstoindicatethatthenewmodelisbetter,butitisonlymorecomplex,
i.e. it is simulating reality not simplifying it but simply auto adapting itself. If we build a regression
modeltounderstandrealityandnotsimulatingit,keepingthemodelsimplemustbeourpriority.
6.7.2 Multicollinearity
Multicollinearity exists in a multivariate regression model when one or more of the
independent variablesused in regression depend in a deterministic way on eachother. In this case
the corresponding independent variables contribute redundant information. For example, suppose
wewanttoconstructamodeltopredictthefuelcostofatruckasafunctionofitsload
andthe
power
ofitsengine.Clearly,thesetwovariablesaredependentsinceusuallyapowerfultruck
carries huge loads. Although both
and
contribute information for the prediction of fuel
cost,thecontributionsaretautologicalandinthemodel coefficientswillnotreflecttheeffectof
eachindependentvariable.
A simple way to detect multicollinearity is to calculate the Pearsons correlation coefficient
between each pair of independent variables in the model. When a calculated value differs
significantlyfromzero,thevariablesinquestionarerelatedandamulticollinearityproblemexists.
6.7.3 DependenterrorsandDurbinWatsontest
Aswehaveseeninsection6.2,residualsaredependent.Residualsaretheestimationsofthe
outcomesofrandomvariable ,whichinordertoperformusefulnesstests,mustbeallindependent.
Since their estimates are dependent, we may legitimately argue that the independence hypothesis
maynothold.Infact,therearemanypracticalsituationswhereitdoesnothold,inparticularwhen
theobservedcasesaretakenatdifferenttimes,sincecyclicalcomponentofatimeseriesmayresult

Page51

AdvancedStatistics

indeviationsfromtheseculartrendthattendtoclusteralternatelyonthepositiveandnegativesides
of the trend. For example, if our dependent variable is the monthly Gross Domestic Product of a
country, its cases are taken at different times and their values may be influenced by cyclical
fluctuationswhich,notpredictedbythedeterministicsideofthemodel,endupinfluencingtheerrors
whicharethereforenomoreindependent.
~N 0;

Prerequisites:
H0:

areindependent

Statistic:

1
1

Statisticdistribution:DurbinWatsonstable
SPSS:AnalyzeRegressionLinearStatistics

Supposing now that the observations are taken at different times, and thus using as case
index,wewanttotest

H0:
H1:

The statistic is

wehave

and takes values from 0 to 4. Value 0 corresponds to a situation

arenotautocorrelatedforevery
areautocorrelatedforatleasta

areconstant,andthusaperfectcorrelation.Ontheotherhand,when

whereallthe

and
and

4,whichthereforecorrespondstoaperfectnegativecorrelation.

A value of 2 is the uncorrelations value, where H0 is accepted. Critical values are found in
DurbinWatsonstable.
For example, if we get a DurbinWatsons statistic value of 3.8 for
15 and
3
(multivariatemodelwith 3 independentvariables)wegetacriticalvalueof 0.814.Thismeansthat
the right critical value is 4 0.814 3.186 and therefore we reject. This means that errors are
autocorrelated.
H0 probably true

H0 probably false

+0.814

H0 probably false

4 0.814

3.8

Ifwemanagetoprovethattheautocorrelationisnotzero,thenautomatically and

are correlated and thus (since independence implies zero correlation) they cannot be independent
andusefulnesstestcannotbeperformedsincehypothesesdonotholdtrue.
Ontheotherhand,iftheautocorrelationiszero,wecannotdirectlydeducethat and

areindependent (since zerocorrelationdoesnotimplyindependence).However,when assumption


holds, zero correlation does imply independence. Therefore, if null hypothesis is
~N 0;
accepted,wecanhopethaterrorsbeindependent.
In any case, even with dependent errors, the regression model continues to work and the
determinationcoefficientcontinuestohaveitsmeaning.Theonlythingthancannotbeperformedis
theusefulnesstest.

Page52

AdvancedStatistics

6.7.4 Heteroskedasticity
Let the errors in a simple linear regression model have a varying variance, that is even
though all of them are independent and normally distributed with zero expected value, they come
fromrandomvariableswithadifferent variance .Usuallythishappenswheneverthe datacome
fromobservationswhichvaryalotinsize,asitisthecasewhenourcasesarefirmsofdifferentsizes,
orwhendatacomefromaggregatedvalues,suchasaveragesorsums.
Inthecasewhenweareabletoknowapriorithevalues ,wemaybuildanewmodelfor
whicherrorsvariancesarethesame.Ifwedividethemodel,writtenforeach ,by ,weget

andcalling

itbecomes
1

whichisaparticularmultivariateregressionmodelwithtwoindependentvariables, and
nointercept.Ifwecalculatethevarianceof
Var

Var

,and

wegetthattheyareallthesame:
1

1.

Var

Therefore we can use the new model, which does not present the heteroskedasticity problem, to
estimate andthenmultiplyby togetthe estimations.
Since in practice the value of the standard deviation of errors is unknown, it is common to
divide the regression model by a scale quantity which represent an estimate of the standard
deviation,forexamplethesizeofthefirm(numberofemployeesortotalbudget).
Let

Atypicalcasewherethescalingfactorisknownareaggregateddatawithheteroskedasticity.
betheobservationonthe thcaseinthe thgroup,andconsiderthefollowingregression

If we do not have the cases single values but only aggregate observations on each group are
available,thentheseexpressionsaresummedovercases,thatis
,
,
. If the original errors , those referred to
Here
cases,satisfyourassumptions,i.e.areindependentidenticallydistributedwithexpectedvalue 0 and
variance ,thentheerrorsweuseinthemodel arestillindependentidenticallydistributedwith
since
expectedvalue 0 butwithvariance
Var

Var

Var

This means that the errors in our aggregated model are now heteroskedastic. However, we know
.Hence,wehavetodividethemodelby
andperformtheordinary
theirvariances

Page53

AdvancedStatistics

least squares estimates on the transformed equation. In practice, in this case we do not know the
,avaluethatweknowsinceitisthenumberoffirmsin
valueof butwemaysimplydivideby
everysector.Wewillgetamodelwithconstanterrorsvariance .
In any case, even with dependent errors, the regression model continues to work and the
determination coefficient continues to have its meaning. The only thing than cannot be done is the
usefulnessproblem.

Page54

Вам также может понравиться