Вы находитесь на странице: 1из 39

Tests of Significance for 2 × 2 Contingency Tables

Author(s): F. Yates
Source: Journal of the Royal Statistical Society. Series A (General), Vol. 147, No. 3 (1984), pp.
426-463
Published by: Wiley for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2981577
Accessed: 26-08-2014 23:40 UTC

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content
in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship.
For more information about JSTOR, please contact support@jstor.org.

Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to Journal of the
Royal Statistical Society. Series A (General).

http://www.jstor.org

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
J. R. Statist.Soc. A (1984),
147,Part 3, pp. 426-463

Testsof Significance
for2 X 2 Contingency
Tables

By F. YATES
RothamstedExperimentalStation
[Read beforethe RoyalStatistical
Societyon Wednesday,March 21st, 1984, thePresident
ProfessorP. Armitage
in the Chair]

SUMMARY
Fisher'sexact test,and theapproximation X2 test,
to it by the continuity-corrected
have repeatedlybeen attackedoverthe past 40 years,recently withthe supportof
extensivecomputerexercises.The presentpaperargues,on commonsense grounds,
supportedby simpleexamples,thattheseattacksare misconceived, and are mainly
due to uncritical
acceptanceof theNeyman-Pearson approachto testsofsignificance,
theuse of nominallevels,and refusalto acceptthearguments forconditioning
on the
margins.
Two-sidedtests have also added to the confusion;it is arguedthat the best
definition
ofa two-sided probability
is twicetheobservedone-tailprobability.
Keywords:ANCILLARYSTATISTICS;BINOMIALPROBABILITIES;CONDITIONING;CONSERVATIVE
TEST; CONTINGENCYTABLES;CONTINGENCYTEST; CONTINUITYCORRECTION;
FISHER'S EXACTTEST; GOODNESSOF FIT; NEYMAN-PEARSONTHEORY; ONE-TAIL
PROBABILITIES;QUALITYCONTROL;SIGNIFICANCETESTS; TWO-SIDEDTESTS;
YATES'S CORRECTION

1. INTRODUCTION
Testsof significance forevidenceof associationfromdatain 2 x 2 contingency tableshavelong
been a matterof dispute.Eversinceitsintroduction thelegitimacy of Fisher'sexacttesthas been
underattack,mainlyon thegroundthatit is too"conservative", i.e. thatit givesfewersignificant
resultsthanarejustifiedby theevidenceprovidedby the data,exceptfortablesbothmargins of
whicharedetermined in advance.
These disputesare attributable to the factthatNeymanand Pearson,in the development of
theirtheoryof testsof significance, took it as axiomatic(or as Pearsonpreferred to call it, as a
practicalrequirement) thatthe level of significancemustbe equal to the frequency withwhich
the hypothesis is rejectedin repeatedsamplingof anyfixedpopulationallowedby hypothesis.
Thatthiswas indeedthebasisof theearliestideas on testsofsignificance is unquestionably true.
These had theirgenesisin the conceptof probabilityassociatedwithgamesof chance,later
extended,by the normaltheoryof errors,to errorsin estimates based on continuous variables.
The latter,of course,requiresknowledgeof the standarddeviationor its estimationfromthe
availabledata. It was Gosset'sintroduction of the t-test,whichmade due allowanceforerrors
of estimation of a fromsparsedata,thatled to Fisher'srecognition thatconditioning on ancillary
statistics(i.e. statisticsthatprovideinformation on the accuracyof theestimatedquantitybut
do not themselvescontainany information on this quantity)is a fundamental and valuable
extensionofthetheoryoftestsofsignificance.
The t-testwas acceptableto the Neyman-Pearson schoolbecauseit did not transgress the
frequency requirements of repeatedsampling.The marginaltotalsof a contingency table have
a functionsimilarto s2 in thattheyprovideno information, additionalto thatprovidedby the

Present
address: Rothamsted
Experimental
Station,Harpenden,
HertsALS 2JQ.

? 1984 Royal StatisticalSociety 0035-9238/84/147426 $2.00

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] TestsofSignificance 427
body of the table,on lack of proportionality, but do provideinformation on the accuracyof
estimatesof association.However,like the Behrens-Fisher test,thoughforsomewhatdifferent
reasons,thefrequency requirements ofrepeatedsampling arenotsatisfied.
The frequencyproperty, when it holds,is undoubtedly an easy way of explainingwhatis
meantby testsof significance; indeedFisherlentsupportto thismode of explanationin much
of his earlywriting.Anynumerateperson,forexample,is awareof theprobabilities associated
withsimplegamesof chance,such as tossinga coin or spinninga roulettewheel,and anyone
who has any experienceof errorsof measurement can fullyappreciatethe implications of the
normaltheoryof errors.Becausethe frequencyrequirement is not violatedhe is not likelyto
disputethet-test,thoughhe mayfeelsomeuneaseif thenumberof degreesoffreedom forerror
is verysmall.He is muchmoredisposed,however,to doubt theneed forconditioning in tests
of 2 x 2 contingency tables,particularlyas he is oftenanxiousto use theresultsof suchtests
to provethatsome associationis indicated,and is consequently themorereadyto believethat
theexacttestis "conservative".
It is this mistakenbeliefthat has promptedme to writethis paper. The pointsat issue
are illustrated by simpleexamples,mostlybased on verysmallnumbers, partlybecauseof ease
of presentation, but also becauseit is herethatthecontradictions betweenthetwo theoriesare
most evident.As the size of a sampleis increasedthe discrepancies betweenthe different
approachesare steadilyreduced,thoughit shouldbe remembered thatthesediscrepancies are
primarilydependenton the smallestexpectationof any cell, not on the total numberof
observations in thetable.
In additionto conditioning, the consequencesof usingnominallevelsof significance, such
as 5 and 1 per cent,are also discussed;thispracticeis a further defectof the Neyman-Pearson
theorywhichundoubtedly adds to the generalconfusion.Two-sidedtestsare a further source
of disagreement.
Fisher'sexact testis closelyrelatedto the x2 test,whichis itselfa conditionaltest;indeed
the continuity-corrected x2 givesclose approximations to theexacttest,exceptfortableswith
verysmallexpectations.A briefhistoryof the development of thesetests,and somecomments
on somerecentpaperscriticising theexacttest,areincludedin thepresentpaper.
A moremathematical matterthathas enteredinto the disputesis the questionof whether
the marginalvalues of a table reallycontainno additionalinformation on the existenceof
association,and therefore qualifyas ancillarystatistics.This is a moretechnicalissuewhichis
relegated to a shortappendix.
2. NOTATION
In whatfollowsthenumbersin a 2 x 2 tablearerepresented
bythesymbolsofTable 1,where,
in general,ml < m2,n1 <n2 and q1 = 1 -Pl, etc. Withgivenmargins,
a, whichis thenthe cell
TABLE 1
Notation

B1 B2 Total

A1 a b n1 P1 = aln1
A2 c d n2 P2 c/n2

Total ml m2 N p =m,IN

withthe smallestexpectation,can assumeintegralvaluesof 0 to m1 if m1 ? n1, or 0 to n1 if


m1 > n1. The expectatione of a, whenthereis no association,
equalsm1n1/N. In somecontexts
Pi and P2 can be regardedas estimatesof binomialprobabilitiesPi and P2 - If Pl = P2 = p say,
a combinedestimateof p is givenbyp.
To savespacenumerical valuesforparticular
tablesaregivenin thetextin theform(a, b; c, d).

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
428 YATES [Part 3,
Occasionallythe form{m1, mi2; n1, n2; N} is used formarginalvalues.In the text,also, the
word "table" is used in two senses,(i) a table withparticularnumericalvaluesfora, b, c, d,
(ii) thefamilyof tables,ml + 1 or nI + 1 in number,whichcontainvaluesofa, b, c, d conform-
ingto the givennumericalmarginaltotals.It will be obviousfromthe contextwhichsenseis
implied.
If mlI ? n1 and p' n1/N= e/m1the distribution of a willtendto thebinomialdistribution
(p' +q')ml as N- cowithml1ande fixed;andsimilarly, withml1andn1 interchanged, ifml1>n1.
3. EARLY HISTORY
In 1900 KarlPearsonintroduced thex2 testforgoodnessof fit.The testhas provedto be of
greatutilityin many contexts,but unfortunately Pearsondid not recognizethatin addition
to deductingone degreeof freedomforthenumberin thesamplean additionaldegreeof free-
dom mustbe deductedforeach additionalparameter estimatedfromthe data. In testingfor
associationin contingency tables the expectationsof the cell values are estimatedfromthe
marginaltotals,andthenumberofdegreesoffreedom foran r x s tableis therefore(r - 1) (s - 1),
notrs- 1. Thiserroris particularly seriousin 2 x 2 tables,forwhichx2 withone degreeoffree-
dommustbe used,notthreedegreesoffreedom as Pearsonthought.
Udny Yule was also very concernedwith contingency tables,and introduceda test for
associationin 2 x 2 tablesin histextbook, Introductionto theTheoryofStatistics, firstpublished
in 1911, usingthe large-sample estimate<(pq/n) forthestandarderrorof a proportion p. This
gives an estimateof the standarderrorof the observeddifference, Pl -P2, of the two
of
probabilities,
p(q) P2q2
n, n2

He also notedthatif thereis no difference


Pi andP2 can bothbe replacedby theircombined
estimatep = m1/N. WiththiscombinedestimateYule's testis equivalentto Pearson'sx2 test
withone degreeoffreedom.
Themostconvenient forthisis
formula
(ad - bc)2 N (1)

ml M2 n1 n2
Yule did not mentionthex2 testin his textbook,but he evidently soon becameawareofthe
discrepancy betweenhistestand thex2 testwiththreedegreesof freedom, as he drewattention
to it in GreenwoodandYule (1915), andshortly afterwards constructed 350 2 x 2 tablesand 100
4 x 4 tablesby mechanicaldevicesdesignedto giveindependent distributions,and comparedthe
X2distributions so obtainedwiththosegivenby theory,but did not immediately publishhis
results.
The nexteventof importance was the publication by R. A. Fisherofhis 1922 paper,inwhich
he drewattentionto Pearson'serror.AlthoughYule was not fullysatisfied withFisher'sproof
he thensimultaneously publishedthe resultsof his samplinginvestigation, which,as was to be
expected,confirmed Fisher'sresults.Pearson,as was his wont,did notimmediately admitto any
error,and a considerable controversy arose,but thecorrectness of Fisher'sconclusions ultimately
cameto be generally accepted.
The x2 testis of courseapproximate and willnot hold exactlywhentheexpectations of the
separatecellsof a distributionor contingency tableare small.In Statistical
MethodsforResearch
Workers (1925) Fisheradvanceda ruleofthumbthattheexpectednumberin anyone cellshould
notbe lessthan5. Thisrulemayin factbe adequate,indeedconservative, fortestsinvolving more
thanone degreeof freedom.It is moresuspectfortestsinvolving onlya singledegreeof freedom.
Suchtestsare specialin thatthereare two separatetails,whichshouldbe keptdistinct. A x2 test
with1 df is in factequivalent, iftheappropriate signis attachedto '/x2,to a testof a normal
deviatewithunitstandarddeviation.

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] TestsofSignificance 429
If the exact distribution relevantto anyparticularproblemis knownthe accuracyof thex2
test(or any otherapproximate method)can be investigated by comparing its performance with
thatgivenby the exactdistribution overa rangeoftypicalexamples.In 1933 I becameinterested
in suchan investigation. The exactformof a binomialdistribution withgivenp wasof coursewell
known,but not thatof a 2 x 2 table withgivenmarginaltotals.This was suggested to me by
Fisher,and dependson the restriction that only sets of values conforming to both pairsof
observedmarginal totalsare includedin evaluatingtheprobabilities,a restriction
whichis in fact
also implicitin the x2test,as the expectations of thecellvaluesarecalculatedfromthemarginal
totals.Althoughthenunpublished, the exact formmusthave been knownto Fisherforsome
years,as is indicatedby a crypticpassagein an earlierpaper(Fisher,1926): "an exactdiscussion
wouldshowthat [fortableswith35 entries]the averagevalueof x2 shouldexceedunityby one
partin 34".
The resultsof thisinvestigation,reportedin my 1934 paper,showedthattheapproximations
givenby x2 to both binomialand 2 x 2 exact probabilities, particularlywhen the parent
distributions are approximately symmetrical, are greatlyimprovedby deducting1/2fromthe
observeddeviationsfrom expectationswhen calculatingx2. This I termedthe continuity
correction. Formula(1) abovethenbecomes

- bc -N)2 N
= (lad
ml m2 n1 n2

If, however,the parentdistribution, as forexamplea binomialdistribution withp differing


greatlyfrom0.5, is markedly asymmetrical, the one-tailprobability givenby Xc,the squareroot
of X2 correctedfor continuity, will necessarilydeviatesomewhatfromthe trueprobability,
becausethenormaldistribution to whichit is referredis symmetrical.I therefore produceda small
table,covering2 x 2 tablesand binomialand Poissondistributions, of theXc valuesforthe 2.5
per cent and 0.5 per cent significance levels,corresponding to the 1.96 and 2.58 valuesforthe
normaldistribution. Thiswas laterincludedin StatisticalTables(FisherandYates,1938). Fisher
also added sectionson the continuity correction and the exact testto the 5th edition(1934)
of StatisticalMethodsfor ResearchWorkers, Sections21.01, 21.02. He also (1935) statedhis
reasonforbelievingthatthe exact test shouldalwaysbe used,whetheror not themargins are
determined in advance.
One mighthave thoughtthat this would settlethe matter,particularly as the X2 testhad
cometo be recognized as theappropriate testwhenthe expectations in all fourcellsofthetable
are reasonablylarge.However,in 1945 Barnardput forward a testwhichhe claimedwas more
powerfulthanFisher'sexact test (Barnard,1945). Takingthe tableto be generated by samples
of n1 and n2 fromtwo binomialdistributions withprobabilities Pi and P2, he arguedthatif
PI = P2 = p and n1 = n2= 3, forexample,the probability of gettingthe table(3,0; 0,3) is p3q3 e
whichhas thevalue 1/64whenp = 0.5, and is lessthanthisforall othervaluesof p, as opposed
to a probability
of 1/20ifbothmargins areregarded as fixed.
At firstsight,Barnard'sargument seemsto make good sense.It is certainlytruethatif we
takerepeatedpairsof samplesfromtwobinomialseach withp = 0.5, 1 in 64 pairson theaverage
will givethe table (3,0; 0,3). This,however,is equivalentto takinga sampleof 6 froma single
binomial,and dividing it at randomintotwotriplets. Fromthebinomialdistribution (1/2+ 1/2)6
the probabilityof gettingvalues of 3,3 in the ml, m2 marginis 20/64. Thus the combined
probabilityof getting thetable(3,0; 0,3) is 20/64x 1/20= 1/64.The crucialquestion,therefore,
is whetherthefactor20/64shouldbe includedin the calculationof the significance probability.
Barnardlater(1947) elaboratedhis proposalintowhathe termedthe CSM test.A somewhat
similarproposalhad also beenmadeby E. B. Wilson(1941). Bothproposalsweresoonabandoned,
however, and Fisherwas able to writein his discussionof theprobleminStatistical Methodsand
Scientific
Inference (1956):

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
430 YATES [Part3,
"ProfessorBarnardhas sincethenfrankly avowed[(1949)] thatfurtherreflection
has led
himto thesameconclusion[thatonlysamplesconforming to the observedmarginal
totals
rankforinclusion]as Yates and Fisher,as indeedWilsonwithequal generosity
had done
earlier."
That this conclusionis stillnot acceptedin manyquarters,however,is veryevidentfrom
numerousrecentpublications.The simplenumericalexamplesin the followingsectionswill,it
is hoped, throwfurther lighton the pointsat issue,and illustratethe waysin whichtestsof
significance, applied,canbe ofhelpin theinterpretation
correctly of2 x 2 data.

4. COMPARATIVETRIALS
If,say,we wishto testwhether inoculationwitha new serumreducestheriskof contracting
some infectiousdisease,a groupof N individuals may be chosenforthe testand n1 of them
selectedat random for inoculation,leavingthe remainder n2 uninoculated.This determines
the nl, n2 margin.Moreoverif none of the N individuals wereinoculateda givennumberml
(unknownto the experimenter) would be fatedto contractthe disease.If the inoculationhas
no effectthiswill not be changedby the experiment. Subjectto thiscondition,therefore,
the
ml, m2 marginis also determined. The statisticalproblem,if thereis an apparentbeneficial
effectof inoculation,
is theevaluationof the probability thatthe observedor a greater
apparent
effectcan be attributedto chancecausesresulting fromtherandomassignment oftheinoculation
treatment;and converselyif thereis an apparentdeleteriouseffectthe evaluationof the
probabilityofgettinga negativeeffectofthisorgreater magnitude by chance.
Given the marginalvalues,and randomselectionfor inoculation,the probabilityof the
occurrenceof anyparticular set of cell values(a, b; c, d) wheninoculationhas no effectcan be
shownby combinatorial analysisto be
ml! m2! nl! n2!
a b! c! d!N!
This givesFisher'sexact distribution.If mln In therewill be ml + 1 terms,withvaluesof a
from0 to ml; ifml >nl therewillbe n1 + 1 terms.If ao is theobservedvalue,summation of
the probabilities
fromthe loweror uppertail givesthe probability of gettinga valueof a < or
> ao. Thisprovidesan exacttestof the signiflcance
of an apparentassociation.Withgivenvalues
N + 1 in number,therelevant
of n, and n2 therewillbe a set of suchdistributions, distribution
beingdetermined by the observedvaluesof ml, m2. The figures in bracketsin Table 2 showthe
obtainedwhennI = n2= 5.
11 distributions
It shouldbe notedthatBarnarddid not,in his 1947 paper,discusscomparative trials,butuses
theterm,misleadinglyinmyopinion,forsamplesfromtwobinomials.

5. SAMPLESFROM TWOBINOMIALS
In a comparative trialthe individuals
includedare not necessarily chosenat randomfroma
definedlargerpopulation-theymay merelybe selectedas suitableexperimental material.In
many2 x 2 tables,however,the data are in fact,or canbe regarded as, samplesfromtwo defined
largepopulations, in whichcase thetwo linesof thetableconstitute samplesof n1 and n2 from
two binomialdistributions. If Pi andP2 arethebinomialprobabilities andthereis no association,
so that P1 = P2 = p say, a combinedestimatep = m1/Nfromthe m1,m2 marginprovidesa
sufficientestimateof p. Conditioning on this estimate,i.e. regarding the ml, M2 marginas
"fixed",thengivesFisher'sexact distribution. If,however, we do not imposethisconditioning,
and insteadconsiderall combinations ofthepossiblesamplesofn1 andn2 thatcanarisefromthe
two binomialdistributions, rankedin orderofthevaluesofPI - P2,
theirassociatedprobabilities,
or in someotherplausiblemanner, willprovidea basisforan alternativetestof whetherPi differs
fromP2. Thiswasthebasisof Barnard's"morepowerful"CSMtest.
significantly
The followingspecificexamplemay helpto clarify thinking on thismatter.Table 2 setsout

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] Testsof Significance 431

t :t H O t F~~0 In in uz4 -- 00o


E 4 O H t ^ O ~~~~tO~ 0

S~~~~~~~~~~nR 0 m00 %D 0 000 O I

I I o t. t. ?o I
a It- t
o t m?^ o^ E~~~R om^
oOoF mt o n

4 CJ t > n a, uz n O r uz n a% <

n 0 It
4i~ I 0. o o|0 |n O

X 1 n oOO oo - oo tn ooOo > .: >

X~~~~
S- ~~~ 00 on 1- ? in >4

t N:m~~~~~~~~~t

SL i I I I i + ~~~~+ + + +

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
432 YATES [Part3,
the relativefrequencies (probabilitiesx 1024) of the 36 possibleoutcomeswithn, - n2= 5 and
p = 1/2.In thetabletheoutcomesare classified bythedifferences betweenPi andP2 andbythe
valuesofthemarginal totalsml, M2-
The table enablesthe probability of obtainingsampleswithany givencharacteristics, with
or withoutconditioningonwthe ml, m2 margin,to be calculated.For samplesin which
p, -P2 > 0.6, for example,the probability withoutconditioning is (10 + 25 + 10 + 5 + 5 + 1)/
1024 = 0.055, or directlyby the marginaltotals(45 + 10 + 1)/1024.Thesemarginaltotalsare
in fact 1024 timesthe relativefrequencies, shownin the right-hand margin,of the binomial
(1/2+ 1/2)10.Thusif we werebetting on P1 - P2 > 0.6 withoutknowledge oftheml, m2 margin
thefairbetting oddswouldbe (1024-56): 56 or 17: 1 approximately.
If,however, we wereinformed, bythepersontakingthesamples,ofthemarginal totalsmlI,m2
for each particularsample,the conditionalprobabilities (shownin brackets)would enableus
to make muchmorediscriminating bets. We would,if we werewise,onlybet whenmarginal
totalsof 5, 5, 7, 3 or 3, 7 occurred.For 5, 5 the probability of a successful betis (25 + 1)/252=
0.103, and for7, 3 and 3, 7 is 10/120= 0.083. Formargins 6, 4 and4, 6 theprobability of success
is only5/210= 0.024 andfortheremainder is zero.
Gamblingof thistypecan easilybe performed withtwo packsof cards.If,aftershuffling, 5
cardsaredealtfromeachpack,andifthenumbers ofredcardsareat issue,thisis equivalent, apart
fromthefactthatthepacksconstitute finitepopulations, to independent samplesof5 fromtwo
binomialdistributions each withp = 1/2.If thesamplecardsarelaidon thetablefacedownwards
all we can sayabouttheprobability thattherearethree,four,or fivemoreredcardsinthesample
frompackA thanin thatfrompackB is thatitsoverallvalueis 0.055, and thisshouldgovernthe
betting.If, however, the 10 samplecardsare shuffled and thendisplayedfaceupwards, thetotal
numbersof redsand blacksare immediately apparent,althoughwe stilldo notknowto which
pack each individualcard belongs.If one of the opponentsis aware of the value of this
information, and takescognisanceof it, whilethe otherpinshis faithon the overallprobability,
thelatteris clearlylikelyto findhimself considerably out ofpocket.
To obtainmorepreciseprobabilities accountmustbe takenof the factthatthesamplesare
ftomfinitepopulationsof 52 cards.In suchsamplesthe probabilities of getting0,1, .. ., 5 red
cards are approximately(1, 6, 13, 13, 6, 1)/40, instead of the binomial probabilitiesof
(1, 5, 10, 10, 5, 1)/32.Substitution of thesenewvaluesin the diagonalmargins of thesquareof
valuesof Table 2, withcorresponding adjustments to thevaluesin thebodyofthesquare,givesan
overallprobability of 0.047 of an excessof 3 or morered cardsin packA, i.e. fairbettingodds
of 20:1 insteadof 17: 1; if the marginsare revealedand betsare placedonlyformargins 5, 5,
7, 3, 3, 7 the averagegain per bet at odds of 20: 1 willbe 70 percentof thestake,compared
with68 percentat oddsof 17: 1 fortruebinomialsampling.
The conditionalprobabilities differsomewlhat fromthosegivenby theexactdistribution for
randomsamplingfromtwo infinite populationswiththe samep. The lasttwovaluesforthe5, 5
margin,forexample,are 0.087 and 0.0024 insteadof 0.099 and 0.0040. Thismayat firstsight
seem surprising, but a littleconsideration will show thatgeneration of a table in thismanner
fromtwo finitepopulationsis not equivalentto the randomallocationof treatments adopted
fora comparative trial.
The frequencies in Table2 relateto p = 1/2.Thoseforp = 1/4,forexample,canbe obtainedby
multiplying the valuesin the successivecolumnsby 1, 3, 9, 27 .... This givesa totaloverthe
wholetable of 410 and an overallprobability thatP2 -Pi
p 0.6 of 0.031 insteadof 0.055. The
conditionalprobabilities, based on knowledgeof ml, m2, are, however,unaltered.The lower
overallprobability is due to differences in the frequency of occurrenceof the different values
ofml, M2; 3,7 forexample,will occur81 timesas frequently as 7,3, andthefivecentralml, m2,
whichare the only ones in whichP2 -P1 can possiblybe > 0.6, willonlyoccurin 55 percent
ofall samplesinsteadof89 percent.
Fromtheaboveit willbe seenthatknowledge oftheml, m2 margin merelyprovides a measure
of thesensitivity of the observedsampleto departures fromthenullhypothesis. Although some-

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
19841 TestsofSignifcance 433
timesdisputed(see theAppendix),it seemsto me obvious,as it didto Fisher,thatthemargins of
a 2 x 2 table,howevergenerated, providevirtually no information on theexistenceofassociation.
In samplesfromtwo binomials,for example,absence of associationimpliesthatPi = P2: if
n = n2, differences betweenPi and P2 of a givenmagnitude butoppositesignsoccurwithequal
frequency, as is shownby Table 2; thisdoes nothold if n1 #:n2, but themeanvalueofp1 - P2
forgivenmi, M2 is stillzero.ml, M2 are therefore ancillary in theFisherian
statistics, sense,and
definea "recognizable subset"(StatisticalMethodsand ScientificInference, pp. 32, 109). It is
the probabilities of occurrencein the relevantsubsetthatprovidethe correctbasisfortestsof
significance.In otherwords,we mustconditionon themargins, whatever the originofthetable.
Whether no, one or twomargins are"fixed"in advanceis irrelevant.
It is still sometimesrepresented(e.g. Kempthorne,1979; Upton, 1982) that although
conditioning on the margins is justifiedby the necessityforrandomization in comparative trials
suchas the inoculationexperiment describedin Section4, thisdoes not applyto samplesfrom
two binomials,forwhich"morepowerful" unconditional testsareavailable.Thislineofreasoning
is fallacious.If, forexample,the subjectsfortheinoculation trialofSection4 had beenobtained
by selectinga randomsampleof 10 individuals fromsomelargerpopulationand thenassigning
these individualsat randomto the inoculated-non-inoculated groups,we mightalternatively
havecombinedthetwo stepsby takingsamplesof 5 individuals eachfromthepopulation, which
is notionallyequivalentto dividing the populationintotwopopulations andtakinga samplefrom
each. The difference betweena trialof this,typeandone on a haphazardcollectionof individuals
is that(subjectto the qualification thatany extensive use of inoculationis likelyto reducethe
subsequentriskrate)anyresultsthatemergerelateto all theindividuals intheparentpopulation.
Testsofsignificance areunaffected.
If the differences betweentwoseparatepopulations arebeinginvestigated thenotionaldivision
above has a real existence,and the actual differences replacethose producedby the imposed
treatments. Statistically, therefore, the two situationsare equivalentand the same tests of
mustbe used.
signifi'cance
Withdiscontinuous data, subdivision of the possibleoutcomesinto subsetsdoes, of course,
inevitably reducethe significance level of themoreextremeoutcomes,becauseonlythe proba-
bilitiesof outcomesbelongingto the relevantsubsetwill enterinto the calculationof the
significance. It is thisfact,I think,and the urgeto find"morepowerful"tests,regardless of
theirrelevance,that givesrise to the fatalattraction of unconditional testsfordiscontinuous
data. In continuousdata conditionaltestshave long been acceptedwithoutquestion,at least
providedthata significance levelP is attainedwithfrequency P in repeatedsampling.In testing
forsignificance of a linearregression, forexamnple, the formulaV(b) = J2/S(x-x~)2 is usedifthe
varianceu2 ofy is known,or indeedif it is estimatedfromthe observations, whetherthevalues
of x are preassigned or random,providedonlythatanyunknownparameters in the distribution
ofx areunrelated to theparameters ofinterest.
6. CHANGESIN MARGINALTOTALS DUE TO TREATMENTEFFECTS
Partof the reluctanceto accept the factthatforthe purposeof thetestof significance in a
comparative trialtheM1, M2 marginmustbe takenas known,possiblystemsfromtheknowledge
thatifthetreatment doeshavean effectml andM2 wil certainly be changed.
Considera specificexample.Supposewe are confident thatan inoculation treatment provides
sureprotection againsta certaindiseaseand wishto demonstrate thisby doinga trialon 10 sub-
jects, 5 of whichare to be inoculated,the other5 not. The successof such a demonstration
dependscritically on the numberof subjects"at risk",i.e. thosewho willcontractthe disease
if uninoculated.Table 3 showsthe distribution of significantand non-significant resultsfor
differingnumbersat riskwhentheinoculationis completely successful. The tableis constructed
as follows.If all 10 subjectsareat riska tablewiththevalues(0, 5; 5,0) willalwaysbe obtained,
givingfromTable 2 a significance level of 0.004. If only 8 are at riskthenbeforeinoculation
mI, M2 willhavethevalues8, 2. The chancesof obtaining initialcellvaluesforinoculated,unin-

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
434 YATES [Part 3,

TABLE 3
Percentages
of significant
resultsin trialswith10 subjectswheninoculation
givescertainprotection againstinfection
Numberat risk*
Significance
c,d level (P) 10 9 8 7 6 5 4 3 <3

5,0 0.004 100 50 22 8 2 0.4 - - -


4,1 0.024 - 50 56 42 24 10 2 - -
3, 2 0.083 - - 22 42 48 40 24 8 -
2, 3 etc. >0.222 - - - 8 26 40 74 92 100

* Those who will become infectedif not inoculated.

oculated,of (3, 2; 5, 0), (4, 1; 4, 1), (5, 0; 3, 2) are consequently,


fromTable 2, 0.222, 0.556,
0.222. On completionof the experiment the first(inoculated)rowof each of thesetableswill
become0, 5, givingsignificance levels,againfromTable 2, of 0.004, 0.024, 0.083 respectively.
Thisexampleillustrates anotherpoint.Ifall 10 subjectsareat risk,and one ofthoseinoculated
contracts the disease,thetable(1, 4; 5, 0) willbe obtained.Thishas thesamesignificance level,
0.024, as (0, 5; 4, 1), thesecondof the two outcomesabovewhen9 subjectsareat risk,butthe
interpretation is different: herethe claimthatthe inoculationis alwayssuccessful is definitely
disproved, whereasthe latterresultmerelyindicatesthatat leastone of the subjectswas not at
risk.
A statistician reporting on a 2 x 2 table,therefore, shouldnotregarddetermination of a formal
significancelevelas his sole duty.The two extremeoutcomesin an inoculationtrial,(0, 5; 0, 5)
and (5, 0; 5, 0), forexample,bothgiveP = 1.0; theformer merelyindicatesthatall ormostofthe
testsubjectswerenot at risk,and thatfurther trialsshouldbe made on moresuitablematerial;
thelatterthatinoculation is clearlynotveryeffective.
The latterresultdoes not, of course,implythatinoculationprovidesno protection.Upper
limitsto Pi at varioussignificance levels(P) can be obtainedfromthelimitsofexpectation forthe
p of a binomialdistribution; fora = 0, n1 = 5 and P- 0.1, 0.025, 0.005 the upperlimitsfor
p are0.37, 0.52, 0.65 (Statistical Tables,TableVIII.1).
7. EFFECT OF INCREASING THE NUMBER OF CONTROLS
Many 2 x 2 comparative trialsconsistof the comparisonof a new treatment againstsome
standardor no treatment. In suchcases additionalcontrolscan oftenbe includedat littleextra
cost. If so therewill be a usefulgainin the sensitivity
of significance
testsand in theaccuracy
ofestimates ofthedifference betweenPi and P2-
As an exampleconsiderthe effecton testsof significance of doublingthenumberof unin-
oculatedsubjectsin thetestof the lastsection.We now haven1 = 5, n2 = 10, giving66 possible
outcomesof the test. The percentages of significant
resultswithvarying numbersof subjectsat
riskare shownin Table 4. These are calculatedin the samemanneras thoseof Table 3. To

TABLE 4
Effectof increasing
thenumberof controls:percentages
of significant
resultsin'trials
with5 inoculatedand 10 uninoculated
subjectswheninoculation
givescertainprotection
Numberat risk
Significance
c level (P) > 12 12 11 10 9 8 7 6 5 <5

10, 9, 8 < 0.01 100 74 41 17 5 1 - - - -


7 0.019 - 26 44 40 24 9 2 - - -
6 0.042 - - 15 35 42 33 16 4 - -
5 0.084 - - - 8 25 39 39 25 8 -
< 5 > 0.154 - - - - 4 18 43 71 92 100

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] Testsof Significance 435
facilitatecomparison betweenthetwo tablessomesignificance levelshavebeengroupedtogether
in Table4.
As is to be expectedthe additionof extracontrolssubstantially
increases ofthe
thesensitivity
test.If 80 per centof thesubjectsare at risk,forexample,thenwithadditionalcontrols74 per
level< 0.01, and the remaining
centof all testsattaina significance 26 percenta level< 0.025,
whereaswithoutadditionalcontrolsthecorresponding are22 percentand 56 percent,the
figures
remaining 22 percentbeingnon-significant.
8. FISHER'S TEA-TASTINGEXPERIMENT
The examplein thelast sectionillustrates theinsight thatcanbe gained,whenplanning experi-
mentson quantal data, by studyingthe performance of testsof significance undervarious
circumstances, usingpostulatedreal effects.A further examplewhichI foundintriguing is an
experiment describedby Fisherin theDesignofExperiments. Its objectwasto testa lady'sclaim
that she could tell,whendrinking tea, whetherthe milkhad been pouredbeforeor afterthe
tea. As designedby Fisher,the experiment providesa classicand somewhatrareexampleof
thegeneration ofa 2 x 2 tablein whichbothmargins aredetermined in advance.
In thisexperiment the lady was offered eightcups of tea,and was askedto decidewhichof
thesehad the milkadded first, and whichlast,havingbeen informed therewerein factfourof
each kind. Suppose the lady has definitediscriminating ability,but is sometimes in doubtas
to the correctverdict.If thereare no doubtfulcasesthe outcome(4, 0; 0, 4) willresult,giving
a significanceprobabilityof 1/70.Uncertainty on one cup onlywillgivethesameresult, as itwill
be assignedto thegroupwithonlythreecups.The samewillhappeniftherearetwouncertainties
bothbelonging to thesamegroup.But if thesebelongone to each group,therewillbe a 50 per
cent chanceof wrongassignment. This will givethe outcome(3, 1; 1, 3), and a probability of
17/70of gettingthisor a betterresult.The unbracketed valuesin the "no rejects"columnsof
Table 5 summarize theseresults.
If theladyis not informed in advancethattherearefourcupsofeachkindtheml, m2 margin
is not determined, and she willconsequently no longerbe able to makecorrectassignments with
certaintyforthe 1, 0 and 2, 0 distributions;thereis also an additionalpossiblepairofoutcomes
in the 1, 1 case. These cases are shownin squarebrackets.Theyindicatetheadvantageto the
subjectoftheinformation on theconstitution ofwhatis submitted fortest.
If,however, theladyis permitted to declareheruncertainties, and theseareomittedfromthe
assessment of significance,we obtainthe resultsshownin thelasttwocolumnsofTable5. These
generatea nicelygradedset of probabilities whichgivea fairerassessment of thesubject'strue
powerof discrimination, not marredby the potentialchancefailurein the 1, 1 case,but giving
due creditto correctclear-cutjudgements.
9. NOMINALLEVELS OF SIGNIFICANCE
A contributory cause of confusionthataffectsdiscontinuous data is theuse of conventional
nominallevelsof significance suchas 5 and 1 percent.Thiswas partlyengendered bytheuse of
the nominalsignificance probability fortheargument in tablesof t and thenormaldistribution.
This did tendto encouragepracticalworkers to thinkthatifan experiment givesa non-significant
resultat thechosenlevelnot onlyis theexistenceof a realeffectnot established, butthatthere
is in factno effect.Thismodeofthought was further encouraged by themathematical symbolism
adoptedby theNeyman-Pearson school:Ho: 0 = 0, H1: C $ 0, or theevenmoreabsurd,if0 can
be negative, Ho: 0 = 0, H1: 0 > 0.
In quantitativeexperiments the practiceof ornamenting tablesby one,two or threestarsto
denote5, 1 and 0.1 percentsignificance is a convenient way of drawing attention to themore
outstanding thoughthisdoes not obviatethe need to reportstandarderrors.Withdis-
effects,
continuousdata, however,the use of nominallevelscan be seriously misleading. The chanceof
getting8 or moreheadsin 10 tossesofa coin,forexample,is 0.055, andthatfor9 ormoreheads
is 0.011, as caneasilybe calculatedfromthebinomial(1/2+ 1/2)10Hereno conditioning (except

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
436 YATES [Part 3,

It_

4 a oX Q
O-O NO

vm~~~o
tt F
0

40~~ X_ 0

'E0~ f4

4.k b 000

c t a .^ *e t m

St <O n n^^
0

8 S ~~~~~~~~~t
t It ItI:4

?t~~~~~~~~~~C C;C;^?
* U

U
z z ? '-40.=

b$ ta

Q . ; Ps~~
A ~oo o

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] Testsof Significance 437
on the numberin the sample)is involved,but providedthe coin is unbiasedonly 1.1 per cent
at a nominallevelof 5 percent.The actual
of all sampleson averagewillbe declaredsignificant
probability
significance attainedshouldthereforealwaysbe givenwhenreporting on discontinuous
data.
Concentration nominallevelsof 2.5 and 0.5 per centis a defectin my 1934
on single-tail
paper,whichreflects the currentthinkingof thattime.It maybe noted,however, thatalthough
Fisherwas himselfin largepartresponsible forthewidespread use of nominallevels,he always
levelswhendiscussing
gaveactualsignificance discontinuousexamples.
10. QUALITY CONTROL
Testsbased on 2 x 2 tablesare sometimesrequiredforqualitycontrol.An earlyexampleis
providedby Pearson(1947). This was fortestingthe performance of batchesof smallarmour-
shot.To testanyparticular
piercinganti-tank batcha randomsampleof 12 shotfromthebatch
and a similarsampleof standardshotwerefiredat a testplate. This procedurewas adopted
because of unavoidablevariationsbetweendifferent test platesand thelimitednumberof shot
thatcould be firedat anyone plate.The batchwas rejectedif its performance was significantly
at somechosennominallevelof significance
inferior to thatof thestandard.
Pearsonactuallyrecommended use of the x2 testwithoutcorrection on the
forcontinuity,
groundsthat this provideda reasonableapproximation to Barnard'sunconditionalCSM test.
(See commentson Table 8, Section 12, fordiscussionon thispoint.)Whathe overlookedwas
that extremevalues of the Ml, m2 marginin eitherdirectionwill alwaysgivenon-significant
results,whetheror not the batchis defective.Such extremevalueswilloccurmorefrequently
if both pi and P2 are near 1, or both are near0. VariationbetweenthetestplatesaffectsPi
and P2 jointly,therebyincreasingthese frequencies.Such resultsshould be labelled "No
verdict".
Whatthepracticalmanrequiresto know,therefore, is thepercentagesofbatcheswithvarying
degreesof defectwhichare likelyto be passed,rejected,or returnedforfurther testing.The
procedurefor determining these percentagesmay be illustrated, withoutinvolving excessive
arithmetic,for n = n2= 5. The resultsare set out in Table 6. Pi and P2 are the assumed

TABLE 6
of batcheson whichthetestgivesno verdict,
Qualitycontrol:percentages and
of batches(otherthanthoseon whichthereis no verdict)whichare
percentages
rejected(nI = n2 = 5)

P2 (a) No verdict(%7) (b) Rejection


(% excluding(a))

Odds ratio Pi: 2/3 1/2 1/3 2/3 1/2 1/3 2/3 1/2 1/3

1 :1 2/3 1/2 1/3 30 11 30 6.3 6.1 6.3


4: 1 1/3 1/5 1/9 9 25 61 32.8 32.9 34.0
00 0 0 0 21 50 79 100 100 100

p1 is the probabilityof penetrationby the standardshot,p2 thatfora shot fromthe batch. The
assumedvalues of p2 are those givenby the odds ratio,p1q2 /q1p2.

probabilitiesof penetration
of a shotfromthestandard Valuesof
and batchsamplesrespectively.
2/3, 1/2and 1/3weretakenforPi and valuesgivingodds ratiosof 1: 1, 4: 1 and infinity
(i.e.
completefailure)forP2.
The resultsforPi = P2 = 1/2 can be deduceddirectlyfromTable 2. As thetableshows,the
availablerejectlevelsare necessarilymarkedlydifferent ml, m2 values.Those
for alternating
chosenwere 0.083, 0.024, 0.103, 0.024, 0.083 forml = 7 to 3. The relativefrequenciesof
differentm 1, m2 aregivenbythebottomlineofthetable.Excluding outcomesforwhichm1 = 0,

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
438 YATES [Part 3,
1, 2 or 8, 9, 10, therefore,
theprobability
of obtaining resultwhenPi = P2
a significant = p = 1/2
is
(120 x 0.083 + 210 x 0.024 +. ..)/(120 + 210 +...)=(10 +5 +. ..)/912 = 56/912= 0.061.
Thisis merelya weighted meanof thesignificance levelsforcedupon us by the discontinuity of
thedata.Theweights dependon thevalueofp, buttheweighted meanis littlechangedby changes
in p. The weightsfor p = 1/3, for example,are givenby the coefficients of the binomial
(1/3 + 2/3)10,and can be obtainedby multiplying the valuesin thebottomlineof Table 2 by
1, 2, 4, 8, ..., 1024, and similarly,in reverseorder,for p = 2/3. This givesa probabilityof
rejectionof 0.063 forbothp = 1/3and2/3.
The proportion of excludedoutcomes,forwhicha re-test willbe required, is, however,much
moredependent onthevalueofp. For p = 112theproportion willbe 2(1 + 10 + 45)/1024= 0.109,
whereasforp = 1/3itwillbe (1 + 20 + 180 + 11520 + 5120 + 1024)/310= 0.303.
These valuesare exhibited, in percentageform,in the firstlineof Table 6. Theytellus what
maybe expectedif thebatchbeingtestedis equal to thestandard.To see howeffective thetest
is in detectingbatchesthat are sub-standard we mustascertainwhathappenswhenP2 < PI
Thiscanbe donebyconstructing tablessimilarto Table2 withnewrelative frequencies.
For Pi = 1/2,P2 = 1/5,(odds ratio4:1), forexample,the NW borderline of the squareis
unchanged,but the NE borderline must be replacedby the coefficients of the binomial
(1/5+ 4/5)5, i.e. by 1,20, 160,640, 1280, 1024. A new squareof productsis thenformed, and
the columnsare summedto give the ml,mi2 frequencies, whichalso serveas divisorsfor
calculating thehypergeometric probabilities.
The resultsobtainedfromthesefurther tablesare shownin the secondlineof Table 6. The
thirdline of Table 6 is easilyobtained,as completefailureof a testbatchcan onlygivetables
(a, b; 0, 5) wherea andb have the binomialdistribution (PI + ql)5. Tables witha = 3, 4 or 5
willbe significant.
It is obviousfromTable 6, as indeedis to be expected,thatsamplesas smallas 5 giveonly
veryroughtests.Takingan odds ratioof 4: 1 as representing a seriousdegreeofdefect,onlyone
thirdof such batcheswill be rejected,and thisat a cost of rejecting 6 percentof thebatches
whichareup to standard.
The amountof re-testing requiredforvariousPi showsclearlythatPi = 1/2is thevalueto
aim at, if, as is to be expected,the majorityof batchesare up to standard.The differences in
the valuesof thispartof thetableare a reflection of variationsin the expectedmarginal totals
withdifferent valuesof Pi and P2. Note also thatthe actualvalueof Pi foranyparticular test
is not undercompletecontrol:it dependson the testplate actuallyused. If it was,and if the
othervariablescould be similarly controlled,therewouldbe no needto includea standard sample
in eachtest.
The aboveexampleis onlyintendedas an illustration of method.Thevitalpointthatemerges
is the importanceof recognising thatsomesamplesdo not giveanyworthwhile information on
the pointat issue,and thattheproportion of suchsamplesis substantially increasedbyvariation
in Pi These uninformative samplesalwaysgive a non-significant P, but can be identified by
theirmarginal values.
It wouldbe interesting to seehowthetestperforms withlargern1 andn2. Thesameprocedure
can be followed,but the arithmetic is tediouson a deskor pocketcalculator.It would,however,
be a simplematterto programa computerto do all or at leastthemoreonerouspartsof the
calculations.
11. RECENT CRITICISMSOF THE EXACT TEST AND THE CONTINUITY
CORRECTION
Failureto recognizetheforceofthearguments forconditioningoutlinedabove,andevaluation
of the performance of testsat nominallevelsof significance,
has resultedin numerouspapers
Fisher'sexacttestand the continuity
criticizing and manyalternative
correction, testshavebeen

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] Testsof Significance 439
devised.Upton(1982) examinesno lessthan22 testsforcomparative trials,and givesreferences
to 53 papers,25 of themdatedfrom1970 onwards.Thisis byno meansa completebibliography
of papers,evenin EnglishandAmerican journals,on thissubject.
Recentlycomputershave been calledinto play to investigate morefullytheperformance of
rivaltests.Extensivetableshavebeen published which,takenat theirfacevalue,maywelldeceive
uncritical
readers.
Thereis no needhereto attemptanygeneralreview.A briefdiscussion ofa paperby Berkson
(1978) and fourrejoinders Basu,Corstenand de Kroon,and Kempthorne
to it by Barnard, (1979),
togetherwithUpton'spaper,willserveto illustrate thepresentconfusion ofthought, andis parti-
cularlyrelevantbecausethe papersby Berksonand Kempthorne arealreadybeingtakenas guides
to up-to-datethinking on the subject:Upton leansheavilyon themin his preamble, and they
werecitedby Fienberg withoutadversecomment in a lectureseriesR. A. Fisher:An Appreciation
(1980).

12. BERKSON'S "DISPRAISE"


Berksonconsidersthreetests,whichhe denotesby TN, TC and TE. TN is whathe termsthe
normaltest.Thishe specifies in thesamemanneras did Yule, not specifically tellinghis readers
(thoughhe was clearlyawareof it) thatTN is the sameas the x2 testwithoutthecorrection for
continuity, the customaryformulaforwhichis much moreconvenient forcomputation.Tc
is definedas the sametestwith"Yates' correction". TE is theexacttest.Attheendofthepaper
he forcefully concludesthat,"at least for a comparative trialwithn, = n2, TN is preferable
to TE [andbyimplication to TcI and TE shouldnotbe used".
How did he reachthisconclusion?"Followingthe ideas of the Neyman-Pearson theoryof
testsof significance"and adoptingthetwo-binomial model,he determined thefrequencies Yzewith
which significant verdictswould be givenby TN, Tc and TE at nominalsignificance levels
oa= 0.05 and 0.01 (singletail)in repeatedsamplingoftableswithnI = n2 whenPi = P2. Thetable
givinghis resultscoversvaluesofp = 0.1 (x 0.1) 0.9 andvaluesofn = 5, 10, 20, 50, 100,200 (the
last forp = 0.5 only).The production of thistable,and an associatedtableof the powerof the
tests,involvedBerksonandhisassociatesin a considerable computer exercise.
Table 7 givesan extractof his table for TN and TE for a = 0.05 and p = 0.5 and 0.2, 0.8.

TABLE 7
Berkson'sotefora = 0.05

p =0.5 p=0.2, 0.8

nl,n, TN TE TN TE
5 0.0547 0.0107 0.0218 0.0023
10 0.0579 0.0211 0.0455 0.0150
20 0.0421 0.0213 0.0513 0.0226
50 0.0449 0.0287 0.0502 0.0288
100 0.05 18 0.0384 0.0497 0.0350
200 0.0494 0.0400 - -

(His valuesfor TC are, as is to be expected,forthe mostpartthe sameas thoseforTE.) The


valuesforn1 - n2 = 5 when p = 0.5 can be verifiedfromTable 2. For TE, onlythe tablesin
the last two linesattainsignificance at the0.05 level.Theircombinedunconditional probability
is (5 + 5 + 1)/1024= 0.0107. For TN the threetablesin the next line also attainsignificance,
givinga combinedunconditionalprobaI4ility of 56/1024= 0.0547. Similarcalculationswith
the frequencies in the successivecolumnsof Table 2 multiplied by 1, 4, 42, etc. and a divisor
of 510 givethevaluesforp = 0.2, 0.8.
The factthatin thefulltablethevaluesof aYegivenby TN are closeto thenominalae,except
forsmalln, and p differing considerably from0.5, convincedBerksonthat TE was extremely

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
440 YATES [Part3,
conservativeand that TN was the righttest to use. The table is, however,irrelevant in any
practicalsense.The probabilities
givenbyall threetestsarein factconditional, Tc and TN because
theyare bothbasedon x2 with1 df.Thejustification forTc is thatit providesa closeapproxi-
mationto the exactconditional testTE. Omissionof thecontinuity correction fromTN results
in muchhighervaluesforx2, particularly fortableswithone or moresmallmarginal values,and
TN consequentlygreatlyexaggeratesthe conditionalsignificance. It also exaggeratesthe
unconditionalsignificance,
at leastforp = 0.5, as is shownin Table 8. For thetable(4, 1; 1,4),
for example,PE - 0.103, PC = 0.103, PN = 0.029, whereasthe unconditionalprobabilityis
0.055.

TABLE 8
ValuesofPE, PC,PN, and unconditional
Pfor n1 - n2 = 5

UnconditionalP
Table Pi -P2 PE PC PN p = 0.5 p = 0.2, 0.8

(4, 1; 1, 4) 0.6 0.103 0.103 0.029 0.0547 0.0218

(35,0; ? 3)5 0.6 0.083 0.084 0.019 0.0303 0.0192


(4,10; 2,3)~
(4, 1; 1 4)5 0.8 0.024 0.026 0.0049 0.0107 0.0023
(5,0; 0,5) 1.0 0.0040 0.0057 0.0008 0.0010 0.0001

Havingestablished to his own satisfaction


thatTN is thecorrecttest,Berksongivesan example
of the contrastingperformance of TN and TE in a hypothetical clinicaltrial-he evenspecifies
it as "double blind"-in which30 out of 35 patientsare curedby a newtreatment and 24 out
of 35 are cured by the currentlyused treatment.This givesPN = 0.0438, PE = 0.0767. Berkson
suggeststhat the scientistconcernedmightreasonablyconsiderTE to be "destructive" rather
than"conservative".
To be fair,Berkson,afterreferring to variousauthorities in supportofhisarguments in favour
of TN, does quote fromFisher's1935 paper,and aftersome discussionconcludes:"If the
significanceP is takento represent,not thefrequencies of errorsof thefirstkind,buta measure
of the subjectivecredibilityof thenullhypothesis, objectified byequatingit to fairbettingodds,
then the exact testwithrandomizing is the correcttest." He then,however,continues:"But
whatinvestigating scientistwould decidewhichof two drugsis themoreeffective by tossinga
coin?Perhapsthisis a crucialcasein reference to thequestionas to whether statistics
is concerned
with decisionor inference." This missesthe point.The coin tossingis onlyused to eliminate
selectionbias. Nor would a scientistexpectdecisionsto be takensolelyon thebasisof a single
trialof thissize. The significance
levelof 1 in 13 givenby PE is by no meansnegligible evidence
in favourofthenewtreatment.
For good measure,Berksonalso questionsFisher'scontentionthatthe marginalvaluesof a
2 x 2 table containno information on lack of proportionality, citingvariousauthorities, and
even,in a supplementary paper(1978), advancing a mostremarkable "proof' ofhisown!
13. REACTIONS TO BERKSON'S DISPRAISE
Berkson'sattackson the exact test elicitedfourreplies.I need onlycommentbriefly. They
do littleto clarifythe real issues-indeedforthe mostparttheyonlyadd further confusion.
The most remarkable is that by Kempthorne. He startsby definingthree"origins":I, a
double dichotomy(only N determined by the observer);II, two binomials(n1 and n2 deter-
mined);III, a comparativetrial(n, and n2 determined,
withrandomassignment betweenthem).
Afterlengthydiscussionand muchrhetorical abuseof Fisher'sarguments Methods
in Statistical
and ScientificInference,and regretat Barnard'sdisavowalof the CSM test,he concludesthat

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] TestsofSignificance 441
Berkson'sstudyindicatesthat TN (thoughperhapsnot as good as the CSM test)is appropriate
forOriginII data,and "does whatis soughtverynicelyand easily".He basesthisconclusionon
the resultsof Berkson'scomputations, summarized in Table 7 above.If he had workedout the
CSM values for n1 = n2 = 5 (not a difficult task,and includedin my Table 8) he mighthave
realizedthattheimpression createdby Berkson'stableis misleading.
Kempthorne does recognizethatforOriginIII onlythe exact TE testis appropriate, though
on the somewhatweak groundthattheindividuals in thetrialarenot selectedat randomfrom
a largerpopulation.They can of coursebe so selected,and thenassignedat randomto thetwo
treatments; if so, any resultsemerging fromthe trialwill be relevantto the populationfrom
whichtheselectionis made.
Even moresurprisingly, "in the lack of additionalknowledge"he optsforthe exacttestfor
OriginI data. Surely,if his arguments on OriginII wereaccepted,theywouldapplywithequal
or greaterforceto OriginI. Also it is scarcelya help to "readerswithlack of timeto readthe
whole of the material"to reproduceBerkson'sdefinitionof TN in the last paragraphof his
conclusion,insteadof tellingthemthatit is x2 withoutthe correction forcontinuity. Washe
unawareofthis?
Basu's paper containsa much brieferbut equallyramblingdiscussionof the problem.His
finaladviceis to "act like a Bayesian",adding:"Data interpretation is not a scientific
method.
Therecannotbe a mindlessweighing of evidence.Can I be trulyobjectiveunlessI am completely
ignorant ofthesubject?!"
Corstenand de Kroon's shortpaper is muchmoresensible.Accepting, withoutargument,
thatconditioning is appropriate,theyconcludethat" comparison of TN and TE at thehonest
basis of conditioning on k [theml, m2 marginin my notation]discredits TN completely; the
value of ae is irrelevant in this context." They continue,however,with a sectionheaded
"UnconditionalTesting"beginning:"Berkson'spreference may stillexistforthosewho reject
conditionalconsiderations at all in thisproblem."To caterforthispreference theystate: "It
is customary adviceto replacethis [TE] by . . . the (unconditional) teststatisticTi." TE is
in factthe same as thatwhichPearsonused whencorrecting forcontinuity, and differsfrom
Tc onlyin the substitution of the factorN- 1 forN in theformulaforX3.Theiritalicsindicate
thattheythinkthatuse of a normalapproximation in thisway makesthetestunconditional!
I suspectthatBerkson, andindeedKempthorne, suffered fromthesamedelusion.
Barnard'spaperis mainlyconcernedwithwhathe terms"testprocedures", and is somewhat
peripheral to Berkson'spaper,but he veryfirmlyrecommends thatonly the exacttestshould
be usedbyindividual experimenters whenreporting theresultsoftheirexperiments.
14. UPTON'S PAPER
Upton's main object was to examinethe performance of the manyteststhat have been
proposedfor 2 x 2 comparative trials.The definitionhe adopted forsuch trialswas thatof
Barnard(1947), i.e. tableswithone fixedmargin.The mainpartof his paperis devotedto a
description of 22 alternative
tests,anda comparison oftheperformance of 17 of themin repeated
unconditionalsamplingof two binomialswitha commonp and 'nominalo = 0.05 (apparently
two-tail),overthewholerange,0 to 1, of p. In additionan attempt is madeto assess"theoverall
accuracy"of the 17 tests,usingseveralcriteria.These criteriawereevaluatedon an assorted
collectionof no less than20 tables,bothforo = 0.05 and o = 0.01; onlythe resultsforac= 0.05
are reported.All thismusthave involvedan immenseamountof work,onlymadepossibleby
moderncomputing aids.
.In his summary Uptonstates:"Amongstotherresultsit is shownthattheexacttestof Fisher,
and the corresponding Yates correctionto Pearson'sx2 test,give testswhichare both very
conservative and inappropriate. The uncorrected x2 test performs well. On both empiricaland
theoretical grounds, thepreferredtestis thescaledversion(N - 1)/Nx2 ." Essentiallyhisprocedure
is thatadoptedby Berkson;he couldin facthavepresented hisresults
in tabularform,as in Table
7, and his summary echoesthatof Berkson.My criticisms of Berksontherefore applyequallyto

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
442 YATES [Part3,
Upton. Hfissuggested modification of the uncorrected x2 by thefactor(N- 1)/Nis trivialand
lacks any sound theoreticalbasis. True, followingKempthorne, he does state in his recom-
mendationsat the end of the paperthat "if the set of data beinganalysedcannotbe regarded
as a randomsamplefromthe population(s)of interest, as forexampleoccursin theself-selecting
medicaltrial",onlythe exacttestor thecontinuity-corrected x2 is appropriate, whichis a slight
advanceon Berkson.Berksondid,however,confinehis attention to one-tailtests,whereasUpton
claimsit is "morenatural"to use two-tailtests,and adoptsa prevalent butmisleading procedure
(described below) -for derivingtheir associated probabilities.This has introducedfurther
irregularities
intohisresults.
Althoughthereare numerousreferences to previousliterature,it seemsthatUptonhas made
only a superficialstudyof manyof thepapershe cites,or has reliedon comments on themby
otherauthors.For example,hisdescription of "Yates' correction to x2"begins:"Yates (1934) ob-
servedthat,as N increases, thehypergeometric distribution
is increasingly wellapproximated by
thenormaldistribution." I madeno suchobservation. Noris the statement, in the formhe gives
it,true;if n1 and p are held fixed,but n2, and thereforeN, tendto infinity, thehypergeometric
distribution tendsto a binomialwithn1 + 1 termsonly.It was Pearson,in his 1947 paper,who
applied a continuitycorrection to the hypergeometric normalapproximation, {(N- 1)/N}x2.
My continuity correction was applieddirectlyto x2. Actually,thoughPearsondid not realize
it, my correctionperforms on averagebetterthanhis. Upton'sfurther statement, based on a
paragraph in Pearson's1947 paper,thatmycorrection"had been in commonuse sinceat least
1921", is incorrect,
andresultsfroma misinterpretation ofPearson'sactualremarks.
The introductory sectionssuffer similarly.Had UptonstudiedFisher'sStatistical Methodsand
ScientificInference,insteadof accepting Kempthorne's "analysis"ofit,he mighthavehad doubts
abouttherelevance ofhisinvestigation.

15. TWO-SIDED TESTS


With normallydistributed continuousdata the customarytabulationof t has encouraged
statisticiansto thinkin termsof two-sidedtests.As thenormaldistribution and theassociated
t distributions are symmetrical thisraisesno problems, thoughit shouldbe remembered thatif
an experiment shows a significant difference betweentwo treatments at P = 0.04, say, and B
has emergedas superiorto A, thisis equivalentto the statement thatB is significantly better
thanA at theP = 0.02 level;also, if fiduciallimitsx ? t.o5sm are assignedto a meanm ofwhich
x is an estimate,thefiducialprobability thatm is belowthelowerlimitis 0.025, not 0.05, and
similarlyit is 0.025 thatm is abovetheupperlimit.
If a continuouserrordistribution is symmetrical aboutthenullvalueequal deviations in either
directionwill have equal one-tailprobabilities; if the errordistributionis not symmetrical these
probabilities will be unequal.Howeverany continuousdistribution witha singlemaximumcan
be transformed into a normaldistribution. Moreoverin any one set of resultstheinformation
on departuresfromthe null hypothesisrelatesonly to departures in the observeddirection.
Consequently the rule fordetermining the two-sidedprobability,if thisis required,shouldbe
to double the observedone-tailprobability.This is invariantundertransformnation, whereas
basingtwo-sided probabilitieson equal butoppositedeviations is not.
Transformation of data to normalor approximately normalformis of coursea Well-known
devicefordetermining significance
probabilitiesand fiduciallimits.A classicexampleis provided
by the correlationcoefficient. As Fisher showed(see, for example,StatisticalMethodsfor
ResearchWorkers, Section35) thetransformation
z = I {109ge(1 + r) - 109e(l -r)}
gives a distributionwhich is closely approximatedby a normaldistribution with variance
1/(n'- 3), wheren' is thenumberof pairsof observations on whichr is based. If,forexample,
we wishto assessthe significance of the difference of an observedr froma theoretical expected
value ro of +0.75 (zo = +0.97) the one-tailprobability will be givendirectlyby reference to a

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] Testsof Significance 443
table of the standardnormalintegralwithx = (z - 0.97)/v/(n'- 3).
Withdiscontinuous distributionsthereis a furtherproblem.In a 2 x 2 table withn1 = n2
there will be pairs of points representingthe integraldivisionson the two tails whichare
equidistantfromthe expectedvalue,e. These will have equal hypergeometric probabilities,
as
is shownby Table 2. If n #n2, but 2e is integral,therewillstillbe pairsof pointsequidistant
frome, but also somepointson thelongertailthatareunpaired; thehypergeometric distribution
will thenbe asymmetric, and the associatedprobabilitieswill be unequal.If 2e is notintegral
therewillbe no equidistant
pairs.Thislastcontingency maybe termedmismatch.
Table 9 givesexamplesof the two-sidedprobabilities forthe moreextremevaluesof a, the
observedvaluein the cell withthe smallestexpectation,obtainedby (i) doublingthe one-tail
probabilityof thevalueactuallyobserved,and by (ii) takingthesumof the one-tailprobability
of theobservedvalueand the one-tailprobability of thevalueon the oppositetailforwhichthe
deviationis equal to that of the observedvalue,or if thereis mismatch(2e not integral)the
valuewiththenextgreater deviation.

TABLE 9
Two-sidedprobabilitiesforfourtableswithgivenmargins: contrasts
between
(ii) thesumof theexactprobabilities
(i) twicetheexactone-tailprobability; ofall
valueswithdeviations greaterthanor equal to thatof theobserveddeviation,
ofsign;(iii) theprobability
regardless givenbythecontinuity-corrected x2

Table A Table B Table C Table D

ai (i)
ii)0 (iO ) 00i OHi) (i) (ii (i)) W_ Ji (MJ)

0 0.008 0.013 0.012 0.009 0.019 0.023 0.036 0.041 0.021 0.016 0.039
1 0.092 0.096 0.123 0.095 0.128 0.160 0.170 0.174 0.151 0.102 0.166
6 0.092 0.096 0.067 0.040 0.070 0.181 0.170 0.174 0.191 0.171 0.185
7 0.008 0.013 0.005 0.003 0.008 0.049 0.036 0.041 0.053 0.037 0.045
8 - - - - - 0.010 0.005 0.007 0.011 0.005 0.007

Exp'n (e) 3.5 3.325 3.5 3.544

Table a b 20 a b 19 a b 20 a b 20
c d 20 c d 21 c d 60 c d 59

7 33 40 7 33 40 14 66 80 14 65 79

in tablesB andD it isnot.TableA is symmetric


In tablesA and C 2e is integral, andtheunder-
lyingdistributionin tableB is nearlyso; tablesC and D are markedly asymmetric.Consequently
bothmethodsgivethesameresultsin tableA, butin tableB all theprobabilities givenby method
(ii) areone quarterto one thirdlessthanthoseofmethod(i). In tableC theasymmetry ofmethod
by method(ii) exceptforunmatched
(i) is obliterated extremeson the longertail. Thisin itself
seemsunreasonable, as ifa = 0 is observed,forexample,itssmallerprobability
shouldbe regarded
as givingstronger evidencefora departure fromthenullhypothesis thanwouldtheoccurrence
of a = 7.
fromtable C onlyin the rejectionof a singleobservation
Table D differs fromcelld (whichin
tableC mustcontainat least46 observations).Thisincreases theexpectation ofa slightly;
it there-
foreseemsreasonableto expectthatthe probabilities fora = 0 and 1 willbe slightly decreased
and thosefora = 6, 7 and 8 willbe slightlyincreased, as is indeedthe case formethod(i). For
method(ii), however,the changesfora = 6, 7 and 8 are trivial, but the decreasesfora = 0 and 1
are large.The practicalworker,confronted with thisfact,mightwell concludethathowever
"natural"method(ii) appearsto be at firstsightit standscondemned on common-sense grounds.
WhatwereFisher'sviewson thismatter?So faras I knowtheywereneverexpressed in print,

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
444 YATES [Part3,
buthis replyto a letterby D. J. Finney,a copyof whichcamemywaybychanceafterdrafting
the above argument, is, I think,of sufficientinterestto reproducehere.I am mostgrateful to
Professor J. H. Bennettof AdelaideUniversity and to Professor Finneyforpermission to quote
thiscorrespondence.
Finney'squeryarose fromFisher'sletterto Science(1941), in whichFishergavethe one-
tailedtest (treated,not treated)forWilson'sexample(5, 1; 1, 5). FinneynoticedthatWilson's
originalstatement oftheproblemreallyrequireda two-sided test,andthatalthough thispresented
no difficulty in Wilson'sexamplethe solutionwas not clear for an asymmetrical table, for
example(5, 3; 1, 5). As he wrote(May 28th, 1946): "How is he to testthe null hypothesis
thatA and B are equallyharmful, whileconsidering deviations fromequalityin eitherdirection?
Simplyto doublethetotalprobability for(5, 3; 1, 5) and (6, 2;.0, 6) scarcelyseemsappropriate,
as it does not correspond to anydiscretesubdivision of casesat theothertailsuchas (1, 7; 5, 1)
and (0, 8; 6, 0). Nor does thereappearto me any obviousreasonforcalculating theprobabilities
forthetwo mostextremeconfigurations at the othertail (keepingmarginal totalsunaltered) and
addingtheirtotal'to the appropriate probability forthe tail at whichthe observations occur.
"Am I missingsomething verysimplehere? I cannotremember havingseen thisproblem
discussed, and shouldbe gratefulforyourviews."
To thisFisherreplied(May 31st): "My dear Finney,-Thanks foryourletter.It is a good
problem,but I believeI can defendthe simplesolutionof doublingthe total probability, not
becauseit corresponds to anydiscretesubdivision of cases oftheothertail,butbecauseit corres-
pondswithhalvingtheprobability, supposedlychosenin advance,withwhichtheone observed
is to be compared.Thatis to say,one maydecidein advancethatif the probability is less than
one in fortyin eitherdirectionthenwe shallconsiderif [that?],pendingfurther investigation,
thevirusesarenotpathologically equivalent.
"How doesthisstrikeyou?"
16. USE OF XCIN TWO-SIDEDTESTS
A x2 test with 1 df is essentially a two-sidedtest.To obtainthe one-tailprobability, the
probability obtainedby reference to a x2 table mustbe halved,but its valueis dependenton
the deviationactuallyobserved,regardless of whetherthe deviationson thetwo tailsmatchor
not. Therewill therefore be no underestimation of P in two-sidedtestsdue to mismatch. As,
however,equal deviationsin oppositedirections giveequal x2values,differences in significance
due to asymmetry willbe obliterated.Thisis apparentin tableC of Table 9, whereforexample
the valuesin column(iii) fora = 0 and 7 are both0.041, whereasthosein column(i) are 0.023
and 0.049 respectively. The effectof mismatch on column(ii) of tableD, however, is eliminated
in column(iii); the difference betweenthevalues0.039 and0.045 fora = 0 and 7 is solelydueto
thegreater deviation fromexpectation fora = 0.
TableB, whichdiffers fromtableA, also illustrates
trivially ofthecolumn
theseriousdistortion
(ii) valuesdue to mismatch: fora = 6, forexample,the correctvalue0.067 is reducedto 0.040,
whereasthecontinuity-corrected x2 givesthevalue0.070.
Comparisons of columns(i) and (iii) of Table 9 givean indicationof theaccuracyof theXY,
approximations in tableswithsmallvaluesof e. The largestdiscrepancies are ofcoursethosedue
to asymmetry, suchas thosein tablesC andD. It is perhapsworthnotingthatlinearinterpolation
in Table VIII of StatisticalTables, using Xc as argument,gives considerablyimproved
approximations. (The procedure is thereillustratedin Example5.) The approximations to twice
theone-tailprobabilities fortableC, a = 0,7,8, forexample,are0.026, 0.049, 0.009,whichagree
wellwith0.023,0.049, 0.010.
The above approximations are based on the tabularvalues for the corresponding limiting
contingency distributions,whichwill have margins { 14, 42; 14, 42; 56}. Similaradjustments,
using the tabularvalues for the corresponding binomialdistribution (3/4+ 1/4)14,give the
approximations 0.016, 0.053, 0.011.
Now thatcomputers are availableit wouldbe a relatively simplematterto providea tablefor

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] TestsofSignificance 445
adjustingXc valueswhichis moredetailedand easierto use. If correctionsto Xc are tabulated,
withXc as argument, and a table of the normalprobabilityintegralis available,almostexact
significance
probabilitiescould rapidlybe obtained with even the most primitivepocket
calculators.
17. RECENT ATTACKSON Xc
Adoptionof what I considerto be an inappropriate definitionof two-sided tests((ii) above)
has resultedin numerouswarnings againstthe use of the continuity-correctedx2 forsuchtests.
A remarkable investigationwas made by Haber (1980). Usinga speciallywrittencomputer
program, he comparedtheperformance of fivedifferent testson all 2 x 2 tables,some 150 000
in number,for whichN < 100, e > 1, and forwhichwhathe termed"the exact exceedance
probability"(i.e. definition
(ii), heredenotedby PH) has a value between0.001 and 0.1. The
fivetestsconsideredwere: the uncorrected x2, thecontinuity-corrected x2, two testsbased on
"Cochran'sprinciple"(thoughCochranhimselfdid not supportits use fortwo-sided tests),and
a testproposedby Mantel.When2e is an integertests3 and 5 are equivalentto the continuity-
correctedx2, and test4 is nearlyso. (For a specificationof theselattertestssee Haber'spaper.)
Habertabulatedhis resultsin 60 groups,covering valuesofPH in theranges0.001-0.01 and
0.01-0.1 and groupedaccordingto values of N and e. ResultsforR,Rm,n and Rmax were
reported,whereforeach testR =PA/PH, PA beingthe probability givenby thetest.Table 10
showsa typicalpanelof his table,thatforPH (0.0 1-0.1), 3 6 e < 5, 40 '<N< 60. Thisincludes
contributionsfrom2924 tables.

TABLE 10
An exampleofHaber'sresults.
R is theratioof thetestprobability
toHaber's"exactexceedanceprobability"

Test 1R Rmin Rmax

1. UncorrectedX2 0.64 0.39 1.03


2. Continuity-correctedX2 1.56 1.05 2.77
3. Two testsbased on 1 1.03 0.75 1.50
4.-' "Cochran's principle" 1.00 0.74 1.39
5. A testproposedby Mantel 1.13 0.80 1.56

Taken at theirface value, these resultsindicatethat not only does the uncorrected x2
considerablyunderestimate the true significance probability,but also 'that the continuity-
correctedx2 seriouslyoverestimates it. The other threetests also exhibitwide variations,
though themeanvaluesofR arereasonably closeto 1.
This,however,givesan entirelyfalsepicture,as Table 9 shows.For thecontinuity-corrected
X2thevaluesof Haber'sR are givenby theratiosof theprobabilities in columns(iii) and (ii).
The majordifferences fromunityare due to mismatch.For tablesB and D the fourR in the
range0.01-0.1 of PH have values 1.35, 1.75, 2.43, 1.22, meawn 1.69, whereasthe two pairsin
tablesA and C havevalues1.04 and 1.14. As thegreatmajorityof the 2924 tablesin Table 10
(as in otherpartsof Haber'stable) are subjectto mismatch,:the conformity of thefourvalues
fortablesB and D withhis'reported resultsis notsurprising.
If, however,the two-sidedprobability is definedas twice the one-tailprobabilityof the
observed value,theappropriate comparison forassessing
theaverageaccuracyof x2 is thatbetween
columns(i) and (iii),not(ii) and (iii). The averagesforcolumns(i), (ii) and (iii) oftheprobabilities
forwhichthe column(i) value is in the range0.01-0.1 are, fortablesA and C, 0.053, 0.052,
0.056, and fortablesB and D, 0.033, 0.021, 0.036. Thisshows,as is confirmed by Table VIII
ofStatisticalTables,thattheaveragebias of P(X2), if definition (i) is adopted,is small,evenfor
smalle. This,of course,does not implythaterrorsin P(X2) or P(Xc) are alwaysnegligible. They
canbe relativelylargefortableswithasymmetric as tablesC and D show.
distributions,

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
446 YATES [Part3,
Had Habersegregated tableswithmismatch in the presentationof his results,he wouldhave
produceda muchmoreinformative table,and one whichis morerelevantto practicalrequire-
ments.In mostplannedcomparative trialsn, = n2 or n2 is a smallmultiple,X,of nl. Ifn1 = n2
thereis nevermismatch;if X= 2 or 3, approximately one thirdor one half,respectively, of the
resultanttableswillbe freeofmismatch.
Morefundamentally, if Haberhad made comparisons of P(X2) withdefinition (i) as wellas (ii)
of the exact probability,and if he had subdividedhis resultsaccordingto the degreeof
asymmetry, the real causes of discrepancies in P(X2) would havebeenmuchmoreapparent.In
a laterpaper(Haber,1982) he doesmentionthata two-sided canbe definedin several
probability
ways,but againadoptsdefinition (ii) withoutdiscussion,and withouteven revealing whatthe
alternatives
are.
18. CONCLUSIONS
The following seemto be themostimportant conclusions thatshouldbe drawnfromtheabove
discussion:
1. In spiteof thefrequently expressedviewthatFisher'sexacttest,basedon conditioning on
the marginalvalues,is too "conservative", it appearsto be the onlyrationaltest,whetherboth,
one, or neitherof the marginsare determined in advance.The marginalvaluesdetermine the
sensitivity ofthetest.
2. Unconditionaltests,based on the two-binomial model,appeal becausetheyare "more
powerful"thanthe exacttest,butstandcondemned bothbythegeneralarguments forcondition-
ing,and also becausetherandomassignment oftreatments in comparative trialsleadsto theexact
test,in spiteofonlyonemargin beingdetermined in advance.
3. The examplesgivenin Table9 confirm thatthecontinuity-corrected x2 givescloseapproxi-
mationsto the exact test,except when the underlying exact hypergeometric distribution is
markedlyskew.Condemnation of the continuity correction on the groundthatit givesa test
thatis too conservative is merelytheresultof failureto recognizethatthex2 test,liketheexact
test,is a conditionaltest.
4. Use of nominallevelsof significance such as 5 and 1 percentis a further sourceof con-
fusion;theactuallevelsattainedshouldalwaysbe givenwhenanalysing discontinuous data.
5. In general,one-tailprobabilities shouldbe used,but if a two-sidedprobability is required
the best conventionto adopt is to doublethe observedone-tailprobability, as the x2 testdoes
automatically. The commonconvention of takingthe sum of the probabilities of all deviations
greaterthanor equal to the observeddeviation,regardless of sign,has no realisticjustification.
6. In reportingon comparativetrialsand comparisonsbetweendifferent populationsthe
responsibility of the statistician
does not end withtheevaluationof the significance probability.
He shouldalso comment on theactualp values,andtheirlikelyimplications.
7. In planningqualitycontroltestsusing2 x 2 contrasts, the probabilities of acceptance,
rejectionand any requiredretesting shouldbe calculatedforvariouspostulatedlevelsof defect
in a batch.

ACKNOWLEDGEMENTS
I shouldlike to put on recordmy thanksto David Cox and GeorgeBarnardforhelpful
mostof whichhavebeen incorporated,
suggestion's, to Donald Preeceforhelpin variousways
andto DawnJohnson forhersplendidworkin thepreparation
ofthefinaltypescript.
REFERENCES
Barnard,G. A. (1945) A newtestfor2 X 2 tables.Nature,156, 177.
- (1947) Significancetestsfor2 X 2 tables.Biometrika, 34, 123-138.
(1949) Statistical J.R. Statist.Soc. B, 11, 115-139.
inference.
(1979) In contradiction to J. Berkson'sdispraise:conditional testscan be moreefficient. J. Statist.
Planning andInference, 3, 181-187.
Basu,D. (1979) Discussionof JosephBerkson's paper"In dispraise of theexacttest".J. Statist.Planning and
Inference,3, 189-192.

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] Testsof Significance 447
Berkson, J.(1978) In dispraise
oftheexacttest.J.Statist.Planning andInference,2, 27-42.
(1978) Do themarginal totalsof the 2 X 2 tablecontainrelevant informationrespecting
thetablepro-
portions?J.Statist.PlanningandInference, 2, 43-44.
Corsten,L. C. A. and de Kroon,J. P. M. (1979) Commenton J.Berkson's paper"In dispraise
of theexact
test".J.Statist.Planningand Inference,
3, 193-197.
Cox,D. R. (1970) TheAnalysis ofBinaryData. London:Methuen.
Fienberg,S. E. (1980) Fisher'scontributions to the analysisof categoricaldata. In R. A. Fisher:An
Appreciation (S. E. Fienberg
andD. V. Hinkley, eds),pp. ?5-84. Berlin:Springer-Verlag.
Fisher,R. A. (1922) On theinterpretationofx2fromcontingency tables,andthecalculation
ofP. J.R. Statist.
Soc., 85, 87-94.
(1925) StatisticalMethodsforResearchWorkers. Edinburgh: OliverandBoyd(5thedition,1934).
(1926) Bayes'theorem andthefourfold table.Eugen.Rev.,18, 32-33.
(1935) Thelogicofinductive inference.J.R. Statist.Soc., 98, 39-54.
(1935) TheDesignofExperiments. Edinburgh: OliverandBoyd.
(1941) Theinterpretation ofexperimental four-foldtables.Science,94, 210-211.
(1956) StatisticalMethods andScientific Inference.Edinburgh: OliverandBoyd.
Fisher,R. A. and Yates, F. (1938) StatisticalTables forBiological,Agricultural and MedicalResearch.
Edinburgh: OliverandBoyd(6thedition1963).
Greenwood, M. and Yule, G. U. (1915) The statistics of anti-cholera ahd anti-typhoidinoculations, and the
interpretation ofsuchstatistics ingeneral.Proc.R. Soc. Med. (Epidemiology), 8, 113-190.
Haber,M. (1980) A comparison of somecontinuity corrections forthex2teston 2 X 2 tables.J.Amer.Statist.
Assoc.,75, 510-515.
(1982) Thecontinuity correction andstatistical
testing.It. Statist.Rev.,50, 135-144.
Kempthorne, 0. (1979) In dispraise of theexacttest:reactions. J.Statist.Planning andInference, 3, 199-213.
Pearson,E. S. (1947) The choiceof statistical testsillustratedon theinterpretation of dataclassedin a 2 X 2
lable.Biometrika, 34,.139-167.
Pearson,K. (1900) On the criticism thata givensystemof deviations fromtheprobablein thecase of a cor-
relatedsystemof variables is suchthatit can be reasonably supposedto havearisenfromrandomsampling.
Phil.Mag.(5),50, 157-175.
Plackett,R. L. (1977) Themarginal totalsofa 2 X 2 table.Biometrika, 64, 37-42.
Sprott,D. A. (1975) Marginal andconditional sufficiency.Biometrika, 62, 599-605.
Upton,G. J.G. (1982)-A comparison of alternativetestsforthe2 X 2 comparative trial.J. R. Statist.Soc. A,
145,86-105.
Wilson, E. B. (1941) Thecontrolled experiment andthefour-fold table.Science,93, 557-560.
Yates, F. (1934) Contingency tablesinvolving smallnumbers and the x2 test.J. R. Statist.Soc. Suppl., 1,
217-235.
(1939) An apparentinconsistency arisingfromtestsof significance basedon fiducialdistributions of
unknown parameters.Proc.Camb.Phil.Soc., 35,579-591.
Yule,G. U. (1911) AnIntroduction to theTheory ofStatistics.London:Griffin.

APPENDIX
JustificationforRegardingtheMarginsas Ancillary
Fisherintroduced his argument forthe exacttestwiththe statement:"If it be admittedthat
thesemarginal frequencies by themselvessupplyno informationon thepointat issue,namely,as
to the proportionality of the frequenciesin the body of the table,we may recognizethe
information theysupplyas whollyancillary;".The formofthisstatement is,I think,unfortunate.
Certainlyit has stimulated othersto attemptto demonstratethatthemargins do containsome
information on proportionality.Had Fisherphrasedhis statementdifferently, by saying"If it
be admittedthatthesemarginal frequencies
supplyno information,additionalto thatcontained
in the body of the table, . . .", possiblymentioningthatthisfollowsfromthe factthatPi and P2
statisticsforPi and P2, his groundsfortreating
are sufficient themarginsas ancillary
statistics
wouldhavebeenclearer.
Thatthemargins ofa 2 x 2 tableby themselves do not,exceptin extremecasesandin repeated
sampling,containany information on proportionality,is certainlytrue.In the analogouscase
in whichquantalobservations are replacedby quantitative measurements, however,thesituation
is somewhatdifferent. Such measurements can be arranged in tabularform,analogousto thatof
a 2 x 2 table,as in Table 11. Assuming thattheobservations are normallydistributed
withthe
samevarianceaboutmeans l1and gp2,a testof significance ofxI -x2 is providedbytheordinary
t-test,and the fiducialdistributionof Ml- ,2 is similarlyavailable.Suppose,however,thatthe
measurements cannot be assignedto groupsA1 and A2, as mightconceivably happenif the

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
448 YATES [Part 3,
TABLE 11
Theanalogybetweena 2 x 2 table
and n1 and n2 valuesofa quantitative
variatex relating
to twosamples
A1 andA2

Al n1 valuesof x1 XI
A2 n2 values of x2 x2

N values of x x

identityof theobjectshad been concealedwhenbeingmeasuredandtherecordofthecode used


was thenlost.Ifpt, - 112 is largecomparedwithitsstandard errorthehistogram of all themeasure-
mentswillexhibittwoseparatedistributions, thuspermitting fullreconstitutionofthedata.Note,
however,thatif n1 = n2 we cannotsay whichdistribution appertains to A1. If the distributions
overlap,but withclearpeaks,significance is stillnotin doubt,butan unbiasedestimateofjul-,u2
and its standarderrorwouldrequirea curve-fitting exercise.If therearenottwoclearpeaksthere
willbe littleinformationonMl- /12 oron thesignificance ofthedifference. The factthatin certain
casesthereis quitedefinite informationfromthemargin in no wayinvalidates theancillarity argu-
menton whichthe t-testand the associatedfiducialdistribpation are based. This restson the
sufficiency of theestimates fromthefulldataofthemeansandvariances, andtheirindependence.
(See, forexamDle.Yates. 1939.)
Turningnow to 2 x 2 contingency tables,it is clearthatseparationof themarginal lineinto
its A1 and A2 componentsis not in generalpossible.However,if Pi = 1 and p2 = 0 thetable
(n1, 0; 0,n2) will always be obtained,givingml= n in the bottom margin,whereasif
PI = P2 = p thenml will have the binomialdistribution (p + qy1 in successivesamples.Thusif
N is largeand the difference betweenml and n1 is verysmallwe mightregardthisas somewhat
shakyevidencethatPi and P2 differ substantially.This,however,is an unprofitable speculation,
as whenn, and n2 are largethevaluesin thebody of thetableprovideaccurateinformation on
Pi and P2. Note also that,as in the quantitative case,ifn, = n2 we cannottellfromthemargin
whichof Pi and P2 is likelyto be thegreater.
Thisvestigialsourceof information fromthemargins mayaccountfortheresultsobtainedby
Plackett(1977), usingthe likelihoodapproach.Here I need only quote fromhis conclusions:
"Fisherdid not say thatthemarginal frequencies
supplyno information,
but he argued
as if thiswerethecase. The following remarksseemto confirm theintuitive
viewthatthe
likelihoodfunction provideslittleinformation
aboutN.
......**.*..***....***..*.*...

(d) The proceduresof inference


used here are knownto be asymptotically
bestin many
problems.Theirapplication
hasbeeninconclusive."
Sprott(1975) also tackledthisproblem,andtookas an examplea setofmatchedpairs,one of
each pair beingselectedat randomfortreatment A1, the otherbeinggiventreatment A2. Any
such pair must be one of fourtypes: (a), (1, 0; 0, 1); (b), (1, 0; 1, 0); (c), (0, 1; 0, 1); (d),
(0, 1; 1, 0). Only(a) and (d), whichbothhavemargins {1, 1; 1, 1; 2 }, giveanyinformation on
the difference betweenA1 and A2. The obviousprocedure undermostcircumstances is therefore
to reject(b) and (c) pairsand includeonly (a) and (d) pairsin the analysis,as Cox (1970), to
whomSprottrefers, recommends.
As Sprottwas concernedonlywiththe marginshe could not adopt thiscourse.Insteadhe
dividedthepairsintotwo groups,(a) and (d), whichhe termeddiscordant, and(b) and(c), which
he termedconcordant. If,forany one pair,thebinomialprobabilities, in mynotation, arePi and
P2, the probabilities of (a), (b), (c) and (d) are Plq2, PiP2, qlq2, q1P2 respectively. The null
hypothesis thatPi = P2 = p thengivesthe probability of a discordant pairas 2pq, and of a con-
cordantpairas p2 + q2, Since2pq < 1/2the probability thatall n pairsarediscordant is < 112n,

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] DiscussionofDr Yates'sPaper 449
e.g. for 10 such pairsP<0.001. From thishe concludedthat "it appearsthatsome sortof
information can on occasionbe presentin themarginal totalsof a contingency table". All he is
reallysayingis thatif all pairsarediscordant,
one ofthetreatments is likelyto be almostalwaysa
failure,and the otheralmostalwaysa success.But evenif Pi 0.1 and P2 = 0.9, forexample,
withno variationbetweenpairs,thereis onlya 1 in 7 chancethatall of 10 pairswillbe discordant.
Norcanwe tellfromthemargins whichtreatment
is a success.

DISCUSSION OF DR YATES'S PAPER


ProfessorG. A. Barnard(Retired): As Dr Yatespointsout,arguments about2 x 2 tableshave
now gone on for70 years,so perhapsit would be too muchto hope to forestall a centenary,
thoughhis papershouldgo fartowardsreducing theaudienceat any suchcelebration. Wemust
be gratefulto himforthis,and foremphasizing thatthereis muchmoreto theinterpretation of
data,evenas simpleas this,thansimplesignificance testing,so-called.
Dr Yates has dealt so exhaustively withrandomized comparative trialsthatanyresidualcon-
troversy mustnow be concernedwiththetwo binomialcase,and I shallconfinemyremarks to
this.We needto remember thatP, andP2 serveto parameterize thiscasefully,so thatwe cannot
hope to make a fullyadequatesummary of the messagein the data if we confineourselves to
significancetesting,and thatin relationto one parameter onlyratherthantwo.I stressthispoint
becausethe Neymannian and the Bayesianapproachesto theseproblemssharethe featurethat
we appearto be able to demandinferences of a particular kind-forexampleabouta parameter
chosenby us as the"parameter ofinterest", irrespectiveofotherso-called"nuisanceparameters".
Bayesianssucceedin doingthisby addinguntestable assumptionsto thedata,whileNeymannians
introducearbitrary "principles" suchas "similarity" or "unbiasedness"ofa testwhichsometimes,
as withthe t test,happento producesensibleanswers,but at othertimes-as withthe 2 X 2
table-produceabsurdities (Barnard,1982a). A Fisherianapproachesdatain thehopethatit may
throwlighton questionsof interest, but recognising thata givendata set maynot allowus to
provideunambiguous answersto all thequestionswe wouldwishto ask.
In the presentcase we wantto say something about the "difference" betweenPi and P2,
withoutreference to anycomplementary parameter. Wemustfirstfindtwo parameters, 0 and4,
such that0 represents "difference", while / represents the complementary parameter whichis
to be neglected.The two parameters mustbe range-independent, and 0 shouldreverse itssignon
interchange of Ps and P2. The simpledifference P -P2 cannotbe range-independent of any
complementary parameter,and the simplest(and perhaps,essentiallythe only) parameters
1
satisfyingthese conditionsare 0 = {ln (pI/q ln (p2/q2)} and 0= 2 {ln (pl/qI) + ln (p2/q2)}
)l-
the semi-difference
of log odds, and the semi-sum.We then enquire whetherthe 2 x 2 table data
allowus to infersomething
about0 withoutreference
to Q.
In parametriccases such as this,the kind of informationprovidedby the data is discoverable
from the likelihood functionwhich, for the table (3, 0; 0, 3) is, writingX fore0 and v for e,
L(X,v) - V3X6/{X+ V+ Xv2+ VX2}3 which can be factorised into L (X) L2 (A,v), where
L (X) = X6/{ 1 + 9X2 + 9X4 + X6} while and L2 (X, v) = A3{1 + 9X2 + 9X4 +X6 }X +V+Av2 +VA2}3.
If the second factorinvolved4 only, we could immediatelyinferthe possibilityof makingfully
efficientinferencesabout 0 without regard to b. As it is, we recognize L2 as the likelihood
functionprovided by knowledge of the marginaltotals {3, 3; 3, 3; 6}, while L1 is the further
likelihood provided by knowledge of the contents (3, 0; 0, 3) of the table, given the marginal
totals. We can imagine ourselvesbeinginforrmted of the data, firstby beingtold the marginaltotals
and then,knowingthese, being told the contentsof the table. If we are preparedto neglectsuch
informationabout 0 as is provided by the marginaltotals, then inferencesabout 0, irrespective
of 4, are possible on the basis of the conditionaldistribution,
Should we be preparedto neglect the informationin the marginaltotals? The null value of 0
is 0, and a representativealternative might be taken as 1, correspondingto an odds ratio
pjq2/p2qj of about 7.4. The likelihood ratio for 0 = 1 against 0 = 0, fromthe conditional
distribution,is L1(e2)/Ll(e0) = 8.3846, while assumingthe most likely value, 0, for'b, the likeli-
hood ratio fromthe distributionof the marginaltotals is L2 (e2, e? )/L2(e?, eo) = l. 1652. Measur-
ing the amount of informationby the logarithmof the likelihood ratios,thereis almost 14 times
as much informationin the conditional distributionas in that of the marginaltotals. And, of
course, to use the informationin the marginaltotals requires some guess concerningthe value of

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
450 DiscussionofDr Yates'sPaper [Part3,
4- eitherin the formof a guessed distributionfor 4, or a guess of a specificvalue. It is as if we
have a sample of 15 observations,14 of them reliable, the remainingone being affectedby an
unknownadditional error.It would seem reasonable to ignorethe doubtfulobservationand base
an inferenceon the 14 reliable observations.Correspondingly, it seems reasonableto neglectsuch
informationas may be in the marginaltotals, mixed up with the parameter), and base ourselves
only on the conditional distribution.If we are estimatingG we should use the likelihoodL1, while
if we are only testing0 = 0 we obtain the conditional P value, 1/20. In doing this we should
recognize that we are not using all the informationin the data, the loss of informationarising
fromour requirementof makingan inferenceirrespectiveof 4.
The case chosen, (3, 0; 0, 3) is one where the marginaltotals are verysmall. Were they still
smaller,we should be less justifiedin neglectingthe informationin the margins;but in most real
cases the marginswill be larger,and the relativeamount of informationneglectedwill be even
smaller than here. And we may note, since we will typically be interestednot only in the
magnitudeof 0 but also in its sign,that the marginssupplyinformationonly about I 0 1,L2 being
invariantunderthe changeX -+
As Dr Yates makes clear, the factthatthe P value obtained fromthe conditionaldistributionis
so much less than the maximumpossible frequencyof rejection of the null hypothesiswhen true
is mainly due to the discretenessof the distributionused. If we adopt Anscombe's suggestion
for such cases, countingonly half the probabilityof the observedvalue towards the P, we would
get P = 1/40, much nearer to the maximum rejection frequency of 1/64. Personally,I think
Anscombe's suggestionuseful if a whole collection of P values is to be reviewed;but the cogency
of a particularsingleinferenceis bettermeasuredby the usual P value.
I must resist the temptationto discuss the problem of estimating0 (Barnard, 1982b). But I
hope I may be forgivenfor reminiscinga little,since it may help if I sketch the thoughtprocess
which led me to abandon the argumentwhich I had found acceptable fortyyears ago. In a
privateletter followingour firstexchange in Nature,Fisher raised a case where we are interested
in the probabilityp that a breed of floweringplant will have purple ratherthan white flowers.
Modifyinghis example, let us suppose we wish to test the hypothesisp = I and for this purpose
we have four specimensfor each of which the probabilitythat it will failto floweris 1/4,regard-
less of potentialcolour. On cultivationall fourspecimensgivepurpleflowers.The conditionalone-
sided P is 1/16. Should we multiplythis by 81/256, the probabilityof gettingall fourspecimens
to flower?The argumentI had used suggestedthat we should,makingthe resulthighlysignificant,
beyond the 2 per cent level. But what if someone else had discoveredhow to get his plants to
flower every time? He would, surely,justifiablycomplain if he, gettingthe same result,had it
judged non-significant at 5 per cent,just because of his skillin horticulture.
When firstfaced with this argumentI counteredthat with the 2 x 2 table fluctuationin the
marginsis inherentin the model, whereas fluctuationin the numberfloweringis not inherentin
the flowerscase. And the numberfloweringis wholly independentof colour, while in the 2 x 2
case we do have the marginsdependent partlyon 0. Afterlong meditationI was forcedto agree
with Fisher that we must in statistical inference,separate informativefrom non-informative
samples,regardlessof whetherthe variationin informativeness is, or is not, potentiallyunderour
control. So, in 1949, I was led to write that I now thought Fisher was "rightafter all"-an
acknowledgementwhich led Fisher to remarkto my friendHarold Ruben, "Barnard is the only
statisticianwho has ever admittedhe was wrong". This commentshould be borne in mind when
readingDr Yates's remarkconcerningthe wont of Karl Pearson.
Dr Yates refersto the comments on Berkson's paper by Basu and Kempthome. Since they
both referto my work, I may perhaps add that Basu's discussionis predicatedon the assumption
that the possibilitythat Pi should be less than P2 is excluded. I findit impossibleto imagine
circumstancesin which we could be testingPi =P2 with this possibilityruled out. So far as
Kempthorne'scommentsare concerned,I am gratefulto him for implyingthat my originalideas
were not altogetherstupid. But I still think they were wrong. There was a private pun in my
labelling the suggestedprocedure CSM-it referredalso to the Company SergeantMajor in my
Home Guard unit at the time,my relationswithwhom were not altogethercordial. I stillfeelthat
the test,like the man, is best forgotten.
It is a greatpleasureto move the vote of thanks.

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] DiscussionofDr Yates'sPaper 451
ProfessorD. R. Cox (Imperial College, London): Discussion of tests for the 2 x 2 table can
be describedas a saga, a storywith deep implications.I can understandthose who would prefer
to approach the matterfroma different and Bayesian viewpoint;understand,althoughlargelynot
agree with. I can understandthose who would prefermore emphasison estimation;understand
and largelyagree with, as I suspect would Dr Yates himself.But giventhe formulationin termsof
a significancetest of a null hypothesis,it is indeed odd that so much confusionreigns.Fiftyyears
after his pioneering paper, Dr Yates has returned to the subject with admirable vigour and
enthusiasm. Let us be optimistic and hope that he has squashed once and for all various
misconceptions.
Any traditionthat seconders of votes of thanksconcentrateon disagreementswiththe paper is
one I am unwillingto follow. For I accept threemain thesesof the paper,thatthe test should be
conditional, that concentration on achieving preassigned magic levels like 0.05 rather than
calculatingp-values is misguided,and that by and large the power comparisonsreportedin the
literatureare irrelevantor worse. Neverthelesspoints remain for discussion,in particularso as to
understandwhat to do in more complicated cases for which the single2 x 2 table is a prototype.
On the relativelyminor issue of the two-sided test I agree that in some sense doubling the
one-sidedarea is the appealing thingto do, preferableto summingover possibilitieswithequal or
largervalues of X2. The requirementof a hypotheticaloperationalinterpretation seems to demand
somethingdifferent,however; one suggestion(Cox and Hinkley, 1974, p. 106) is intendedto be
close to doubling the one-sidedarea and to have such an interpretation,but is rathercontorted.
Fisher's argumentrecorded in the paper I have not yet found enlightening, partlybecause it seems
quite stronglytied to achievingpreassignedlevels. What does Dr Yates thinkof the argument?
One way of amelioratingthe effectof discretenessuseful in extremecases is by approximate
conditioning,i.e. by carefullyassemblingconditional distributionsgivenancillaryvalues close to
that observed. At a practicallevel I doubt whetherthat would ever be a good idea in the present
instance.
The precise nature of and justificationfor the conditioningin generalraises difficultissues. If
we have a numberof 2 x 2 tables withthe same values of nl, n2, Pl, P2 it is clearlypossible from
the othermarginsto estimateconsistentlyboth
(nlpl +n2p2) (nt +n2) ' and (P - P2).
If the differenttables have differentprobabilitiesPi,i P12 for the ith-table,with constantlogit
difference8, unconditionalmaximumlikelihood can be bad with a large numberof small tables,
such as arise in case-control studies in epidemiology. A preferredtechnique is conditional
maximumlikelihood. Note however that if the differenceof probabilitiesis of interesta different
approach is needed; the differencewould certainlybe appropriate if, for example, it could be
shown to be more stable under replicationthan the logisticdifference,even though the latteris
in many ways the more naturalmeasure. Approximateinferencefor 71? p P2 - is possible from
one or several tables provided one is reasonably careful. Whethersome form 6f approximate
conditionalinferencecan be achieved is not clear. It is possibly relevantthat pairs of orthogonal
parametersare
5 = log{pl(I -pl)-l log{p2(I -p2)-f }, (nlpl +n2p2) (n1 +n2)f

71= P1 - p2, n1 log {pl (1 - p1Y)' } + n2 log {p2(l - p2)-' }.


8
When is of interestthe conditioningstatisticis the estimateof the associated parameter.
On a point of terminology,perhaps the standardtest should be called the RSM test (remain
with the same margins). It would thus take complete dominance not just over all other NCO's
but over everyoneelse in sight.
It gives me very great pleasure to congratulateDr Yates and to second the vote of thanks.
The vote of thankswas passed by acclamation.

Dr G. J. G. Upton (Universityof Essex): May I firstcongratulateDr Yates on his "golden"


paper, and apologize for not being in entire agreement! I have some commentson most of his
examples.
In the example on the effectof a serum,Dr Yates arguesthat,in the absence of an inoculation
effect,the number of individualsin the sample who were fated to contractthe disease is fixed

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
452 DiscussionofDr Yates'sPaper [Part3,
whateverthe allocation. However,as Dr Yates states,the numberml is not known in advance. If
the N individualswere drawn at random from some largerpopulation then we can argue that,
withrespectto thatpopulationthe numberml is not fixed but is an observationon a random
variable. On the other hand, if the experimentis unrepeatable then we can argue that the N
individualsthemselvesconstitutethe population, in which case it is logical to conditionupon ml
and the exact test-and hence the Yates-correctedchi-squaredtest-are thenappropriate.
I have belatedly realised that the tea-tastersituation is a ratherpoor example of the case of
two fixed margins,since the marginsare not only fixed but also matched with ml = nl, etc.
A common and more realistic example, which does not have this extra constraint,is that of a
firmwhich has ml vacancies which are to be filledfromthe nI male applicantsand the n2 female
applicants.
The example involvingplaying cards serves only to increase one's confusion. Our bet is not
whetherthe sample P -P2 value exceeds 0.6, but whetherthe population pi - P2 value exceeds
zero. I feel thatwe mustlook furtherthan the actual sample in hand.
The object of my 1982 paper was to see how various tests performedwhen they were used
as "black boxes" by someone who was not an expert statistician.I agree with Dr Yates that it
is best to quote exceedance probabilities;neverthelessthe tests are usually used at nominal 5
per cent significancelevels. Given this use it seemed sensible to recommenda test whichglobally
has a 5 per cent Type I error.Incidentally,a properlycorrected(using HCF/2) normal approxi-
mationgives0.057, 0.042, 0.0119 and 0.0022 forthe foursituationsdiscussedin Table 8.
Finally, a plea for sanityof notation. Please can we stop using X2 as a symbol for a quantity
havinga discretedistribution!

Dr I. D. Hill (Clinical Research Centre): All discontinuous tests are necessarilyconservative,


if we insist on takingarbitrarydividinglines and reportingP < 0.05 or whatever.Fiftyyears ago,
when we had to look values up in tables, this was a sensible thingto do. In these electronicdays
it is no longer sensible; we can reportan exact value, such as P = 0.032 (using the exact test, of
course), and all the argumentabout conservativetests then vanishesin a puffof blue smoke. I
thereforeentirelyagree withDr Yates, wherethe testis one-sided.
However, one-sided tests should not often be used, and I wish that he had givenrathermore
attentionto two-sidedtests,where a generallyagreedprocedureis needed. I findit intriguing that
Section 16 of the paper ends with Fisher's question to Finney: "How does thisstrikeyou?" Well,
how did it strikehim?-I should love to know the answer.
Dr Yates in Table 9 considerstwo possibilities(in addition to X2) but neitherof themseems to
me to be reasonable. Whatwe want is the probabilityof the observedvalue, or an equally extreme
or more extremeone, in eitherdirectionand the question is: what do we take as more extreme?
In 1965, M; C. Pike and I publishedan algorithmforFisher'sexact test,and needed two-tailed
as well as one-tailedanswers(Hill and Pike, 1965), and we could not agree on how to do it. My
argumentwas that the second tail should include all termssuch that the sum of theirprobabilities
does not exceed the probabilityin the observed tail. I still regardthat as the "right" solution,
being the equivalent of what we do with continuous distributions.The value is always less than
or equal to doubling the one tail. Pike argued that the degree of dependence is measured by the
cross-ratio,and that the second tail should thereforeinclude all termswith an inversecross-ratio
equal to, or more extreme than, the observed one. In the end our algorithmincluded both and
gave the user the choice.
The Pike method has the unusual feature,compared with all the others,that the second tail
always includes at least one term.Thus if one extremevalue has a large probability,nothingcan
evershow two-tailedsignificancein the othertail no matterhow small its own tail may be.
There is yet another possibilitythat is sometimes suggested.This is to take all values whose
probability ordinates are no greater than the observed one. It has the advantage that it can
immediatelybe extended from the 2 x 2 table to the m X n table, where other definitionsof
"more extreme" are not at all obvious-. but it does not correspondto what is done forcontinuous
distributions.
The programI currentlyuse asks the user for the a, b, c and d values and, if theyare 2, 17, 5
and 16 forexample (to take a particularrealisationof Yates' Table 9B) it replies:

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] DiscussionofDr Yates'sPaper 453
0 19 7 14 0.0062 0.0062 1.0000
1 18 6 15 0.0553 0.0615 0.9938
2 17 5 16 0.1866 0.2482 0.9385 Observed table
3 16 4 17 0.3111 0.5593 0.7518
4 15 3 18 0.2765 0.8358 0.4407
5 14 2 19 0.1310 0.9667 0.1642
6 13 1 20 0.0306 0.9973 0.0333
7 12 0 21 0.0027 1.0000 0.0027

which gives the individual probabilities,and the cumulative probabilitiesin each direction.The
user can then choose whicheverprocedureis preferred.
One final question: suppose someone collects data until the numberin cell a reaches a parti-
cular value, and then stops. All the informationis now in the marginaltotals. N will follow a
negativebinomialdistribution.Whatis the correcttestforindependence?

ProfessorM. S. Bartlett (Retired): When Dr Yates reviewsa problem discussed in 1934, I am


among a small group who mightcelebrate our own little quinquagenary.I have not always seen
eye to eye with him, but we seem broadly in agreementwith his defence of fixed marginsfor
contingencytables. The exact test makes use of what in my 1937 paper I called "conditional
likelihood", when I examined the various sufficiencypropertiesleading to valid tests.However,I
regard the "frequency requirementof repeated sampling" as including conditional inferences;
and, while I -subscribeto Fisher's search for the most relevant "referenceset", I would reject
any sets that do not provide a valid samplingframe.(Such a requirementpermits inclusion of
the t-test,but not the Behrens-Fishertest.)
I do not rule out on principlethe possible value of inexact tests based on inequalities,and in
the present context such tests include Barnarc's CSM test. But a crucial furtherpoint is what
happens when Pi /P2; this is very simply illustratedin the extreme case of small Pl, P2 (or
alternativelyql, q2). Then a, c and ml become Poisson variables with true means X1, X2 and
X1 + X2, say; the ratioa/lm is a binomialvariable,withinformationml/(PQ)on P = X1/(X1+ X2),
varyingwith m1. For generalprobabilitiesthe situationis more complex, but if 4 = p1q2/(p2q1)
thejoint probabilityof a, c is
nlCa n2CC Oa [pm1qi qnn2-m ,

where the factor in square brackets is constant for constant m1 (and nl , n2); hence the
conditional probability,given ml, depends only on 4, as noted by Fisher in 1935, thoughsome-
what obscurely. Because of this, it would be surprisingif the advantageous propertiesof the
conditionalprobabilitydo not extend to the generalPl, P2 case.
Finally, while it is useful to review the justificationof any standard test occasionally, this
includes reviewingall the necessary conditions. The dangers of neglectinghidden classifications
can be highlightedby Fisher's original example on convictionsof "like-sex twins of criminals",
classifiedby zygosity,but apparentlynot by sex. (There seems some ambiguityabout this;in his
1935 R.S.S. paper Fisher does refer to twin brothers,implyingall males, yet in the 1958
edition of Statistical Methods he refersexplicitlyto brothersor sisters.) If anyone is unfamiliar
withthe dangersof hidden classifications,I recommendSimpson (1951).

ProfessorM. Aitkinand Mr J.P. Hinde: Dr Yates's firstsentence in the second paragraphof


his appendix should be expressedin the opposite form:the marginsof a 2 X 2 table do in general
contain informationabout the odds-ratio parameter0 though in large samples this information
may be negligible.The marginaltotals are not ancillarybecause theirdistributiondepends on 0.
Fisher's statement in Dr Yates's firstparagraph is thereforecorrectlyqualified. Conditioning
is widely adopted because it seems otherwise impossible to draw conclusions about 0 in the
absence of knowledge of some nuisance parameter0 (e.g. the odds product); the 2 X 2 table being
a two-parameterproblem.
A new solution to this problem,and to nuisance parameterproblemsin general,has been pro-
posed in Hinde and Aitkin (1984). The approach can be stated succinctly: given a likelihood
functionL(0, s), we definethe canonical likelihoodsC1 (0) and C2 () for0 and / by

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
J
454 DiscussionofDr Yates'sPaper [Part3,
XC1(0) = L(O, 0 C2() do

XC2 (G = fL(0oOC1
(0)dO.
For the two-binomialsexample with n1 and n2 fixed,if 0 = (p, /(I P1 ))/(P2/(l -P2 )) and
=PiP2/((l pi) (1 -p2)) then withobservedsuccessesr, and r2
-
;(O, 0) = 0(r, +r2)/2 0(r r2)/2 (1 +.\/(0t0))-nj (1 +.\/(OI0))-n2.

The conditionallikelihood for0 givenr1 + r2 is

C(0)= (n1, n2 )0 r, /1( n)( n2 )Os

r, r2 s s r, +r2 -S
but L (0, k)IC(0) still depends on 0, as Professor Barnard has also remarked. r1 + r2 is not
ancillary,unlike the case of the ratio of Poisson means.
The canonical likelihoods are evaluated in generalby numericalintegrationover a finitegrid:
in this simple two-parametercase they are given by the principaleigenvectorsof the matrixof
values of the likelihood function evaluated at the grid-points.For the table n1 = 10, n2 = 8,
r, = 2, r2 = 6 the canonical likelihood for 0 is shifted away from 0 = 0 compared to the
conditional likelihood: the relative canonical likelihood at 0 = 1 (compared to that at the
maximum likelihood estimate 0 = 1/12) is 0.05 18 (using 15-point grids for log (0) and log (p))
while the relative conditional likelihood is 0.0685. The directlikelihood interpretationin either
case is that the evidence against 0 = 1 is strong.The Fisher exact probabilityis 0.0288 for this
table, and the exact probabilitiesof the more extremetables withr1 = 1 and r1 = 0 are 0.0018 and
0.00002, givinga one-sided significanceprobabilityof 0.0306. If the significanceprobabilityis
doubled forthe two-sidedtest,the evidenceagainst0 = 1 is not strong.
For the extremecase n1 = n2 = 2, r1 = 2, r2 = 0, the relativeconditional likelihood at 0 = 1 is
1/6, the same as the Fisher exact probabilityof the table. There is not even weak evidenceagainst
0 = 1. The canonical likelihood, efficientlyusing the informationin the marginaltotals, gives a
relativelikelihood at 0 = 1 of 0.026, strongevidence against 0 = 1. This illustratesBarnard'sargu-
ment: the relative likelihood L(pl,p2)/L(l, 0) is 1/16 at Pi =P2 =P=, and less at any other
common value of p: the relativelikelihood when correctlyaveragedover the nuisance parameter
mustbe less than 1/16 at 0 = 1.
As noted above, this procedureapplies quite generallyto nuisance parametermodels: it gives
an improvementoverthe marginallikelihood fora in theN(p, a2) model as well.

Dr D. M. Grove (University of Birmingham): Dr Yates has expressed concern about the


numberof recent papers which question the wisdom of conditioningon the observed marginsof
a 2 x 2 table. I conducted a haphazard surveyof statisticaltextbooks,and I can reassurehim that
in those books the idea of conditioningwas accepted withoutquestion. Everyauthor,explicitlyor
implicitly,presented the X2 test as a convenientapproximationto an ideal representedby the
exact test. But note that in no case was the distinctionmade between 2 x 2 and largertwo-way
tables. Dr Yates' paper is not concerned with largertables, but I thinkthat his argumentforthe
irrelevanceof the samplingrule, which he givestowards the end of Section 5, could be applied to
largertables. Moreover,it makes no mentionof the parameters.
Leaving parametersout of the argumentseems to me to obscurethe factthat the 2 x 2 case is
a special one. We can express the amount of association in a single odds ratio (as Professor
Barnard, Professor Cox and Mr Hinde have all done), and we can follow Barnard or Plackett
(1 977) in arguingthat thereis verylittleinformation,if any, in the marginsabout that odds ratio.
In a largertable it is well-knownthat the conditionaldistributioncan stillbe writtenas a function
of cross-productratios-odds ratios-of individual cell probabilities. However, in tables with
orderedmarginsthereare manyinteresting hypotheseswhich cannot be expressedin termsof such
cross-productratios.
Take, for example, a singlemultinomialmodel coveringthe entiretable. If we define r to be
the population versionof Kendall's rank correlationcoefficient,then r > 0 is a naturalexpression

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] DiscussionofDr Yates'sPaper 455
of theidea of a positiveassociationbetweentheorderedrowsandtheorderedcolumns.A natural
testof independence againstthealternative
of positiveassociationcouldbe basedon theempirical
Kendallrank correlationcoefficient, as ProfessorArmitage(1955) and othershave discussed.
It is not difficult
to findcombinations bf probabilities
and marginssuch thatalthoughthe
Kendallpopulationr is strictlypositivethe empiricalr convergesto a negativevalue in the
conditionalreference set thatis, lettingthe marginaltotalstendto infinityin fixedratios.The
sameresultholds forotherplausibletest statistics such as thoseproposedby Clayton(1974).
The corresponding testsare inconsistent.My conclusionis thatthe argument forconditioning
needsto be carefullyformulated in sucha wayas to avoidthiskindofinconsistency.
MrGrahamJagger(Life Science Research Ltd): I would like to make two points.
The firstconcernsthe meaningof "more extreme"in the applicationof Fisher'stest.Consider
the table (2,3; 4,21). This has expectations (1,4; 5,20). The tables(3,2; 3,22),. . ., (5,0; 1,24)
are clearlymoreextremethanthe originalin the sensethattheydepartmoreand morefrom
expectation. Theprobability of (2, 3; 4,21) ormoreextremetablesis,then,0.254.
Thereis however,anotherseriesof tables,(2,3; 4,21), .. ., (0,5; 6,19), givinga probability
of 0.959 forthat observedor less extreme.Note thatthissecond case is the one achievedby
progressively decrementing the smallestcell in the table, a proceduremore or less strongly
impliedby the authorsof most modernstatisticaltexts. Indeed, the manufacturer of my
Hewlett-Packard programmable pocketcalculatorsuppliesa program thatdoesjustthat.
It cannotbe stressedtoo strongly thatthe processof deriving moreextremetablesis not
formally equivalentto the progressive decrementation of the smallestcell; whichprocedure is,in
general,incorrect.My second point is concernedwith the calculationof the two-sided
probability.Doublingthe single-sided probabilityof 0.254 derivedfrommyfirstexampleyields,
usingYates' method(i) a two-sided probabilityof0.508. On thefaceofit thisis notunreasonable
butdifficultiesrapidlyarise.
Considerthe table (2,3; 4,5). This and more extremetables (as definedabove) yieldsa
single-sidedprobability of 0.657 so method(i), thatis, dcwubling, is not an appropriate wayof
finding the two-sidedprobability. On the otherhand,addingup the probabilities of thetables
at the otherend of the tail whichare as, or more,extreme, givesa two-sidedprobability of 1.
Thisis Yates'smethod(ii), and,whatever itsshortcomings, alwaysproducesa moreorlesssensible
result.I do not understand the "common-sense grounds"upon whichDr Yates condemnsthis
method,especiallysincemethod(i) is an evenlessagreeablealternative.
PerhapsDr Yates could,in his reply,add to his alreadyfascinating paper with further
discussion ofthisquestionof two-sided tests.
Dr H. D. Patterson(AFRC Unit of Statistics,Edinburgh):I would like to add my con-
gratulations paper.No questionappearsto remainun-
to Dr Yates fora clearand illuminating
answeredon the single2 X 2 tablebut I wonderwhetherDr Yates has any adviceforus on (a)
the combinationof several2 X 2 tables and (b) the use of conditionalarguments in other
statistical
activities,
suchas theanalysisoftreatments
x placestablesof agricultural
yields.

The followingcontributions werereceivedin writingafterthemeeting.


Dr R. S. Cormack(NorthwickPark Hospital and Clinical Research Centre): We are all
indebtedto FrankYates forhisveryclearand cogentrestatement ofthecase forFisher'smodel-
let us hope that both margins are now welland trulyanchored.But thereremainsthe question
of the secondtail. The consensusof opinionamongsomedistinguished is thatIrwin's
scientists
rule,and relatedmethodsof defining the2ndtail,arenotuseful.Howeveradmirable theymaybe
in theorythese methodsare not consistently plausiblein real life.Yates has founddelightful
examplesof theanomaliesthatcan arisewiththe Irwinrule-howcan theprobability of seeinga
particularfaceof the die onlyonce be greaterfor31 throwsthanfor30? Certainly thedoubling
rulemakesmoresensehereandis consistently plausiblein thecritical-regions.
Unfortunately, thisdistinguishedassemblyis clearlynot agreedon thetheoretical justification
fordoublingwhenthetailsare asymmetric. Thusthemostpressing neednow wouldseemto be
fora way of completing the exacttestswhichnot onlygivesplausibleresults,likethe doubling
rule,but also has a convincing base. Meanwhilesupportforthedoublingrulemustbe
theoretical
fautede mieux.

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
456 DiscussionofDr Yates'sPaper [Part3,
Dr S. E. Fienberg (Carnegie-Mellon University,Pittsburgh,PA-); By casting his defense of
Fisher's exact test for 2 X 2 contingencytables in a Fisherianperspective,Dr Yates failsto cite
other support for his conclusions. Lehmann (1959, pp. 140-146) notes a strongNeyman-Pearson
justificationfor the two-binomialand independence problems,namely that it is the uniformly
most powerfulunbiased level a test (with randomizationto achieve the nominallevel) forequality
of the p's. Cox and Hinkley (1974, pp. 134-140) state anotherversionof this result,based only
upon achievable levels of significance,and they describe the Fishertest as uniformlymost power-
ful similar.Despite all of this, I would make a case forthe use of the uncorrectedX2 statisticand
its likelihoodratio counterpart.
Except for cases where n1 and n2 (and by symmetryml and m2) are quite small,thereis not
much differencebetween inferencesbased on the exact test, and those based on the X2 test. (For
these very small sample cases, I too would recommend,and do use in practice,the exact test.)
Referringthe uncorrectedx2 test statisticto the X2 distributiongives a test which for modest-
sized samples achieves nominal levels under a multinomial or two-binomial sampling model.
Despite what Dr Yates says, the X2 test in this sense is not a conditionaltest and the continuity-
correctedX2 test is conservative.Moreover,the x2statisticfor 2 x 2 tables is a very special case
of a test statisticused for a much broader class of problems,in largertwo-wayand in multi-way
tables, representablein terms of loglinearmodels. The shaky Fisherianargumentsabout lack of
informationin the marginsto be used for conditioningdo not really carryover to these more
general settings.For tests,involvingmultipledegreesof freedomperformedfor loglinearmodels,
we often wish to partition the X2 statisticor the likelihood-ratiostatisticinto one-degree-of-
freedom components. Using the standardstatisticswithout correctionand the correspondingx2
referencedistributionsseems preferablehere. For these reasons, I recommend using the un-
correctedX2 testin the 2 x 2 table, except when sample sizes are too small.
The controversyis over the relevance of the exact test, not over whetherthe continuity-
correctedx2 givesclose approximationsto the exact test (Fienberg, 1980). The crux of Dr Yates's
defense of the exact test is Fisher's argumentforregardingthe marginsas ancillary.Since Yates
admits that this argumentis only approximatelytrue at best, it is not surprising that thereis still
debate over whetherthe conditionaldistributionapproach is appropriate.

ProfessorD. J. Finney (Universityof Edinburgh): We should be gratefulto Dr Yates for his


characteristicallyrealistic account of mattersthat in recent years others have tended to make
increasinglyobscure. I personallywas glad that he has rescued, fromcorrespondencethat I had
forgotten,an opinion from R. A. Fisher who seems otherwiseto have preserveda remarkable
reticenceon this subject.
Since Dr Yates did not set out to discuss the roles of one-tailand two-tailtests,my remarks
are peripheralto his theme; I want to suggestthat greaterattention ought to be given to the
appropriate choice. I have difficultyin findingexamples of situationsin which a one-tail test
makes sense. Textbooks and common practice seem often to imply that, if I am interestedonly
in a deviationin one direction,I should employa one-tailtest. Is a restrictionof interestsufficient,
even though declared in advance of data acquisition?I believethe correctconditionto be "I know
that any apparent deviation in the other direction,howeverlarge,must be due to chance". For
example, in a recent legal battle in an Edinburghcourt, one partyargued vigorouslythat cancer
death rates had been increased by the adding of fluorideto public water supplies,and insisted
that a test of significanceshould look only at deviationsin this direction.They had no interestin
indications of a beneficialeffect.Yet if a deviationin the directionof reduced mortalityamong
those exposed to fluoridehad been severaltimesits standarderror,could it conceivablyhave been
automatically rejected as due to chance? (In actuality, when properlycalculated, there was a
negligiblysmall deviation in that direction). The complications of X help to emphasize this
issue, but of course the logic is equally relevantto othersignificancetests.

ProfessorM. J. R. Healy (London School of Hygiene and Tropical Medicine): A common


thread runningthroughmuch of the discussion has been the use of the log odds-ratioor logit
differenceto measure the amount of association in the 2 X 2 table. As Dr Hill has commented,
this provides an alternativedefinitionof extremenessof departurefromthe null situation,and
hence a method of renderingthe exact test two-tailed.The probabilitiesfor Tables C and D of
Table 9 of the paper are

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] DiscussionofDr Yates'sPaper 457
C D
0 0.0115 0.0107
1 0.0849 0.0810

6 0.1705 0.1713
7 0.1047 0.1023
8 0.0164 0.0160
To me the main reason against Dr Yates' prescriptionof doubling the one-tailedprobabilityis
simplythatthe resultingnumberis not the probabilityof any definableevent.
The lesson I draw from the paper is one Dr Yates has taught me long ago, that a mere
significancetest is seldom a satisfactorysummaryof a body of data. The sensible investigator
should always seek a quantitativeestimateof the amount of association in the table, an interval
estimate for preference,and this the log odds-ratioreadily provides.From anotherviewpoint,it
gives the necessarypeg on which a Bayesian argumentcan be hung, namely a quantityto which
a priordistributioncan be ascribed.
A furtherlesson is that the analysis of discretedata by classical methods is much more tricky
than that of continuous measurements.It is a pity that the differencebetween two proportions
is so often chosen as the firstexample of a statisticaltechniqueto be prescribedin the elementary
textbooks.
ProfessorF. D. K. Liddell (McGill University,Canada): Consider samples.n, and n2 fromtwo
distinctpopulations,all N subjected to the same treatment.There can be no doubt (see Section 5
of the paper) that the value of a mighttake any value, say rl, from0 to n1, and c any value, say
r2, from 0 to n2. Thus, there are two degrees of freedom in the system-with a total of
(ni + 1) (n2 + 1) possibilities-instead of one, and that restrictedto at most (n1 + 1) possibilities.
For given iT,it is well known that P(rl, r2 I nl, n2,ir)=H(rl I R, nl, n2) x Bi(R I N, r), where
R = ri + r2: the two termsare the hypergeometricprobabilityof obtainingr, had R been fixed,
and the binomial probability of obtaining R out of N with true probability7r. Ranking the
probabilities, for every possible combination of r, and r2, in order of the values of
(rl/nl -r2/n2)-or of any plausible statistic that compares Pi and P2 -could lead to a fully
unconditional test. It is common practice, when a test procedure depends on an unknown
population parameter,to replace that parameterwith its sample estimate. Here, replacingthe
unknown Xrby its ML estimatep = (ml/N) still permits the estimation of the probabilityof
(rI/n1- r2/n2) for all r, and r2. This process has used one df for estimatingp^,leavingthe other
for comparingpi and P2; however,it has not fixed the ml, m2 margin,and does not appear to
involve any illogical step. The test is not strictlyunconditionalfor the estimateof 7rdoes take
account of ml1. Nevertheless,there seems no need to resort to a severely restrictedtruly
conditionaltest.
Where is the logical failurein this proposal? In other words, if the CDF of (rl/nl -r2/n2)
can be determinedon the basis of p, why should it not be used for a test, even if the only con-
ditioningis the use of ml in obtaining p, and even if the test is more liberal than the (fully)
conditional test? The test on these lines I put forwardin 1976 did have to be defended,in 1980,
but againstcriticismsof quite differenttypes.

ProfessorNathan Mantel (The American University,Bethesda, MD, USA): The erroneous


faultingof the Fisher exact testand its continuity-correctedchi square approximationis a curious
aberrationwhich just will not go away. Seeming faults with those procedureskeep gettingdis-
covered,rediscovered,and published by new accessions to the statisticalprofession,and even by
well-establishedprofessionals.
Perhaps because it is Yates who has now taken up the defense,thingswill improve,but I doubt
it. Other statisticianswho also know better are forced to keep quiet inasmuch as no claim of
makinga new or positivecontributioncan be made by defendersof the exact test.
Where I would disagree with Yates is in his recommendation for using twice the one-tail
probability as a measure of the two-tail probability. He bases this recommendationon the
unseemly behaviour he sees for other proceduresin an instance where he has been able to shift
the cell expectation slightlyaway from a multiple of 0.5. But whetherthis is a sound enough
basis seems dubious to me. In any case, for tables so large that exact enumerationis unfeasible,
Yates would have to make some otherrecommendationforgettingtwo-tailprobabilities.

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
458 DiscussionofDr Yates'sPaper [Part3,
Because of Yates' emphasis on single-tailtesting,he avoids havingto definethe test statistic
relativeto whichthe probabilityis beingsought.The single-tailprobabilityis constantwhetherour
test statistic is the tail probability itself, the absolute deviation from expectation, or the
probabilityof the specificoutcome observed. Each of these could giveriseto a differentvalue for
the two-tail probability,another factor of variation being whetherwe use exact two-tail pro-
babilities,or where suitable, some chi-square-basedapproximationto that two tail probability.
Among the two-tail probability methods that Yates rejects in favour of twice the one-tail
probabilityis an exact two-tail probability,the test statisticbeing presumably,as I see it, the
absolute deviation fromexpectation. In his discussionof the work of Haber, Yates notes a "test
proposed by Mantel". For the examples Yates givesin Table 9, had Yates applied that chi-square-
based test he would have found it to give remarkablyclose approximationsto the exact two-tail
probabilities which he had gotten. Still, Yates now rejects such exact two-tail probabilities,
thoughthese are the ones customarilyused when feasible.

Dr J.A. Nelder (Rothamsted Experimental Station): I have recently completed an annual


task of interviewingcandidates for the governmentstatisticiangrade. We usually ask them about
X2. Everyone has heard of it, and they all know the "observed-minus-expected" formula.But
almost no one knows on what assumptionthe expected values are expected, and nobody knew
this year what X2 measures. They would not know a "proportional" table with a zero X2 if it
got up and bit them. I very much hope that in teachingwe can move fromsignificancetesting
towards estimation,with the odds ratio (or some functionof it) to measurewhat is happeningin
a 2 X 2 table, with the conditionalvarianceto supply an asymptotics.e. or the exact conditional
distributionto supply more exact limits.We can then move on to considerthe analysisof sets of
2 x 2 tables in terms of consistencyof odds ratios, and so on. A disadvantageof X2 is that it
lumps togetherthe extremesof large positive and large negativelog-oddsratios in the same tail,
somethingthat we should not always want to do. However X2 also has a left-handtail, whenthe
log-odds ratio is close to zero. AnthonyEdwards has pointed out that X2 near zero, the case of
"suspiciously close agreement" is extreme in the sense of casting doubt on the variance
assumption. Perhaps some data have been rejected, or some other cause of underdispersionis
operating.Fisher'sre-analysisof Mendel's data is relevanthere.

Dr R. L. Plackett (Retired): I regretthat a prior engagementpreventsme fromattendingthis


golden jubilee, and voicingmy congratulationsto Frank Yates in person.The apparent-simplicity
of 2 x 2 contingencytables is deceptive,and conceals severalimportantmattersof principle.Two
are consideredhere.
(a) Conditioning on the margins. The combination of numerical examples and statistical
intuitionin this paper is welcome, not least because any failureto recognjze the force of the
argumentsfor conditioningcould be explained by the fact that Fisherand Yates always regarded
the need to condition as obvious and presumablythereforenot requiringany justification.A
consequence of the propertythat the marginalfrequenciesprovide virtuallyno inforrmation is
that inferencesabout the cross-productratio ; should be based on the conditional likelihood
function.Fisher (1935) derivedan upper fiduciallimit on this basis, but he took a different view
about estimation: "If we want an estimateof 14we have no choice but to take the actual ratio of
the products of the frequenciesobserved in opposite cornersof the table". I would be interested
to know why the unconditionalmaximumlikelihood estimateis so firmlyrecommended.
(b) Definitionof the significancelevel. In his 1900 paper on chi-squared,Karl Pearson gave
actual levels of significance.The use of nominal levels originatedwhen Fisher (1925, chap IV),
"owing to copyright restrictions",prepared a new table of chi-squared "in a form which
experience has shown to be more convenient". Fisher's argumentfor defininga two-sided
probabilityas twice the one-tailprobabilityis based on the practiceof usingnominallevels,which
is now criticizedas being defective.Anotherdefinitionwas introducedby Neyman and Pearson,
who arrangedeventsin order of decreasingprobabilityand calculated the total probabilityin the
tail. A modified version is described by Lancaster (1969, Chap. 3) and Anscombe (1981,.
Chap. 12). This is the median probability,whichis halfthe probabilityof the observedeventplus
the sum of probabilitiesfor all eventsless probable, with suitable multiplesfor equally probable
events. I preferthis definitionbecause the expected value of the median probabilityis 2 underthe
hypothesisbeing tested,just as it is for a continuous distribution.In the case of 2 X 2 tables,the
median probabilityis well approximatedwithoutusinga continuitycorrection.

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
19841 DiscussionofDr Yates'sPaper 459
Theauthorrespondedbriefly at themeeting;he laterreplied,in writing,
as follows:
I shouldfirstlike to thankall thosewho contributed to the discussion,or sentin written
comments. As was to be expectedmanydiversepointswereraised.I willonlycommenton them
briefly;to do so fullywouldtakemorespacethanis availablehere.
It is gratifyingto findthatthe need forconditioning on the marginsappearsto havebeen
acceptedby mostof the discussants. I am verygrateful thatProfessor Barnard, in proposingthe
voteof thanks, reiterated
his"disavowal"oftheCSM test,firstmadein 1949,and fortheaccount
of the correspondence withFisherthatled to thisconclusion,particularly as his originalpaper
and the associatedpaper by Pearson (1947) on quality controlcontinueto be quoted as
authoritative bytheNeyman-Pearson school.
Section10 of mypapercriticizes Pearson'sproposals,but onlygivesa numerical examplefor
f1 -n2 = 5. Afterthe paperwentto the pressI wrotea computerprogram in timeto present
theresultsforn, = n2 = 12 (Pearson'svalues)to themeeting.I reproducethemherein thesame
format as Table6, whichshouldbe referredto forthefullheadings.
Odds ratio (a) No verdict(%) (b) Rejection (% Excluding (a))
p1: 2/3 1/2 1/3 p1: 2/3 1/2 1/3
1:1 2.0 0.0 2.0 2.7 3.2 2.7
4 1 0.0 1.0 17.6 42.4 36.0 28.1
00 1.9 19.4 63.2 100 100 100

It is instructiveto comparethesevalueswiththoseof Table 6. Notethatthebestvalueto aimat


forPi is likelyto.besomewhat greaterthan1/2.
The onlyissue on whichBarnarddiffers fromme is on thetheoretical justificationforcon-
ditioning. He maintains thatin the absenceof knowledgeof the body of thetablethereis some
information on the log-oddsratioin the margins, thoughthisis a trivially smallfractionof that
fromthe body of the table except in verysmall samples,and is affectedby the additional
unknownerrorin 4. Does thisaffectthe argument? I maintainedit did not,as Pi and P2 are
sufficient statistics,and p is merelytheirweightedmean.Professor Plackett'sfirstqueryalso
relatesto thisissue.
Professor Aitkinand MrHindearguemorestrongly on thesamelines,and haveproposeda new
solutionbased on whattheytermcanonicallikelihoods.The resultstheygiveappearto indicate
thattheirmethodgives probabilities similarto Barnard'sunconditional probabilities.I leaveit
to othersto discusstheirproposalwhena fullaccountis published.
For reviewinga collectionof P valueson the same issue Barnardrecommends Anscombe's
suggestionof countingonly half of the probabilityof the observedvalue. Professor Plackett
also favoursthisdefinition ofP fordiscontinuous distributions.Thisis moreor lessequivalentto
omittingthe continuitycorrectionto X2, which,contraryto what I firstthought(1934), is
necessarywhenusingFisher'scombination-of-probabilities test.I discussedthismatterat length
in a paper(1955b) on thesubject,whichwasessentially an appendixto a paper(1955a) on theuse
of maximum likelihood.Dr Patterson maybe interested in thesepapers.
Apartfromsome-remarks on whathe termed"therelatively minorissue" of two-sidedtests,
whichare consideredbelow,ProfessorCox mainlydiscussedproblemsof estimation that arise
whensummarising theresultsfroma numberof 2 x 2 tables.I willnot commentin detailhere,
as myown paperdealtonlywithtestsof significance forindividual 2 x 2 tables.I agreewithhim
that differences in log-oddsare not necessarily the mostappropriate measureof the difference
betweentwo probabilities in all circumstances.He instancedas an alternative Pi - P2. Another
alternative is suggestedby the inoculationtrial describedin Section6. If an estimateof the
protectiongivento thoseat riskwererequired,insteadof a demonstration thatinoculationis
alwayseffective, such an estimatewould be givenby PI= 1 - Pi /p'zwherePI is the proportion
of thoseat riskwho are protectedby inoculation.The numbersin sucha trialwouldof course
haveto be considerably greaterthanthe 10 ofthedemonstration trialto giveanyreliableestimate.
Dr Nelder,also, makesa plea forgreateremphasison estimation. Withthegeneraltenorofhis
remarksI entirely agree.He does,however,likemanyof the contributors, somewhat uncritically
supporttheuse of the odds ratioin all circumstances. In hisremarks on x2 thedisadvantage that
he mentions-that X2 lumps togetherlarge deviationsoccurringon the oppositetails-is a
commonfaultin its use, butis simplyavoidedbyworking with+ V x2, thedeviations in opposite
directionsbeing distinguished by the attachedsign.These are equivalentto deviatesof the

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
460 DiscussionofDr Yates'sPaper [Part3,
standardnormal distribution:values of X2 near zero are equivalentto suspiciouslysmall normal
deviates.
Dr Nelder's experience of interviewingcandidates remindsme of a similarexperience of mine
when acting some long time ago as externalexaminerat a university.In one of the questions I set
I asked the studentsto calculate the value of X2 in a 2 X 2 table, and to commenton the meaning
of the result.When the internalexaminersent me the papers he told me that he had reduced the
maximummarkforthis question,as it seemed easier than all the others.I was somewhatsurprised,
but the explanation was soon forthcoming.None of the studentshad made any meaningfulcom-
ment on the values (mostly correct) that they had obtained. I thereforerestoredthe maximum.
Presumablythese partialfailureswere due to a defectin the teaching.
ProfessorBartlett'scontributionpuzzles me. Afterstatingthat he agrees withconditioningon
the observedmarginshe says he would not rule out "on principle"the CSM test. Thus, to take the
(3,0; 0,3) case, he is saying that he agrees with Fisher's exact probabilityof 1/20 but does not
rule out Barnard's probability of < 1/64, which Barnard himselflater repudiated. The whole
burden of my paper was that it should be repudiated. Nor is this merelya matterof theoretical
controversythat does not affect the user of statistics.To take Berkson's example of a clinical
trial (Section 12), Kempthornerightlysupportedthe exact test, but had the experimentermain-
tained that the subjects of the trial were in fact a random sample of a much largerpopulation of
sufferersfromthe disease, would Kempthornehave been entitledto use the CSM test? My answer
is an emphatic No. The flaw in Bartlett'sargument,I think,lies in his statementthat Fisherwas
searching for a relevant "reference set", a Neymanian concept quite differentfrom Fisher's
"relevantsub-set".
Bartlett's warningabout "hidden,classifications"is of course valid, except that if they are
really hidden there is nothingwe can do about them when analysingobservationaldata, except
to express the hope that they make no materialdifference.In experimentalwork,choice of some
appropriatesystemof randomizationneutralisesthe effectof classificationsthat are unknownand
of those whose effectsare judged to be likelyto be too smallto be wortheliminationby statistical
analysis.
ProfessorFienbergis also ambivalentin his choice of tests.If, as he states,he recommendsand
himselfuses the exact test for very small samples, why change the logical basis of the test by
switchingto the uncorrectedx2 for largersamples? His real aim, apparently,is to achieve satis-
factorynominal levels. The uncorrectedx2 'certainlygives better approximationsto these than
does the correctedx , as Berkson's investigationshows (Table 9). The differencesare, of course,
greatestin verysmallsamples.
ProfessorLiddell falls into the trap of taking as known the estimateof p givenby the sample.
This is analogous to using the normal distributioninstead of the t distributionfortestinga mean
x, given an estimateS2 of the true variance a2. Would he subscribeto this?I presumethathe has
not yet seen ProfessorBarnard's contribution;I hope that this will give him food for further
thought.
Dr Upton has based his refutationof conditioningon Pearson's 1947 argumentof repeatability.
This argumentis unconvincing.Comparativetrials do not require that the experimentalunitsare
selected at random fromsome defined population. They depend for their validity on random
allocation of the unitsto the treatments.Any such experimentcan be repeated,but will in general
necessarilybe repeated on differentunits,with freshrandomization.If the units in one or more
trialsare themselvesa random sample fromsome largerpopulation then any conclusion emerging
fromthe trialscan be applied to the population,but testsof significanceon the individualexperi-
mentsare unaffected,as I emphasizedat the end of Section 6.
Upton also appears to have misunderstoodthe playing-cardexample (Section 5). Both players
know that each pack contains 26 red cards. The test is that PI - P2 > 0.6. The example was
chosen merely to illustrate that knowledge of the margins contributes informationon the
probabilityof gettingparticularoutcomes, and thereforeto the assessmentof the evidenceagainst
the null hypothesisPi = P2. He concludes that we must look furtherthan the actual sample in
hand; but thismissesthe whole point of the argumentsforconditioning.
I am glad that he agrees that it is best to quote "exceedance levels" (one-tail,two sided?),
but if so why does he pander to usersof "black boxes", of which I suppose HCF/2 is an example,
without even warningthem in his paper of the errorsof theirways? And why should X2 not be
used for a value which is to be referredto a X2 distribution,just as t and F are used for values

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] DiscussionofDr Yates'sPaper 461
whichare to be referredto t and F distributions?
I am glad that Dr Grove, in his surveyof statisticaltext-books,found that conditioningwas
accepted without question. I have seen at least one in the last year in whichthe contraryis true,
but perhapsfriendsonly draw my attentionto what they know will shock me. I agree with him
that the 2 X 2 case is special, but to discusslargertables would be out of place here. It may,how-
ever, be worth mentioningthat condensation of parts or the whole of largertables to form a
2 x 2 table is sometimesuseful for makingquick tests of association before embarkingon more
formalanalysis.
RegardingMr Jagger'squery on two-sided probabilities,obviously the direction of the sum-
mation of the probabilitiesmust be the one that gives the lower total. If both directionsgive
totals greaterthan 0.5 the cell can be regardedas the "central" cell, not belongingto eithertail,
witha two-sidedprobabilityof 1.
To sum up, my view of the function of a test of significanceof a single 2 X 2 table is that
it should provide a correct measure of the probabilityP of gettingthe same or a more extreme
set of resultsby chance, on the assumptionthat thereis no differencein the probabilities,Pi and
P2, of the two lines of the table; and that if by probabilitywe mean the concept applied to tosses
of a coin or spins of a roulettewheel the only correctmeasure of P is thatgivenby conditioning
on the margins.
Because of the discontinuousnature of the distributions,the combination-of-probabilities test
cannot be used on a set of P's, as I mentionedabove. But even withadjustedP's (e.g. by omission
of the continuitycorrection)theircombinationis inefficientbecause there is no measure of the
amounts of informationcontainedin them. For estimatesinvolvingseveral2 x 2 tables containing
small numbers,likelihood must be used. If the numbersin the tables are not too small, normal-
theoryapproximationsmay suffice.
Finally, some comments on my recommendationthat if a two-sided probabilityis required
this should be obtained by doubling the one-tail probabilityactually observed. This seems the
natural thing to do with any continuous distribution,whethersymmetricalor not, as it is the
probabilityof the departurefromthe null value which measuresits significance,and, as I pointed
out in the paper, this is invariantunder transformation. Moreover,when estimationis involved,
the direction of the deviation of the parameterfrom any assumed null value is usually vital.
Therefore,in general,one should think in terms of one-tailprobabilities,as was implied in the
firstparagraphof Section 4. This is so whetherwe are dealing with continuous or discontinuous
distributions.It is unfortunatethat the customarytwo-sided tabulations of the normal and t
distributionshave had the opposite effect.
This conclusion is relevant to Professor Finney's contribution.I would question whether
we are everreallyin a positionto state "I know that any apparentdeviationin the otherdirection,
howeverlarge,is due to chance". On the contrary,it is our especial dutyto drawattentionto any
evidence contradictingour hopes or beliefs. But it is the one-tail probabilitythat providesthe
correct measure of the strengthof this evidence. Only when tests are being made to check that,
for example, two strainsof a virus are not materiallydifferent,as in Wilson's query to Fisher,
or that a coin or die used for gamblingis unbiased, can use of a two-sidedprobabilityreally be
justified.
In discontinuous distributionsthere is the furtherproblem that, except in symmetrical
distributions,there will rarely be any value on the opposite tail which has the same one-tail
probabilityas that on the observed tail. The Cox-Hinkley rule (originally,I believe, suggested
by Irwin) is then to take the value havingthe next lower probabilityto that observedand add this
to the observed probability.This seems more sensible than taking the value with the same or
next greaterdeviation (method (ii) of Table 9), which is that widely adopted. But why not take
the next higher probability, or better still a weighted mean of the two? Once this latter
possibilityis considered it is obvious that weightscould be chosen so as to give a probability
equal to that observed.And thenwe are back at the doublingrule.
The consequences of using the deviations rule were displayed in Tables 9 and 10 and com-
mented on in the text. With the Cox-Hinkley rule (CH) there will almost always be mismatch,
except in symmetricaldistributions.Consequentlythe CH probabilitywill almost always be less
than double the observed one-tailprobability,and can never be greater,as Dr Hill observed. On
the other hand the asymmetryof the probabilitieswill not be obliterated.In Table 9, forexample,
the CH probabilitiesfor tables C and D are almost identical with one another,and (by chance)

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
462 DiscussionofDr Yates'sPaper [Part3,
with column (ii) of table D. But as in the deviationsmethod thereare sudden large changes in
particularprobabilitieswith slightchangesin the data. For the fourtables withn2 = 59 (table D),
60 (table C), 61, 62, the CH probabilitiesfora = 1 are 0.102, 0.104, 0.107, 0.170 respectively,
compared with the values given by doubling of 0.151, 0.160, 0.169, 0.178. Surelythese latter
show a more reasonableprogression?
Of the discussants,Dr Cormack was the only one who openly condemned the Cox-Hinkley
rule,and even he had reservationson the completevalidityof the doublingrule. However,Fisher,
in his letter to Finney,has opportunelygivenposthumous supportto the doublingrule; his use
of nominal levels leads to the same conclusion, by a differentroute, as that argued above.
ProfessorCox has also giventhe doublingrule qualified support,"as the appealing thingto do".
ProfessorHealy, in privatecorrespondence,dismissedFisher's suggestion,because of its use of
nominal levels, as "pure Neyman-Pearson".In his writtencontributionhe objected that the result
of doubling "is not the probabilityof any definableevent". But nor is a one-tailprobability,as
largerdeviationsin the same directionare included. There seems no compellingreason forusinga
strict > rule for the opposite tail, whether our measure is the deviations themselves,their
cumulativeprobability,or the log-odds-ratio.Healy favoursthis last as an all-purposemieasureof
association, and recommendsits use for two-sidedtests,in spite of the strongand fullyjustified
objection by Dr Hill, givenin his account of his dispute with Dr Pike. I also doubt whetherthe
log-odds-ratioalways, or indeed generally,providesthe best way of summarisingthe information
in a single small table. A reportof the values of Pi and P2, accompanied by a test of significance,
is frequently preferable. The assessment of the efficiencyof inoculation, mentioned above,
provides an example in which the log-odds-ratiowould merelyconfuse. Nor would I like to give
any encouragemnent to the Bayesians.
I was disappointed that Dr Mantel was not convinced by my criticismof Haber's results,
particularlyin view of his long-standingsupport of Fisher's exact test. Perhaps the further
commentson the problem that I have made here will serveto convince him thatany of the rules
that have been proposed, other than that of doubling,lead to irregularchangesin the two-sided
probabilityassociated with an observed one-tail probabilitywhich is itselflittlechangedby slight
changes in the data. Mantel calls such irregularity"unseemly behaviour"; I would say that it
indicatesthat thereis likelyto be somethingwrongwithour reasoning.
I am not disputing that Mantel's test gives approximations to Haber's exact exceedance
probabilitieswhich are better than those given by x2 or x2; I am merelysayingthat the test is
aimed at the wrongtarget.Surely Mantel's admission that the same observed one-tailprobability
can give rise to differenttwo-sidedprobabilities,accordingto the rule used, is an indicationthat
somethingis wrongwithsome or all of the rules.

REFERENCES IN THE DISCUSSION


Anscombe, F. J.(1981) Computing inStatistical
Sciencethrough APL. NewYork:Springer-Verlag.
Armitage, P. (1955) Testsforlineartrends inproportions andfrequencies. Biometrics, 11, 375-386.
Barnard, G. A. (1982a) Conditionality vs similarityin theanalysis of2 X 2 tables.InStatisticsandProbability:
EssaysinHonorof C. R. Rao (G. Kalianpuretal., eds),pp. 59-65. Amsterdam: North-Holland.
(1982b) Letterto theEditorson 2 X 2 Tables.Appl.Statist.,31, 304-305.
Bartlett, M. S. (1937) Properties ofsufficiencyandstatisticaltests.Proc.Roy.Soc. A, 160,268-282.
Clayton,D. G. (1974) Some odds ratiostatistics fortheanalysisof orderedcategorical data.Biometrika, 61,
525-531.
Cox,D. R. andHinkley, D. V. (1974) TheoreticalStatistics.
London:Chapman andHall.
Fienberg,S. E. (1980) Fisher'scontributions to the analysisof categoricaldata. In R. A. Fisher:An
Appreciation (S. E. Fienberg andD. V. Hinkley, eds),pp. 75-84.
Hill, I. D. and Pike,M. C. (1965) Algorithm 4: TWOBYTWO.ComputerBull., 9, 56-63. (Reprintedin
ComputerJ. (1979) 22, 87-88; AddendainComputerJ. (1966) 9, 212; and (1967) 9, 416.
Hinde,J.P. andAitkin, M. A. (1984) Nuisanceparameters, canonicallikelihoods anddirectlikelihood inference.
Research paperNo. 4, CentreforAppliedStatistics, University ofLancaster.
Lancaster, H. 0. (1969) TheChni-squared Distribution.-NewYork:Wiley.
Lehmann, E. L. (1959) Testing Statistical
Hypotheses. NewYork:Wiley.
Liddell,D. (1976) Practical testsof2 X 2 contingency 25, 295-304.
tables.Statistician,
- (1980) Practicaltestsforcomparative trials:a rejoinder to N. L. Johnson. 29, 205-207.
Statistician,
Simpson,E. H. (1951) The interpretation of interaction in contingency tables.J. R. Statist.Soc. B, 13,
238-241.

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions
1984] DiscussionofDr Yates'sPaper 463
Yates,F. (1955a) The use oftransformationsand maximum likelihoodin theanalysisof quantalexperimnents
involvingtwotreatments. Biometrika,42, 382-403.
testto a set of 2 X 2 tables.
of probabilities
-"(1955b) A note on the applicationof the combination
Biometrika, 42, 404-411.

As a result of the ballot held duringthe meeting the followingwere elected Fellows of the
Society.
Cheesbrough,Anne Griffin,Thomas James Shariff,Nazneen
Dagpunar,JohnS. Hall, Peter Gavin Somchiwong,Malinee
Dunn, Richard Jones,Michael Christopher Streeter,MarionJane
Emes, Gerald R. Mukherjee,Dipak Taylor,WayneA
Ghezzo, Ruben H. Rogers,JohnWistar

This content downloaded from 137.151.37.0 on Tue, 26 Aug 2014 23:40:14 UTC
All use subject to JSTOR Terms and Conditions

Вам также может понравиться