Вы находитесь на странице: 1из 5

Beta-Binomial Anova for Proportions

Author(s): Martin J. Crowder


Source: Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 27, No. 1
(1978), pp. 34-37
Published by: Wiley for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2346223 .
Accessed: 12/12/2014 09:05

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.

Wiley and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to
Journal of the Royal Statistical Society. Series C (Applied Statistics).

http://www.jstor.org

This content downloaded from 152.74.16.139 on Fri, 12 Dec 2014 09:05:13 AM


All use subject to JSTOR Terms and Conditions
App!.Statist.(1978),
27, No. 1, pp. 34-37

Beta-binomial
AnovaforProportions
By MARTINJ. CROWDER
SurreyUniversity, Britain
Guildford,
[Received January 1977. Revised May 1977]
SUMMARY
A methodis proposedfor the regressionanalysisof proportionsbased on the
Beta-binomial
distribution.
Keywords: ANOVA FOR PROPORTIONS; BETA-BINOMIALDISTRIBUTION
1. INTRODUCrION
THEproblem considered herearosefromresearch conducted bymicrobiologist Dr P. Whitney
of SurreyUniversity. A batchoftinyseedsis brushedontoa platecoveredwitha certain
extractat a givendilution.The numbersof germinated and ungerminated seeds are
subsequently counted.A considerable amountofdatahasthusbeengenerated, with4 types
of seed,3 different extracts, severalserialdilutions and upwardsof 5 replicates formany
combinations. It is clear,however, frominspection of thedata thatthereis heterogeneity
of proportions betweenreplicates, thisobservation beingsupported (in factinsistedupon)
bytheexperimenter; x2testson theappropriate 2 x mtablesgivesomeconfirmation, though
the frequencies are oftensmall. Such a situation, withvariationof proportions between
replicates, cannotbe uncommon in othercontexts,buttheproblem ofanalysingthevariation
withinand betweendata setsdoes not seemto havebeentackledbefore.The analogous
situation forcontinuous datais thestandard nestedmixedmodelwherethedifferent treatment
groupsrepresent thefixed effects
andreplicationswithin treatmentsrepresent
therandom effects.
Variousapproximate methods mightbe used. Supposethatthereare mi observed pro-
portionsin the ith data set (i = 1,..., k), and let m+ denote i mi. A 2 x m+ table yieldsa
x2 with(m+- 1) d.f.of whicha component withk-I d.f.can be assignedto thecontrast
betweensets,and suchpartitioning of x2 can be extendedto the case of cross-classified
proportions. Butthemethod doesnotcaterforvariation ofexpected proportionswithin cells.
Thisis also trueof LogitAnalysis.Another methodwouldbe analysisofvarianceapplied
to theangulartransforms oftheobserved proportions. Thisallowsforwithin-cellvariation,
butthevariancestabilizing property ofthetransform dependson eachn beinglarge.
A standardmodelis thebeta-binomial distribution (BBD) in whichtheexpectedpro-
portionsare beta-distributed. Chatfield and Goodhart(1970) discussthe use of BBD in
connection withconsumer purchasing; theyfitdata bymatching themeanand zerocount.
Anotherapplication is givenby Griffiths (1973)whosetitleis self-explanatory. However,
in thesepapers,and otherscitedin them,thepurposeis to fita singleset of data. Here
suggestions are madeforusingBBD as an errordistribution forregression. The approach
has severaladvantages:
(i) it is based on a modelwhichis exactlyrealizableand containsparameters capableof
meaningful interpretation;
in thattheparticular
(ii) it is flexible assumptions madeabouttheparameters, representing
(a) thetypeofwithin-cell heterogeneity,and (b) theformoftheregression can
relationship,
be variedto suitthedata;
(iii) forone-way Anovaa singlesubroutine (FortranIV, about60 instructions,suppliedon
request)suffices to compute thevariouslikelihoods, in conjunctionwithanystandard routine
forfunction minimization; moregenerallinearmodelsarelikewise simply applied.
34

This content downloaded from 152.74.16.139 on Fri, 12 Dec 2014 09:05:13 AM


All use subject to JSTOR Terms and Conditions
BETA-BINOMIAL
ANOVAFOR PROPORTIONS 35
2. ONE-WAYANOVA
Considerfirst a singlesetofproportions, withdata{(rj,nj): j = 1,...,m}. Thejth count
rj is binomial(nj,pj) conditionally on pj, and thepi's are assumedto be i.i.d.beta (y,8).
Thuspj has meanv = y/(y + 8) and varianceT(1- 7r)/(y + a +1) = a2 7(- T), say. Sincey
and 8 are positivefora beta distribution, a2 < 1. In factforthepresent application we
envisage onlyy> 1 and 8 > 1, so thatthebetadensity is zeroat 0 and 1, and has a modeat
(y-l)/(y+ 8-2); in thiscasea2 < 13
Now supposethatthereare k setsof proportions, {(r1, nij): i = 1,...,k; j= ,...,mi},
r*jbeing the number of successes in nij(fixed)
trials. The parameters fortheithdatasetare
(yi,S), or equivalently(7T, o2). Byanalogy withAnovaforinterval-level dataan assumption
willbe madeconcerning homogeneity ofvariancebetweendata sets. We willtakea2 = a2
hereforillustration,andbecauseitturnsoutto be reasonable fortheseeddata. Thenumber
ofparameters is thusreducedfrom2k to k+1, andthelog-likelihood is

l(.a,c) = ,[In (ni) + I'ln(c7ri+rij-s)+ Zln{c(l-7 )+nij-rij-s}-ZlIn(c+nj-s)


i, L ri1/ 8 8 8

wheresummation Es is over 1< s rij and is omittedaltogether if rj= 0, Z' is over


1 < s < nj-rij and is omitted if r*j= n*j,and ES is over1< s< nij; c = y + 8, = a-2-1 iS
usedforalgebraic convenience.
Experience withfittingthemodelto theseeddata has drawnattention to thefollowing
points:
(i) If rq+= 0, so thatthereare no successesin theqthset,thevalue*q = 0 maximizes
I irrespectiveofthevaluesoftheotherparameters, andthecontribution to I bytheqthsetis
thenzeroand maybe omitted.However, in thiscase,it can be seenthat 1/iTrq# 0 at *q,
and so theasymptotic approximations are invalidforinference involving Trq.Similarly,for
theotherextreme caserq+= nq+,when7q = 1.
(ii) The valuec= 0o (&2 = 0) alwayssatisfies Ol/lc= 0, and thentheuniquesolutionto
0I/07Ti= 0 is Ar = r+/ni+.It hasbeenfound in somecases,wherethewithin-set variationof
sampleproportions is small,thattheiterative maximization processfailsto converge to a
finite c. (Suchbehaviour can also be predicted Thisis takenas an indication
analytically.)
thatin suchcasesitis inappropriate to allowheterogeneity ofproportions.
(iii) Regarding the existenceof solutionsto the likelihoodequations,it is foundthat
is positivenear
9l/17ri Ti = 0, and negativenear7ri = 1. Also 02l/an2 is negativedefinite.
Thus,forgivenc, ' existsandis unique,and belongsto (0,1).
Table1 contains datafortheseedOrobanche cernua inthreedilutions
cultivated ofa bean
rootextract.Themeanproportions forthethreesetsare0142,0-872and0a842,andtheoverall
meanis 0-614.Theredoesappearto be heterogeneity ofproportions within
sets,particularly
forthefirst dilution,thoughthecorresponding Pearsonx2valuesare only9 91 (5 d.f.),7 30
(4 d.f.),5X15(4 d.f.);thesumis 22X36 which,with13d.f.,is justborderlineat the5 percent
significance level. Another wayof testingforsuchheterogeneity wouldbe to comparethe
maximized likelihood undertheBBD modelwiththatunderthestraight binomialmodel,
thelatterbeinga specialcase (a = 0) of theformer.However, thismethodwouldrequire
computation ofm.l.e.'sfortheBBD model,perhapsonlyto findthatitis notneeded.
The maximized fortheBBD modelwithvariousnumbers
log-likelihoods offitted para-
meters aregivenin Table2; G2denotestwicethedifference between thecorresponding log-
likelihood and theoneaboveit. Usingthestandard largesampleapproximation basedhere
onthemagnitude ofthen+'s, G2= 0-324is clearlynon-significantas X2.Thustheparticular
formof assumption madeabouthomogeneity of varianceseemstenable.The m.l.e.'sare
I=(04132, 0871, 0839), c = 78-424and =O112, 0 withestimated standarderrors0027,
0-028,0-032(7x)and 0-0525(a). The difference, G2= 42-534,is highlysignificant as X2
confirming theobviousdifferencesin v betweendatasets.

This content downloaded from 152.74.16.139 on Fri, 12 Dec 2014 09:05:13 AM


All use subject to JSTOR Terms and Conditions
36 APPLIED STATISTICS

TABLE 1 TABLE 2
Data for 0. cernuaseed in bean rootextract for one-waydata
Log-likelihoods

Dilution1/1 Dilution1/25 Dilution1/625 Fittedparameters Log-likelihood G2

r n r/n r n r/n r n r/n 1, w, 7J3, 2, 24 2 _34-829


771' V2' 7J3, a2 - 34-991 0-324
2 43 0-05 17 19 0-89 11 13 0-85 IT, a2 -56-258 42-534
9 51 0d18 43 56 0-77 47 62 0*76 _
5 44 0-11 79 87 0-91 90 104 0-87
16 71 0*23 50 55 0.91 46 51 0-90
2 24 0-08 9 10 0.90 9 11 0-82
0 7 0-00

3. DEVELOPMENTS
In orderto generalize to regardthesubscript
theanalysisit is onlynecessary i in (rjj,njj)
as defining thesetofconditionsunderwhichthosedataweregenerated, keepingjforreplicates
wherethereare morethanone foreach i. Thusi mayrepresent a combination of factor
levels,or a setofcovariates, or both. Therenowarisesthepossibility ofrelating thesa's to
thefactors andcovariates usinga regression equation.
In Table3 somemoreseeddataaregivenas a 2 x 2 factorial layout.Therearetwotypes
ofseed,0. aegyptiaca 75 and 0. aegyptiaca 73,and tworootextracts, beanand cucumber,
thedilutionbeing1/125throughout. Applying theone-way
first Anovadescribed aboveto
thesefourdata setswe findlog-likelihoods - 53-667fortheBBD modelwith8 parameters
(i.e.a (-, a) pairforeachdataset),- 53-767with5 parameters (- , a3, -T4anda common a),
and - 64*516with2 parameters (commonv and a). Thushomogeneity of a2 iS supported
(G2= 0-2, 3 d.f.)anddifferences between v valuesare highly (G2= 21-498,
significant 3 d.f.).

TABLE 3
Data for seeds 0. aegyptiaco75 and 73, bean and cucumber
rootextracts

0. aegyptiaca75 0. aegyptiaca73

Bean Cucumber Bean Cucumber

r n r/n r n r/n r n r/n r n r/n

10 39 0-26 5 6 0-83 8 16 0-50 3 12 0-25


23 62 0-37 53 74 0-72 10 30 0-33 22 41 0-54
23 81 0-28 55 72 0-76 8 28 0-29 15 30 0-50
26 51 0-51 32 51 0-63 23 45 0.51 32 51 0-63
17 39 0-44 46 79 0*58 0 4 0-00 3 7 0-43
10 13 0-77

TheIT's maynowbe regressed on dummy variablesrepresenting the2 x 2 factorial


structure.
A logitmodelhas been tried,in whichln{7Ti/(l-v7)} is a linearcombinationof the factorial
effects.In Table 4a themaximized are given,and Table 4b containsthe
log-likelihoods
corresponding valuesset out as in a standardAnovatable. The "extrasumof squares"
convention has beenusedhere,so thateach maineffect is adjustedfortheother,and the
interactionis adjustedforall maineffects;thus,e.g. forseeds,G2= 2 x (574196- 55.832)=
2-728.(The notationG2is borrowed fromBishopet al. (1975);Table 4.4-8theremaybe
compared withTables2 and 4b here.)It appearsthatthemostimportant is Extracts,
effect

This content downloaded from 152.74.16.139 on Fri, 12 Dec 2014 09:05:13 AM


All use subject to JSTOR Terms and Conditions
BETA-BINOMIAL ANOVA FOR PROPORTIONS 37
butwiththesuspicion ofa non-zero
interaction maynotbe additive
termso thattheeffects
logit(IT) is
on the logitscale. Table 4a showsthatwhenbean is replacedby cucumber
increasedmorefor0. aegyptiaca75 thanfor0. aegyptiaca73.

TABLE 4a
Logit log-likelihoods

M.l.e.'s
Factorialeffects Numberof Log-
included parameters 7TIl 7T12 7T21 7T22 likelihood
None 2 0 494 0 494 0 494 0 494 -64-516
Seeds only 3 0 538 0 538 0 435 0 435 -63 553
Extractsonly 3 0-376 0-620 0-376 0-620 -57-196
Seeds+extracts 4 0 405 0-651 0-326 0 570 -55-832
(maineffects)
Full modelwith 5 0-368 0 685 0-391 0-519 -53 767
interaction

TABLE4b

x2 values
foreffects
Source d.f. G2

Main effects 2 17f368(P<0001)


Seeds 1 2-728(P > 0 05)
Extracts 1 15-442(P<0-001)
Interaction 1 4-130(P< 005)

It is clearthattheapproachis generallyapplicableandthattheassumptions made,namely


(i) homogeneity ofu2 (as usedfortheseeddata) or ofur2T(l- 7T), or perhapsofsomeother
quantity, and (ii) theformofregressionof 7Tjon theexplanatory whether
variables, logitor
someotherscale,can be variedas appropriate to thedata. Further to multi-
generalization
variatedata, whereone has multinomial ratherthan binomialobservations, may be
accomplished usingtheDirichlet in placeofthebeta; a comprehensive
distribution account
oftheresulting compound multinomial
distributionmaybe foundin Mosimann(1962).
ACKNOWLEDGEMENTS
I thankthereferees
and theEditorforhelpful and forreferences
comments to previous
oftheBBD.
applications
REFERENCES
BISHOP,Y. M. M., FIENBERG,S. E. and HOLLAND,P. W. (1975). Discrete Multivariate Analysis. M.I.T.
Press.
CHATFIELD,C. and GOODHART,G. J. (1970). The beta-binomial model for consumer purchasingbehaviour.
Appl.Statist.,19, 240-250.
D. A. (1973). Maximumlikelihoodestimationfor the beta-binomialdistribution
GRIFFITHS, and an
29,
of the total numberof cases of a disease. Biometrics,
applicationto the householddistribution
637-648.
MOSIMANN, J. E. (1962). On thecompoundmultinomial themultivariate
distribution, and
P-distribution,
correlationamong proportions. Biometrika,49, 65-82.

This content downloaded from 152.74.16.139 on Fri, 12 Dec 2014 09:05:13 AM


All use subject to JSTOR Terms and Conditions

Вам также может понравиться