Вы находитесь на странице: 1из 7

A Score Test for Zero Inflation in a Poisson Distribution

Author(s): Jan van den Broek


Source: Biometrics, Vol. 51, No. 2 (Jun., 1995), pp. 738-743
Published by: International Biometric Society
Stable URL: http://www.jstor.org/stable/2532959
Accessed: 09-04-2015 03:40 UTC

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content
in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship.
For more information about JSTOR, please contact support@jstor.org.

International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to Biometrics.

http://www.jstor.org

This content downloaded from 202.92.128.135 on Thu, 09 Apr 2015 03:40:00 UTC
All use subject to JSTOR Terms and Conditions

BIOMETRICS

June 1995

51, 738-743

A Score Test forZero Inflationin a Poisson Distribution


Jan van den Broek
Center for Biostatistics, University of Utrecht,
Yalelaan 7, 3584 CL Utrecht, The Netherlands

SUMMARY

When analyzingPoisson-countdata sometimesa lot ofzeros are observed. Whenthereare too many
zeros a zero-inflatedPoisson distribution
can be used. A score testis presentedto testwhetherthe
numberof zeros is too large fora Poisson distributionto fitthe data well.

1. Introduction
Johnson, Kotz, and Kemp (1992, pp. 312-318) discuss a simple way of modifyinga discrete
distributionto handle extra zeros. An extra proportionof zeros, w, is added to the proportionof
zeros fromthe originaldiscretedistribution,
f(0), while decreasingthe remainingproportionsin an
appropriateway:
JP(Yi= 0) = w + (1 - Wo)f)
P(Yi = yi) = (1 - W)f(yi)

(Yi >0)

(1)

They state thatit is possible to take w less thanzero, providedthat:


f(0)
[1 - M)]A

withequalityforlefttruncation.
Farewell and Sprott(1988) discuss an inflatedbinomialas a mixturemodel forcount data. They
also pointout the two-populationinterpretation
of thismodel: in one populationone observes only
zeros, while in the otherone observes counts froma discretedistribution.
As an example of such an interpretation
considera populationwhichconsistsof two groups: one
of people who are not at riskof developinga certaindisease and one of people who are at riskand
may develop the disease several times. Of course such a model should be plausible in a given
situation.
Anotherexample is discussed by Lambert(1992). Manufacturing
equipmentmaybe in two states:
a perfectstate in which the machine produces no defects and an imperfectstate in which the
machine produces a numberof mistakesaccordingto a Poisson distribution.She discusses maximum likelihoodestimationand testingin the zero-inflatedPoisson regressionusing:
ln(A) = X,8 (with A the mean of the Poisson distribution)
ln

Gy

forcovariate matricesX and G. Two cases are considered: A and o functionallynot related and A
and o functionallyrelated. She also proves the asymptoticnormalityof the distributionof the
parameterestimatesand shows thatthe likelihoodratiostatisticis asymptoticallydistributedas a x2
withappropriatedegrees of freedom.
is appropriateor, to put it differently,
These examples assume thatthe zero inflateddistribution
thatthepopulationconsideredconsistsoftwo subpopulationsas describedabove. This is notalways
obvious. One would like to see ifthereis some evidence fromthe observed data to supportsuch an

Key words: Poisson; Score test; Zero inflation.


738

This content downloaded from 202.92.128.135 on Thu, 09 Apr 2015 03:40:00 UTC
All use subject to JSTOR Terms and Conditions

739

A Score Testfor Zero Inflation

assumption.
To achievethisforthezero-inflated
Poissondistribution
a scoretestis proposedinthe
nextsection.
2. A ScoreTest
A scoretestforw = 0 in theinflated
Poissonhas theadvantagethatone neednotfittheinflated
instead.
Poissonbutjust a Poisson,whichis thedistribution
underthenullhypothesis,
Using(1) thedensityfortheinflated
Poissonis:
P(Yi = 0) =

+ (1 -

P(Yi = Yi)= ( - c))


Using0 =

l(A,0; y) =

=
E

)ee

AI
Yf

(Yi> ?).

as:
constant(-f(O) - 0 < xo),thelog likelihood
can be written

{-elog(l

+ 6) + l(Y=)elog(6 + eA)

+ I(.,,>O)[-Ai+

- elog(y!)1},
yielog(Ai)

(2)

where1(condition) takes the value 1 if the conditionis trueand 0 otherwise.Fromthis,taking


U(,B,0) andtheexpectedinformation
J(,B,0) can be calculated.The
ln(Ai)= XiB,thescorefunction
scorestatistic
fortesting0 = 0 is then:
S(f3)= S(f3,0) = UT(f3,O)[J(f3,
O)]PU(f3,0)

(3)
1)

TXXTdiag()X11XT

fordetails
wheref8andAiaretheestimates
off8andAiunderthenullhypothesis.
(See theAppendix
and,forinstance,Cox and Hinkley(1974),pp. 321-325,fora discussionofthescoretest.)
thelatterequality
If themodelcontainsa constantthen
IXTA = E Ai= ny7,
diag()
beingtruedue to theestimating
equationsunderthenullhypothesis
(see Appendix).
Ifone writes:floi= P(Yi = 0) = ek-i, thenthescorestatistic
as:
can be written

S(f3)

1
l(Yj=o)

J~

I
ELi=l
{

Poi}

1-j5}-Poi

willhave an asymptotic
Underthenullhypothesis
thisstatistic
distribution
with1
chi-squared
degreeoffreedom.
Thiscan be interThe termXTX[XTdiag(X)X]-IXTAcan be readas E(XTY)[var(XTY)]-IE(XTY).
as itrelatesto XTE(Y).IfXTE(Y)departsmuchfromzero,the
pretedas an "F value"-likestatistic
The statistic
secondterminthedenominator
of(3) willhavesubstantial
influence.
S(13)can be seen
as a goodness-of-fit
statistic.
It looksat thefitofthezerosbutalso accountsforthemeansofthe
fitted
Thisis reasonable,becausethequestionoftoo manyzerosis notonly
Poissondistribution.
withthemeanofthe
thisnumber
answeredbylookingatthenumber
ofzerosbutalso bycomparing
canindicatethatthe
A moderate
number
ofzerosanda highmeanoftheobservations
observations.
thebinomial
number
ofzerosis toohigh.If,insteadofthePoissondistribution,
distribution
B(ni,pi)
is used, thesame statistic
is obtainedexceptforthetermE(XTY)[var(XTY)]-IE(XTY)
whichwill
havevaluesaccording
to thebinomial
distribution.
thenbecomes(1 - fii)'i.However,underthe
fl0i
same conditionsunderwhichthe binomialcan be approximated
by the Poisson (smallpi, np1
andnilarge),thestatistic
obtainedwith
ofthesamestatistic
constant,
(4) willbe an approximation
thebinomialdistribution.
3. The case ofno covariates

Consider the case where thereare n observations,among themn0 zeros, and no covariates. The
score statisticfor testingwhetherthe Poisson distribution
fitsthe numberof zeros well is, in this
case:

This content downloaded from 202.92.128.135 on Thu, 09 Apr 2015 03:40:00 UTC
All use subject to JSTOR Terms and Conditions

740

Biometrics,June 1995
Table 1
Percentile
pointsofthestatisticS(,81)based on 5,000samplesofsize nfroma Poisson
distribution
withmeanAand thesamepointsofa X2(l) distribution
P7

n = 100

A = .5
A= 1

n = 200

A= .5
A= 1

=
1.07
1.13
1.11

Percentilepointsof a X2(l)
P.8 = 1.64 P9 = 2.71 P95 = 3.84

1.13
1.02

1.66
1.68

2.73
2.75

3.86
3.97

P99 = 6.63
6.77
6.87

1.37
1.58

2.67
2.56

3.73
3.68

6.52
6.61

(nO- nfo)2

fo(l

fio)

(5)

- nypfiO

In orderto see ifthechi-squareapproximation


is appropriate
a simulation
studywas carriedout.
Froma Poissondistribution
withmean.5, 5,000samplesweretakenoncewithsamplesize 100and
once withsamplesize 200. The same was done witha meanof 1. These smallvaluesforA are
chosen,because if thereare a lot of zeros the meanof thePoissondistribution
underthe null
is low. For everysamplethescore statistic
hypothesis
was calculatedand afterwards
percentile
pointswereobtained.These are to be comparedwiththepercentile
pointsof a chi-squaredistributionwithone degreeof freedom(Table 1). The y2(l) approximation
for S(f31)looks very
reasonable.A reasonablyhighmeanand hardlyanyzeros givesproblemswiththeapproximate
distribution
unlessthesamplesizeis large.In thiscase however,
theremight
be no need
chi-squared
fora teston thefitofthezeros.
Cochran(1954)proposeda statistic
forcomparing
theobservedand expectedfrequencies
of a
Ifone uses thisstatistic
to comparetheobservedand
singleoutcomefroma Poissondistribution.
thescorestatistic
expectedzero-frequencies,
(5) is obtained.
Another
statistic
forlookingat thenumber
ofzerosinthecase ofno covariateswas proposedby
Rao and Chakravarti
(1956):
(f;

no-nn(

2(

l)Y
jl

n
+ n(n - 1)(-

This statistic
is obtainedby conditioning
on thesumoftheobservations.
In a simulation
withthe
studyEl-Shaarawi(1985)comparedtheabove two statistics
together
This simulation
likelihoodratiostatistic.
study,usingsamplesizes of 15 and 50 and a meanof5,
levelbuthas
showedthatthelikelihood
ratiostatistic
ofthetruesignificance
givesclosestestimate
muchlowerpowerthentheothertwo.The significance
andthestatistic
levelsofthescorestatistic
ofRao andChakravarti
are closerto thetruelevelfora samplesize of50 thanfora samplesize of
15.El-Shaarawiconcludesthatthescorestatistic
arepreferable
andtheoneofRao andChakravarti
has theadvantageof
to thelikelihood
ratiostatistic
becauseofthehigher
power.The scorestatistic
beingeasierto compute.Besidesthis,it can be used in thecase werethereare covariates.
4. An example
the department
of internal
virus(HIV)-infected
Of 98 humanimmunodeficiency
men,attending
oftimestheyhadan urinary
tractinfection
medicine
attheUtrecht
University
Hospital,thenumber
(numberof episodes)was recorded(Hoepelmanet al., 1992).Besidesthis,theimmunestatusof
theCD4+ cellcount.Table2 showsthata lotofpatients
everypatientwas determined
bymeasuring
did nothave a urinary
tractinfection.
Table2

Frequencies of the numberof episodes


Number of episodes
Frequencies

0
81

1
9

2
7

3
1

This content downloaded from 202.92.128.135 on Thu, 09 Apr 2015 03:40:00 UTC
All use subject to JSTOR Terms and Conditions

A Score Test for Zero Inflation

741

To assess whetherthereare too manyzeros forthe data to have arisen froma Poisson distribution,the score statistic(4) can be calculated with
Pioi= exp(-e

F+jj1CD4+i)

The outcome of the score statisticis 5.96 (whereas the score statisticwithoutusing the covariate
CD4+ has an outcome of 15.35,illustrating
theimportanceof theuse of covariates),givingevidence
thattoo many zeros are observed forthe Poisson to fitthe data well.
An alternativedistribution
forthe Poisson thenis the inflatedPoisson. As pointedout, the model
then used is that the population can be thoughtof as consistingof two parts: a proportionw
consistingof patients not being at risk of developing a urinarytract infectionand an other part
consistingof patientswho are at risk of developinga urinarytractinfection.
The covariate can be used to model the mean numberof episodes of the patientsbeing at risk:
ln(A) = I3 + 131CD4+.
It can also be used to model the "probability" of not being at risk:
ln(

co + c CD4+.

If one fitsan inflatedPoisson with both, the log-likelihoodis -53.21 on 94 degrees of freedom.
Lambert (1992) discusses the fitting
procedureforwhich iterativemethodsare needed. The likelihood ratio statisticfortesting3, - 0 has an out come of .101, indicatingthatln(A) can be modeled
as a constant: ln(A) = p0. Using this and the same model for the "probability" of not being at
risk as above, the log-likelihoodis -53.26 on 95 degrees of freedom.The resultsof this fitare in
Table 3.
Table 3
Estimationresults
Asymptoticcorrelations
Parameter

Estimates

Standard
errors

a}o
a,]
130

-.487
.007
--.094

.699
.003
.317

between estimates
al
I3
1
-.66
.64
- .23
1
1

ao

To see ifthe model can be simplifiedany furtherthe likelihoodratio statisticfortestingca = 0 was


calculated. The outcomeis 12.19, so thereis a relationbetweenthe "probability"of notbeingat risk
and the CD4+ cell count. Roughly:the odds of not beingat riskfora patientwho has a CD4+ cell
count of 100 higherthan anotherpatient,are about twice as high.
It mightbe possible thatthereis a relationbetweenthefollow-uptimeand thenumberofobserved
episodes. The model-was refittedwithln(follow-uptime) as an additionalcovariate. This gave no
improvement(a lik'elihoodratio statisticof .56).
As El-Shaarawi (1985) points out, rejectingthe hypothesisthatthe Poisson distributionfitsthe
numberof zeros well, does notimplythattheinflatedPoisson is theappropriatemodel. There might
be other distributionsthat fitthe data well. For instance the negative binomial can be a good
candidate. Fittingthe negative binomialwith mean ,u, variance ,u(1 + cr,u)and a log-link,in this
example, gives a log-likelihoodof -55.67 on 95 degrees of freedom.Table 3 shows some summary
statisticsof the Pearson residuals,definedas [Yi - E(Yi)]/V'var(Yi)with Yi the numberof episodes
of patienti, forthe negativebinomialand the inflatedPoisson. Inspectionof these residuals shows
thattheyare, in absolute sense, more oftensmallerforthe inflatedPoisson. (See Table 4).
Table 4
Summarystatisticsof thePearson residuals
Firstquartile
Negative binomial
InflatedPoisson

.513
- .544

Mean
-

.010
.003

Thirdquartile Range
-

.096
.042

6.42
5.39

This content downloaded from 202.92.128.135 on Thu, 09 Apr 2015 03:40:00 UTC
All use subject to JSTOR Terms and Conditions

742

Biometrics,June 1995
ACKNOWLEDGEMENTS

I wouldliketo thankJimLindsey,ByronJ.T. Morgan,and an AssociateEditorfortheirhelpful


comments
and suggestions
and AndyHoepelmanforpermission
to use thedata.
RESUME
I1 arrivequelquefoisqu'on observeun grandnombrede z6ros lors de l'analysede donn6esde
comptages.Dans une tellesituation
une distribution
de Poissonmodifi6e
peutetreutilis6e.Nous
pr6sentons
untestdu scorepourd6cidersi le nombre
de z6rosobserv6sesttropgrandpourqu'une
distribution
de Poissonnonmodifi6e
puisseetrecompatible
avec les donn6es.
REFERENCES

Cochran,W. G. (1954). Some methodsforstrengthening


the commonx2 tests.Biometrics10,
417-451.
Statistics.London:Chapmanand Hall.
Cox, D. R. and Hinkley,D. V. (1974).Theoretical
El-Shaarawi,A. H. (1985).Some goodness-of-fit
methodsforthePoissonplusaddedzerosdistribution.Applied and EnvironmentalMicrobiology49, 1304-1306.

Farewell,V. T., andSprott,D. A. (1988).Theuse ofa mixture


modelintheanalysisofcountdata.
Biometrics44, 1191-1194.

Hoepelman,A. I. M., Van Buren,M., Van denBroek,J.,andBorleffs,


J.C. C. (1992).Bacteriuria
in meninfectedwithHIV-1 is relatedto theirimmunestatus(CD4+ cell count).AIDS 6,
179-184.
Johnson,N. L., Kotz, S., and Kemp,A. W. (1992). UnivariateDiscreteDistributions,
second
edition.New York:JohnWiley& Sons, Inc.
Lambert,D. (1992).Zero-inflated
withan application
Poissonregression,
to defectsin manufacturing. Technometrics34, 1-14.

Rao, C. R., and Chakravarti,


I. M. (1956).Some smallsampletestsof significance
fora Poisson
distribution.Biometrics12, 264-282.

Received December 1993; revisedDecember 1994; accepted February 1995.

APPENDIX

The model underthe nullhypothesishas linkfunctionln(A) = X,Bwith/8a p x 1 vector. The model


the log likelihood(2) withrespectto 18and 0 gives:
matrixincludes a constant.Then differentiating

dl()
dl(*)
d6
d(9

{l(Y=O)
'

?1

(It-1~~
+ 0) +

Under the null hypothesis: 0


hypothesis,(6) becomes:

E
{;Y

0j=) + ~~
?(A)jt.

(6)

(7)

e-Ai}(7

0, with /3and Aj maximumlikelihood estimates under the null

E(yji-Aj)xij.

and (7) then is

Aixir+ l(yi>O)(yi - Ai)xi,4 r = 1

= O, r= 1 ... p

(8)

x;-ijso

UT(p,

0) =0,

0, E

-(

j)

The second derivativesare given by:

This content downloaded from 202.92.128.135 on Thu, 09 Apr 2015 03:40:00 UTC
All use subject to JSTOR Terms and Conditions

(9)

A Score Testfor Zero Inflation

d21()

/-e [(A1 - Aj)6 + e A\l

dfrdf3,

[0 +

|;O)

df3rdOs E(5'=O)(
d21(

A
AIiXvxi - 1(,>o)AiXisxi4

(yi=?)

e-j2J

Jr,p+

X -e(

/d 1(

dpi

1I

Ai

AiAi6 + 6

+e

i ((1 +

=8 s

0)2[0 +

1 +6

r= 1

}AiXirXis

((1 + 6[f6+ e Al)Aixir

itcan be seenthat

(1 + 6)[6 + eAil

-(d/3dfJ)

p l,pd I

p, S

+ 1

0 + eA
UsingE[1(y;=O)]= P(yi = 0) =
andE [1(,>o)] = P(yi > 0)
1 +0
J(/3,0), has entries:
matrix,
theexpectedinformation
Jr,S -

r= 1

r= 1

+e12)AiXirJ

(I1 + 0)2

d02

743

p,

1.

s =

r = 1 ** p

e-A)

and so J(f3,
0) has entries
r,s = >

ixirxis
ii

JPIp+=)+

(eA;

i).

Aix

r=1...

-A

(1

PartitionJ(f3,0) as

r= ...p,s =1...p; Jp +I=-

Lillj121Xwhere Jl
J22

[J2

XTdiag(X)X;jl2

TX, =
-XTAJ21

XATX;J22

as
Now denotetheinverseofJ(P, 0) as C whichcan be partitioned

[C c2J

Due to thestructure
ofU(13,0) onlyC22is needed.
C22

=22

l)

(>

A;

XTX[XTdiag(A)XIXTX.

Usingthiswith(9), equation(3) follows.


Sincethemodelcontainsa constant,
Acan be written
as A = diag(A)Xep,
whereep is a (p x 1)
equalzero.So ATX[XTdiag(X)X]andhavingotherelements
element
IXTA=
vectorhavinga 1 as first
epTXTdiag(X)X[XTdiag(A)X]
-IXTdiag(X)Xep= 1TX.From(8) thiscanbe seento equalny.

This content downloaded from 202.92.128.135 on Thu, 09 Apr 2015 03:40:00 UTC
All use subject to JSTOR Terms and Conditions

Вам также может понравиться