Statistical Estimation and Model Selection of Spec

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/229796934
Statistical Estimation and Model Selection of Species‐Accumulation

Functions
Article in Conservation Biology · March 2005

DOI: 10.1111/j.1523-1739.2005.00453.x
CITATIONS READS
66 368
2 authors:
Eloisa Díaz-Francés Jorge Soberón

Centro de Investigación en Matemáticas (CIMAT) University of Kansas
18 PUBLICATIONS 163 CITATIONS 182 PUBLICATIONS 20,674 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Collaborative Forest Management in central Uganda View project
Ecology of Mexican Butterflies View project
All content following this page was uploaded by Jorge Soberón on 22 January 2019.
The user has requested enhancement of the downloaded file.

Statistical Estimation and Model Selection of
Species-Accumulation Functions
ELOÍSA DÍAZ-FRANCÉS∗ AND JORGE SOBERÓN†‡
∗
Department of Probability and Statistics, Center of Research in Mathematics (CIMAT) A.P. 402, Guanajuato, Gto. 36000, Mexico
†Department of Ecology of Biodiversity, Institute of Ecology, National University of Mexico, and National Commission on
Biodiversity (CONABIO) Insurgentes-Periférico 4903, Tlalpan, 14010, Mexico, D.F.
Abstract: Soberón and Llorente (1993) proposed pure-birth stochastic processes as theoretical models for
species-accumulation curves, and these processes have frequently been used to describe the progress of biological
inventories. We describe, in algorithmic form, an alternative statistical analysis based on a likelihood approach
(Dı́az-Francés & Gorostiza 2002) that provides mathematical rigor to the ideas in Soberón and Llorente
(1993) and improves the estimation of the models by incorporating the facts that the variance of the error
is not constant and that the observations are correlated. Additionally, we used the likelihood ratios between
candidate models as an objective procedure for model selection, allowing comparison between the goodness
of fit of various models. The software for these statistical methods can now be downloaded off the Internet. We
used two examples of butterfly data sets to illustrate the use of the methods and the software.
Key Words: model comparison, pure-birth process

Estimación Estadı́stica y Selección de Modelos de Funciones de Acumulación de Especies
Resumen: Soberón y Llorente (1993) propusieron procesos estocásticos de nacimientos puros como mode-
los teóricos para curvas de acumulación de especies, y estos procesos han sido usados frecuentemente para
describir el progreso de inventarios biológicos. Describimos, en forma algorı́tmica, un análisis estadı́stico alter-
nativo basado en un método probabilı́stico (Dı́az-Frances & Gorostiza 2002) que aporta rigor matemático a
las ideas en Soberón y Llorente (1993) y mejora la estimación de los modelos mediante la incorporación del he-
cho que la varianza del error no es constante y que las observaciones están correlacionadas. Adicionalmente,
utilizamos las proporciones probabilı́sticas entre modelos candidatos como un procedimiento objetivo para
la selección de modelos, que permite la comparación de la calidad del ajuste de varios modelos. El software
para estos métodos estadı́sticos puede ser descargado de la Internet. Utilizamos dos ejemplos con conjuntos de
datos de mariposas para ilustrar el uso de los métodos y el software.
Palabras Clave: comparación de modelos, proceso de nacimientos puros
Introduction the number of remaining uncollected species can be ob-

tained from such mathematical descriptions. There are
The problem of mathematically describing the process many ways of attacking this problem (reviewed in Colwell
of discovery of new species in an inventory has be- & Coddington 1994). A popular method is based on the
come pressing, given the rate of disappearance of habitats idea of fitting equations such as the Michaelis-Menten or
worldwide. Assessment of the degree of completeness of the von Bertalanffy models to effort-species data sets (ac-
the exploration process and extrapolations to estimate cumulated units of collecting effort versus accumulated
‡Address correspondence to J. Soberón, email jsoberon@xolo.conabio.gob.mx

Paper submitted October 3, 2003; revised manuscript accepted July 1, 2004.
569
Conservation Biology, Pages 569–573

Volume 19, No. 2, April 2005
570 Model Selection of Species-Accumulation Functions Dı́az-Francés & Soberón
Table 1. Species-accumulation functions considered in Soberón and between errors decreases with time. Statistical inference
Llorente (1993).∗ about the model parameters is greatly improved by using
this upper bound for the variance and by considering an
Model S(t; a,b) Asymptote
additional parameter ρ that accounts for possible corre-
Exponential (a/b) (1 − e −bt ) a/b lation between the observations. A likelihood function is
Clench at/(1+bt) a/b then obtained for each model. As a result, relevant models
Logarithmic (1 − e−b )−1 log (1 + (1 − e −b ) at) — under consideration can be compared for a given data set
∗ The effort is represented by t, and a and b are the parameters to be through their corresponding likelihood ratio and a useful,
fitted. objective, and quantitative tool for model comparison and
selection is provided.
Our purpose here is to give, for the first time in the bi-
number of observed species; see Clench 1979; Lamas et ological literature, an outline of the alternative statistical
al. 1991). method proposed by Dı́az-Francés and Gorostiza and to
Soberón and Llorente (1993, henceforth Soberón and introduce its implementation, now available in freeware.
Llorente) provide a theoretical justification for using such The work of Soberón and Llorente has been referenced
parametric models by applying the ideas of the pure-birth frequently in faunistic and floristic studies, and at least
stochastic process theory. They assume that the accumu- two of the three families of species-accumulation func-
lated number of different species are the states of a pure- tions presented there are used widely. Therefore, it is im-
birth process, where Y(t) represents the state (accumu- portant to provide better estimation methods for these
lated number of species) after t units of effort, and that models. These estimation methods can now be fitted to
the corresponding species-accumulation function can be data sets through user-friendly software available free on
modeled as the mean function S(t; θ) of this process, the Internet at http://cimat.mx/info.php?m=1&ind=5.
where θ is the vector of parameters and t is usually given This software uses the likelihood ratios between candi-
in time or effort units. They allow biological interpreta- date models among the pure-birth models proposed by
tion of three specific families of birth rates (also called col- Soberón and Llorente and Dı́az-Francés and Gorostiza to
lecting functions), assuming that the probability of find- select the best model for a given data set. Further com-
ing a new species depends on the size of the list and the parisons with other, different parametric stochastic mod-
time already spent in the field. From these birth rates, in els can be made through corresponding likelihood ratios.
each case, they derived the mean functions of the process A by-product of this freeware is a likelihood plot of the
Y(t) (Table 1) for θ = (a, b). Soberón and Llorente fitted total number of different species (TNS) for the models
a selected S(t; θ) to the observed Y(t) by using nonlin- that reach an asymptote, which gives information on how
ear regression procedures that assume uncorrelated er- plausible different values of TNS are for the given data.
rors with constant variance. The authors acknowledge We used this software to fit models from Table 1 to two
that this procedure is far from ideal and ask for statistical butterfly data sets.
improvements. They also note the difficulty of the pro-
cedure and the strong need for development of objective
procedures for model selection. Methods
Dı́az-Francés and Gorostiza (2002) (henceforth Dı́az-
Francés and Gorostiza) comment on some mathematical A species-accumulation function is a nondecreasing curve
inconsistencies they found in Soberón and Llorente. They that represents the expected accumulated number of dif-
propose a statistical analysis that incorporates corrections ferent species encountered within a certain geographi-
to these errors, in addition to presenting an objective pro- cal area as a function of a measure of the effort (usually
cedure for model comparison and selection for a given time or person-hour units) to collect them. The observed
data set. Dı́az-Francés and Gorostiza show that the func- effort-species data set is a collection of pairs {t i ,Y(t i )}, i =
tions S(t; θ) obtained by Soberón and Llorente are gener- 1, . . . n, where t i is the successive effort units and Y(t i ) is
ally not means of pure-birth processes but that they can the accumulated number of different species seen up to t i ;
be well approximated by the mean function of a suit- consequently, the observed data are always nondecreas-
able nonhomogeneous pure-birth process B(t), which ing. The exact time of appearance of a new species is not
is close to the observed process Y(t). Dı́az-Francés and recorded. The available information is only that Y(t i ) −
Gorostiza suggest the use of the species-accumulation Y(t i−1 ) new species appeared within the time interval t i −
function S(t; θ) proposed by Soberón and Llorente and t i−1 and that frequently the increments Y(t i ) − Y(t i−1 ) are
an upper bound U(t; θ) for the (unknown) variance of >1.
B(t) to make statistical inferences about θ. They propose The first two models proposed by Soberón and Llorente
the use of likelihood-based statistical methods to create a (exponential and Clench, Table 1) describe families of
nonlinear regression model with mean S(t; θ) and normal species-accumulation functions that are bounded (i.e.,
errors, possibly correlated and such that the correlation they are useful to describe situations in which TNS, which
Conservation Biology
Dı́az-Francés & Soberón Model Selection of Species-Accumulation Functions 571
is directly related to the asymptote of the curve, will be can be compared by calculating their corresponding like-
registered eventually). In other words, the area is small, lihood ratio,
the taxa are well known, or both, or the collectors ac-
cumulate experience, which increases the plausibility of L G (θ G )/L H (θ H ) = x,
detecting new species as more time is spent in the field. where the likelihood functions are evaluated at their cor-
In contrast, the logarithmic model is unbounded and is responding maximum likelihood estimates (MLE), θ G and
useful to describe situations in which the area is large, θ H , for models G and H, respectively. For example, if x =
the taxa are poorly known, or both. 3, the observed data set is three times more probable un-
To use the freeware that implements the method pro- der model G than under model H, so model G is preferred
posed by Dı́az-Francés and Gorostiza, it is necessary to over H for this specific data set. The value of x indicates
select a function S(t; θ) from Table 1, where θ = (a, b), or how many times more (or less) plausible model G is for
to propose a valid function S(t; θ) with biological mean- the observed data than model H; that is, the likelihood
ing. The data can be input from a spreadsheet or a text file. ratio provides a continuous plausibility rating scale to as-
To obtain estimates for the parameter θ, the log-likelihood sess different models (Fisher 1973). Any departure from
function for θ under the proposed model by Dı́az-Francés 1—the value for which both models fit equally well—is
and Gorostiza is maximized. The proposed estimates for evidence in favor of one of the two models. The issue is
θ are the values of the parameters that make the observed then how strong the evidence has to be to prefer a given
data set most probable. These are called maximum likeli- model. This may vary depending on the consequences
hood estimates because they maximize the log-likelihood of selecting a model. The recommendation is that if one
function. An upper bound for the variance of the asso- model is to have priority over others, the evidence should
ciated pure-birth process is also used in this estimation be strong, as it is in the case for La Calera butterflies in the
process. The variance of the process is not constant, examples that follow. In contrast, for those cases where
and it usually cannot be calculated easily. Incorporating the evidence is not strong enough to clearly favor a given
the proposed upper bound for the variance, however, model, additional data must be collected to acquire more
greatly improves the inferences about the model parame- information for model selection. If at least one of the mod-
ters. Also, as a by-product, the parameter ρ that describes els considered describes the data set adequately from sta-
the strength of the correlation between adjacent observa- tistical and biological points of view, the selection of a
tions is also estimated. In the approach presented here, model can be based on this quantitative approach. This
any positive correlation between pairs of observations (if process must always be complemented with diagnostic
present) is inversely related to the distance in time or checks to assess the goodness of fit of the “best” model
effort units between them. to the data, in order to verify that the assumptions of the
The estimated TNS has to be relevant to the observed stochastic model are reasonable and that they hold at least
species-accumulation data; therefore, it must be an inte- approximately for the data. Additionally, external biologi-
ger equal to or larger than the total number of observed cal reasoning should provide support to the model under
species Y(t n ). The estimation methods must be condi- consideration in light of the observed data, in the sense
tioned on this fact. that the model should give reasonable answers to the
If the selected S(t; θ) is bounded—exponential or questions being asked and its features and assumptions
Clench—the model can be reparameterized in terms of should describe the biological characteristics of the data
TNS and a likelihood plot can be obtained that gives in- reasonably well. The important point is that the model
formation about the plausibility of different values of TNS should be adequate for the purpose at hand.
for the observed data. Also, likelihood intervals can be ob-
tained by drawing a horizontal line at a certain height in
the plot of the likelihood function; thus, any value within
Examples
a given likelihood interval is more likely to be the true
TNS than any other value outside it. Usually likelihood
La Calera and Atoyac Butterfly Examples
intervals of height 0.01 to 0.15 are used for interpre-
tation purposes. The lower endpoint of the 0.01 inter- La Calera, a tropical semideciduous rain forest in Jalisco,
val is often considered the smallest likely value for TNS, Mexico, is near the biosphere reserve of Manantlán. As a
whereas the upper endpoint of this interval is consid- part of a butterfly inventory of Jalisco, collections were
ered the largest plausible value for TNS in the light of the made in La Calera. After 53 days of work, 240 differ-
observed data. Under certain conditions, likelihood inter- ent species had been observed (Vargas et al. 1996). The
vals of 0.15 achieve an approximate 95% confidence level. observed species-accumulation data and the estimated
The maximum likelihood estimate of TNS is included in species-accumulation curves S(t; θ) under the three dif-
all the likelihood intervals. ferent models given in Table 1 were fitted, using the free-
Different species-accumulation functions S G (t; θ G ) and ware program, to the La Calera data (Fig. 1 & Table 2). The
S H (t; θ H ) for the same data set, {t i ,Y(t i )} (i = 1, . . . n), best-fitting model was the exponential model, for which
572 Model Selection of Species-Accumulation Functions Dı́az-Francés & Soberón
Figure 1. Fit of three models to the La Calera Figure 2. Fit of three models to the Atoyac butterflies
butterflies data set (open circles, accumulated species data set (open circles, accumulated species number;
number; continuous line, the fit to the exponential continuous line, fit to the exponential model; broken
model; broken line, fit to logarithmic model; dotted line, fit to logarithmic model; dotted line, fit to Clench
line, fit to Clench model). model).
the observed data set was 767 times more probable than The TNS under the exponential model was 361 species in
under the next best-fitting model (Clench model). Also, contrast to the TNS predicted under the Clench model of
the exponential model was more than 50 million times as 509, and the logarithmic was unbounded. To be able to
probable as the logarithmic model for this data set. This is discriminate sharply between the models, additional data
a case where there is overwhelming evidence that favors would need to be collected.
the exponential model over the other two models. The
software also allows displaying a likelihood plot of the
total number of species under the exponential model. In Discussion
this case, although the estimate of the TNS was 247, any
value within the 1% likelihood interval (240,257) is a plau- Besides the models presented in Table 1, other equations
sible value given the data set. Values outside this interval for S(t; θ) could be used for a species-accumulation func-
can be overlooked because they are highly implausible tion. The only requirements for the pure-birth model of
for the observed sample. Dı́az-Francés and Gorostiza are that they must be posi-
Vargas et al. (1994) reported results of 3 years of sam- tive, nondecreasing, and differentiable such that S(0) = 0
pling butterflies over a large transect at Atoyac, from 300 and that the derivative S’(t; θ) tends monotonically to 0
to 2500 m above sea level (asl). This range covers semide- as t tends to infinity. The variance of the corresponding
ciduous rain forest to pine forest. After 152 time-effort approximating pure-birth process, as described in Dı́az-
units (person-days), 342 different species were observed. Francés and Gorostiza, is generally not constant, and its
The three models were fitted to this dataset (Fig. 2 & Ta- behavior should be taken into account to improve the
ble 3). The best-fitting model was the exponential, which estimation procedure.
was only 2.02 times more probable than the second-best- Ideally, effort units should be based on accumulated
fitting model, Clench, and 2.34 times more probable than number of individuals because the carriers of taxonomic
the logarithmic model. Although the data set slightly fa- information are individuals (Gotelli & Colwell 2001). The
vors the exponential model, this is a case in which the pure-birth models we propose assume that the effort is
evidence in favor is clearly not strong. Nevertheless the represented as a continuous quantity, with the unit se-
three models predicted contrasting long-term behaviors. lected such that the probability of observing one new
Table 2. Estimated parameters for models given in Table 1 for La Calera butterflies.∗
Model a b ρ TNS LR 1/LR
Exponential 15.092 0.061 0.653 247 1 1

Clench 14.990 0.042 0.601 357 0.0013 767
Logarithmic 10.986 0.005 0.688 — 1.97e–08 50,709,864
∗ Abbreviations: TNS, total number of species; LR, likelihood ratio. The a and b are fitted parameters, and ρ is the correlation of errors.
Dı́az-Francés & Soberón Model Selection of Species-Accumulation Functions 573
Table 3. Estimated parameters for models given in Table 1 for Atoyac accumulation functions. The estimated values of parame-
butterflies.∗ ters with the method of Dı́az-Francés and Gorostiza are nu-
merically different from those presented in Soberón and
Model a b ρ TNS LR 1/LR
Llorente because they are based on methods with differ-
Exponential 6.159 0.017 0.713 361 1 1 ent assumptions. The statistical analysis proposed by Dı́az-
Clench 6.543 0.013 0.637 509 0.4939 2.02 Francés and Gorostiza gives mathematical rigor to the
Logarithmic 7.273 0.006 0.573 — 0.4268 2.34 ideas presented in Soberón and Llorente, and the freeware
∗ Abbreviations: TNS, total number of species; LR, likelihood ratio. provides a useful and practical method for estimating,
The a and b are fitted parameters, and ρ is the correlation of errors. comparing, and selecting different species-accumulation
functions for a given data set.
species within a small effort interval is much larger than
that of not seeing a new species or that of observing more
than one new species in that interval. Also, the conditions Acknowledgments
throughout the observation period should be constant or
at least should not change significantly, and the same cap- This work was partially supported by the Consejo Na-
turing procedures should be used consistently. cional de Ciencia y Tecnologı́a (CONACYT) of Mexico
Considering observed individuals as effort units grants 32156-E and 37130-E. We thank L. Gorostiza and
changes the setting to discrete time models. It is possi- D.A. Sprott for reading a preliminary version of the pa-
ble to consider observed individuals as effort units in a per and for useful comments. We also thank J. Golubov
continuous time model similar to a pure-birth process by for several comments that helped clarify the text and J.
assuming that the individuals appear (or are collected) Llorente, who kindly provided us with many butterfly
approximately uniformly in time. This is consistent with data sets. We thank J. Ramón Domı́nguez and the Centro
describing the appearance of individuals (independently de Investigación en Matemáticas (CIMAT) software de-
of their species) as a Poisson process in continuous time, velopment team for producing the species-accumulation
where the parameter of the process is the average number freeware. Finally, we thank R.K. Colwell and the referees
of individuals observed in a unit of time, and replacing the for constructive comments that helped to improve this
pure-birth process with a process that increases by one work.
whenever a new species is observed in the underlying
Poisson process.
Literature Cited
This setting may not be reasonable when individuals
appear in a nonhomogeneous fashion (e.g., when more Bates, D. M., and D. G.Watts. 1988. Nonlinear regression analysis and its
are seen at different seasons). If special care is taken to applications. John Wiley & Sons, New York.
ensure the “uniformity” of appearance of the individuals Clench, H. 1979. How to make regional lists of butterflies. Some
thoughts. Journal of the Lepidopterist’s Society 33:216–231.
in time, the models in Dı́az-Francés and Gorostiza could Colwell, R. K., and J. A. Coddington. 1994. Estimating terrestrial bio-
represent such data sets reasonably well. Nevertheless, a diversity through extrapolation. Philosophical Transactions of the
modified model based on observed individuals as effort Royal Society of London, B. 345:101–118.
units should be analyzed in more detail in future work. Dı́az-Francés, E., and L. G. Gorostiza. 2002. Inference and model compar-
The likelihood-ratio procedure per se is not enough to ison for species accumulation functions using approximating pure-
birth processes. Journal of Agricultural, Biological, and Environmen-
find a good model for the data. At least one of the mod- tal Statistics 7:29–43.
els entertained must adequately fit the data set. There is Fisher, R. A. 1973. Statistical methods and scientific inference. Hafner
no need, however, to test all reasonable models against Publishing, New York.
known standards such as complete or nearly complete in- Gotelli, N. J., and R. K. Colwell. 2001. Quantifying biodiversity: proce-
ventories for a wide variety of taxa and localities (Colwell dures and pitfalls in the measurement and comparison of species
richness. Ecology Letters 4:379–391.
& Coddington 1994). The only requirement is to have a Lamas, G. R., K. Robbins, and D. J. Harvey. 1991. A preliminary survey
reasonable probabilistic model in terms of a well-defined of the butterfly fauna of Pakitza, Parque Nacional del Manú, Perú,
parameter, so that model can be compared with other with an estimate of its species richness. Publicaciones del Museo de
models with the same characteristics. The selected best Historia Natural, Universidad de San Marcos, Perú 40:1–19.
model, among those considered, should always be subject Seber, G. A. F., and C. J. Wild. 1989. Nonlinear regression. John Wiley &
Sons, New York.
to standard diagnostic tests—as in any nonlinear regres- Soberón, J., and J. Llorente. 1993. The use of species accumulation func-
sion procedure—to assess the fit of this model, in its own tions for the prediction of species richness. Conservation Biology
right, to the observed data set (see Bates & Watts 1988: 7:480–488.
section 3.7; Seber & Wild 1989: sections 4.6 & 5.5).These Vargas, I., J. Llorente, and A. Luis. 1994. Listado lepidopterofaunı́stico de
diagnostic procedures check that data do not flagrantly la Sierra de Atoyac de Alvarez en el estado de Guerrero: notas acerca
de su distribución local y estacional (Rhopalocera: Papilionoidea).
contradict any of the model assumptions. Folia Entomologica Mexicana 86:41–178 (in Spanish).
Our results describe an appropriate application of pure- Vargas, I., A. Luis, and J. Llorente. 1996. Butterflies of the state of Jalisco,
birth-process theory to improve the estimation of species- Mexico. Journal of the Lepidopterists Society 50:97–138.
View publication stats

Statistical Estimation and Model Selection of Spec

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Statistical Estimation and Model Selection of Spec

Загружено:

Авторское право:

Доступные форматы

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Statistical Estimation and Model Selection of Species‐Accumulation

Article in Conservation Biology · March 2005

Eloisa Díaz-Francés Jorge Soberón

SEE PROFILE SEE PROFILE

Collaborative Forest Management in central Uganda View project

Ecology of Mexican Butterflies View project

The user has requested enhancement of the downloaded file.

Key Words: model comparison, pure-birth process

Palabras Clave: comparación de modelos, proceso de nacimientos puros

Introduction the number of remaining uncollected species can be ob-

‡Address correspondence to J. Soberón, email jsoberon@xolo.conabio.gob.mx

Conservation Biology, Pages 569–573

Model a b ρ TNS LR 1/LR

Exponential 15.092 0.061 0.653 247 1 1

View publication stats

Вам также может понравиться