Вы находитесь на странице: 1из 14

IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO.

2, JUNE 2007 223

A Comprehensive Empirical Study of Count Models


for Software Fault Prediction
Kehan Gao, Member, IEEE, and Taghi M. Khoshgoftaar, Member, IEEE

Abstract—Count models, such as the Poisson regression model, predicted probability


and the negative binomial regression model, can be used to obtain expected value
software fault predictions. With the aid of such predictions, the de-
velopment team can improve the quality of operational software. variance
The zero-inflated, and hurdle count models may be more appro- mean value of the dependent variable for
priate when, for a given software system, the number of modules
with faults are very few. Related literature lacks quantitative guid- module
ance regarding the application of count models for software quality , and parameter and/or parameter vectors to be
prediction. This study presents a comprehensive empirical investi- estimated
gation of eight count models in the context of software fault predic- log-likelihood function
tion. It includes comparative hypothesis testing, model selection,
and performance evaluation for the count models with respect to probability of a module being perfect
different criteria. the selected threshold value
The case study presented is that of a full-scale industrial soft-
ware system. It is observed that the information obtained from hy- likelihood ratio statistic
pothesis testing, and model selection techniques was not consistent chi-square distribution
with the predictive performances of the count models. Moreover,
the comparative analysis based on one criterion did not match that ratio of predicted probabilities of the two models
of another criterion. However, with respect to a given criterion, the mean of
performance of a count model is consistent for both the fit, and test
data sets. This ensures that, if a fitted model is considered good standard deviation of
based on a given criterion, then the model will yield a good predic- Vuong’s statistic
tion based on the same criterion. The relative performances of the
eight models are evaluated based on a one-way ANOVA model, and the maximum value that the dependent variable
Tukey’s multiple comparison technique. The comparative study is takes
useful in selecting the best count model for estimating the quality observed frequency (i.e., the fraction of the
of a given software system. sample with )
Index Terms—ANOVA, count models, hypothesis testing, infor- predicted frequency
mation criteria, Pearson’s chi-square, software metrics, software
-significance level
quality, Tukey’s multiple comparison.
-score in the table
number of mutually exclusive cells into which
NOTATION the range of is divided
module (observation) identifier SS sum of squares
number of modules (observations) MS mean squares
vector of independent variables for module df degree of freedom
vector of [1, ] F statistic
dependent variable for module
predicted value of the dependent variable for ACRONYM1
module AAE average absolute error
probability density function AIC Akaike information criterion
ANOVA analysis of variance
ARE average relative error
Manuscript received January 2003; revised September 2004, November 2004,
and January 2005; accepted June 2005. This work was supported in part by NSF
BIC Bayesian information criterion
Grant CCR-9970893. Associate Editor: M. Vouk. CAIC consistent Akaike information criterion
K. Gao is with the Department of Mathematics and Computer Science,
Eastern Connecticut State University, Willimantic, CT 06226 USA. EM expectation-maximization
T. M. Khoshgoftaar is with theEmpirical Software Engineering Laboratory,
Department of Computer Science and Engineering, Florida Atlantic University,
IC information criterion
Boca Raton, FL 33431 USA (e-mail: taghi@cse.fau.edu).
Digital Object Identifier 10.1109/TR.2007.896761 1The singular and plural of an acronym are always spelled the same.

0018-9529/$25.00 © 2007 IEEE


224 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

LR likelihood ratio term (gamma distributed), yielding the negative binomial re-
HNB1 hurdle model with threshold 1, and negative gression model (NBRM).
binomial distribution Another variation of count models are zero-inflated, and
HNB2 hurdle model with threshold 2, and negative hurdle models, which are used when there is an excess of zeros
binomial distribution for the dependent variable. The proportion of faulty modules
HP1 hurdle model with threshold 1, and Poisson of a high assurance software system (such as telecommu-
distribution nications) is usually very small. Under such scenarios, the
HP2 hurdle model with threshold 2, and Poisson zero-inflated, and hurdle count models may be more appro-
distribution priate. The zero-inflated model assumes that the population
MLE maximum likelihood estimation of the software modules consists of two groups, i.e., perfect
MSD minimum -significant difference modules, and non-perfect modules. For the perfect modules,
MSE mean squared error no faults occur; while for the non-perfect modules, the number
of faults follows some standard distribution, such as Poisson,
NBRM negative binomial regression model
or negative binomial. If the non-perfect group is assumed to
PDF probability density function follow a Poisson distribution, then a zero-inflated Poisson (ZIP)
PRM Poisson regression model model is obtained. If the non-perfect group is assumed to follow
ZIP zero-inflated Poisson a negative binomial distribution, then a zero-inflated negative
ZINB zero-inflated negative binomial binomial (ZINB) model is obtained.
The hurdle model also consists of two parts. However, in-
stead of having perfect, and non-perfect groups, it divides the
I. INTRODUCTION modules into lower, and higher count groups, based on a binary
IVEN the goal of delivering a software product that has
G minimal corrective maintenance, software quality mod-
eling techniques are useful tools for applying timely software
distribution. The dependent variable of these groups is assumed
to follow a separate distribution process. In the case of hurdle
count models, two factors may affect its predictive quality: the
quality improvement efforts. For example, by predicting the dependent variable’s threshold2 value that is used to form the
number of faults in program modules, a software fault predic- two groups, and the specific distribution each group is assumed
tion model can direct the software quality assurance team in to follow. We investigated these two factors to study their impact
targeting the most faulty modules first. on the prediction performances of hurdle models. For the case
The relationship between software complexity metrics, and study presented, we chose two threshold values for the number
the occurrence of faults in program modules has been used of faults: 1, and 2.
by software quality prediction models, such as case-based The distributions chosen for the count groups of the hurdle
reasoning [1], regression trees [2], [3], fuzzy logic [4], and models are the Poisson, and negative binomial distributions.
multiple linear regression [5]. Typically, a software quality Hence, we have four kinds of hurdle models: 1) hurdle model
model of a given software system is calibrated using software with threshold 1, and Poisson distribution (HP1); 2) hurdle
metrics, and fault data collected from a previous system release model with threshold 2, and Poisson distribution (HP2); 3)
or similar projects. The model can then be applied to predict hurdle model with threshold 1, and negative binomial dis-
the software quality of a release currently under development,
tribution (HNB1); and 4) hurdle model with threshold 2, and
or similar projects.
negative binomial distribution (HNB2). Previous works related
Software fault prediction based on count model techniques
to hurdle models were focused on applying HP1, and HNB1 in
[6] is attractive because a specific count model can be chosen
econometrics [10]–[12]. The application of hurdle regression
such that it best represents the fault occurrence process of the
techniques with different threshold values to software quality
given software system. In addition, they have a unique feature
estimation modeling is a contribution of this study. To our
in that count models can provide the probability that a given
knowledge, this is the first study that has investigated hurdle
number of faults will occur in any given program module. Count
models for software quality estimation.
models can also be used for software quality classification of
program modules, i.e., fault-prone, and not fault-prone [7], [8]. In related literature, we did not find any comprehensive em-
We feel that the application of count models in software engi- pirical study that provides a clear guidance for applying existing
neering has been very limited [6], [9]. count models for software quality estimation. In one of the very
Poisson regression is the basis of the various count models few studies, Evanco [9] applied a Poisson Regression Model
which are derived from it. It has a statistical characteristic called (PRM) to determine the fault locality, and fault correction ef-
equidispersion, which implies the equality of the mean to the fort during unit, system, and acceptance testing of a software
variance of the dependent variable (a non-negative discrete in- project. However, no reasoning was provided regarding the ap-
teger). In the software fault prediction problem, it is often ob- propriateness of the PRM for the software data set. Moreover, the
served that the distribution of the number of faults (dependent study did not use any evaluation criteria to assess the quality of
variable) is such that its variance exceeds its mean value. Such a the fitted count model.
scenario is referred to as overdispersion. One way of accounting 2In our study the dependent variable is the number of faults in a software
for overdispersion is introducing an unobserved heterogeneity module.
GAO AND KHOSHGOFTAAR: EMPIRICAL STUDY OF COUNT MODELS FOR SOFTWARE FAULT PREDICTION 225

Our preliminary efforts investigated the benefits of applying is Poisson distributed with the probability density function (PDF)
ZIP models for software fault prediction [6]. This study presents of
a comprehensive empirical investigation of eight different count
models for software fault prediction. Six of the count models (1)
are commonly used in non-software engineering fields, such as
econometrics [11], [13]. The other two, i.e., HP2, and HNB2, are
where is the mean value of the dependent variable . The
investigated for the first time in this study. A statistically-based
expected value, and the variance of are given by
comparative analysis, including several model selection tech-
niques, is presented to provide a definitive guidance for applying
(2)
count models for software fault prediction.
The case study presented is of software metrics, and fault data
Equation (2) demonstrates an important property of the Poisson
collected from two Windows-based embedded software systems
regression model, i.e., equidispersion, which implies that the
configured for wireless telecommunications. The eight count
expected value of is equal to its variance.
models built for the system were compared with each other. A
To ensure that the expected value of is nonnegative, the link
one-way ANOVA model, and Tukey’s multiple comparison tech-
function which displays a relationship between the expected
nique were used in our comparative analysis.
value, and the independent variables should have the form [13]
The remainder of this paper presents details of the count
models in Section II, model selection tests & techniques in
(3)
Section III, a case study in Section IV, and conclusions in
Section V.
where denotes an unknown parameter
vector, and represents the transpose of the vector , which
is equal to . Note that both , and are
II. COUNT MODELING TECHNIQUES vectors, where is the number of independent variables used
Count models are generally a variation of the commonly in the data set. (1), and (3) jointly define the Poisson regression
used Poisson regression model. Another commonly used count model.
model is the negative binomial regression model, which is a The maximum likelihood estimation (MLE) technique is often
derivative of the PRM. Other count models have been derived by used in the parameter estimation for regression models [15].
combining different distributions, for example, finite-mixture Given a set of observations, the log-likelihood function of the
PRM is given by
models [13]. In a finite-mixture model, a random variable is
postulated as a draw from a super-population that is an additive
mixture of a group of distinct populations, each of which has its (4)
own unique distribution, such as Poisson, or negative binomial.
Zero-inflated count models, such as zero-inflated Poisson, and
The commonly used method in MLE is the Newton-Raphson
zero-inflated negative binomial models are specific cases of fi-
iteration technique. Other methods, such as the EM algorithm,
nite-mixture models. Hurdle models, including hurdle Poisson,
also perform successfully for parameter estimation [16].
and hurdle negative binomial models are also special cases of
finite-mixture models.
In addition to the finite-mixture models discussed in this B. Zero-Inflated Poisson Model
paper, we investigated several other mixture models. How- A data set with an excess of zeros for the dependent vari-
ever, it was observed that they were either unsuitable for the able is a commonly observed phenomenon in software quality
embedded software system being studied, or yielded similar modeling. This observation is especially prevalent for high as-
results as compared to some of the count models presented. surance, and mission-critical software systems.
For example, upon investigating the Poisson-inverse Gaussian The ZIP model first introduced by Lambert [16] assumes that
model [14], it was observed that its prediction results were very all zeros come from two sources: the source representing the
similar to those of the NBRM. Therefore, the modeling results perfect modules in which no faults occur, and the source repre-
of other mixture models are not presented in the paper. senting the non-perfect modules in which the number of faults in
the modules follows the Poisson distribution. In the ZIP model, a
parameter is introduced to denote the probability of a module
A. Poisson Regression Model
being perfect. Hence, the probability of the module being non-
The Poisson regression model is derived from the Poisson dis- perfect is . The PDF of the dependent variable under the
tribution by allowing the expected value of the dependent vari- ZIP model is therefore,
able to be a function associated with the independent variables.
Let ( , ) be an observation in a data set, such that , and ,
are the dependent variable, and vector of independent vari- .
ables for the observation, respectively. Given , assume (5)
226 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

A ZIP model may be obtained by adding the following two link where the gamma function denoted by is defined in [13],
functions: and is a constant parameter. The expected value of for the
negative binomial distribution is the same as that for the Poisson
(6) distribution, i.e., . However, the condi-
(7) tional variance of the negative binomial is given by

where is an unknown parameter vector (12)


of dimension .
The expected value, and the variance of in the ZIP model The conditional variance of in the NBRM exceeds the condi-
are, respectively, tional mean, because both , and are positive. Equation (12)
becomes if . In fact, the negative binomial distribution
(8) reduces to a Poisson distribution when . The standard pa-
rameter estimation technique for the NBRM is also MLE, and can
and
be derived in a similar fashion as presented for the PRM. How-
(9) ever, for further details, one can refer to [13].

Equations (8), and (9) display that the variance of exceeds its D. Zero-Inflated Negative Binomial Model
expected value, which is known as overdispersion. An excess The zero-inflated negative binomial model is similar to the ZIP
of zeros for the dependent variable implies the overdispersion model. The primary difference is that, in the case of the ZINB
phenomenon. model, the negative binomial distribution is used for the non-
We use the MLE technique as the standard parameter estima- perfect modules group, as compared to the Poisson distribution
tion technique for the ZIP model. Let denote an in- used in the ZIP model.
dicator variable that takes the value of 1 when , and the The probability density function for the ZINB model is given
value of 0 otherwise. The log-likelihood function for the ZIP is by (13) at the bottom of the page. The ZINB model is obtained
therefore given by by adding the following two link functions:

(14)
(15)

where , and are the parameter vectors that are to be estimated.


Similar to the other count models, the MLE technique can be used
(10) to estimate the model parameters , , and . Further details
regarding the maximum likelihood estimation technique, and
the likelihood function for the ZINB model are presented in [13].
C. Negative Binomial Regression Model
E. Hurdle Regression Models
Another approach that addresses the overdispersion issue is
the negative binomial regression model (NBRM) [13], [15]. Var- The hurdle models partition the observations (modules in our
ious methods are available to obtain the NBRM. For example, study) into two parts, i.e., a lower count group, and a higher
introducing unobserved heterogeneity is an intuitive motivation count group, depending on the selected threshold value, .
to develop such a model [13]. The chosen threshold can be any positive integer, according to
The negative binomial probability distribution finally can be the problem domain. The model assumes that each group fol-
obtained by the function lows some standard distribution.
In our empirical study, we wanted to investigate whether the
threshold selection affected the performance of the respective
hurdle model. We selected the threshold values of 1, and 2 faults
(11) to obtain the lower, and upper count groups. Therefore, software

,
(13)
GAO AND KHOSHGOFTAAR: EMPIRICAL STUDY OF COUNT MODELS FOR SOFTWARE FAULT PREDICTION 227

modules that had a number of faults greater than or equal to the tributions. The probability density function of HNB2 is therefore
selected threshold are categorized into the upper count group, given by
and into the lower count group otherwise. Moreover, for each
threshold value, we chose two distributions (Poisson, and nega- ,
tive binomial) to represent the groups of the hurdle model. When , (21)
the Poisson distribution is used, the resulting model is a hurdle
Poisson (HP) model. Whereas, when the negative binomial dis-
tribution is used, the resulting model is a hurdle negative bino- where is the same as in (20), and
mial (HNB) model. The hurdle model has a general form of
(22)
,
, and have the same forms as , and except
(16)
, for the replacement of parameters , and by , and ,
respectively.
The log-likelihood function for each hurdle model can be
where , and are some density functions that may have written as a sum of two individuals [13]: , and .
the same format, but with different parameters; and repre-
sents the selected threshold value. In this paper, we discuss four
different hurdle models. (23)
1) Case 1 (HP1): For a threshold value of 1 fault, both ,
and are defined to have Poisson distributions. Therefore,
the HP1 density function becomes (24)

, where is the function of the parameters for the lower counts


(17)
, ( for HP, ( ) for HNB); while is the function of the
parameters for the higher counts ( for HP, ( ) for HNB).
where is the Poisson mean for the group of Because, , and are functionally independent of each other,
lower counts, and is the Poisson mean for the the joint log-likelihood can be maximized by separately maxi-
group of higher counts. For this case, , and respectively mizing each individual function.
represent the zeros, and the positives groups.
2) Case 2 (HP2): For a threshold value of 2 faults, both , III. MODEL SELECTION TECHNIQUES
and are defined to have Poisson distributions. Therefore,
the HP2 density function becomes A. Likelihood Ratio Hypothesis Testing
The NBRM reduces to the PRM when the dispersion parameter
, . Hence, the null hypothesis, i.e., , against the
, alternative hypothesis, i.e., , can be tested by the LR
. test
(18) (25)
3) Case 3 (HNB1): For a threshold value of 1 fault, both ,
and are defined to have negative binomial distributions. where , and represent log-likelihood functions of
The probability density function of HNB1 is therefore given by the PRM, and the NBRM, respectively. Generally, if is true, the
likelihood ratio statistic, , follows a distribution with de-
, grees of freedom equal to the number of independent constraints
(19)
, [13]. However, the distribution of this statistic is nonstandard,
because of the restriction that .
where is The asymptotic distribution of the likelihood ratio statistic is
a weighted chi-square [13] distribution. For a given -signif-
(20) icance level, , is rejected when the test statistic exceeds
rather than . The statistic shows which regres-
Note that has the same form as , except for the re- sion model, PRM or NBRM, is preferred or is more appropriate.
placement of parameters , and by , and , respec- The same tests are used for the pairs ZIP versus ZINB, HP1 versus
tively. More specifically, has a negative binomial distri- HNB1, and HP2 versus HNB2. In the case of model simplification
bution with parameters , and . from HP to PRM, or from HNB to NBRM, a standard nested hy-
4) Case 4 (HNB2): Similarly, for a threshold value of 2 faults, pothesis is involved; therefore, the standard likelihood ratio test
both , and are defined to have negative binomial dis- is suitable.
228 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

B. Vuong’s Hypothesis Testing 1. Akaike information criterion: AIC = , where


is the number of parameters in the model, and is the
Most hypothesis tests on the regression coefficient, and dis- log-likelihood function.
persion parameters in count models can be obtained by applying 2. Bayesian information criterion: BIC = ,
the three classic hypothesis testing techniques: Likelihood Ratio where is the number of observations in the fit data set.
test, Wald test, and Lagrange Multiplier (LM) test [15]. However, 3. Consistent Akaike information criterion: CAIC =
there are still some hypothesis tests which cannot be obtained by ,
directly using any of these methods. The primary reason is that A count model with the lowest respective criterion-value (AIC,
the relationship between the two models are different than the CAIC, or BIC) will be selected. As compared to hypothesis
assumptions of the hypothesis tests. testing, model-selection based on information criteria (IC) can
Given two conditional models, we can group the relationship facilitate an ordering of the various models.
between the two models into two categories: nested, and non-
nested models. Generally speaking, two models are considered D. Performance Measures
nested if one model is a special case of the other. In contrast, 1) Pearson’s Chi-Square Measure: In the case of parametric
non-nested models imply that neither model can be represented models, a crude diagnostic of studying the performance of a
as a special case of the other. prediction model is to observe how close the fitted probabili-
Greene [15] pointed out that the PRM, and ZIP models are not ties compare to the actual frequencies of the dependent variable.
nested. Consequently, he used a test proposed by Vuong [17] Suppose the dependent variable takes values , where
for the non-nested models, which is now briefly described. Let . Let the observed frequency, i.e., the fraction of
be the predicted probability of the first model (Model the sample with , be denoted by ; and the corresponding
1), be the predicted probability of the second model fitted frequencies be denoted by , , where the
(Model 2), and be defined by fitted frequency, , is computed as the average over the obser-
vations of the predicted probabilities fitted for count .
(26) Model performance can then be measured by comparing
with . The closer the two values are, the better is the model
performance. Pearson’s chi-squared is a closure measurement in
Then, the Vuong’s statistic for testing the nested hypothesis of statistics [13]. Suppose the range of is divided into mutually
Model 1 versus Model 2 is given by exclusive cells, where each cell may include one or more values
of . For example, , will
(27) result in . The Pearson’s chi-square statistic is

(28)
where denotes the mean of , for , and
denotes the standard deviation of , for . 2) Absolute, and Relative Errors: The accuracy of
Vuong’s statistic, , is used to test the hypothesis that the fault prediction models are measured by the av-
, i.e., for each observation, the two models have erage absolute error (AAE), and the average relative error
the same predicted probabilities for a given dependent variable (ARE), i.e., AAE = , and ARE =
. Vuong showed that the statistic is bidirectional, and , where is the number of
asymptotically -normal. For a given -significance level, , if modules, are the actual values of the dependent variable, and
, then Model 1 is chosen; if , then are the predicted values of the dependent variable. In the case
Model 2 is selected; otherwise, i.e., , either of of ARE, because the actual value of the dependent variable may
the two models can be selected as the appropriate model. In be zero, we add a “1” to the denominator to make the definition
this case, a simpler model is always preferred. For our study, always well-defined [5].
Vuong’s test is also suitable for testing the relative appropriate-
ness of the NBRM, and the ZINB model. IV. EMPIRICAL CASE STUDY

A. System Description
C. Information Criteria
The software metrics, and fault data for this case study was
Information criteria-based model selection techniques, which collected from initial releases of two large Windows-based em-
are based on the fitted log-likelihood function, can be used for bedded software applications used primarily for customizing
both nested, and non-nested count models. It is assumed that the configuration of wireless telecommunications products. The
the log-likelihood will increase as more parameters are added two applications, written in C++, provided similar functionali-
to a model. The penalty of increasing log-likelihood takes into ties, and contained common source code; and hence, are anal-
account the number of parameters as well as the number of ob- ysed as one software system. Each application contained more
servations. The three information criteria measures considered than 27.5 million lines of code. A program module is comprised
in this study include of a single source file. Upon preprocessing & cleaning of the
GAO AND KHOSHGOFTAAR: EMPIRICAL STUDY OF COUNT MODELS FOR SOFTWARE FAULT PREDICTION 229

TABLE I TABLE II
SOFTWARE METRICS HYPOTHESIS TESTING RESULTS

ZINB, PRM versus HP1, PRM versus HP2, NBRM versus HNB1,
NBRM versus HNB2, HP1 versus HNB1, and HP2 versus HNB2.
collected software metrics, and fault data, i.e., removing incom-
In addition, the non-nested models were examined by applying
plete or illogical data points, 1211 modules remained in the data
Vuong’s hypothesis testing to two pairwise comparisons: PRM
set. versus ZIP, and NBRM versus ZINB.
The software metrics for each module were collected using a The testing results, which were consistent for all fifty data
combination of tools, and databases. Among the 1211 modules, splits, are summarized in Table II. The first column of the table
over two-thirds had no faults, and the maximum number of the lists the ten pairwise comparisons, the second column shows
faults in one module was 97. The dependent variable, Fault, in- the hypothesis testing techniques utilized, the third column
dicated the number of the software faults discovered in a source indicates the recommended model according to the results of
file during system test. In the context of this case study, Table I the respective pairwise testing, and the fourth column presents
lists the five software metrics that are used as independent vari- the -values of the comparisons. The -values indicate the
ables for calibrating software quality models. -significance by which the recommended model is -better
The fit, and test data sets were obtained by applying an im- than its counterpart. The overall conclusion from the hypothesis
partial data splitting technique to the original data set of 1211 testing indicated that the ZINB, and HNB models are -signif-
program modules. Consequently, two-thirds of the modules, i.e., icantly better than the other count models, for both threshold
807, were assigned to the fit data set, whereas, the remaining values. Determining the best model among these three is not
one-third of the modules, i.e., 404, were assigned to the test data feasible with hypothesis testing, i.e., to our knowledge, no
set. To avoid biased results due to a lucky (or unlucky) data split- hypothesis testing-based technique exists that can be used
ting, the original data set was randomly split 50 times, yielding to compare ZINB versus HNB models, and the respective HNB
50 pairs of the fit, and test data sets. Empirical studies were per- models with different thresholds. Therefore, with respect to hy-
formed for all 50 data splits. pothesis testing-based model selection, these three models are
considered to have similar performance. However, if one were
B. Results & Analysis to select the best count model among them, then information
According to the modeling methodology of each count model criteria (IC)-based model selection techniques could be used.
(see Section II), eight count models were calibrated for each The next step involved IC-based model selection, i.e., AIC, BIC,
of the fifty data splits. To determine the best fitted models for and CAIC, and evaluating the results with those obtained by hy-
the case study, we employed pairwise hypothesis testing tech- pothesis testing. It was observed that for all three information
niques. Because we considered eight different count models, the criteria, similar model selection results were obtained. More-
number of possible pairwise comparisons is . How- over, the top three models selected by IC were the same as those
ever, we present only ten of the pairwise comparisons, and sub- determined by hypothesis testing. More specifically, ZINB, and
sequently selected the three best count models. Other compar- HNB have the lower IC values, indicating that they are the pre-
isons were either not performed because hypothesis testing was ferred models for this case study. In contrast, the PRM demon-
not feasible, or are not presented because the respective pair- strated the largest values with respect to information criteria,
wise tests were irrelevant. The eight pairwise likelihood ratio- whereas the NBRM, ZIP, and HP models depicted intermediate
based hypothesis testing included PRM versus NBRM, ZIP versus information criteria values.
230 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

TABLE III TABLE IV


MODEL SELECTION BASED ON INFORMATION CRITERIA FOR COUNT MODELS ANOVA TABLE:  (FIT DATA SETS)

An important advantage of the information criteria-based


comparative techniques over the hypothesis testing-based tech-
niques is that the modeling techniques can be sorted according
to their IC statistics. Table III shows the sorted models based
on the IC, discussed in Section 3.3, when applied to the fit data
set. For all three IC techniques, i.e., AIC, BIC, and CAIC, the
same performance order was obtained. The symbol “ ” in the
table indicates that the left hand side model has lower values
of IC than the model on the right hand side, i.e., the left hand
side model is preferred over the right hand side model. The
modeling techniques sorted from left to right are the same for
all fifty data splits except for one split (split number 22). As
compared to the other 49 data splits, the only difference in the
performance-order for split 22 is that the NBRM is on the right
hand side of the ZIP, and HP1 models as compared to being on
their left hand side for the other data splits. The different order TABLE V
of models obtained for split 22 may be due to a few influential ANOVA TABLE:  (TEST DATA SETS)

data points grouped into the fit data set.


The next step in our empirical study involved evaluating the
count models with respect to the Pearson’s chi-square measure.
A comparison of the actual, and fitted frequency distributions
was made for the eight count models. The range of was divided
into seven cells, i.e., , and .
In the context of the Pearson’s chi-square measure, we further
grouped the cells into two sub-parts: zeros, which included cell
; and positives, which included all the other cells. The lower
the Pearson’s chi-square statistic, the better is the performance
of the model.
The Pearson’s chi-square statistics for the eight count models
were computed for each of the fifty data splits. The one-way
ANOVA model indicated whether the eight count models per-
formed -significantly different than each other across all the
data splits. The ANOVA results for the fit, and test data sets
are presented in Tables IV and V, respectively. Each table is
comprised of three components: Overall, Zeros, and Positives.
Overall includes all seven cells, Zeros only includes cell ,
and Positives includes the other cells. We observe that for both
fit, and test data sets, and for each component, the average
chi-square statistics over the fifty splits of the eight models are any two group means exceeds this value of MSD, the differences
-significantly different from each other, i.e., all -values are 0. are -significant at the given level. Multiple comparisons re-
Because the ANOVA test indicated that the eight count sults for fit, and test data sets are shown in Tables VI and VII,
models performed -significantly different than each other, we respectively. Each table consists of three parts: Overall, Zeros,
proceeded with Tukey’s multiple comparison tests with the and Positives. For each part, the mean values of methods over
Pearson’s chi-square measure as the dependent variable. For the fifty splits are sorted in descending order; N represents the
a given -significance level , we can calculate the minimum group size (50 in our study), and the -significance level is
-significant difference (MSD) [18], [19]. If the difference of 0.15.
GAO AND KHOSHGOFTAAR: EMPIRICAL STUDY OF COUNT MODELS FOR SOFTWARE FAULT PREDICTION 231

TABLE VI TABLE VII


MULTIPLE PAIRWISE COMPARISONS:  (FIT DATA SETS) MULTIPLE PAIRWISE COMPARISONS:  (TEST DATA SETS)

Note: Means with the same letter are not s-significantly different. Note: Means with the same letter are not s-significantly different.

models represented with the same letters belong (according to


The first column of the table with label “Tukey’s Grouping” relative -significance) to the same group, and members in the
categorizes the models into performance-based groups. The same group are not -significantly apart from each other. In
232 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

TABLE VIII TABLE IX


ANOVA TABLE: AAE, AND ARE (FIT DATA SETS) ANOVA TABLE: AAE, AND ARE (TEST DATA SETS)

• The performance order (rank among the eight models) of


contrast, models represented with different letters are in dif- a given count model for the test data set was the same as
ferent groups, and members in the different groups are -sig- its performance order for the fit data set, except for three
nificantly apart from each other. For example, for the Overall models in the Zeros part. The HNB2 model has better per-
part in Table VI, there are five groups: Group A—PRM, Group formance than the HP1, and ZINB models for the test data
B—HP2, Group C—HP1 & ZIP, Group D—NBRM, and Group set. But for the fit data set, it performs worse than the HP1,
E—ZINB & HNB1 & HNB2. Hence, the chi-square mean of PRM and ZINB models. However, this inconsistency occurred
is -significantly larger than the other seven models, and the only for models that have an -insignificant difference in
chi-square mean of HP1 is larger than that of ZIP, but the dif- their relative performances. The top three models (for the
ference is not -significant. The multiple comparison results of Overall part) were the same as those determined by hypoth-
the Pearson’s chi-square measure demonstrated several empir- esis testing-based, and information criteria-based model
ical observations for the software system studied in this paper: selection. These three models, in no particular order, are
• The PRM, and HP2 had -significantly poorer performances the ZINB, HNB1, and HNB2 models.
than the other six methods in the Overall, Zeros, and Pos- The final model selection & evaluation techniques investi-
itives parts. gated included the performance metrics AAE, and ARE. The use
• All methods, except the PRM, and HP2 model, had similar of these measures is similar to our previous studies related to
performances in the Zeros part. count models [6]. The AAE, and ARE values were computed for
• The ZINB, and HNB models had -significantly better per- both the fit, and test data sets. Once again, we grouped each
formances than the other five methods for the Positives data set (Overall) into two parts, i.e., Zeros, and Positives. The
part, and the NBRM model had -significantly better per- modules with no faults were assigned to the Zeros part, whereas
formance than the ZIP, and HP models for the same part. those with non-zero faults were assigned to the Positives part.
• The performances of the models in the Positives part deter- The respective AAE, and ARE values for the three parts were
mined the Overall performances, because they had similar computed for all the fifty data splits. The one-way ANOVA, and
performances in the Zeros part, except for PRM, and HP2. Tukey’s multiple comparison tests once again were employed
GAO AND KHOSHGOFTAAR: EMPIRICAL STUDY OF COUNT MODELS FOR SOFTWARE FAULT PREDICTION 233

TABLE X
MULTIPLE PAIRWISE COMPARISONS: AAE, AND ARE (FIT DATA SETS)

Note: Means with the same letter are not s-significantly different.
234 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

TABLE XI
MULTIPLE PAIRWISE COMPARISONS: AAE, AND ARE (TEST DATA SETS)

Note: Means with the same letter are not s-significantly different.
GAO AND KHOSHGOFTAAR: EMPIRICAL STUDY OF COUNT MODELS FOR SOFTWARE FAULT PREDICTION 235

to analyse the prediction performances of the different methods V. CONCLUSION


with respect to the AAE, and ARE values. Tables VIII and IX
Count models, such as Poisson regression, are appropriate
summarize the ANOVA test results for the fit, and test data sets,
for software quality modeling because software faults are
respectively. In each table, the upper half reports the ANOVA test
usually recorded as numerical counts. A comprehensive em-
results for AAE, whereas the lower half reports the ANOVA test
pirical investigation on the comparative performances of eight
results for ARE. The ANOVA test results indicate that the mean
count models for predicting faults in program modules was
values of AAE, and ARE of the eight models over the fifty splits
performed. Case studies of multiple software systems were
-significantly differ from each other for the Overall, Zeros,
performed; however, only a case study of a telecommunications
and Positives parts. Hence, we once again proceeded with per-
system has been presented. Hypothesis testing-based, and in-
forming multiple comparisons of the different models.
formation criteria-based techniques were applied to determine
Tukey’s multiple comparison results for the fit, and test data
the appropriateness of different count models. The relative
sets are shown in Tables X and XI, respectively. Each table con-
predictive accuracies of the models were evaluated based on
sists of the comparisons of AAE (left), and ARE (right) of the
a one-way ANOVA model, and Tukey’s multiple comparison
different models. These comparisons were conducted for the
technique.
Overall, Zeros, and Positives parts. The methods were sorted
Among the different count models, the zero-inflated nega-
by their mean values of AAE, and ARE in descending order, and
tive binomial, and the hurdle negative binomial models demon-
are grouped with critical values at a -significance level of 0.15.
strated better fitting according to the hypothesis testing-based,
The empirical observations based on these tables are summa-
information criteria-based, and the Pearson’s chi-square mea-
rized below.
sure-based model selection. The predictive accuracy (for test
• The NBRM showed -significantly poor performance (for
data) according to the AAE, and ARE suggested that the HP2 per-
both AAE, and ARE) than the other seven count models for
formed relatively better than the other count models. The contra-
the Positives, and Overall parts. For these two parts, the
diction of results based on hypothesis testing, AAE-based predic-
other seven models were similar at the -significance level
tion, and ARE-based prediction quality suggested that different
of 0.15.
selection criteria may yield different models. However, it was
• In the Zeros part, the prediction performance of models
found that the best fitted model using a given performance mea-
were segregated into four groups: Group A—PRM, Group
sure (Pearson’s chi-square measure, AAE, or ARE) yielded the
B—ZIP & HP1, Group C—ZINB & NBRM & HNB, and Group
best predictive accuracy with respect to the same measure.
D—HP2. Models from the different groups are -signifi-
Future work related to software quality modeling with count
cantly apart from each other. It is seen that Group D had
models will involve additional empirical case studies; and will
the best prediction accuracy; Group A had the worst pre-
investigate the effect of outliers, and noise in the fit data set
diction accuracy; and Groups B, and C had intermediate
on the model performance. Specifically, effective techniques to
performances.
identify & eliminate such data points will be investigated with
• The performance order for the test data sets was similar to
the aim of improving model performance.
that for the fit data sets. Exceptions occurred only for the
models that have -insignificant differences in their per-
formances. For example, PRM was an exception. Based on ACKNOWLEDGMENT
AAE, PRM had better fitted accuracy than ZINB, and HNB
The authors would like to thank Naeem Seliya for his sugges-
in the Positives, and Overall parts; but worse (in the same tions and assistance with patient editorial reviews, and modifi-
parts) for the prediction accuracy based on the test data cations of the paper.
sets.
With respect to AAE, and ARE, the HP2 model is the preferred
model because it demonstrated the best predictive quality, and REFERENCES
a good quality of fit. This contradicts the hypothesis testing, in- [1] K. Ganesan, T. M. Khoshgoftaar, and E. B. Allen, “Case-based
software quality prediction,” International Journal of Software En-
formation criteria, and Pearson’s chi-square measure model se- gineering and Knowledge Engineering, vol. 10, no. 2, pp. 139–152,
lection results, which gave evidence that ZINB, HNB1, and HNB2 2000.
were the preferred models. This contradiction indicates that the [2] S. S. Gokhale and M. R. Lyu, “Regression tree modeling for the predic-
tion of software quality,” in Proceedings: 3rd International Conference
best fitted model chosen by hypothesis testing-based or infor- on Reliability and Quality in Design, H. Pham, Ed., Anaheim, Cali-
mation criteria-based model selection may not ensure the best fornia, USA, March 1997, pp. 31–36, International Society of Science
performance with respect to some other evaluation perspective. and Applied Technologies.
[3] T. M. Khoshgoftaar and N. Seliya, “Tree-based software quality
However, the consistency, and similarity of the model’s quality models for fault prediction,” in Proceedings: 8th International Soft-
of fit, and the model’s prediction quality existed in the context of ware Metrics Symposium., Ottawa, Ontario, Canada, Jun. 2002, pp.
203–214, IEEE Computer Society.
the Pearson’s chi-square, AAE, and ARE measures. This implies [4] Z. Xu, T. M. Khoshgoftaar, and E. B. Allen, “Application of fuzzy
that a good performance for the fit data set with respect to the linear regression model for predicting program faults,” in Proceedings:
Pearson’s chi-square measure will yield a good performance for Sixth ISSAT International Conference on Reliability and Quality in De-
sign, H. Pham and M.-W. Lu, Eds., Orlando, Florida, USA, Aug. 2000,
the test data set with respect to the Pearson’s chi-square mea- pp. 96–101, International Society of Science and Applied Technolo-
sure. This is also true in the case of AAE, and ARE measures. gies.
236 IEEE TRANSACTIONS ON RELIABILITY, VOL. 56, NO. 2, JUNE 2007

[5] T. M. Khoshgoftaar, J. C. Munson, B. B. Bhattacharya, and G. D. [17] Q. H. Vuong, “Likelihood ratio tests for model selection and non-nested
Richardson, “Predictive modeling techniques of software quality from hypotheses,” Econometrica, vol. 57, no. 2, pp. 307–333, Mar. 1989.
software measures,” IEEE Trans. Software Engineering, vol. 18, no. [18] SAS/STAT User’s Guide. Cary, NC, USA: SAS Institute Inc, 1990,
11, pp. 979–987, Nov. 1992. vol. 2.
[6] T. M. Khoshgoftaar, K. Gao, and R. M. Szabo, “An application of [19] M. L. Berenson, D. M. Levine, and M. Goldstein, Intermediate Statis-
zero-inflated Poisson regression for software fault prediction,” in Pro- tical Methods and Applications: A Computer Package Approach. En-
ceedings of the Twelfth International Symposium on Software Relia- glewood Cliffs, New Jersey: Prentice Hall, 1983.
bility Engineering, Hong Kong, China, November 2001, pp. 66–73,
IEEE Computer Society.
[7] L. C. Briand, W. L. Melo, and J. Wust, “Assessing the applicability of
fault-proneness models across object-oriented software projects,” IEEE
Trans. Software Engineering, vol. 28, no. 7, pp. 706–720, July 2002.
[8] T. M. Khoshgoftaar, B. Cukic, and N. Seliya, “Predicting fault-prone
modules in embedded systems using analogy-based classification Kehan Gao received the Ph.D. degree in Computer Engineering from Florida
models,” International Journal of Software Engineering and Knowl- Atlantic University, Boca Raton, FL, USA, in 2003. She is currently an
edge Engineering, vol. 12, no. 2, pp. 201–221, Apr. 2002, World Assistant Professor in the Department of Mathematics and Computer Science
Scientific Publishing. at Eastern Connecticut State University. Her research interests include software
[9] W. M. Evanco, “Modeling the effort to correct faults,” Journal of Sys- engineering, software metrics, software reliability and quality engineering,
tems and Software, vol. 29, pp. 75–84, 1995. computer performance modeling, computational intelligence, and data mining.
[10] P. Deb and P. K. Trivedi, “Demand for medical care by the elderly: A She is a member of the IEEE Computer Society, and the Association for
finite mixture approach,” Journal of Applied Econometrics, vol. 12, pp. Computing Machinery.
313–336, 1997.
[11] J. Mullahy, “Specification and testing of some modified count data
models,” Journal of Econometrics, vol. 33, pp. 341–365, 1986. Taghi M. Khoshgoftaar is a professor of the Department of Computer Science
[12] W. Pohlmeier and V. Ulrich, “An ecometric model of the two-part de- and Engineering, Florida Atlantic University, and the Director of the Empirical
cision making process in the demand for health care,” The Journal of Software Engineering Laboratory, and the Data Mining and Machine Learning
Human Resources, vol. 30, pp. 339–361, 1995. Laboratory. His research interests are in software engineering, software met-
[13] A. C. Cameron and P. K. Trivedi, Regression Analysis of Count Data. rics, software reliability and quality engineering, computational intelligence,
: Cambridge University Press, 1998. computer performance evaluation, data mining, machine learning, and statis-
[14] C. Dean, J. F. Lawless, and G. E. Willmot, “A mixed Poisson-inverse- tical modeling. He has published more than 300 refereed papers in these areas.
Gaussian regression model,” The Canadian Journal of Statistics, vol. He is a member of the IEEE, IEEE Computer Society, and IEEE Reliability So-
17, no. 2, pp. 171–181, 1989. ciety. He was the program chair, and General Chair of the IEEE International
[15] W. H. Greene, Econometric Analysis, 4th ed. Upper Saddle River, Conference on Tools with Artificial Intelligence in 2004, and 2005 respectively.
New Jersey: New York University: Prentice-Hall Inc., 2000. He has served on technical program committees of various international confer-
[16] D. Lambert, “Zero-inflated Poisson regression, with an application to ences, symposia, and workshops. Also, he has served as North American Editor
defects in manufacturing,” Technometrics, vol. 34, no. 1, pp. 1–14, Feb. of the Software Quality Journal, and is on the editorial boards of the journals
1992. Software Quality and Fuzzy systems.

Вам также может понравиться