Вы находитесь на странице: 1из 7

2008, Vol.13 No.

1, 014-020

Article ID 1007-1202(2008)01-0014-07
DOI 10.1007/s11859-008-0104-6

Predicting the Maintainability of Open


Source Software Using Design Metrics

ƶ ZHOU Yuming1,2, XU Baowen1,3† 0 Introduction


1. School of Computer Science and Engineering, Southeast
University, Nanjing 210096, Jiangsu,China; Open source software is usually developed by vol-
2. Department of Computing, Hong Kong Polytechnic unteers from all over the world working cooperatively.
University, Hong Kong, China; Several empirical studies have to date been carried out to
3. Jiangsu Institute of Software Quality, Nanjing 210096, investigate the maintainability of open source soft-
Jiangsu,China ware[1-5]. Existing empirical studies on maintainability of
open source software can be classified into three broad
Abstract: This paper empirically investigates the relationships categories. The first category investigated whether open
between 15 design metrics and maintainability of 148 Java open source software has better maintainability than closed
source software. The results show that size and complexity metrics source software. The second category investigated how
are strongly related to the maintainability of open source software.
maintainability evolves with versions. The third category
However, cohesion and coupling, as currently captured by existing
metrics, do not seem to have a significant impact on maintainabil- investigated the relationships between design metrics and
ity. When used together, these metrics can predict system main- the maintainability of open source software. For example,
tainability fairly accurately (mean MREs below 30%). Misra[5] found that design/code metrics can be useful for
Key words: open source: object-oriented; maintainability; metric; predicting maintainability of open source software.
prediction; regression In this paper, we investigated the relationships be-
CLC number: TP 311.5
tween a number of metrics and maintainability based on
a large number of Java open source software systems. In
other words, our study falls into the third category men-
tioned above. Unlike previous work in this area, our
study investigated the relationships between a number of
metrics and maintainability based on a large number of
Java open source software systems. More specifically,
we collected 15 design metrics from 148 open source
software written in Java. We first tested each of the met-
Received date: 2007-05-25
rics against a null hypothesis using linear regression.
Foundation item: Supported by the National Natural Science Foundation of Then, we built prediction models based on these metrics.
China (60425206, 60633010), the High Technology Research Project of Jiangsu
Province (BG2005032) and the Specialized Research Fund for the Doctoral
Program of Higher Education of China (20060286020) 1 The Empirical Study Design
Biography: ZHOU Yuming (1974-), male, Ph.D. candidate, Visiting re-
searcher of Hong Kong Polytechnic University, research direction: software
metrics. E-mail: csyzhou@ comp.polyu.edu.hk In this section, we provide some background on the
† To whom correspondence should be addressed. E-mail:bwxu@seu.edu.cn open source software investigated in this study, describe
Wuhan Univ. J. Nat. Sci. 2008, Vol.13 No.1 15

the dependent variable and independent variables, and 1.3 Independent Variables
give the hypotheses that we will investigate. The independent variables consist of 15 object-
1.1 Systems Investigated oriented design metrics of size, complexity, coupling,
We collected open source software from http:// cohesion, and inheritance. The definitions of those met-
sourceforge.net/ and http://java-source.net, which are two rics are given in Table 1. These metrics include the
well-established open source software websites. In the popular metrics suite proposed in Refs.[11,12], and other
former website, open source software are classified into 19 metrics that are commonly used and have been vali-
topics. The first five topics involved are software devel- dated[13]. They have received considerable attention from
opment tools, internet, communications, database, and researchers.
system, which account for more than 60% of total Java
Table 1 Definitions of metrics
software. In the latter website, all software systems were
implemented in Java. Although the software classification Metric Description
is somewhat different from that in the former website, it is NPAVGC Average number of parameters per method
not difficult to find the relatedness between them. OSAVG Average complexity per method
In this study, we downloaded 130 software systems CSAO Average number of attributes & methods per class
were from http://sourceforge.net/ and 18 software sys- CSA Average number of attributes per class
tems were from http://java-source.net. The selection cri- CSO Average number of methods per class
teria in the former website are as follows: a) software SDIT Average depth of inheritance tree per class
must be written in pure Java; b) software is selected as SLCOM Average lack of cohesion on methods per class
possible as from the first five topics of total Java soft-
SRFC Average response per class
ware in this website; c) software with a large number of
SWMC Average weighted methods per class
downloads (such information is available) has priority to
SNOC Average number of children per class
be selected, as this may indicate that more volunteers
MHF Method hiding factor
contribute to it. In the latter website, for each software,
the number of downloads is unavailable. Therefore, we POF Polymorphism factor

randomly selected software from the following topic: NCLASS Number of classes
software development tools, internet, communications, NMETH Number of methods
database, and system. The purpose of the selection crite- PDIT Maximum depth of inheritance trees
ria is to make selected systems as representative as pos-
Note that many metrics, such as DIT, RFC, and
sible of total Java open source software.
NOC in the Chidamber and Kemerer metrics suite, are
1.2 Dependent Variables
originally defined at the class level. However, our study
In this study, system maintainability is quantified is performed at the system level. In other words, each
via a Maintainability Index (MI). MI is a combination of system provides an observation in our data set. Therefore,
widely-used and commonly-available metrics that affect those metrics cannot be directly used as independent
maintainability[4-7]. More precisely, MI is defined as fol- variables. In this study, for each such metric, its mean
lows: among classes is used as an independent variable. For
MI = 171−5.2 ln(aveV) − 0.23aveV(g') example, SDIT is an independent variable which is actu-
− 16.2 ln (aveLOC) + 50 sin (sqrt (2.4perCM)) ally the mean DIT per class in the system. In Table 1,
where aveV is the average Halstead’s Volume per mod- such metrics include NPAVGC, OSAVG, CSAO, CSA,
ule[8], aveV(g') is the average extended cyclomatic com- CSO, SDIT, SLCOM, SRFC, SWMC, and SNOC.
plexity per module, aveLOC is the average count of lines 1.4 Hypotheses
of source code per module, and perCM is the average Table 2 summarizes the hypotheses that relate ob-
percentage of lines of comments per module. MI takes ject-oriented design metrics to system maintainability
into account multiple aspects of maintainability: the size, and gives an intuitive reason for each hypothesis. In par-
the complexity, and the self-descriptiveness of the code. ticular, the “Relationship” column gives the conjecture
This makes MI a more suitable dependent variable for on the direction of correlation between each metric and
studying the relationships between design metrics and maintainability, where + means positive correlation and
system maintainability[4-10].  means negative correlation.
16 ZHOU Yuming et al : Predicting the Maintainability of Open Ă

Table 2 The hypotheses and intuitive analyses

Measure Relationship Intuitive reason(with increase in the value of measure)


NPAVGC − The methods become more complex and more difficult to understand
OSAVG − The classes become more complex and more difficult to understand
CSAO − The average size per class increases and thus the classes are more complex
CSA − The average number of attributes per class increases and thus the classes are more
complex
CSO − The average number of methods per class increases and thus the classes are more
complex
SDIT − The average number of definitions that a class inherits from its ancestors increases,
and thus the coupling among classes increases
SLCOM − The amount of encapsulation decreases, increasing the complexity of the system,
making understanding more difficult and implementing and testing more complex
SRFC − The number of methods and the number of methods invoked by these methods in a
class increase and thus the coupling among classes increases
SWMC − The control flows of the methods in a class become more complex
SNOC − The average number of classes affected by a class increases
MHF + The amount of abstraction increases, thus getting information hiding benefits
POF − The number of alternative methods for one class statement increases and the code is
hence more difficult to understand and maintain
NCLASS − The intelligent content, the number of faults, and the difficulty of understanding and
testing the system both increase
NMETH − The intelligent content, the number of faults, and the difficulty of understanding and
testing the system both increase
PDIT − The maximum depth of inheritance trees increases

used to examine the effect of each metric separately.


2 Data Analysis Methodology Multivariate regression analysis is used to examine the
common effectiveness of the metrics. In this paper, the
In this section, we introduce the data analysis significance level is set at α = 0.05.
methodology, consisting of the descriptive statistics, Linear regression is the most commonly used tech-
univariate and multivariate regression analyses, and pre- nique for modeling the relationship between independent
diction model evaluation. variables and a dependent variable. It works by fitting a
2.1 Descriptive Statistics linear equation to the observed data. The general form of
The distribution and variance of each metric is ex- a multivariate linear regression (MLR) model can be
amined to select those with enough variance for further given by
analysis. Metrics with low variance do not differentiate
yˆi = a0 + a1 xi1 + " + ak xik
systems very well and therefore are not useful predictors
for our data set. As a rule of thumb, only metrics with yi = a0 + a1 xi1 + " + ak xik + ei
more than five non-zero data points were considered for where xi1 ,", xik are the independent variables, a0 ," , ak are
all subsequent analyses. the parameters to be estimated, yˆi is the dependent
2.2 Univariate and Multivariate Regression variable to be predicted, yi is the actual value of the
Analysis dependent variable, and ei is the error in the prediction
Linear regression technique was employed to ana- of the ith case.
lyze the OO measurement data. It assumes that the de- A linear regression model should be tested for in-
pendent variable is a linear function of independent fluential observations and multi-collinearity. Cook’s dis-
variables. In our analysis, both univariate and multivari- tance is a metric of the influence of an observation. In
ate regressions are used. Univariate regression analysis is this study, an observation with a Cook’s distance that is
Wuhan Univ. J. Nat. Sci. 2008, Vol.13 No.1 17

greater than 4/n is regarded as an influential observation, assess a model built on the remainder of the data set.
where n is the number of observations[5]. As for the de-
tection of multi-collinearity, the commonly used metric 3 Experimental Results
is the conditional number. As suggested by Belsley[13] et
al, the degree of multi-collinearity is harmful, when the
conditional number is greater than the critical value of In this section, we present the details of the results
20. of our analysis. We first give the descriptive statistics of
2.3 Model Evaluation the Java data set. We then perform univariate analysis
We evaluate the goodness of fit of prediction mod- and multivariate analysis using linear regression to build
els in terms of three standard metrics: coefficient of de- and evaluate prediction models from the data set. Finally,
termination of the regression model (R2 between actual we discuss the implications of design metrics for system
and predicted maintainability), absolute relative error maintainability.
(ARE), and magnitude of relative error (MRE). 3.1 Descriptive Statistics
Assume that the training set consists of n observa- Table 3 presents the descriptive statistics for the
tions. Given an observation i, the corresponding residual Java data set. Columns “Skewness” state for whether the
is the difference between the actual value and the pre- data distribution is skewed, and “Kurtosis” state for
dicted value. ARE is the absolute value of the residual, whether the data are peaked or flat relative to a normal
that is, distribution.
AREi =| yi − yˆi | As can be seen, the data set has low medians and
means for SDIT and SNOC. This indicates that inheri-
where yi is the ith value of the dependent variable as
tance was not much used in most systems. This is further
observed in the data set and yˆi is the corresponding
confirmed by the distribution of PDIT, with more than
predictive value from the prediction model. Given an
half of the systems having a PDIT value of less than 3.
observation i, MREi is defined as
As a result, the polymorphism metric, POF, also shows a
y − yˆi
MRE i = i low median and mean. The distribution of the values of
yi MHF shows that there are relatively few private or pro-
Furthermore, we will employ leave-one-out (LOO) tected methods in those systems. Some systems even
cross-validation to obtain a realistic estimate of the pre- have no such methods at all. This may reflect the lack of
dictive power of a model when it is applied to data sets development experience of the open-source program-
other than that from which the model was derived. For a mers involved in those systems. A low mean and median
data set with n observations, an LOO cross-validation was found for SLCOM, implying that the cohesion of
divides the data set into n parts, each part being used to classes is relatively high.
Table 3 Descriptive statistics of the Java data set
Maximum Upper Median Lower Minimum Standard
Metric Mean value Skewness Kurtosis
value quartile value quartile value deviation
NPAVGC 3.727 1.073 0.913 0.773 0.445 0.963 0.349 4.043 27.642
OSAVG 5.483 2.070 1.771 1.547 1.130 1.873 0.555 2.822 13.593
CSAO 32.563 11.892 9.885 7.633 3.958 10.498 4.019 1.805 5.902
CSA 18.219 4.158 3.141 2.393 1.167 3.616 2.183 3.350 17.047
CSO 21.313 8.010 6.459 5.004 2.167 6.883 2.769 1.782 5.229
SDIT 1.474 0.514 0.278 0.108 0 0.355 0.315 1.185 1.435
SLCOM 87.605 16.117 7.039 2.512 0.000 12.111 13.684 2.278 7.129
SRFC 41.125 16.239 11.505 8.797 2.708 13.077 6.649 1.867 4.279
SWMC 69.058 16.239 12.623 8.888 3.850 13.966 8.015 2.979 15.516
SNOC 0.765 0.393 0.238 0.103 0 0.257 0.188 0.419  0.571
MHF 0.253 0.078 0.053 0.034 0 0.062 0.043 1.642 3.877
POF 0.731 0.163 0.101 0.043 0 0.123 0.126 2.369 7.868
NCLASS 2353 385 146 65 4 311.074 397.235 2.525 7.821
NMETH 19829 2932.25 965 467.750 16 2266.750 3127.471 2.932 10.771
PDIT 8 3 2 1 0 2.257 1.596 0.488 0.021
MI 88.655 52.355 37.177 15.630  174.366 28.703 39.314  2.040 7.142
18 ZHOU Yuming et al : Predicting the Maintainability of Open Ă

All metrics have large differences between the As can be seen, NPAVGC, OSAVG, CSA, CSO,
lower 25th percentile, the median, and the 75th percen- SDIT, SNOC, POF, and PDIT are statistically significant.
tile, thus showing strong variations across systems. Fur Furthermore, the coefficients for NPAVGC, OSAVG,
thermore, all metrics have more than five non-zero data CSA, and POF are negative, thus supporting the corre-
points and are hence considered for further analysis. sponding hypotheses stated in Section 2.4, respectively.
3.2 Univariate Analysis However, the coefficients for CSO, SDIT, SNOC, and
Table 4 summarizes the results of the univariate PDIT are positive, contradicting the hypotheses.
linear regression analysis. For each metric, both the “all OSAVG has the largest R2 value, thus indicating it is the
cases” model and “filtered” model are provided. The best predictor. Since the largest R2 is 0.521, most met-
former was built using all observations, while the latter rics do not explain more than half of the behavior of MI.
was built using filtered observations (i.e. excluding all Therefore, a multiple linear regression model should be
influential observations). Column “N” indicates how considered.
many observations were used to build the model. Col- 3.3 Multivariate Analysis
umn “R2” gives the proportion of the total variation in Only those metrics that had been found to be statis-
the dependent variable that is explained by the model. tically significant (i.e., p-value İ 0.05) in the univariate
Columns “p-value”, “Std err”, and “Relationship” show linear filtered models were used in building the multi-
the statistical significance, the standard error, and the variate linear regression model.
sign of the regression coefficient for the independent The multivariate linear regression model is shown
variable, respectively. in Table 5. To build this model, we used the stepwise
Table 4 Results of linear regression
Metric Model N R2 p-value Std err Relationship
NPAVGC All cases 148 0.014 0.154 9.265 −
Filtered 119 0.1 0 10.783 −
OSAVG All cases 148 0.298 0 4.914 −
Filtered 125 0.521 0 3.765 −
CSAO All cases 148 0.011 0.202 0.805 −
Filtered 115 0.002 0.611 0.518 −
CSA All cases 148 0.185 0 1.346 −
Filtered 119 0.183 0 1.391 −
CSO All cases 148 0.034 0.024 1.155 −
Filtered 112 0.056 0.012 1.009 +
SDIT All cases 148 0.047 0.008 10.079 +
Filtered 108 0.382 0 5.178 +
SLCOM All cases 148 0 0.921 0.238 +
Filtered 124 0.003 0.578 0.193 +
SRFC All cases 148 0.008 0.267 0.487 +
Filtered 113 0.026 0.087 0.427 +
SWMC All cases 148 0.052 0.005 0.395 +
Filtered 123 0.019 0.125 0.325 −
SNOC All cases 148 0.036 0.021 17.004 −
Filtered 108 0.386 0 7.878 +
MHF All cases 148 0.006 0.369 75.4 +
Filtered 129 0.004 0.477 49.02 +
POF All cases 148 0.008 0.286 25.626 +
Filtered 118 0.053 0.012 23.1 −
NCLASS All cases 148 0.017 0.112 0.008 −
Filtered 118 0.017 0.163 0.007 +
NMETH All cases 148 0.026 0.05 0.001 +
Filtered 134 0.002 0.631 0.001 +
PDIT All cases 148 0.067 0.001 1.969 +
Filtered 122 0.065 0.005 1.258 +
Wuhan Univ. J. Nat. Sci. 2008, Vol.13 No.1 19

variable selection procedure. The entry criterion used in ject-oriented design metrics on the maintainability of
the stepwise selection is the p-value of the F statistic open source software. These implications can provide
being smaller than or equal to 0.05. The eliminating cri- decision support for developers in controlling the main-
terion used is the p-value of the F statistic being larger tainability of open source software at the relatively early
than or equal to 0.10. We will refer to this model as the phases of software development.
“MRL model”. The adjusted R2 is 0.752. The conditional 3.4.1 Influence of individual design metrics
number for this model is 15.301, well below the critical Among 15 design metrics under investigation,
threshold of 20. The MLR model has 40 influential ob- NPAVGC, OSAVG, CSA, SWMC, and POF were found
servations, which were excluded for the final model fit- to have statistically significant negative effects on main-
ting. However, to retain the objectivity of the analysis tainability. Therefore, the hypotheses for those metrics
results, influential outliers will be kept during the are supported. On the other hand, CSO, SDIT, SNOC,
evaluation of this model. and PDIT were found to have statistically significant
Table 5 MLR model positive effects on maintainability. Thus, the hypotheses
for these metrics are not supported, because the trend
Metric Coefficient Std error t-ratio p-value Std beta
observed was contrary to what was expected. Also, the
Constant 77.905 1 5.451 7 14.290 0 <0.000 1 ü
metrics CSAO, SLCOM, SRFC, MHF, NCLASS, and
OSAVG  36.015 2 2.879 5  12.507 3 <0.000 1  0.710 8
NMETH were found to have no statistically significant
SNOC 26.273 7 6.765 1 3.883 7 0.000 2 0.212 5
effects on maintainability. Therefore, the hypotheses for
CSO 5.222 7 0.782 0 6.678 7 <0.000 1 0.424 2
these metrics are not supported. These findings are in
CSA  4.861 0 1.177 4  4.128 6 <0.000 1  0.291 0
line with what was reported in Ref.[5], with the excep-
As can be seen, this model has four covariates. For tion of POF. In Ref.[5], the analyses were performed on a
each covariate, we provide its coefficient, the standard C++ data set and POF was found to have no significant
error of the coefficient, the t-ratio of the coefficient, the effect on maintainability.
statistical significance of the coefficient, and the stan- The results indicate that the increase of the number
dardized beta coefficient. In particular, the standardized of method parameters, the control flow complexity of
beta coefficient indicates that OSAVG is the most im- methods, the number of attributes, and the amount of
portant metric for predicting maintainability. The next polymorphism will decrease maintainability. However,
two important metrics are CSO and CSA. The mean the increase of the number of methods, depth of inheri-
ARE is 18.563. The mean MRE is 1.005, which means tance tree, and the number of child classes will increase
that predictions will show, on average, a relative error of maintainability. On the other hand, other factors, such as
100.5%. the amount of method hiding and the number of classes,
For the MLR model, the adjusted R2 is 0.752. It is have no significant influence on maintainability.
actually the filtered model, in which 40 influential ob- 3.4.2 Capability of design metrics to predict maintain-
servations were excluded from model fitting. If all ob- ability
servations were used for model fitting, the adjusted R2 The multivariate linear regression model shows that
would be reduced to 0.471. As discussed above, to retain OSAVG, CSO, CSA, and SNOC play a more dominant
the objectivity of the analysis results, influential obser- role in maintainability prediction. Among these four
vations were kept when evaluating the prediction accu- metrics, OSAVG is the most important metric for pre-
racy of the filtered MLR model. dicting maintainability with the second and third most
LOO cross-validation produces a R2 of 0.486 be- important being CSO and CSA. When applying the
tween actual and predicted maintainability for the MLR model to maintainability prediction, the goodness of fit
model. The mean ARE and MRE are 18.833 and 1.020, indicates that the optimistic results would be a R2 of
respectively. In addition, the distribution mean of the around 0.75 between actual and predicted maintainability
residuals for the MLR model is −6.936. Therefore, the and a median MRE of around 0.29. On the other hand,
MLR model tends to overpredict maintainability. the results of LOO cross validation show that a more
3.4 Implication of Design Metrics for realistic picture would be a R2 of around 0.5 and a me-
Maintainability dian MRE of around 0.30. It is concluded that the pre-
Based on the results of regression analyses per- diction models can predict maintainability of open
formed, this section discusses the implications of ob- source software reasonably well, because more than 50
20 ZHOU Yuming et al : Predicting the Maintainability of Open Ă

percent of observations have estimates with an MRE of Maintainability [J]. Communications of the ACM, 2004,
30% or less. 47(10): 83-87.
[4] Schach S, Jin B, Wright D, et al. Maintainability of the Linux
4 Conclusion Kernel [J]. IEE Proceedings: Software, 2002, 149(1): 18-23.
[5] Misra S. Modeling Design/Coding Factors that Drive Main-
In this paper, based on a data set of 148 Java open tainability of Software Systems [J]. Software Quality Journal,
source systems, we employed classical linear regression 2005, 13(3): 297-320.
to investigate the relationships between design metrics [6] Welker K, Oman P. Software Maintainability Metrics Models
and the maintainability of open source software. We not in Practice [J]. Journal of Defense Software Engineering,
only analyzed the influences of individual design metrics, 1995, 8(11): 19-23.
but also reported their ability to predict how maintain- [7] Welker K, Oman P, Atkinson G. Development and Applica-
able a system is, when design metrics are used together. tion of An Automated Source Code Maintainability Index [J].
Univariate analysis results have shown that many design Journal of Software Maintenance: Research & Practice, 1997,
metrics are strongly related to the maintainability of open 9(3): 127-159.
source software. In particular, the average control flow [8] Halstead M. Elements of Software Science [M]. New York:
complexity per method (OSAVG) appears to be the most Elsevier Science Inc, 1977.
important maintainability factor. On the other hand, co- [9] Coleman D, Lowther B, Oman P. The Application of Soft-
hesion and coupling, as currently captured by existing ware Maintainability Models in Industrial Software Systems
metrics, do not seem to have a significant impact on [J]. Journal of Systems Software, 1995, 29(1): 3-16.
maintainability. The multivariate prediction model dem- [10] VanDoren E, Sciences K, Springs C. Maintainability Index
onstrates reasonable accuracy, providing more than 50 Technique for Measuring Program Maintainability [EB/OL].
percent observations with an MRE of 30% or less. [2002-03-12]. http://www.sei.cmu.edu/str/descriptions/mitmp
m_ body.html.
[11] Chidamber S, Darcy D, Kemerer C. Managerial Use of Met-
References
rics for Object-Oriented Software: An Exploratory Analysis
[1] Mockus A, Fielding R, Herbsleb J. Two Case Studies of Open [J]. IEEE Transactions on Software Engineering, 1998, 24(8):
Source Software Development: Apache and Mozilla [J]. ACM 629-639.
Transactions on Software Engineering and Methodology, [12] Brito F, Carapuca R. Object-Oriented Software Engineering:
2002, 11(3): 309-346. Measuring and Controlling the Development Process [C]//
[2] Paulson J, Succi G , Eberlein A. An Empirical Study of Proceedings of the 4th International Conference on Software
Open-Source and Closed-Source Software Products [J]. IEEE Quality. Virginia: IEEE Press, 1994.
Transactions on Software Engineering, 2004, 30(4): 246-256. [13] Belsley D, Kuh E, Welsch R. Regression Diagnostics: Identi-
[3] Samoladas I, Stamelos I, Angelis L, et al. Open Source Soft- fying Influential Data and Sources of Collinearity [M]. New
ware Development Should Strive for Even Greater Code York: John Wiley and Sons, 1980.
ƶ

Вам также может понравиться