A Genetic Algorithm Based Clustering Framework For Detection of Software Process Model Compatibilities

International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com

Volume 2, Issue 3, May June 2013 ISSN 2278-6856

Volume 2, Issue 3 May June 2013 Page 249

Abstract: The most difficult task in software engineering is
to select an appropriate software process model, which
completely suits a particular situation. Inappropriate
selection leads to hindrance in software development. It will
consume more time, and will result in a higher budget than
the estimated one. This paper suggests a method to find out
the feasibility of not just one process model but a
combination of process models for optimizing development. A
clustering approach to ascertain the compatibility of process
models has also been proposed and implemented using k-
means and Genetic Algorithm. Emphasis is on integration of
compatible models based on the project characteristics.

Keywords: clustering, feasibility, compatibility, Genetic
Algorithm

1. INTRODUCTION
Various Software process models have been proposed and
applied in the past. There are pros and cons associated
with each model and in many cases the selection of the
appropriate model becomes a task unto itself. Mostly
software projects fail due to inappropriate modeling.
Analysts equate a troubled project to an insect caught in a
spider web of sticky constraints, trying desperately to
break free before the spider arrives to feed [6].This
arouses interesting questions on the prospects of process
model integration aimed at tapping the pros of the
constituent process models. The biggest roadblock in this
direction seems to be the identification of compatibility of
the models considered for integration
There are various characteristics of a software project to
be taken into account while selecting a process model.
Models that have a high degree of overlap in their
assumptions and constraints offer a greater scope for
integration. Hence, the research attempts to provide an
insight into the models that have high degree of overlap
and hence can be integrated. The main objective is to
enlighten the software engineering community with the
knowledge of the prospects of software process model
integration.
Data Mining the extraction of implicit, potentially
interesting, previously unknown patterns from large
volume of data, has offered solutions to various problems
cutting across domains. This gives rise to the possibility
of utilizing techniques of data mining to solve complex,
hitherto unsolved problems in Software Engineering. The
research applies one data mining technique called
clustering to the problem of identification of compatible
Software Process Models for their possible integration.
Clustering is a division of data into groups of similar
objects. Each group called a cluster, consists of objects
that are similar between themselves and dissimilar to
objects of other groups [5]. These clusters correspond to
hidden patterns, and the search for clusters is termed
unsupervised learning. Several clustering techniques
are available in the literature [7, 8]. Some, like the widely
used K-means algorithm [7], optimize of the distance
criterion either by minimizing the within cluster spread
(as implemented in this article), or by maximizing the
inter-cluster separation.

2. GENETIC ALGORITHMS
Genetic Algorithm is based on natural adaptation. It
produces more effective problem [3] solutions. GAs work
with a set of individual solutions called a population.
Associated with each individual is a fitness measure that
determines how good the solution represented is. Each
iteration in GA is termed as a generation. Some of the
main issues to be addressed while using GA include:
The choice of the objective function to assign a fitness
to each individual a mechanism that assigns high fitness
values to better solutions is needed
Encoding A mechanism needs to be devised to encode
the solution to a problem as a set of genes. The choice of
this mechanism can have a significant impact on the
performance of the GA
Selection and reproduction pairs of individuals are
chosen from the population based on fitness
Crossover A mechanism for combining the contents of
two individuals to create a new one
Mutation Every individual in the population has a
probability to have its contents altered slightly
Population Size The number of solutions in a
population

2.1 The Genetic Algorithm Skeleton
The skeleton of the genetic algorithm can be expressed
as:
Generate the initial population in most cases this is
random
Create a new population by applying selection and
reproduction.
Apply the cross-over operator to the pairs of strings of the
new population
A Genetic Algorithm based Clustering
framework for Detection of Software Process
Model Compatibilities

V. Therese Clara

Asst professor of Computer Science, Madurai Kamaraj University College, Madurai, India


Apply the mutation operator to each string in the new
population
Replace the old population with the newly created
population
If the number of iterations is less than the maximum, go
to step 2. Else stop the process and display the best result
found.

3. CLUSTERING USING GA
The problem of clustering can be formally stated as
follows Clustering in N-dimensional Euclidean space
N
R is partitioning a set of n points into k groups or
cluster based on some similarity/dissimilarity metric such
that points within a single exhibit a high degree of
similarity [2]. Let the n points be denoted as
( )
n
x x x ,..., ,
2 1
and the k clusters as
k
C C C ,..., ,
2 1
. The
conditions are:
j i and k 1,2,..., j k; 1,2,..., i
, , ,..., 2 , 1 ,
= = =
= = =
for
C C k i for C
j i i

.
Some of the clustering techniques available include the k-
means clustering, Branch and Bound technique and
graph-theoretic approaches. K-means is one of the most
popular clustering techniques but has the drawback that it
tends to produce sub-optimal solutions. For completeness,
the k-means clustering algorithm is formally stated as
follows:
1. Choose K initial cluster centers
K
z z z ,..., ,
2 1
randomly from the n-points
( )
n
x x x ,..., ,
2 1

2. Assign point xi to a cluster C jiff
p j K p z x z x
P i j i
= = < , ,..., 2 , 1 ,
3. Compute new cluster centres
* *
2
*
1
,..., ,
K
z z z as
K i x
n
z
i j
C x
j
i
i
,... 2 , 1 ,
1
*
= =

e
where ni is the number of
points belonging to cluster Ci
4. if K 1,2,..., i ,
*
= = for z z
i i
terminate, else continue
from step 2.

4. EXPERIMENTAL STUDY
To apply clustering to reveal compatibilities between
process models, 50 projects undertaken by a local
software organization were studied. All the projects have
been successfully delivered. Various process models have
been chosen for the projects by the senior project
manager. The manager responsible for all the 50 projects
has over 10 years of experience in Software Development
and 5 years of experience in Software Projects
Management. The various process models used by the
projects and the number of projects for each process
model are tabulated below:

S.
No
Process Model # of
Projects
1 Waterfall Model 3
2 Spiral Model 9
3 Prototyping 4
4 Extreme Programming 12
5 Incremental Delivery 5
6 Rapid Application
Development (RAD)
5
7 SCRUM 10
The Software Project Manager had made the decisions
concerning the choice of the process models based on his
experience and competency. He had based his decision on
various characteristics of the projects. The objective of the
research is to utilize the decisions made by the manager
to derive process models that exhibit a high degree of
compatibility. Despite the subjectivity involved in the
decisions made, the research attempts to utilize the
objective project characteristics and the subjective
decisions of the manager to reveal process models that
have a high degree of compatibility and hence offer scope
for integration.
The characteristics of the projects considered for the study
are tabulated below. The Project Manager was given the
list of characteristics and asked to rate each characteristic
for all the projects. These are derived from [1] [4].
S.
No
Characteristic
1 Certainty of Requirements (whether
requirements are known at the beginning
or they are subject to frequent changes
2 Requirements Understanding
(Understood or Not well understood)
3 Cost (Low, Medium, High or very high)
4 Simplicity (Simple, Intermediate or
Complex)
5 Risk involved (Low, Medium or High)
6 User Involvement (Only at beginning
and end, Medium or High)
7 Required Flexibility
8 Novelty of the problem domain (novel or
not novel)
9 Novelty of the Technology (Novel or
Not Novel)
10 Required Progress Visibility (High,
Medium or Low)

5. APPLYING GENETIC ALGORITHM TO
THE PROBLEM
To apply GA to the required clustering problem, the
following steps are applied.

5.1 String representation
A fundamental requirement for any GA is to encode the
solutions as chromosomes. For the problem, all
characteristics are assigned numbers in the range of 1-4


where a 1 represents that the characteristic under question
is minimum or low and 4 represents that the
characteristic is maximum. The research uses the
following numbers for representing various
characteristics.
Characteristic Numbers
Certainty of
Requirements
1- requirements are known at
the beginning
4 - requirements are subject
to frequent changes
Requirements
Understanding
1 Understood
4 - Not well understood
Cost 1-Low, 2-Medium, 3-High
or 4-very high
Simplicity 1-Simple, 2-Intermediate, 3
Complex, 4-Very Complex
Risk involved 1-Low, 2-Medium 3
High,4-Very High
User Involvement 1-Minimum, 2-Moderate, 3-
High, 4-Very High
Required
Flexibility
1-Minimum, 2-Moderate, 3-
High, 4-Very High
Novelty of the
problem domain
1 novel, 4-not novel
Novelty of the
Technology
1-Novel, 4- Not Novel)
Required Progress
Visibility
1- Minimum, 2-Moderate, 3-
High, 4-Very High
For an N-dimensional space with K-clusters, the length of
the chromosome is N*K words where the first N genes
represent the first cluster centre, next N genes represent
the second cluster center and so on. This encoding is
essentially the same as followed by [2].

5.2 Population Initialization
All the K cluster centers are initialized to random integer
values chosen from 1 to 4.
Fitness Computation
After assigning points xi, i=1,2,..,n to clusters Cj with
centre zj such that
p j K p z x z x
P i j i
= = < , ,..., 2 , 1 , and
resolving all ties arbitrarily, the new centres
*
i
z are
calculated as K i x
n
z
i j
C x
j
i
i
,... 2 , 1 ,
1
*
= =

e
. The
Clustering Metric is now computed as

e =
= M M = M
i j
C x
i j i
K
i
i
z x ,
1
. The fitness function
is chosen as f=1/M so that maximization of the fitness
function leads to the minimization of M.

5.3 Selection
Every chromosome is assigned a number of copies
proportional to its fitness to the mating pool from which
chromosomes are selected for cross-over.
5.4 Cross-over
After selecting 2 chromosomes from the mating pool, a
random cross-over point is selected in the range [1, l-1]
where l represents the length of the chromosome and
genes before the point are taken from one parent and
genes after the point are taken from the other.
5.5 Mutation
The strategy chosen for mutation is the same as followed
by [2]. A number in the range [0,1] is generated with
a uniform distribution. If the value at a position is v it
becomes , 0 , * * 2 = v v v or , 0 , * 2 = v v the
reason for this selection is explained by [2].

5.6 Termination
The procedure is topped after going through a specified
number of iterations. Elitism is incorporated by
preserving the best string encountered up to that
generation.

6. RESULTS AND ANALYSIS
The results obtained by applying the k-means algorithm
for the selected problem for various values of K are
tabulated below. The M value which represents the error
value is also shown.

Table 1: Error values produced by k-means algorithm
K M value
5 672.86
4 891.73
3 1022.41

The results obtained by applying the genetic algorithm for
the selected problem for various values of K are tabulated
below. The M value which represents the error value is
also shown.

Table 2: Error values produced by GA Clustering
K M value
5 621.29
4 703.31
3 812.51
The least error value in both the cases was obtained when
K=5, which is the maximum of the selected values of K.
It should be noted that the number of process models
considered is 7 and hence there is a little meaning in
setting the k value to a value greater than 5. In that case
all the projects with a given model are likely to be placed
in a cluster and it would be difficult to ascertain the
process model compatibility.
The performance difference between k-means and GA
tends to increase in favour of GA as the value of K
decreases. A lower value of K implies the attempt to place
projects with differences in the same cluster. This is a
complex problem and as the complexity of the problem
increases, GA tends to give more promising
improvements.



Figure 1 The Error value for k-means and GA

The process models of projects in the clusters is of
utmost interest as this would throw light on the
compatibility of process models. The process models of
projects in various clusters obtained by using GA for the
case when K is set to 5 is shown below:

Table 3: Proportion of projects in various clusters (For
K=5 )

Of all the clusters, cluster
3
C is interesting in that it is
dominated by projects following Waterfall model.
Extreme Programming and SCRUM seem to have almost
equal shares in all clusters. Particularly in clusters
5
C
and
2
C , these 2 models dominate the clusters. This
indicates that extreme programming and SCRUM may
offer more prospects of integration. One immediate
reasoning for this may be that both are agile models.
Cluster
1
C is dominated by Spiral and Prototyping
models. Again this might indicate some compatibility
between the two.
Of all the models, Waterfall model seems to be the one
that offers the least scope of integration with other
models. The number of projects of Waterfall model
(84.29%) in the cluster
3
C is the highest of all. This
indicates that projects with waterfall model tend to form a
cluster of their own and hence exhibit a low degree of
compatibility with other models.
Some projects are placed in more than one cluster. That is
the reason why the column wise sums in the table are not
always 100%.

Figure 2 Proportion of projects in various clusters for
k=5

7. CONCLUSIONS AND FUTURE WORK
A clustering approach for ascertaining the compatibility
of process models was proposed and implemented using
k-means and Genetic Algorithm. It is found that GA
tends to give more promising results compared to k as the
complexity increases. It is also noted that Extreme
Programming and SCRUM have greater prospects of
integration followed by Prototyping and Spiral Models.
Waterfall model was found to offer the least prospect of
integration. The clustering is performed using the
Managers decision of process models for 50 projects.
The work can be extended to include many other project
characteristics and other models. The research provides
insight into how clustering can be of immense utility to
optimize software development using Software Process
Model Integration. The results of the proposed approach
can be validated by collection of Real time data pertaining
to real time projects employing a combination of models.

REFERENCE
[1] Manish Sharma, A Survey of project scenario impact
in SDLC models selection process
[2] Maulik Ujjwal!, Bandyopadhyay Sanghamitra:
Pattern Recognition 33 (2000) 1455}1465 Genetic
algorithm-based clustering technique ",*,1
[3] George F Luga, Artificial Intelligence, V Edition,
Pearson Education
[4] Dr. Jamwal Deepshikha, Analysis of Software
Development Models IJCST Vol. 1, Iss ue 2,
December 2010
[5] Berkhin, Pavel, Survey of Clustering Data Mining
Techniques.
[6] Barry Boehm, Dan Port, Mohammed Al-Said,
"Avoiding the Software Model Clash Spiderweb",
University of Southern California, 2000. Pp. 1-3.


[7] J.T. Tou, R.C. Gonzalez, Pattern Recognition
Principles, Addison-Wesley, Reading, 1974.
[8] K. Fukunaga, Introduction to Statistical Pattern
Recognition, Academic Press, New York, 1990.

Author

V. Therese Clara received her B.Sc. degree in
physics from Lady Doak College, Madurai,
India,in 1994, Master of Computer Applications
degree from St. J osephs College Trichy, India, in
1997, and the Master of Philosophy in Computer Science degree
from Madurai Kamaraj University, Madurai, India in 2007. She
has been working as assistant professor in the department of
Computer Science in Madurai Kamaraj University College since
2007.

A Genetic Algorithm Based Clustering Framework For Detection of Software Process Model Compatibilities

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

A Genetic Algorithm Based Clustering Framework For Detection of Software Process Model Compatibilities

Загружено:

Авторское право:

Доступные форматы

International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com

Вам также может понравиться