International Journal of EmergingTrends & Technology in Computer Science(IJETTCS)
Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com
Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 249
Abstract: The most difficult task in software engineering is to select an appropriate software process model, which completely suits a particular situation. Inappropriate selection leads to hindrance in software development. It will consume more time, and will result in a higher budget than the estimated one. This paper suggests a method to find out the feasibility of not just one process model but a combination of process models for optimizing development. A clustering approach to ascertain the compatibility of process models has also been proposed and implemented using k- means and Genetic Algorithm. Emphasis is on integration of compatible models based on the project characteristics.
1. INTRODUCTION Various Software process models have been proposed and applied in the past. There are pros and cons associated with each model and in many cases the selection of the appropriate model becomes a task unto itself. Mostly software projects fail due to inappropriate modeling. Analysts equate a troubled project to an insect caught in a spider web of sticky constraints, trying desperately to break free before the spider arrives to feed [6].This arouses interesting questions on the prospects of process model integration aimed at tapping the pros of the constituent process models. The biggest roadblock in this direction seems to be the identification of compatibility of the models considered for integration There are various characteristics of a software project to be taken into account while selecting a process model. Models that have a high degree of overlap in their assumptions and constraints offer a greater scope for integration. Hence, the research attempts to provide an insight into the models that have high degree of overlap and hence can be integrated. The main objective is to enlighten the software engineering community with the knowledge of the prospects of software process model integration. Data Mining the extraction of implicit, potentially interesting, previously unknown patterns from large volume of data, has offered solutions to various problems cutting across domains. This gives rise to the possibility of utilizing techniques of data mining to solve complex, hitherto unsolved problems in Software Engineering. The research applies one data mining technique called clustering to the problem of identification of compatible Software Process Models for their possible integration. Clustering is a division of data into groups of similar objects. Each group called a cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups [5]. These clusters correspond to hidden patterns, and the search for clusters is termed unsupervised learning. Several clustering techniques are available in the literature [7, 8]. Some, like the widely used K-means algorithm [7], optimize of the distance criterion either by minimizing the within cluster spread (as implemented in this article), or by maximizing the inter-cluster separation.
2. GENETIC ALGORITHMS Genetic Algorithm is based on natural adaptation. It produces more effective problem [3] solutions. GAs work with a set of individual solutions called a population. Associated with each individual is a fitness measure that determines how good the solution represented is. Each iteration in GA is termed as a generation. Some of the main issues to be addressed while using GA include: The choice of the objective function to assign a fitness to each individual a mechanism that assigns high fitness values to better solutions is needed Encoding A mechanism needs to be devised to encode the solution to a problem as a set of genes. The choice of this mechanism can have a significant impact on the performance of the GA Selection and reproduction pairs of individuals are chosen from the population based on fitness Crossover A mechanism for combining the contents of two individuals to create a new one Mutation Every individual in the population has a probability to have its contents altered slightly Population Size The number of solutions in a population
2.1 The Genetic Algorithm Skeleton The skeleton of the genetic algorithm can be expressed as: Generate the initial population in most cases this is random Create a new population by applying selection and reproduction. Apply the cross-over operator to the pairs of strings of the new population A Genetic Algorithm based Clustering framework for Detection of Software Process Model Compatibilities
V. Therese Clara
Asst professor of Computer Science, Madurai Kamaraj University College, Madurai, India International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 250
Apply the mutation operator to each string in the new population Replace the old population with the newly created population If the number of iterations is less than the maximum, go to step 2. Else stop the process and display the best result found.
3. CLUSTERING USING GA The problem of clustering can be formally stated as follows Clustering in N-dimensional Euclidean space N R is partitioning a set of n points into k groups or cluster based on some similarity/dissimilarity metric such that points within a single exhibit a high degree of similarity [2]. Let the n points be denoted as ( ) n x x x ,..., , 2 1 and the k clusters as k C C C ,..., , 2 1 . The conditions are: j i and k 1,2,..., j k; 1,2,..., i , , ,..., 2 , 1 , = = = = = = for C C k i for C j i i
. Some of the clustering techniques available include the k- means clustering, Branch and Bound technique and graph-theoretic approaches. K-means is one of the most popular clustering techniques but has the drawback that it tends to produce sub-optimal solutions. For completeness, the k-means clustering algorithm is formally stated as follows: 1. Choose K initial cluster centers K z z z ,..., , 2 1 randomly from the n-points ( ) n x x x ,..., , 2 1
2. Assign point xi to a cluster C jiff p j K p z x z x P i j i = = < , ,..., 2 , 1 , 3. Compute new cluster centres * * 2 * 1 ,..., , K z z z as K i x n z i j C x j i i ,... 2 , 1 , 1 * = =
e where ni is the number of points belonging to cluster Ci 4. if K 1,2,..., i , * = = for z z i i terminate, else continue from step 2.
4. EXPERIMENTAL STUDY To apply clustering to reveal compatibilities between process models, 50 projects undertaken by a local software organization were studied. All the projects have been successfully delivered. Various process models have been chosen for the projects by the senior project manager. The manager responsible for all the 50 projects has over 10 years of experience in Software Development and 5 years of experience in Software Projects Management. The various process models used by the projects and the number of projects for each process model are tabulated below:
S. No Process Model # of Projects 1 Waterfall Model 3 2 Spiral Model 9 3 Prototyping 4 4 Extreme Programming 12 5 Incremental Delivery 5 6 Rapid Application Development (RAD) 5 7 SCRUM 10 The Software Project Manager had made the decisions concerning the choice of the process models based on his experience and competency. He had based his decision on various characteristics of the projects. The objective of the research is to utilize the decisions made by the manager to derive process models that exhibit a high degree of compatibility. Despite the subjectivity involved in the decisions made, the research attempts to utilize the objective project characteristics and the subjective decisions of the manager to reveal process models that have a high degree of compatibility and hence offer scope for integration. The characteristics of the projects considered for the study are tabulated below. The Project Manager was given the list of characteristics and asked to rate each characteristic for all the projects. These are derived from [1] [4]. S. No Characteristic 1 Certainty of Requirements (whether requirements are known at the beginning or they are subject to frequent changes 2 Requirements Understanding (Understood or Not well understood) 3 Cost (Low, Medium, High or very high) 4 Simplicity (Simple, Intermediate or Complex) 5 Risk involved (Low, Medium or High) 6 User Involvement (Only at beginning and end, Medium or High) 7 Required Flexibility 8 Novelty of the problem domain (novel or not novel) 9 Novelty of the Technology (Novel or Not Novel) 10 Required Progress Visibility (High, Medium or Low)
5. APPLYING GENETIC ALGORITHM TO THE PROBLEM To apply GA to the required clustering problem, the following steps are applied.
5.1 String representation A fundamental requirement for any GA is to encode the solutions as chromosomes. For the problem, all characteristics are assigned numbers in the range of 1-4 International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 251
where a 1 represents that the characteristic under question is minimum or low and 4 represents that the characteristic is maximum. The research uses the following numbers for representing various characteristics. Characteristic Numbers Certainty of Requirements 1- requirements are known at the beginning 4 - requirements are subject to frequent changes Requirements Understanding 1 Understood 4 - Not well understood Cost 1-Low, 2-Medium, 3-High or 4-very high Simplicity 1-Simple, 2-Intermediate, 3 Complex, 4-Very Complex Risk involved 1-Low, 2-Medium 3 High,4-Very High User Involvement 1-Minimum, 2-Moderate, 3- High, 4-Very High Required Flexibility 1-Minimum, 2-Moderate, 3- High, 4-Very High Novelty of the problem domain 1 novel, 4-not novel Novelty of the Technology 1-Novel, 4- Not Novel) Required Progress Visibility 1- Minimum, 2-Moderate, 3- High, 4-Very High For an N-dimensional space with K-clusters, the length of the chromosome is N*K words where the first N genes represent the first cluster centre, next N genes represent the second cluster center and so on. This encoding is essentially the same as followed by [2].
5.2 Population Initialization All the K cluster centers are initialized to random integer values chosen from 1 to 4. Fitness Computation After assigning points xi, i=1,2,..,n to clusters Cj with centre zj such that p j K p z x z x P i j i = = < , ,..., 2 , 1 , and resolving all ties arbitrarily, the new centres * i z are calculated as K i x n z i j C x j i i ,... 2 , 1 , 1 * = =
e . The Clustering Metric is now computed as
e = = M M = M i j C x i j i K i i z x , 1 . The fitness function is chosen as f=1/M so that maximization of the fitness function leads to the minimization of M.
5.3 Selection Every chromosome is assigned a number of copies proportional to its fitness to the mating pool from which chromosomes are selected for cross-over. 5.4 Cross-over After selecting 2 chromosomes from the mating pool, a random cross-over point is selected in the range [1, l-1] where l represents the length of the chromosome and genes before the point are taken from one parent and genes after the point are taken from the other. 5.5 Mutation The strategy chosen for mutation is the same as followed by [2]. A number in the range [0,1] is generated with a uniform distribution. If the value at a position is v it becomes , 0 , * * 2 = v v v or , 0 , * 2 = v v the reason for this selection is explained by [2].
5.6 Termination The procedure is topped after going through a specified number of iterations. Elitism is incorporated by preserving the best string encountered up to that generation.
6. RESULTS AND ANALYSIS The results obtained by applying the k-means algorithm for the selected problem for various values of K are tabulated below. The M value which represents the error value is also shown.
Table 1: Error values produced by k-means algorithm K M value 5 672.86 4 891.73 3 1022.41
The results obtained by applying the genetic algorithm for the selected problem for various values of K are tabulated below. The M value which represents the error value is also shown.
Table 2: Error values produced by GA Clustering K M value 5 621.29 4 703.31 3 812.51 The least error value in both the cases was obtained when K=5, which is the maximum of the selected values of K. It should be noted that the number of process models considered is 7 and hence there is a little meaning in setting the k value to a value greater than 5. In that case all the projects with a given model are likely to be placed in a cluster and it would be difficult to ascertain the process model compatibility. The performance difference between k-means and GA tends to increase in favour of GA as the value of K decreases. A lower value of K implies the attempt to place projects with differences in the same cluster. This is a complex problem and as the complexity of the problem increases, GA tends to give more promising improvements.
International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 252
Figure 1 The Error value for k-means and GA
The process models of projects in the clusters is of utmost interest as this would throw light on the compatibility of process models. The process models of projects in various clusters obtained by using GA for the case when K is set to 5 is shown below:
Table 3: Proportion of projects in various clusters (For K=5 )
Of all the clusters, cluster 3 C is interesting in that it is dominated by projects following Waterfall model. Extreme Programming and SCRUM seem to have almost equal shares in all clusters. Particularly in clusters 5 C and 2 C , these 2 models dominate the clusters. This indicates that extreme programming and SCRUM may offer more prospects of integration. One immediate reasoning for this may be that both are agile models. Cluster 1 C is dominated by Spiral and Prototyping models. Again this might indicate some compatibility between the two. Of all the models, Waterfall model seems to be the one that offers the least scope of integration with other models. The number of projects of Waterfall model (84.29%) in the cluster 3 C is the highest of all. This indicates that projects with waterfall model tend to form a cluster of their own and hence exhibit a low degree of compatibility with other models. Some projects are placed in more than one cluster. That is the reason why the column wise sums in the table are not always 100%.
Figure 2 Proportion of projects in various clusters for k=5
7. CONCLUSIONS AND FUTURE WORK A clustering approach for ascertaining the compatibility of process models was proposed and implemented using k-means and Genetic Algorithm. It is found that GA tends to give more promising results compared to k as the complexity increases. It is also noted that Extreme Programming and SCRUM have greater prospects of integration followed by Prototyping and Spiral Models. Waterfall model was found to offer the least prospect of integration. The clustering is performed using the Managers decision of process models for 50 projects. The work can be extended to include many other project characteristics and other models. The research provides insight into how clustering can be of immense utility to optimize software development using Software Process Model Integration. The results of the proposed approach can be validated by collection of Real time data pertaining to real time projects employing a combination of models.
REFERENCE [1] Manish Sharma, A Survey of project scenario impact in SDLC models selection process [2] Maulik Ujjwal!, Bandyopadhyay Sanghamitra: Pattern Recognition 33 (2000) 1455}1465 Genetic algorithm-based clustering technique ",*,1 [3] George F Luga, Artificial Intelligence, V Edition, Pearson Education [4] Dr. Jamwal Deepshikha, Analysis of Software Development Models IJCST Vol. 1, Iss ue 2, December 2010 [5] Berkhin, Pavel, Survey of Clustering Data Mining Techniques. [6] Barry Boehm, Dan Port, Mohammed Al-Said, "Avoiding the Software Model Clash Spiderweb", University of Southern California, 2000. Pp. 1-3. International Journal of EmergingTrends & Technology in Computer Science(IJETTCS) Web Site: www.ijettcs.org Email: editor@ijettcs.org, editorijettcs@gmail.com Volume 2, Issue 3, May June 2013 ISSN 2278-6856
Volume 2, Issue 3 May June 2013 Page 253
[7] J.T. Tou, R.C. Gonzalez, Pattern Recognition Principles, Addison-Wesley, Reading, 1974. [8] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, 1990.
Author
V. Therese Clara received her B.Sc. degree in physics from Lady Doak College, Madurai, India,in 1994, Master of Computer Applications degree from St. J osephs College Trichy, India, in 1997, and the Master of Philosophy in Computer Science degree from Madurai Kamaraj University, Madurai, India in 2007. She has been working as assistant professor in the department of Computer Science in Madurai Kamaraj University College since 2007.