Вы находитесь на странице: 1из 8

Problem statement : COCOMO Estimating using PCA and ANN

Working of Basic Generic model:


This Generic model based on both algorithmic and non-algorithmic methods.
Step1: Sizing specifications, source code, and test cases:
The first step in any software estimate is to predict the sizes of the deliverables that must
be constructed. Sizing must include all major deliverables such as specifications, source code,
manuals, documents and test cases. A variety of sizing methods are now included, such as:
a. Sizing based on function point metrics.
b. Sizing based on Source Lines of Code (SLOC) metrics.
Step 2: Specify various implementation attributes and cost drivers involving other different
functional & operational characteristics.
2.1 Specify Product Factors: Determining the rating value of required reliability
& reusability, product complexity etc.
2.2 Specify Computer Factors: Determining the rating value of Execution time
constraint, Main storage constraint etc.
2.3 Specify Personnel Factors: Determining the rating value of Analyst
capability, Language & Tool experience etc.
2.4 Specify Project Factors: Determining the rating value of required
development schedule, use of software tool etc.
Rating value includes very low, low, nominal, high & very high. According to that rating
value we can calculate the complexity of the project & how many person/month required for
developing the project.
Step 3: Implementation for Effort & Time:
Use COCOMO model for calculating the effort & Time.
Step 4: Estimation of cost:
Cost = Effort * Average salary per unit time.
Step 5: Recording of Estimated Data:
The next most important objective of software estimation and measurement practices is
the recording of estimated data. In other words, in order to save time and bring efficiency and
maturity into software cost estimation process, the organization should record the estimation data
for comparison and analysis purposes for future projects.














Working with Proposed Generic model
Schematic illustration for PCA+ANN


The entire data set of N samples is quality filtered
1. and then the dimensionality is further reduced by PCA to 10 PCA projections
2. from the original M expression values. Next the N2 test experiments are set aside
and N1 training experiments are randomly partitioned into three groups
3. One of these groups is reserved for validation and the two remaining groups are
used for calibration
4. ANN models are then calibrated using the 10 PCA values for each sample as input
and the phenotype category as output
5. For each model the calibration is optimized with a number of iterative cycles
(epochs). This is repeated using each of the three groups for validation
6. Samples are again randomly partitioned and the entire training process repeated
7. For each selection of validation group one model is calibrated resulting in a total of
3 x K trained models. Once the models are calibrated, they are used to rank the
genes according to their importance for the classification
8. The entire process (27) is repeated using only the top ranked genes
9. The N2 test experiments are subsequently classified using all the calibrated models.







Schematic Illustration for Proposed System




















Size(2
parameter
)




Cost
Factor (15
parameter
)
PCA








(Domain
Matrix for
each
parameter)
COCOMO







Effort (OUTPUT)
Duration (OUTPUT)
SAMPLE DATA SET

Description
We used the following methodology in this study:
1. The input metrics were normalized using min-max normalization. Min-max
normalization performs a linear transformation on the original data .
Suppose that mina and max A are the minimum and maximum values of an
attribute A. It maps value v of A to v in the range 0 to 1 using the formula:


2. Perform principal components analysis on the normalized metrics to
produce domain metrics.
3. We divided data into training, test and validate sets using 3:1:1 ratio.
4. Develop ANN model based on training and test data sets.
5. Apply the ANN model to validate data set in order to evaluate the accuracy
of the model.


A. Principal-Component (or P.C.) Analysis
Cost effort metrics have high correlation with each other. P.C analysis transforms
raw metrics to variables that are not correlated to each other when the original data
are Cost effort metrics, we call the new P.C. variables domain metrics . P.C.
analysis is used to maximize the sum of squared loadings of each factor extracted
in turn. The P.C. analysis aims at constructing new variable (Pi), called Principal
Component (P.C.) out of a given set of variables Xj' s( j = 1,2,...., k) .



All bijs called loadings are worked out in such a way that the extracted P.C.s
satisfies the following two conditions:
(i) P.C.s are uncorrelated (orthogonal) and
(ii) The first P.C. (P1) has the highest variance; the second P.C. has the next
highest variance so on. The variables with high loadings help identify the
dimension P.C. is capturing but this usually requires some degree of interpretation.
In order to identify these variables, and interpret the P.C.s, we consider the rotated
components. As the dimensions are independent, orthogonal rotation is used. There
are various strategies to perform such rotation. We used the varimax rotation,
which is the most frequently used strategy in literature. Eigen value or latent root is
associated with P.C., when we take the sum of squared values of loadings relating
to dimension, then the sum is referred to as eigenvalue. Eigen value indicates the
relative importance of each dimension for the particular set of variables being
analyzed. The P.C.s with eigenvalue greater than 1 is taken for interpretation.
Given an n by m matrix of multivariate data, P.C. analysis can reduce the number
of columns. In our study n represents the number of classes for which Cost effort
metrics have been collected. Using P.C. analysis, the n by m matrix is reduced to n
by p matrix (where p<m).

B. ANN Modeling
The network used in this work belongs to Multilayer Feed Forward networks and is
referred to as M-H-Q network with M source nodes, H nodes in hidden layer and Q
nodes in the output layer. The input nodes are connected to every node of the
hidden layer but are not directly connected to the output node. Thus the network
does not have any lateral or shortcut connection. ANN repetitively adjusts different
weights so that the difference between desired output from the network and actual
output from ANN is minimized. The network learns by finding a vector of
connection weights that minimizes the sum of squared errors on the training data
set. The summary of ANN used in this study is shown in Table.

Architecture
Layer 3
Input Unit 17
Hidden Unit 170
Output Unit 2
Training
Feature selection PCA
Algorithm Back propagation

The ANN was trained by the standard error back propagation algorithm at a
learning rate of 0.005, having the minimum square error as the training stopping
criterion.

C. Performance Evaluation
In this system the main measure used for evaluating model performance is the
Mean Absolute Relative Error (MARE). MARE is the preferred error measure for
software measurement researchers and is calculated as follows


where:
estimate is the network output for each observation n is the number of observations
to estimate whether models are biased and tend to over or under estimate, the Mean
Relative Error (MRE) is calculated as follows



A large positive MRE would suggest that the model over estimates the number of
lines changed per class, whereas a large negative value will indicate the reverse.

Вам также может понравиться