Вы находитесь на странице: 1из 29

Introduction to Defect

Prediction
Cmpe 589
Spring 2008
Problem 1
 How to tell if the
project is on
schedule and within
budget?
 Earned-value charts.
Problem 2
 How hard will it be
for another
organization to
maintain this
software?
 McCabe Complexity
Problem 3
 How to tell when the
subsystems are
ready to be
integrated
 Defect Density
Metrics.
Problem Definition
 Software development
lifecycle:
 Requirements
 Design
 Development
 Test (Takes ~50% of overall time)
 Detect and correct
defects before
delivering software.
 Test strategies:
 Expert judgment
 Manual code reviews
 Oracles/ Predictors as secondary
tools
Problem Definition
Testing
Defect Prediction
 2-Class Classification Problem.
 Non-defective
 If error = 0
 Defective
 If error > 0
 2 things needed:
 Raw data: Source code
 Software Metrics -> Static Code
Attributes
Static Code Attributes
 void main()
 {
 //This is a sample code

 //Declare variables
 int a, b, c;
Module LOC LOCC V CC Error
 // Initialize variables
 a=2; main() 16 4 5 2 2
 b=5;

 //Find the sum and display c if greater sum() 5 1 3 1 0


than zero
 c=sum(a,b);
 if c < 0 c>0
 printf(“%d\n”, a);
 return;
 }
c LOC: Line of Code
 int sum(int a, int b) LOCC: Line of commented Code
 { V: Number of unique operands&operators
 // Returns the sum of two numbers
 return a+b; CC: Cyclometric Complexity
 }
+
Defect Prediction
 Machine Learning based models.
 Defect density estimation
 Regression models: error pronness

 First classification then regression

 Defect prediction between versions


 Defect prediction for embedded systems
Constructing Predictors
 Baseline: Naive Bayes.
 Why?: Best reported results so far
(Menzies et al., 2007)
 Remove assumptions and construct
different models.
 Independent Attributes ->Multivariate dist.
 Attributes of equal importance
Weighted Naive Bayes
2
1  x  mij 
d t

g i ( x)   
j
Naive Bayes  log( P (Ci ))

2 j 1  s j 

2
1d  x  mij
t

g i ( x)    w j    log(P(Ci ))
j
Weighted Naive Bayes
2 j 1  s j 

Datasets
Name # Features #Modules Defect Rate(%)

CM1 38 505 9
PC1 38 1107 6
PC2 38 5589 0.6
PC3 38 1563 10
PC4 38 1458 12
KC3 38 458 9
KC4 38 125 40
MW1 38 403 9
Performance Measures
Actual Accuracy: (A+D)/(A+B+C+D)
Defects
no yes
Pd (Hit Rate): D / (B+D)
no A B
Prd
yes C D Pf (False Alarm Rate): C / (A+C)
Results: InfoGain&GainRatio
WNB+IG (%) WNB+GR (%) IG+NB (%)
Data
pd pf bal pd pf bal pd pf bal
CM1 82 39 70 82 39 70 83 32 74
PC1 69 35 67 69 35 67 40 12 57
PC2 72 15 77 66 20 72 72 15 77
PC3 80 35 71 81 35 72 60 15 70
PC4 88 27 79 87 24 81 92 29 78
KC3 80 27 76 83 30 76 48 15 62
KC4 77 35 70 78 35 71 79 33 72
MW1 70 38 66 68 34 67 44 07 60
Avg: 77 31 72 77 32 72 65 20 61
Results: Weight Assignments
Benefiting from defect data in practice
 Within Company vs Cross Company Data
 Investigated in cost estimation literature
 No studies in defect prediction!
 No conclusions in cost estimation…
 Straight forward interpretation of results in
defect prediction.
 Possible reason: well defined features.
How much data do we need?
 Consider:
 Dataset size:1000
 Defect rate: 8%
 Training instances: %90
 1000*8%*90%=72 defective instances
 (1000-72) non-defective instances
Intelligent data sampling
 With random sampling of 100 instances
we can learn as well as thousands.
 Can we increase the performance with
wiser sampling strategies?
 Which data?
 Practical aspects: Industrial case study.
WC vs CC Data?
• When to use WC or CC?
• How much data do we need to construct a
model?
Module Structure vs Defect Rate

 Fan-in, fan-out
 Page Rank Algorithm
 Call graph information on the code
 “small is beautiful”
Performance vs. Granularity
120 Performance
Granularity
100

80

60

40

20

0
Statement Method Class File Component Project

Вам также может понравиться