Hyunsook Do

ACCOUNTING FOR CONTEXT AND LIFETIME FACTORS: A NEW APPROACH FOR EVALUATING REGRESSION TESTING TECHNIQUES
by
Hyunsook Do
A DISSERTATION
Presented to the Faculty of The Graduate College at the University of Nebraska In Partial Fulllment of Requirements For the Degree of Doctor of Philosophy
Major: Computer Science
Under the Supervision of Professor Gregg Rothermel
Lincoln, Nebraska
May, 2007
ACCOUNTING FOR CONTEXT AND LIFETIME FACTORS: A NEW APPROACH FOR EVALUATING REGRESSION TESTING TECHNIQUES
Hyunsook Do, Ph.D. University of Nebraska, 2007 Advisor: Gregg Rothermel Regression testing is an expensive testing process performed on modied software to provide condence that the software behaves correctly, and that modications have not impaired its quality. Regression testing is widely used in industry; however, it is often performed inadequately. As a result, software quality and reliability can decrease over the softwares lifetime. To address this problem, researchers have spent a great deal of eort creating and studying various methodologies for improving the cost-eectiveness of regression testing. To compare and assess such methodologies, researchers relied initially on analytical approaches. More recently, however, focus has shifted to empirical studies. Empirical studies of regression testing techniques have greatly expanded our understanding of techniques and the factors that aect them, but to date, these studies have also suered from several limitations which limit the extent to which their results may generalize to practice. (1) Most studies have considered only a few context factors (characteristics of the environment or engineering processes that may aect technique performance). (2) Prior studies have calculated costs and benets using a snapshot view in which results are considered strictly per system version; this approach, however, ignores the fact that methodologies may exhibit dierent costbenet tradeos when assessed across entire system lifetimes than when assessed
relative to individual versions. (3) Previous studies have largely ignored cost-benet tradeos, relying on comparisons strictly in terms of simple benet and cost factors, using cost-benet models that are naive in their handling of important revenue and cost components, or using metrics that render comparisons across specic types of techniques impossible. Limitations such as these make it dicult or impossible to accurately compare and assess regression testing methodologies relative to practical software engineering contexts. Moreover, they can lead researchers and practitioners to inaccurate conclusions about the relative cost-eectiveness of techniques in practice, or the suitability of particular techniques to particular engineering processes. This dissertation addresses these limitations. First, we surveyed the state of the art of empirical studies of regression testing techniques and identied problems with evaluation methods and processes, and problems related to infrastructure required for empirical studies. Second, we developed infrastructure to support empirical studies of regression testing considering a wide variety of software artifacts. Third, using the infrastructure developed in the second step, we conducted several initial empirical studies on regression testing techniques. Fourth, we developed a cost-benet model to assess the cost-eectiveness of regression testing techniques considering system lifetime and context factors. Finally we conducted an empirical study, assessing regression testing techniques using these cost-benet models. Through our work, we provide several important advantages for practitioners and researchers. For practitioners, we provide new practical understanding of regression test techniques. For researchers, we provide a new cost-benet model that can be used to compare and empirically evaluate regression testing techniques, and that accounts for testing context and system lifetime factors. We identify problems involving infrastructure, and provide infrastructure that can help researchers conduct various
controlled experiments considering a wide variety of software artifacts. Finally, we provide better understanding of empirical methodologies that can be used by other researchers to make further progress in this area.
ACKNOWLEDGEMENTS
I would like to thank the following people for their assistance and support during the course of my doctoral studies. My rst thanks go to my advisor, Gregg Rothermel, who provided his unselsh commitment of time, valuable advice, constant support, and encouragement. I want to thank Sebastian Elbaum for his help and valuable advice in developing ideas regarding software-artifact infrastructure, and also for serving on my committee. I would like to thank Myra Cohen for providing me an opportunity to teach software engineering classes, for her time and valuable feedback on my teaching, and also for serving on my committee. I thank Fred Choobineh for serving on my committee. I thank Alex Orso, Mary Jean Harrold, David Rosenblum, and Mary Lou Soa, for collaborating on research involving component metadata. I would like to thank Alexey Malishevsky, Alex Kinneer, Vasanth Williams, Weiyun Wu, and Soumya Chattopadhyay for providing and helping with various tools and software artifact preparation, which supported several of my experiments in this dissertation. I would like to thank Fred Ramsey, Steve Kachman, and Lorin Hochstein for providing assistance with statistical analyses for my experimentation results. I thank the National Science Foundation for supporting my research with grants. Also, I thank the Department of Computer Science and Engineering at the University of Nebraska - Lincoln for providing the Jensen Chair Graduate Research Assistant Fellowship. I give thanks to my mother, Youngja Cheon, for absolutely everything. Finally, I would like to thank my husband, Chulho Won, and my dearest son, Sean Won, for understanding and supporting me during my doctoral studies.
Contents
1 Introduction 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Goals of this Research . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Overview of this Dissertation . . . . . . . . . . . . . . . . . . . . . . 2 Background and Related Work 2.1 Regression Testing . . . . . . . . . . . . . 2.1.1 Regression Test Selection . . . . . . 2.1.2 Test Case Prioritization . . . . . . 2.2 Cost-benet Models for Regression Testing 2.3 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 3 5 5 6 6 8 10 12 13 16 17 21 25 26 30 33 35 36 37 40 42 50 51 52 53
3 The State of the Art in Empirical Studies of Testing, and Assessment of and Guidelines for Studies 3.1 A Survey of Studies of Testing . . . . . . . . . . . . . . . . . . . . . . 3.2 A Survey of Literature on Empirical Studies: Reviews and Guidelines 3.2.1 A Survey of Reviews of Empirical Studies . . . . . . . . . . . 3.2.2 A Survey of Experimentation Guidelines . . . . . . . . . . . . 3.2.3 Observations and Problem Areas . . . . . . . . . . . . . . . . 3.3 Criteria for the Evaluation of Empirical Studies of Testing Techniques 3.4 The State of the Art in Controlled Experiments of Testing . . . . . . 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Infrastructure Support for Empirical Studies 4.1 A Survey of Studies of Testing: Infrastructure Use . . . . . 4.2 Challenges for Experimentation . . . . . . . . . . . . . . . 4.3 Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Object Selection, Organization, and Setup . . . . . 4.3.2 Documentation and Supporting Tools . . . . . . . . 4.3.3 Sharing and Extending the Infrastructure . . . . . . 4.3.4 Examples of the Infrastructure Being Used . . . . . 4.3.5 Threats to Validity: Things to Keep in Mind When Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using the . . . . . .
4.4
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5 Empirical Studies of Regression Testing Techniques 56 5.1 Empirical Studies of Test Case Prioritization in a JUnit Testing Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1.1 Study Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1.2 JUnit Testing and Prioritization . . . . . . . . . . . . . . . . . 57 5.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.1.5 Cost-Benets Analysis . . . . . . . . . . . . . . . . . . . . . . 75 5.1.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 Using Component Metadata to Regression Test Component-Based Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2.1 Study Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2.2 Component Metadata-based Regression Test Selection . . . . 86 5.2.3 Empirical Study of Code-based Regression Test Selection . . . 101 5.2.4 Empirical Study of Specication-based Regression Test Selection109 5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.3 On the Use of Mutation Faults in Empirical Assessments of Test Case Prioritization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3.1 Study Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3.2 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.3.3 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.3.4 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6 A Cost-benet Model for Evaluating Regression Testing Techniques143 6.1 Regression Testing Process . . . . . . . . . . . . . . . . . . . . . . . . 144 6.2 Costs Modelled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.3 A Cost-Benet Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.4 Evaluating and Comparing Techniques . . . . . . . . . . . . . . . . . 151 7 Empirical Evaluation of Regression new Cost-benet Model 7.1 Study Overview . . . . . . . . . . . 7.2 Experiment . . . . . . . . . . . . . 7.2.1 Research Questions . . . . . 7.2.2 Objects of Analysis . . . . . 7.2.3 Variables and Measures . . . 7.2.4 Experiment Setup . . . . . . 7.2.5 Threats to Validity . . . . . 7.2.6 Data and Analysis . . . . . 7 Testing Techniques using our 153 . . . . . . . . . . . . . . . . . . . 153 . . . . . . . . . . . . . . . . . . . 153 . . . . . . . . . . . . . . . . . . . 153 . . . . . . . . . . . . . . . . . . . 154 . . . . . . . . . . . . . . . . . . . 154 . . . . . . . . . . . . . . . . . . . 158 . . . . . . . . . . . . . . . . . . . 159 . . . . . . . . . . . . . . . . . . . 160
7.3 7.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8 Conclusions and Future Directions 167 8.1 Merit and Impact of This Research . . . . . . . . . . . . . . . . . . . 167 8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Bibliography 170
List of Figures
2.1 3.1 4.1 4.2 5.1 5.2 5.3 5.4 Example illustrating the APFD measure. . . . . . . . . . . . . . . . . 9
The percentage of empirical studies: example, case study and experiment. 15 Object directory structure (top level). . . . . . . . . . . . . . . . . . . Fault localization guidelines for C programs. . . . . . . . . . . . . . . JUnit test suite structure . . . . . . . . . . . . . . . . . . . . . . . . . JUnit framework and Galileo . . . . . . . . . . . . . . . . . . . . . . . . Overview of experiment process . . . . . . . . . . . . . . . . . . . . . .
APFD boxplots, all programs. The horizontal axes list techniques, and the vertical axes list APFD scores. The left column presents results for testclass level test cases and the right column presents results for test-method level test cases. See Table 5.2 for a legend of the techniques. . . . . . . .
44 49 57 59 64
5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18
Steps for gathering method proling metadata. . . . . . . . . . . . . . Component diagram for VendingMachine and Dispenser. . . . . . . . Statechart specication of VendingMachine. . . . . . . . . . . . . . . Statechart specication of Dispenser. . . . . . . . . . . . . . . . . . . Application VendingMachine. . . . . . . . . . . . . . . . . . . . . . . Component Dispenser. . . . . . . . . . . . . . . . . . . . . . . . . . Global statechart for VendingMachine and Dispenser. . . . . . . . . . Statechart specication of Dispenser . . . . . . . . . . . . . . . . . . . Global statechart for VendingMachine and Dispenser . . . . . . . . . . Test selection results for the NO-META (black), META-C (dark grey), META-M (light grey), and META-S (white) techniques. . . . . . . . Normalized statechart for VendingMachine. . . . . . . . . . . . . . . Normalized statechart for Dispenser. . . . . . . . . . . . . . . . . . . Selective mutant generation process . . . . . . . . . . . . . . . . . . . APFD boxplots, all programs, all techniques. The horizontal axes list techniques, and the vertical axes denote APFD scores. The plots on the left present results for test-class level test cases and the plots on the right present results for test-method level test cases. See Table 5.2 for a legend of the techniques. . . . . . . . . . . . . . . . . . . . . . .
68 87 88 89 89 90 91 99 100 101 106 114 115 123
126
5.19 APFD boxplots, all techniques for galileo (left) and nanoxml (right). The horizontal axes list techniques, and the vertical axes list APFD scores. The upper row presents results for mutation faults and the lower row presents results for hand-seeded faults. See Table 5.2 for a legend of the techniques. . . . . . . . . . . . . . . . . . . . . . . . . . 132 5.20 APFD boxplots, all programs, for results with handed-seeded faults (replicated from [36]). The horizontal axes list techniques, and the vertical axes list fault detection rate. . . . . . . . . . . . . . . . . . . 136 5.21 Fault detection ability boxplots for selected small test suites across all program versions. The horizontal axes list techniques, and the vertical axes list fault (or mutant) detection ratios. . . . . . . . . . . . . . . . 139 6.1 7.1 Maintenance and regression testing cycle. . . . . . . . . . . . . . . . . 145 Cost factor scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
11
List of Tables
3.1 Research Papers Involving Testing and Empirical Studies in Six Major Venues, 1994-2003 (P: the total number of papers, T: the number of papers about software testing, EST: the number of papers about software testing that contained the empirical study). . . . . . . . . . The Numeric Data for Figure 3.1. . . . . . . . . . . . . . . . . . . . . The Experiment Processes. . . . . . . . . . . . . . . . . . . . . . . . . Summary of Testing Techniques. . . . . . . . . . . . . . . . . . . . . . Replicability Checklist. . . . . . . . . . . . . . . . . . . . . . . . . . . Further classication of published empirical studies. . . . . . . . . . . Challenges and Infrastructure. . . . . . . . . . . . . . . . . . . . . . . Objects in our Infrastructure. . . . . . . . . . . . . . . . . . . . . . . Experiment objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . Test case prioritization techniques. . . . . . . . . . . . . . . . . . . . Mean Value and Standard Deviation (SD) of APFD Scores in Figure 5.4. . Wilcoxon Rank-Sum Test, Untreated (T1) vs Non-control Techniques (T4 T9), ant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wilcoxon Rank-Sum Test, Random (T2) vs Non-control Techniques (T4 T9), ant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wilcoxon Rank-Sum Test, All Non-control Techniques (T4 T9), ant.
Wilcoxon Rank-Sum Test, All Non-control Techniques (T4 T9), Test-Class Level vs Test-Method Level, ant. . . . . . . . . . . . . . . . . . . . . . .
3.2 3.3 3.4 3.5 4.1 4.2 4.3 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13
14 16 22 30 31 36 42 43 61 61 67 70 70 72 72 74 74 80 81 81
Instrumentation Points per Method in Java . . . . . . . . . . . . . . Instrumentation Points per Function in C . . . . . . . . . . . . . . . Comparisons: Untreated vs. Heuristics . . . . . . . . . . . . . . . . . Comparisons: Random vs. Heuristics . . . . . . . . . . . . . . . . . . Comparisons: Between Heuristics . . . . . . . . . . . . . . . . . . . . Costs and Savings Data for Prioritization on Our Object Programs, Considering Block-addtl and Method-addtl Techniques versus Untreated Test Orders and Test-method Level Test Suites. . . . . . . . . . . . . 5.14 A Test Suite for VendingMachine. . . . . . . . . . . . . . . . . . . . . 5.15 Branches for VendingMachine and Dispenser. . . . . . . . . . . . . . . 5.16 Branch Coverage for Component Dispenser. . . . . . . . . . . . . . .
84 92 93 96
5.17 Testing Requirements for VendingMachine-Dispenser. . . . . . . . . . 5.18 Execution Times (Hours:Minutes:Seconds) for Test Cases Selected by Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.19 ANOVA for Test Time . . . . . . . . . . . . . . . . . . . . . . . . . . 5.20 Comparisons Between NO-META and each META. . . . . . . . . . . 5.21 Comparisons Between Pairs of META Techniques. . . . . . . . . . . . 5.22 The Number of States and Transitions for Individual and Global Statechart Diagrams Per Version. Each Entry Takes the Form: Number of states (Number of transitions). . . . . . . . . . . . . . . . . . 5.23 Testing Requirement Selection Rates . . . . . . . . . . . . . . . . . . 5.24 Object Programs Used in Prioritization Studies . . . . . . . . . . . . 5.25 Results of Prior Prioritization Studies: Measured Using the APFD Metric, for all Programs Except OA . . . . . . . . . . . . . . . . . . . 5.26 Prioritization Results Grouped by Test Case Source: Measured Using the APFD Metric, for all Programs Except OA. To Facilitate Interpretation, the Last Row Indicates the Average Number of Faults per Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.27 Mutation Operators for Java Bytecode . . . . . . . . . . . . . . . . . 5.28 Experiment Objects and Associated Data . . . . . . . . . . . . . . . . 5.29 Experiment 1: Kruskal-Wallis Test Results, per Program . . . . . . . 5.30 Experiment 1: Bonferroni Analysis, All Programs, Test-class Level Granularity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.31 Experiment 2: Bonferroni Analysis, All Programs, Test-method Level Granularity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.32 Experiment Objects and Associated Data . . . . . . . . . . . . . . . . 5.33 Experiment 2: Kruskal-Wallis Test Results, per Program . . . . . . . 5.34 Experiment 2: Bonferroni Analysis, per Program. . . . . . . . . . . . 5.35 Experiment Objects and Associated Data . . . . . . . . . . . . . . . . 7.1 7.2
100 107 108 109 109
112 112 117 117
119 121 124 127 128 129 131 133 134 141
Experiment Objects and Associated Data . . . . . . . . . . . . . . . . 154 Relative Benets Between Technique Pairs (dollars) . . . . . . . . . . 163
Chapter 1 Introduction
1.1 Introduction
As software is maintained, software engineers regression test it to validate new features and detect whether corrections and enhancements have introduced new faults into previously tested code. Such regression testing plays an integral role in maintaining the quality of subsequent releases of software, but it is also expensive, accounting for a large proportion of the costs of software [82, 97]. For this reason, researchers have studied various techniques for improving the cost-eectiveness of regression testing, such as regression test selection [23, 119], test suite minimization [22, 56, 96], and test case prioritization [42, 122, 141]. Initially, research on regression testing similar to research on testing in general relied primarily on analytical approaches to assess dierent techniques (e.g., [82, 118]). Testing techniques are heuristics, however, and to properly assess their costeectiveness on actual systems, empirical studies are essential. More recent research on regression testing, therefore (e.g., [33, 42, 72, 73, 86]), has emphasized empirical studies of techniques. A common way to conduct such studies has been to collect several software systems with multiple versions, and for each such system, nd or create one or more test suites and locate or seed faults in these versions. Next, the techniques being studied are applied to each of the versions and test suites, and their results are assessed using metrics involving testing eort (typically in terms of numbers of test cases or time required for testing) and testing eectiveness (typically in terms of faults revealed). Empirical studies such as these have allowed researchers to compare regression testing techniques in terms of costs and benets. However, studies to date1 suer from several limitations in their abilities to assess cost-benet tradeos relative to practical testing situations. The rst limitation pertains to context factors characteristics of the environments
1 Chapter 3 presents results of a survey of the literature on regression testing, and an analysis of that work and its limitations.
2 and engineering processes within which specic techniques are employed, that may aect their cost-eectiveness. Previous studies have considered only a few context factors when assessing techniques. Most studies have considered dierences in programs and regression testing techniques, but none have considered costs of other essential testing activities such as test setup and obsolete test identication, or collection and maintenance of resources (e.g., test coverage information) needed for retesting. Further, only a few studies have considered the eects of resource constraints on testing cost-eectiveness, or considered dierent testing processes. Depending on the regression testing process that an organization uses or resource availability in testing, the costs and benets associated with testing techniques can dier. For example, suppose organization A begins regression testing only when a release is feature-complete, and a period of software maintenance lasting several days, weeks, or months precedes this point. Suppose further that organization B uses a nightly build-and-test process, in which the testing and product development phases cannot reasonably be treated separately, and in which very short periods of maintenance occur. In the case of organization A, regression testing costs can be distributed across development and testing phases, since analyses required to apply regression testing techniques in the upcoming release can be accomplished during the maintenance period. In the case of organization B, however, all regression testing costs are incurred in each testing step because the maintenance period is not long enough to allow collection of the analysis information required for regression testing. Without considering dierences in testing processes such as these, we cannot provide realistic assessments of regression testing techniques. The second limitation pertains to lifetime factors. Previous studies have calculated costs and benets independently per system version. This snapshot view of costs and benets masks the fact that regression testing techniques are applied repeatedly across system lifetimes; that is, it ignores the software evolution setting in which regression testing occurs. This evolution setting, however, can fundamentally alter various characteristics of the testing process, and the cost-benet tradeos for techniques across entire lifetimes may be more relevant for choosing a technique than the tradeos on single releases. For example, engineers may be able to incrementally adjust their testing activities based on results of prior testing sessions, reusing rather than recomputing particular relevant information, or relying on knowledge gained from those prior sessions. The snapshot approach essentially relies on a limited model of the economic value of regression testing techniques a model that does not suciently capture factors related to the long-term utilization of techniques. The third limitation pertains to cost-benet models. Previous studies have typically not utilized such models at all, instead measuring only a few simple benets or costs. Costs of missed faults and human time, and tradeos involving product revenue, have not been considered. Moreover, dierent techniques have often been evaluated using dierent metrics, rendering their relative performance incomparable. A few researchers have utilized simple cost-benet models for regression testing, but in these, costs have been ignored or calculated solely in terms of time or numbers of
3 faults missed, and benets have been calculated solely in terms of reduced test suite size or increased rate of fault detection.2
1.2
Goals of this Research
This dissertation addresses these limitations. Our ultimate objective is to improve the eectiveness and reduce the cost of regression testing. Toward this goal, our thesis is that empirical evaluations of testing techniques that do not consider software evolution setting and testing context can fail to account for factors that inuence the costs and benets of regression testing techniques across a systems lifetime. Ultimately, this failure can lead to inaccurate conclusions about the relative costeectiveness of techniques, and inappropriate decisions by engineers relying on such conclusions to select techniques for particular projects. If this thesis is true, then researchers who empirically investigate regression testing techniques and practitioners who might act on the results of those investigations would be better served by empirical investigations founded on system lifetimebased views and evaluation schemes that consider dierent testing contexts, and cost models that appropriately account for both system lifetime and context factors. In the process of addressing this thesis, we found that several preliminary tasks were required. First, we surveyed the state of the art of empirical studies of regression testing techniques and identied problems with evaluation methods and processes, and problems related to infrastructure required for empirical studies. Second, we developed infrastructure to support empirical studies of regression testing considering a wide variety of software artifacts. Third, using the infrastructure developed in the second step, we conducted several initial empirical studies on regression testing techniques. Fourth, we developed a cost-benet model to assess the cost-eectiveness of regression testing techniques considering system lifetime and context factors. Finally we conducted an empirical study, assessing regression testing techniques using these cost-benet models.
1.3
Overview of this Dissertation
The remainder of this dissertation is organized as follows. In Chapter 2, we provide background information on regression testing, costbenet models, and empirical studies. In Chapter 3, we present the state of the art of empirical studies of regression testing techniques. Through this chapter we show how researchers have evaluated regression testing techniques to date, and identify problems that researchers have overlooked in the evaluation processes.
2
Chapter 2.2 describes prior work on cost-benet models in detail.
4 In Chapter 4, we discuss designing and constructing infrastructure that supports experimentation with regression testing techniques. Investigating proper evaluation models and methodologies requires us to conduct a family of controlled experiments. These experiments require a wide variety of software artifacts; this step enables us to proceed with these experiments. In Chapter 5, we present several controlled experiments on regression testing techniques that involve various artifacts from the infrastructure, such as dierent types of test suites and dierent types of faults. These experiments help us adjust and improve our infrastructure, test the validity of empirical approaches, and incorporate and investigate initial aspects of factors and test processes in the evaluation process. In Chapter 6, we present a cost-benet model that can be used to compare and empirically evaluate regression testing techniques, and that account for testing context and system lifetime factors. In Chapter 7, we present a controlled experiment, which provides an initial view of our cost-benet models feasibility, and helps us assess our model. In Chapter 8, we discuss conclusions and future direction of this research.
Chapter 2 Background and Related Work

In this chapter, we provide background information on regression testing, cost-benet models for regression testing, and empirical studies that investigated regression testing problems.
2.1
Regression Testing
Let Pi be a release of program P , let Pi+1 be a modied version of Pi , and let Ti be a test suite for Pi . Regression testing attempts to validate Pi+1 . A typical approach by which engineers do this is as follows: 1. 2. 3. 4. 5. 6.
r Identify obsolete1 test cases in Ti , and repair or discard them, yielding Ti+1 . s r Select Ti+1 Ti+1 , a set of test cases to execute on Pi+1 . s s Test Pi+1 with Ti+1 , to establish the correctness of Pi+1 with respect to Ti+1 . n If necessary, create Ti+1 , a set of new functional or structural test cases for Pi+1 . n n Test Pi+1 with Ti+1 , to establish the correctness of Pi+1 with respect to Ti+1 . f r s n Create Ti+1 , a nal test suite for Pi+1 , from Ti+1 , Ti+1 , and Ti+1 , as a test suite to carry forward in testing future releases of P .
As a typical common practice [97], engineers often just reuse all non-obsolete test s r cases in Ti to test Pi+1 . This approach corresponds to letting Ti+1 = Ti+1 in Step 2 and is known as the retest-all technique [82]. The retest-all technique can be very expensive; for example, Srivastava et al. [132] cite a case in which an oce productivity application of 1.8 million lines of code has a test suite of 3128 test cases that require over four days to run. For this reason, many researchers have addressed regression testing problems and proposed various techniques for improving the cost-eectiveness of regression testing, such as regression
A test case in Ti is obsolete for Pi+1 if it can no longer be applied to Pi+1 (e.g., due to changes in inputs), is no longer needed to test Pi+1 (e.g., due to being designed solely for code coverage of Pi , and now on Pi+1 redundant in coverage) or if its expected output on Pi+1 diers (e.g., due to specication changes) from its expected output on Pi .
1
6 test selection [23, 119], test suite minimization [22, 56, 96], and test case prioritization [42, 122, 141]. The following subsections provide more detailed descriptions of regression test selection and test case prioritization techniques.
2.1.1
Regression Test Selection
Let P be a program, let P be a modied version of P , and let T be a test suite developed for P . Regression testing attempts to validate P . To facilitate regression testing, test engineers typically attempt to re-use T to the maximum extent possible. However, rerunning all the test cases in T can be expensive, and when only small portions of P have been modied, may involve unnecessary work. Regression test selection (RTS) techniques attempt to reduce unnecessary regression testing, increasing the eciency of revalidation. RTS techniques (e.g., [13, 17, 23, 47, 57, 100, 119, 135, 136, 139]) use information about P , P , and T to select a subset of T with which to test P . Most of these techniques are code-based, using information about code changes to guide the test selection process. A few techniques [17, 136], however, are specication-based, relying on some form of specication instead of code. (Rothermel and Harrold [118] survey RTS techniques.) Empirical studies [23, 51, 121, 120] have shown that these techniques can be cost-eective. One important facet of RTS techniques involves safety. Safe RTS techniques (e.g., [23, 57, 119, 135]) guarantee that, given that certain preconditions are met, test cases not selected could not have exposed faults in P [118]. Informally, these preconditions require that: (1) the test cases in T are expected to produce the same outputs on P as they did on P ; i.e., the specications for these test cases have not changed; and (2) test cases can be executed deterministically, holding all factors that might inuence test behavior constant with respect to their states when P was tested with T . This notion of safety is dened formally, and these preconditions are expressed more precisely by Rothermel and Harrold [118]. Two other facets of RTS techniques involve precision and eciency. Precision concerns the extent to which techniques correctly deduce that specic test cases need not be re-executed. Eciency concerns the cost of collecting the data necessary to execute an RTS technique, and the cost of executing that technique. RTS techniques that operate at dierent levels of granularity for example, analyzing code and test coverage at the level of functions rather than statements exploit tradeos between precision and eciency, and their relative cost-benets vary with characteristics of programs, modications, and test suites [10].
2.1.2
Test Case Prioritization
Test case prioritization techniques [42, 122, 141] schedule test cases in an execution order according to some criterion. The test case prioritization problem is formally dened as follows [122]: The Test Case Prioritization Problem:
7 Given: T , a test suite, P T , the set of permutations of T , and f , a function from PT to the reals. Problem: Find T P T such that (T )(T P T )(T = T )[f (T ) f (T )] In this denition, P T represents the set of all possible prioritizations (orderings) of T , and f is a function that, applied to any such ordering, yields an award value for that ordering. The purpose of test case prioritization is to increase the likelihood that if the test cases are used for regression testing in the given order, they will more closely meet some objective than they would if they were executed in some other order. For example, testers might schedule test cases in an order that achieves code coverage at the fastest rate possible, exercises features in order of expected frequency of use, or increases the likelihood of detecting faults early in testing. Depending on the types of information available for programs and test cases, and the way in which those types of information are used, various test case prioritization techniques can be employed. One way in which techniques can be distinguished involves the type of code coverage information they use. Test cases can be prioritized in terms of the number of statements, basic blocks, or methods they executed on a previous version of the software. For example, a total block coverage prioritization technique simply sorts test cases in the order of the number of basic blocks (single-entry, single-exit sequences of statements) they covered, resolving ties randomly. A second way in which prioritization techniques can be distinguished involves the use of feedback. When prioritizing test cases, if a particular test case has been selected as next best, information about that test case can be used to re-evaluate the value of test cases not yet chosen prior to selecting the next test case. For example, additional block coverage prioritization iteratively selects a test case that yields the greatest block coverage, then adjusts the coverage information for the remaining test cases to indicate coverage of blocks not yet covered, and repeats this process until all blocks coverable by at least one test case have been covered. This process is then repeated on remaining test cases. A third way in which prioritization techniques can be distinguished involves their use of information about code modications. For example, the amount of change in a code element can be factored into prioritization by weighting the elements covered using a measure of change.2 Other dimensions along which prioritization techniques can be distinguished that have been suggested in the literature [41, 42, 72] include test cost estimates, fault severity estimates, estimates of fault propagation probability, test history information, and usage statistics obtained through operational proles. Most prioritization techniques proposed to date focus on increasing the rate of fault detection of a prioritized test suite. To measure rate of fault detection we use a
2
Section 5.1 provides more detailed descriptions of these three types of prioritization techniques.
8 metric, APFD (Average Percentage Faults Detected), introduced for this purpose in [42], that measures the weighted average of the percentage of faults detected over the life of a test suite. APFD values range from 0 to 100; higher numbers imply faster (better) fault detection rates. More formally, let T be a test suite containing n test cases, and let F be a set of m faults revealed by T. Let TFi be the rst test case in ordering T of T that reveals fault i. The APFD for test suite T is given by the equation: 1 T F1 + T F 2 + + T F m + nm 2n We illustrate this metric using an example reproduced from [42]. Consider a program with a test suite of ten test cases, A through J, such that the program contains eight faults detected by those test cases, as shown by the table in Figure 2.1.a. Consider two orders of these test cases, order T 1: ABCDEFGHI J, and order T 2: IJEBCDFGHA. Figures 2.1.b and 2.1.c show the percentages of faults detected versus the fraction of the test suite used, for these two orders, respectively. The areas inside the inscribed rectangles (dashed boxes) represent the weighted percentage of faults detected over the corresponding fraction of the test suite. The solid lines connecting the corners of the inscribed rectangles interpolate the gain in the percentage of detected faults. The area under the curve thus represents the weighted average of the percentage of faults detected over the life of the test suite. On test order T 1 (Figure 2.1.b) the rst test case executed (A) detects no faults, but after running test case B, two of the eight faults are detected; thus 25% of the faults have been detected after 0.2 of test order T 1 has been used. After running test case C, one more fault is detected and thus 37.5% of the faults have been detected after 0.3 of the test order has been used. Test order T 2 (Figure 2.1.c), in contrast, is a much faster detecting test order than T 1: the rst 0.1 of the test order detects 62.5% of the faults, and the rst 0.3 of the test order detects 100%. (T 2 is in fact an optimal ordering of the test suite, resulting in the earliest detection of the most faults.) The resulting APFDs for the two test case orders are 43.75% and 90.0%, respectively. AP F D = 1
2.2
Cost-benet Models for Regression Testing
Empirical evaluations of the foregoing regression testing methodologies to date have relied, implicitly or explicitly, on relatively simple models of costs and/or benets. Leung and White [83] present a cost model that considers some of the factors (testing time, technique execution time) that aect the cost of regression testing a software system, but their model does not consider benets. Malishevsky et al. [86] extend this work with cost models for regression test selection and test case prioritization that incorporate benets related to omission of faults and rate of fault
9
test 1 A B C D E F G H I J x x x x x x x x x x x x x x 2 x x x 3 fault 4 5
x x x x
(a) Test suite and faults exposed

Test Case Order T1: ABCDEFGHIJ
100 90 80 70 60 50 40 30 20 10 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 APFD = 43.75%
Test Case Order T2: IJEBCDFGHA

100 90 80 70 60 50 40 30 20 10 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 APFD = 90.00%
Percent Detected Faults
Test Suite Fraction
Percent Detected Faults
Test Suite Fraction
(b)
APFD for test case order T1
(c)
APFD for test case order T2
Figure 2.1: Example illustrating the APFD measure. detection. Harrold et al. [58] present a coverage-based predictive model of regression test selection eectiveness, but this predictor focuses only on reduction in numbers of test cases. There has been some work on models for testing (as distinct from regression testing). Muller and Padberg [91] present an economic model for the return on investment of test-driven development (TDD) compared to conventional development, provide a cost-benet analysis, and identify a break-even point at which TDD becomes benecial over conventional development. However, the analysis in [91] is based on an example, not on empirical data; thus, it has limitations as an exploration of the business value of TDD in general. More recently, Wagner [137] proposes an analytical model of the economics of defect detection techniques that incorporates various cost factors and revenues, considering uncertainty and sensitivity analysis to identify the
10 most relevant factors for model simplication. While cost-benet models in the software testing and regression testing areas are not well established, in other software engineering areas, such as software process and development, models have been considered much more actively, and researchers have attempted to incorporate various economic considerations into these models. For example, recent research in value-based software engineering has sought to support the evaluation of software engineering activities (such as design, development, and maintenance) from an economic point of view. Return On Investment (ROI) models, which calculate the benet return of any given investment [109], provide one such approach to evaluation, which supports systematic, software business value analysis [14, 91, 129].
2.3
Empirical Studies
Until recently, most studies of regression testing or related techniques focused on the eectiveness of individual techniques (typically relative to a baseline such as the retest-all technique or unprioritized test suites) [23, 119, 120, 135]. Other studies have focused on comparisons of techniques [10, 42, 52].3 Only a few studies have evaluated the cost-benets tradeos for such techniques using cost models [33, 83, 86], and in these cases, the models utilized were the relatively simple ones described in Section 2.2. Two recent studies have considered the notion of software evolution setting when evaluating regression testing techniques [72, 73]. In [73], Kim et al. consider an evaluation approach that treats regression testing as a continuous process, and report results of empirical studies in which test application frequency is varied, as could happen across multiple releases of a software system. They nd that the eectiveness of regression testing techniques changes as the frequency of regression testing changes. Whereas the study reported in [73] does not provide an explicit evaluation model, in a later paper Kim et al. [72] do present regression testing processes that consider software evolution, and report results of empirical studies considering techniques that utilize historical test case performance data. Their results show that historical information may be useful in increasing the eectiveness of regression testing techniques in the long run. These two studies thus begin to address questions about software evolution setting (system lifetime-based view) and its eects on testing, and they raise, at least implicitly, the issue of requiring appropriate evaluation schemes that we address explicitly in this dissertation. However, the studies performed and cost models utilized in [72, 73] have several limitations. First, the object programs utilized are small, and do not include true multiple program versions; instead, versions are simulated using fault injection, and the injected faults are the only modications present in the programs. Thus, an open question is whether the results of the studies generalize to
3
Chapter 3 presents the current state of the art of empirical studies of testing.
11 actual, non-trivial systems, with actual sequences of releases containing true modications. Second, the model used in [72] to evaluate techniques does not consider costs incurred by testing techniques benet is the only measure considered. The model also does not consider context factors, which yield dierent scenarios that an organization could face in testing, such as resource availability and time constraints. To increase our understanding of regression testing techniques, we need to evaluate the techniques using object programs of various nontrivial sizes, that have multiple releases, and for this evaluation we require a cost model of costs and benets that accounts for actual costs incurred by regression testing practices, in evolutionary setting and under varying testing contexts.
12
Chapter 3 The State of the Art in Empirical Studies of Testing, and Assessment of and Guidelines for Studies
Empirical studies are necessary when researchers or practitioners wish to evaluate new techniques or methods, or to make decisions about more cost-eective solutions, but the number of empirical studies in software engineering is relatively small compared to in other disciplines. The lack of empirical evidence is likely to hamper the improvement of software quality, and may lead practitioners to make decisions about adopting new software technologies or methods by intuition and not by scientic evidence [45]. Intuition may provide a good starting point, but it must be followed by empirical validation [133]. Empirical studies can be very expensive due to the intrinsic complexity of experimentation and the considerable time required. Such empirical studies, however, can produce worthwhile benets for practitioners and researchers. That is, the results of experiments can produce far more prots than the cost that a company expends on experimentation. For example, Tichy [133] shows that a ve-year lead in software inspections based on in-house experiments at Lucent Technologies is yielding benets for that company. To provide an initial view on the state of the art in empirical studies of software testing, we surveyed recent research papers, focusing on studies of the techniques themselves (we exclude studies with human subjects). Section 3.1 presents the results of this survey. Next we wished to assess the empirical studies surveyed in Section 3.1, and expose open problems in both our empirical understanding of techniques and our understanding of how the techniques should be investigated. To accomplish this, we performed the following preliminary work: 1) Surveyed papers that evaluate empirical studies in software engineering; 2) Surveyed papers that provide experimentation guidelines; 3) Established criteria for the evaluation of empirical studies of testing techniques. Section 3.2 presents the rst two steps of this preliminary work: a survey of
13 reviews of empirical studies and a survey of experimentation guidelines. Then, Section 3.3 presents criteria for evaluating empirical studies of testing techniques, which are based on the results of Section 3.2 together with additional evaluation criteria that we identify. Given the foregoing results, Section 3.4 presents an evaluation of existing controlled experiments on testing techniques (those investigated in Section 3.1) and discusses open problems and implications for empirical studies. Section 3.5 presents conclusions.
3.1
A Survey of Studies of Testing
To provide an initial view on the state of the art in empirical studies of software testing, we surveyed recent research papers following approaches used by Tichy et al. [134] and Zelkowitz et al. [144].1 We selected two journals and four conferences recognized as pre-eminent in software engineering research and known for including papers on testing and regression testing: IEEE Transactions on Software Engineering (TSE), ACM Transactions on Software Engineering and Methodology (TOSEM), the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), the ACM/IEEE International Conference on Software Engineering (ICSE), the ACM SIGSOFT Symposium on Foundations of Software Engineering (FSE), and the IEEE International Conference on Software Maintenance (ICSM). We considered all issues and proceedings from these venues over the period 1994 to 2003. Table 3.1 summarizes the results of our survey with respect to the numbers of full research papers appearing in each venue per year. Of the 1,995 full research papers contained in the issues and proceedings of the six venues we considered, we identied 224 papers on topics involving software testing issues such as testing techniques, test case generation, testing strategies, and test adequacy criteria. We examined these papers and determined that 107 reported results of empirical studies. In this determination, we included all papers that either described their results as empirical or clearly evaluated their proposed techniques or methods through empirical studies. The table contains three columns of data per venue: P (the total number of papers published in that year), T (the number of papers about software testing published in that year), and EST (the number of papers about software testing that contained some type of empirical study). As analysis of the data in the table shows, 11.2% (224) of the papers in the venues considered concern software testing, a relatively large percentage attesting to the importance of the topic. (This includes papers from ISSTA, which would be expected to have a large testing focus, but even excluding ISSTA, 9% of the papers in the other venues, that focus on software engineering generally, concern testing.) Of the testing-related papers, however, only 47.7% (107) report empirical studies.
1 3
Much of the material presented in this section has appeared in [32]. ISSTA proceedings appear bi-annually.
14 Table 3.1: Research Papers Involving Testing and Empirical Studies in Six Major Venues, 1994-2003 (P: the total number of papers, T: the number of papers about software testing, EST: the number of papers about software testing that contained the empirical study).
Year 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 Total Year 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 Total P 74 74 55 62 46 73 50 59 70 68 631 P 75 57 61 67 58 39 52 53 31 30 523 TSE T 8 8 6 5 1 4 5 8 4 7 56 ICSE T 7 4 5 5 4 2 4 5 3 5 44 EST 7 4 5 2 0 3 4 2 1 1 29 EST 3 4 3 2 2 2 2 3 1 2 24 TOSEM P T EST 7 0 0 14 2 0 11 4 3 14 0 0 12 1 0 12 1 1 12 1 1 13 5 2 10 0 0 12 3 1 117 17 8 FSE P T EST 33 4 4 17 0 0 28 3 3 17 2 2 29 1 1 22 2 1 27 3 1 17 2 1 15 1 1 17 2 0 222 20 14 ISSTA3 P T EST 23 11 4 21 10 2 16 9 1 29 13 1 17 10 1 106 53 9 ICSM P T EST 39 4 3 61 9 8 68 4 3 24 0 0 36 4 2 32 3 3 34 2 1 34 1 1 30 4 1 38 3 1 396 34 23
We next analyzed the 107 papers on testing that reported empirical studies to determine the type of empirical study performed. Determining the type of empirical study performed required a degree of subjective judgement, due to vague descriptions by authors and the absence of clear quantitative measures for dierentiating study types. However, previous work [6, 79, 144] provides guidelines for classifying types of studies, and we used these to initially determine whether studies should be classied as controlled experiments, case studies or examples. An additional source for distinguishing controlled experiments from other types of studies was Wohlin et al.s criteria [140], and focused primarily on whether the study involved manipulating factors to answer research questions. Many studies were relatively easy to classify: if a study utilizes a single program or version and it has no control factor then it is clearly a case study or an example; if a study utilizes multiple programs and versions and it has multiple control factors
15 then it is a controlled experiment. However, some studies were dicult to classify due to a lack of description of experiment design and objects. To classify the type of empirical study, rst we considered the following elements: the number of programs and versions empirical studies used, whether they used tests and faults, and whether they were involved in artifact sharing. Then we also considered the experiment design of a study. For example, if a study used multiple programs, versions, tests, and faults but its experiment design did not manipulate any factors, then we classied it as a case study or an example. On the other hand, if a study involved a single program but its experiment design manipulated and controlled factors such as versions or tests, we classied it as an experiment. Having determined which papers presented controlled experiments, we next considered the remaining papers again, to classify them as case studies or examples. If the studies utilized only single trivial programs (i.e. less than a couple of hundred lines of code), then we classied them as examples.
Example percent
100
80 70 60 50 40 30 20 10
94 99 94 98 03 03
94 99 94 98 03 03
94 99 94 98 03 03
94 99 94 98 03 03
94 99 94 98 03 03
94 99 94 98 03 03
94 99 94 98 03 03
TOTAL
TSE
TOSEM
ISSTA
ICSE
FSE
ICSM
Figure 3.1: The percentage of empirical studies: example, case study and experiment. Figure 3.1 summarizes the results of our analysis (the precise data is given in Table 3.2). The gure reports the data for each venue in terms of three time periods: 1994-1998, 1999-2003, and 1994-2003. Across all venues over the ten year period, 34.5% of the studies presented were controlled experiments and 56% were case studies. Separation of this data into time periods suggests that trends are changing: 27.5% of the studies in the rst ve years (1994-1998) were controlled experiments, compared to 38.8% in the second ve years (1999-2003). This trend occurs across all venues other than ISSTA and FSE, and it is particularly strong for TSE (9% vs. 50%). As our analysis shows, the number of research papers that report empirical results has been increasing in recent years. This is a positive indication that more software engineers are aware of the importance of performing empirical studies. In that sense,
A BBAA A A BBA A BAA AA BBAA AA BBA A BAA AA BBAA A
9@ @@ 99@9 9@ @ 9@@9 9 9@ @ 9@9 @9@ 9@ 9@9 @ 9@@9 9 9@ @ 9@9@9 9@@ 9@ 9@9
78 8878 877 7 8 7887 87 7 8 787 8878 87 77 88 7 787
90
56 5655 56 5655 56 565 5655 656 555 56 56 55
4 343 3 34 434 12 12 11 12 12 11 12 434 1 43 1 2112 343 11 2121 434 121 343 1 )0 )0)) )0 )0)) )0 )0) )0)) )0 )0)) )0 )0 )) '( ( '((' (' ' ( '((' (' ' ( '(' (('( ('' ' (( ' '('
% &&%% % % &&% % &%% %% &&%% % % &&% % &%% %% &&%% % % &&%% %
#$ $ #$$# # #$ $ #$$# # #$ #$ $#$ #$ #$ #$# $ #$$# # #$ $ #$# $$#$ # $ # #$#
" !"! ""! "! !"! !" "!"! !" "!" !"!

Case study
Experiment

16 Table 3.2: The Numeric Data for Figure 3.1.

Publication TSE (1999-2003) TSE (1994-1998) TSE (1994-2003) TOSEM (1999-2003) TOSEM (1994-1998) TOSEM (1994-2003) ISSTA (1999-2003) ISSTA (1994-1998) ISSTA (1994-2003) ICSE (1999-2003) ICSE (1994-1998) ICSE (1994-2003) FSE (1999-2003) FSE (1994-1998) FSE (1994-2003) ICSM (1999-2003) ICSM (1994-1998) ICSM (1994-2003) Total (1999-2003) Total (1994-1998) Total (1994-2003) Empirical Papers 18 11 29 3 5 8 6 3 9 14 10 24 10 4 14 16 7 23 67 40 107 Example 0 1 1 0 1 1 0 0 0 1 0 1 2 0 2 2 3 5 5 5 10 Case Study 9 9 18 0 1 1 4 2 6 6 7 13 6 2 8 11 3 14 36 24 60 Controlled Experiment 9 1 10 3 3 6 2 1 3 7 3 10 2 2 4 3 1 4 26 11 37
empirical studies in software engineering have matured in a quantitative way, however, the quality of individual studies of testing has not been evaluated well enough to build knowledge about empirical studies of testing or to provide guidelines for future studies. Thus, we further investigate these issues in the following three sections through additional surveys, and provide insights regarding how empirical studies of testing techniques should be performed.
3.2
A Survey of Literature on Empirical Studies: Reviews and Guidelines
Early surveys of experimental computer science by Feldman et al. [44] and McCracken et al. [88] in 1979 present problems in performing experiments in computer science, such as poor support from industry and government, and lack of engagement in experimentation by computer scientists. An article by the computer science and telecommunications board [127] in 1994 points out that support for experimental computer science is still not sucient, and experimental computer scientists may face challenges in building their academic careers. Software engineering research faces these same problems [134], and as a conse-
17 quence we often nd limited empirical evidence supporting software methods and techniques. To provide an overview of the current state of empirical studies in software engineering, this section reports the results of two surveys of the literature focusing on papers regarding empirical studies. First, we investigate articles that provide evaluations of empirical studies of software engineering. Second, we investigate articles that provide insights into how empirical studies should be performed. The rst survey provides insights into the current status of systematic reviews in software engineering research. The second survey provides guidelines from which to formulate criteria for systematic reviews of papers we consider in this chapter. After presenting results from the two surveys, we discuss observations and problems identied in those surveys.
3.2.1
A Survey of Reviews of Empirical Studies
While there are not many, there are nonetheless some reviews of empirical studies in software engineering that have been performed. This section presents chronologically ve articles that have dierent focuses and evaluation schemes. The rst three articles evaluated general software engineering studies, and the last two articles focused on studies of testing techniques. For each article, we present the source of the survey (for example, what publications were surveyed), the evaluation criteria that the article used, and results and observations drawn. We use the item results only if an article provides statistical or numerical data; otherwise we use the term observations. 1. Basili et al. [6]: Basili et al. propose a framework for experimentation and evaluate existing empirical studies based on that framework. They focus on examining the quality of empirical studies, and identify several problems in experimentation. Source of survey: Experimental studies on software engineering between years 1972-1985 (40 papers). No selection criteria are given. Criteria for evaluation: The authors dene a framework for experimentation (denition, planning, operation, and interpretation), and then assess papers based on the framework. The detailed evaluation criteria are given in Section 3.2.2. Observations: There is no universal model in software engineering. Experimental studies should consider the vast dierences among environments and among people (there are an enormous number of factors that dier across environments). Studies need precise problem specications. The experimental planning process should consider possible replication studies.
18 Several papers do not dene their data well enough to enable a comparison of results across projects and environments. The presentation of experimental results should include appropriate qualication (generalization issue) and adequate exposure (graphical representation of data) to support their proper interpretation. 2. Tichy et al. [134]: Tichy et al. survey research articles in computer science and two other elds, optical engineering and neural computation, and compare them in a quantitative way. Source of survey: All issues from 1991 to 1993 of the ACM Transactions on Computer Systems (TOCS); all issues from 1992 to 1993, and numbers 1 and 2 in 1994, of the ACM Transactions on Programming Languages and Systems (TOPLAS); all issues in 1993 of the IEEE Transactions on Software Engineering (TSE); all issues in 1993 of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI); a random sample of 50 titles from the set of all works published by the ACM in 1993; all issues in 1993 of Neural Computation (NC), and numbers 1 and 3 from 1994, of Optical Engineering (OE). In total, 403 papers were examined. Criteria for evaluation: The authors classify papers into ve categories: formal theory, design and modeling, empirical work, hypothesis testing, and other. Then they assess design and modeling papers based on the amount of space devoted to the description of experimental evaluation. They claim that the amount of space can reect the quality of the experiment. Results: Percentage of papers without empirical evaluation: CS samples 43%, OE samples - 14%. The number of articles in NC and OE devoting over 20% of their space to experimental evaluation is larger than that in CS samples (67% in OE vs 31% in the random CS sample). Samples related to software engineering (TSE and TOPLAS) are worse than the random CS sample. Observations: The results suggest that a large proportion of CS publications may not meet standards long established in the natural and engineering sciences. The youth of the computer science eld is not a sucient explanation for poor standards because the NC eld is only six years old. Computer scientists have neglected to develop adequate measuring techniques. Many computer scientists agree that standards need to be raised, but they are reluctant to take the rst step because building measurement guidelines and expertise requires a tremendous eort without being rewarded (slowing careers). 3. Zelkowitz et al. [144]: Zelkowitz et al. dene 12 validation models for exper-
19 imentation, classify articles in software engineering using those models, and analyze them in a quantitative way. Source of survey: All issues from the years 1985, 1990, and 1995 of the IEEE Transactions on Software Engineering (TSE), IEEE Software, and the International Conference on Software Engineering (ICSE). In total, 562 papers were examined. Criteria for evaluation: The authors develop 12 dierent experimental models with respect to how data was collected, and group them into three broad categories: observational (project monitoring, case study, assertion, eld study), historical (literature search, legacy, lessons learned, static analysis), and controlled (replicated, synthetic, dynamic analysis, simulation). Results: Lessons learned and case studies are the most prevalent validation models (about 10% for each). A third of the papers relied on assertion and about a third of the papers had no experimental validation. The percentage of the papers having no experimental validation fell from 36% in 1985 to 29% in 1990, and then to 19% in 1995. Observations: Too many papers have no experimental validation. Many papers use an informal form of validation (assertion). Researchers often fail to state their goals clearly. Researchers often fail to state how they validate their hypotheses. Experimentation terminologies are used very loosely. 4. Juristo et al. [71]: Juristo et al. examine papers on testing techniques, and evaluate them relative to families of testing techniques. The evaluation is performed in a qualitative way. Source of survey: Papers on testing techniques from the past 25 years. No selection criteria are given. Criteria for evaluation: The authors measure the maturity level of testing techniques using four criteria: laboratory study, formal analysis, laboratory replication, and eld study. The evaluation is made with respect to the family in which the testing technique is a member: random, functional, control-ow, data-ow, mutation, regression, and improvement. Observations: Many studies are based solely on qualitative graph analysis. The response variables examined are of limited use in practice. The artifacts (programs and faults) are not representative. More experiments and replications need to be conducted to generalize results. 5. Do et al. [31]: Do et al. examine articles on empirical studies of testing techniques, and evaluate them from an artifact utilization point of view. The
20 analysis is done in a quantitative way. We report most of the results of this study in Sections 3.1 and 4.1 of this dissertation, but for the completeness of this section, we summarize that work here: Source of survey: All issues and proceedings from two journals and four conferences over the period 1994 to 2003: IEEE Transactions on Software Engineering (TSE), ACM Transactions on Software Engineering and Methodology (TOSEM), ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA), ACM/IEEE International Conference on Software Engineering (ICSE), ACM SIGSOFT Symposium on Foundations of Software Engineering (FSE), and the IEEE International Conference on Software Maintenance (ICSM). In total, 1,995 papers were examined. Criteria for evaluation: Of the 1,995 full research papers, the authors identify 224 papers on topics involving software testing issues, and select 107 papers that reported results of empirical studies for further analysis. The 107 papers are analyzed with respect to several characteristics: the type of empirical study performed (controlled experiment, case study, and example), number of programs used as sources of data, number of versions used as sources of data, whether test suites were utilized, whether fault data were utilized, and whether the study involved artifacts provided by or made available to other researchers. Results: 11.2% of the papers concerned software testing. Of the testingrelated papers, 47.7% reported empirical studies. Of the empirical study papers, 34.5% were controlled experiments, and 56% were case studies. 32.7% utilized data from one program, 44% utilized multiple versions, 35.5% utilized fault data, and 22.4% involved artifact sharing. Observations: Over 50% of studies do not have empirical validation. The number of controlled experimental studies is small, but the trends are changing: 27.5% in the rst ve years (1994-1998), and 38.8% in the second ve years (1999-2003). Researchers are becoming increasingly willing to conduct controlled experiments, and are increasing the extent to which they utilize shared artifacts. These ve articles investigated dierent sources and used dierent evaluation criteria, but they commonly observed that several important aspects of empirical research are neglected by researchers: studies often fail to state a precise specication of the problem, fail to provide a proper interpretation of results, or fail to provide detailed descriptions of data.2 However, as our own results have illustrated in Section 3.1, there is a positive aspect to current empirical studies: the number of empirical stud2
Section 3.2.3 will discuss these problems in detail.
21 ies is increasing, and more computer scientists are proving willing to conduct empirical studies.
3.2.2
A Survey of Experimentation Guidelines
There is a great deal of literature that provides experimental design and data analysis guidelines that can be applied in scientic and engineering areas in general [28, 61, 90]. One set of guidelines by Montgomery [90] recommends seven steps for an experiment: (1) recognition of the problem, (2) choice of factors, levels, and ranges, (3) selection of independent/dependent variables, (4) choice of experimental design, (5) performing the experiment, (6) statistical analysis of the data, and (7) conclusions and recommendations. Depending on the literature, these steps break down into more detailed processes, or some steps merge, but their principal guidelines are not dierent. Based primarily on these general guidelines, several software engineering researchers have established empirical study guidelines for software engineering and have created foundations to assist researchers in performing empirical studies [6, 45, 70, 79, 78, 102, 104, 105, 106, 107, 108, 140]. This section reports on our survey of these. With respect to the articles surveyed, we focused on guidelines for performing controlled experiments. Table 3.3 presents experimentation steps dened in ve dierent sources [6, 70, 78, 105, 140]. They are similar to those from the general guidelines mentioned in the beginning of this section, using slightly dierent terminologies and groupings, but delivering conceptually similar processes. In particular, Basili et al. and Wohlin et al. describe common components, with the exception of presentation and packaging in Wohlin et al.s guidelines. Juristo et al. consider several factors that are specically relevant to experiments in software engineering, but we omit them in this survey because their considerations are focused on human subjects. Detailed descriptions of Table 3.3 are as follows. 1. Basili et al. [6]: Basili et al. describe a framework for experimentation in software engineering, which includes four steps. Denition: Dene six parts - motivation of the study, object (the primary entity examined in the study), purpose, perspective (developer or customer), domain (programs or human subjects), and scope (single/multi project, replicated project, or blocked subject project). Planning: Consider three parts - design (factorial or block design), criteria (direct/indirect reection of cost/quality: cost, reliability, programmer understanding, etc.), and measurement (capture the cost/quality aspects: fault detection cost, fault detection eectiveness, etc.). Operation: Consider three parts - preparation (subject expertise, object instrumentation, etc.), execution (collect and validate data), and analysis (check assumptions before applying statistical tests).
22 Table 3.3: The Experiment Processes.

Steps 1 Basili et al. [6] Denition - motivation - object - purpose - perspective - domain - scope Planning - design - criteria - measurement Peeger [105] Conception - goals - objective Wohlin et al. [140] Denition - object - purpose - quality focus - perspective - context Planning - context sel. - hypothesis formulation - variables sel. - subj./obj. sel. - experi. design - instrumentation - validity eval. Juristo et al. [70] Denition - objectives - hypothesis Kitchenham et al. [78] Experi. context - background - research hypothesis
Design - hypothesis formulation - subj./obj. selection - treatment identif. - variables selection
Design - dene experi. unit - subj./obj. selection - variables selection - dene the parameter
Design - identify population - dene sampl. process - dene treatm. allo. - dene the experi. unit
Operation - preparation - execution - analysis
Preparation - environment setup Execution - measure Analysis - review measurements - statistical procedures Dissemination & decision-making - draw conclus.
Operation - preparation - execution - data validation
Execution - collect data
Execution & data collection - dene all software measures
Interpretation - interpretation context - extrapolation - impact
Analysis and Interpretation - descriptive statistics - data reduction - hypoth. testing Presentation and Package - present ndings
Analysis - examination of data - statistical inference
Analysis - graphical exam. of data - assumption check
Presentation - describe stat. procedures used - present raw data - provide descrip. statistics Interpretation - dene population - stat./prac. importance - dene study type - limitations
23 Interpretation: Consider three parts - interpretation context (statistical results should be interpreted in the context of the study), extrapolation (representativeness of the study), and impact. 2. Peeger [105]: Peeger denes six steps for experimentation. Conception: Dene goals of the study, and state clearly the objective of the study (should be stated as a question you want answered). Design: Create a hypothesis, and generate a formal design to test the hypothesis (identify subject/object, treatment, and independent/dependent variables). Preparation: Set up an environment for execution, considering tool preparation and hardware conguration. Execution: Collect data by applying the treatment to the experimental subjects/objects. Analysis: Review all measurements and apply statistical analyses. Dissemination and decision-making: Draw a conclusion based on the results of the analysis step. 3. Wohlin et al. [140]: The authors dene ve steps for experimentation. The rst four steps are similar to those dened by Basili et al., but the authors provide more detailed descriptions. Denition: Determine why the experiment is conducted. Components to be considered are the object of study, purpose, quality focus (the primary eect under the study, such as eectiveness, or cost), perspective (whose viewpoint, i.e., developer or customer), and context (environment in which the experiment is run). Planning: Focus on how the experiment is conducted. The components are context selection (o-line vs on-line, student vs professional, toy vs real problems, and specic vs general), hypothesis formulation, variable selection (independent and dependent variables), selection of subjects/objects (connected to the generalization of results from the experiment), experiment design (describes how the tests are organized and run), instrumentation (environment setup, such as fault seeding in a program, or interview preparation), and validity evaluation (threats to internal, external, construct, and conclusion validity). Operation: The components are preparation, execution, and data validation. The experiment is carried out to collect the data that should be analyzed. Collected data should be validated by the experimenter. Analysis and interpretation: The components are descriptive statistics (provide an understanding of data set distribution), data set reduction, and hypothesis testing (nd appropriate statistical test procedures based on data distribution, draw conclusions, and discuss statistical/practical signicance of results).
24 Presentation and package: Dene an experiment report outline: introduction, problem statement, experiment planning, operation, data analysis, interpretation, discussion, and conclusions. 4. Juristo et al. [70]: The authors dene four experimentation steps focusing on statistical data analysis, considering experimentation in software engineering. They provide software-engineering-specic factors to be considered when performing an experiment, but most of those factors are related to human subject specic factors, such as learning, experience, boredom, and enthusiasm eects. Denition: Dene objectives of the study, and formulate hypotheses for the experiment. Design: Consider components required in the design phase: experimental units, subjects/objects, independent/dependent variables, and parameters (the characteristics of software that are invariable throughout the experiment). Execution: Run experiment and collect data. Analysis: Analyze data. Examination of data is recommended before applying statistical analysis. The statistical inference is made after statistical analysis. 5. Kitchenham et al. [78]: Kitchenham et al. propose a set of guidelines for empirical studies in software engineering. In the following description, we omit human-subject specic guidelines. Experimental context: Present background and related information for the study. Dene the research hypotheses. Experimental design: Identify the population, and dene a selection process for subjects/objects, a process for assigning treatments, and experimental units. Execution of the experiment and data collection: Dene all software measures (entity, attribute, unit, and counting rules). Analysis: Analyze data. Graphical examinations are recommended before undertaking detail analysis, and an assumption check should be performed on the test procedure. Presentation of results: Provide results of analyzed data. Presentation3 of results should be detailed enough to allow replications of the study. Depending on statistical packages, one might obtain slightly dierent results, so it is a good idea to specify what package has been used in the study. Provide appropriate graphical representations for data comprehension, and appropriate descriptive statistics. Interpretation of results: Dene the population and the type of study. The statistical and practical importances might be dierent, so it is important
The usage of presentation here is dierent from that in Wohlins guideline, which referred to documentation. Here, presentation involves analyzed data.
3
25 to interpret results from both perspectives. Any limitations of the study need to be specied.
3.2.3
Observations and Problem Areas
In Sections 3.2.1 and 3.2.2, we presented surveys of the current state of reviews of empirical studies and experimentation guidelines. Through these surveys we derive the following observations and identify problems that should be addressed. 1. Small number of systematic review articles: During the survey process, we found that there are a limited number of papers that report reviews of empirical studies. Moreover, some reviews do not have systematic evaluation criteria or a denition of the population (such as how they selected papers for reviews). For example, Basili et al. and Juristo et al. do not provide information on how they select papers for reviews. In the absence of this information, it is dicult to draw correct inferences from the results of the study. Kitchenham et al. [77] urge that software engineers need to perform more systematic reviews to establish evidence-based software engineering. They recommend that systematic reviews should use appropriate search methodologies, and justify the method of search, such as the choice of journals and the selection of papers. 2. Few reviews of quality of empirical studies: Many review articles focus on the quantitative aspects of studies; for example, how many papers do not have any empirical validation, or how many papers utilize fault data. Quantitative analysis is an important activity that identies general (or specic) trends in attributes we are interested in, but we also need qualitative analysis to provide insights into how well individual studies were performed. A review by Juristo et al. [70] provides the qualitative point of view of studies of testing techniques, which they describe as the maturity level of a technique. Having more reviews of this type will build up knowledge about empirical studies in a specic area that researchers are interested in, and will provide guidelines for studying that area in future research. 3. No standard experimentation guidelines in software engineering: During the survey of experimentation guidelines, we found that terminologies are not fully standardized, so researchers tend to use terminologies familiar to them. For example, Basili et al. use extrapolation, while some others use generalization, which is more popularly used in the experimentation community. The terms state variable (response variable) and independent variable (dependent variable) are used interchangeably. Juristo et al. use parameter which is dened as any characteristic of a software project that is to be invariable throughout the experimentation. This appears to be in reference to an environment or setting for an experiment, but the use of the terminology is not clear.
26 Some guidelines explicitly recommend graphical examination of data and checking of assumptions before applying statistical procedures, but some make no mention of assumptions whatsoever. Interpretation of test results is the most important part of the experiment since it answers specic research questions, and often research in software engineering carries industrial implications. It is thus important to report both the statistical and practical signicance of experiment results. Most of the guidelines surveyed here miss this point. According to Kitchenham et al. [77], medical research has standard guidelines for conducting and reporting an experiment, known as The CONSORT statement, and medical journals follow those guidelines. Software engineering researchers could benet from this sort of standard set of guidelines for conducting empirical studies.
3.3
Criteria for the Evaluation of Empirical Studies of Testing Techniques
This section presents criteria for the evaluation of empirical studies of testing techniques, primarily based on the surveys in Section 3.2, with additional evaluation criteria for testing techniques that we have identied. The criteria also serve as guidelines for performing empirical studies. The evaluation criteria and guidelines for empirical studies are as follows: 1. Objectives of the study: As recommended by most of the experiment guidelines in Section 3.2.2, for a formal experiment, specic hypotheses or research questions should be formulated from the objectives of the study. If a hypothesis is too general, the experiment results are also too general to be useful. 2. Experiment design: We adapt design guidelines from Kitchenham et al. [78]: Dene the sampling process for objects: In practice, it is dicult to obtain software programs by random sampling. There is no pool having all sorts of programs that supports random sampling. There are two cases here: 1) Open source software hosts such as SourceForge and Apache Jakarta can be one of the pools from which we can obtain pseudo-random samples in a limited sense. 2) Unlike researchers who perform experiments with human subjects, which are typically available for one-time experimental use, researchers who perform experiments with non-human objects often share object programs for their experiments [31]. In either case, a study description needs to specify how objects are obtained. Dene treatment: Dening the treatment helps readers to understand the cause and eect relationship clearly.
27 Dene the process for treatment assignment: The treatment allocation to objects should be unbiased. An biased allocation is unlikely to happen when using object programs instead of human subjects, but still a process should be specied for treatment assignment to make the experiment process clear. In addition to the three guidelines just described, all detailed attributes of objects should be specied. For example, studies of testing techniques possibly involve the following attributes: lines of code (or number of classes), the number of functions (or methods), the number of faults in a program, the type of faults (real, seeded, and mutated faults), the number of versions of a program, the test suite size, the type of test suite (specication-based, code-based, coveragebased, etc), the number of mutants, and any other characteristics relevant to the experiment. This information helps readers understand the experiment results and the complexity of the experiment. Detailed descriptions of the experiment design facilitate replication studies. 3. Study setting: Depending on software tools and hardware devices utilized, experiment results can be slightly dierent. To provide a clear view of the collected data, environment setups that may aect results should be specied (e.g., a specication for the machine utilized, and/or the name (source) of the software tool applied). This criterion is also very important if researchers want to replicate the study. 4. Data collection (measures): When researchers collect data, they should be certain that a measure is valid, and detailed descriptions of data collection procedures are helpful for understanding the experiment and conducting replications. The unit of collected data should be specied clearly, and sometimes researchers forget to do this. For instance, the rst paper in Table 3.4 reports measures of execution time using a specic tool, but no unit of time is specied. Here, time could be seconds or hours, which represent very dierent scales, thus readers cannot interpret the results correctly. For testing techniques, possible measures include: fault detection rate, test suite reduction rate, run time for testing tools/methods, and coverage rate. 5. Data examination and analysis: Before performing a formal data analysis, it is important to examine collected data through graphical presentations. By doing this, it helps us understand how data is distributed (normal or non-normal), whether data includes outliers (and if so, how extreme they are), and whether the variabilities between data sets to be compared are severe. This provides an initial idea about what type of statistical procedures should be applied to the collected data, and serves as a check of assumptions before applying statistical procedures.
28 Depending on the distribution of data, its size, and the variability between data groups to be compared, the choice between parametric and non-parametric analysis is made. Parametric tests are more powerful than non-parametric tests, but researchers should make sure that the data meets the assumptions required for parametric tests. Since dierent statistics packages may give slightly dierent results, specifying what statistics package the study used is important. 6. Interpretation: The most important part of the experiment processes is interpretation of results by connection to the research questions. A study may nd statistical signicance in the results, but there may no practical signicance, and vice versa. Montgomery [90] illustrates this point well using the following example: an engineer may determine that a modication to an automobile fuel injection system may produce a true mean improvement in gasoline mileage of 0.1 mi/gal. This is a statistically signicant result. However if the cost of the modication is $1000, then the 0.1 mi/gal dierence is probably too small to be of any practical value. 7. Limit of the study: Researchers should report any limitations of the study, such as threats to the validity of the experiment. There are four types of threats to validity [140]: internal, external, construct, and conclusion validity. Researchers need to discuss at least internal and external validity [78]. Internal validity: Internal validity concerns confounding factors or biases that can aect the independent variables with respect to causality. External validity: External validity concerns the generalization of results. Construct validity: Construct validity concerns whether the experiment setting actually reects the construct under study. Conclusion validity: Conclusion validity concerns issues that aect the ability to draw the correct conclusions about relations between the treatment and the outcome of an experiment. 8. Replicability: Replication is very important from a scientic point of view. Replication studies can conrm the results of an original study, and so can build knowledge in a signicant way, contributing to the generalization of results. According to a survey by Zelkowitz et al. [144], too many experiments use an informal validation; one third of the papers surveyed relied on informal validation. This problem can be partially solved through external replication.4 The replicability of a study is closely related to how detailed the study is in presenting the experiment design and data collection/measure processes. In
4 There are two types of replications: internal and external replications [18, 70]. Internal replications are run within the experiment itself in order to establish the statistical validity of the experiment. External replications are run by independent researchers in order to build condence in the results of the experiment and conrm the ndings of other researchers.
29 particular, for independent researchers (external replication), replicability is critical. Brooks et al. [18] perform several replicated studies and nd that even in a doctoral thesis, several details were absent from the experiment report, which introduced uncertainty into their replication studies and created sources of variability between the experiments. To judge the replicability of studies by other researchers, we consider four components objects (details on attributes of objects and source of objects), treatments (description of techniques used), study setting, and data collection and examine whether each study provides detailed information about those components. We categorize them into three levels: highly, moderately, and poorly replicable. 9. Long-term and lifetime view: As mentioned in Fenton et al.s study [45], a long-term view could lead to conclusions very dierent from a short-term view. Their study provides one example from the NASA Goddard Software Engineering Conference, which investigated the benets of using Ada instead of Fortran. It was reported that Ada was not eective in the rst set of projects, but after three major Ada developments, results showed that there were signicant benets to using Ada instead of Fortran. A long-term assessment is more important for studies of testing techniques because testing is tightly related to system evolution. There has been considerable research on testing and regression testing. Only recently have people started to evaluate the techniques empirically. Typically these evaluations treat testing and regression testing as a one-time process, whereas what practitioners are really interested in is the cost-benets across a systems entire lifetime. Ignoring system lifetime may lead to incorrect evaluation of techniques. For example, a testing technique A may be benecial compared to a comparable technique B for a base version of a system, but when the system evolves, the same conclusion may not hold. Thus we consider whether studies treat testing a system as a lifetime process (long-term view) or a one-time process (short-term view). 10. Industrial signicance: Often studies fail to draw a bigger picture from their results, and thus they miss important factors when they evaluate techniques. The ndings from studies can be signicant and useful to industry, so it is desirable to link the ndings to industrial contexts and provide a prospective view on potential benets resulting from the techniques. For example, savings as indicated by some measure on a small program may not be very benecial, but for a large commercial program, they may be very signicant.
30
3.4
The State of the Art in Controlled Experiments of Testing
Having dened evaluation criteria for empirical studies of testing in Section 3.3, we now choose a subset of the empirical studies (we focus on studies performing controlled experiments) surveyed in Section 3.1, and analyze them further in terms of how well they have been performed. Here, we consider a set of ACM Transactions on Software Engineering and Methodology (TOSEM) and IEEE Transactions on Software Engineering (TSE) papers. These papers are from all issues from these venues, in the period 1994 to 2003. We analyze 15 papers in total, and our analysis is based on the criteria dened in the previous section. In the analysis process, we focus only on the experimental point of view, so discussion regarding the context of a study is outside the scope of this analysis. Table 3.4: Summary of Testing Techniques.
Papers RQ Design Study Setting yes yes yes/no yes yes yes no yes no no yes no yes yes yes Data Collec. yes yes yes yes yes yes yes no yes no yes yes yes yes yes Anal. Interpre -tation yes/yes yes/yes yes/yes yes/yes yes/no yes/yes yes/yes yes/no yes/yes yes/no yes/yes yes/no yes/no yes/yes yes/yes Limit Replica -bility moder. high moder. high moder. high high poor high poor high high high high high Long term View no no no no no no no no no no no no no no no Indus. Prac. no yes no yes no no no no no no no no no no no
1[1] 2[10] 3[16] 4[42] 5[46] 6[52] 7[58] 8[80] 9[87] 10[89] 11[94] 12[115] 13[119] 14[120] 15[122]
yes no no yes no yes yes yes yes no yes yes no no yes
yes yes yes yes yes yes yes yes yes yes yes yes yes yes yes
no no yes yes no no no no yes no no yes no no yes
I/E/C I/E/C I/E/C I/E/C I/E/C E I/E E I/E/C I/E/C E I/E/C
Table 3.4 summarizes the results of our analysis. The description of each column is as follows: Papers: Lists the order of the papers appearance in the bibliography. Research Questions: Species whether a study states specic research questions (yes or no). Design: Species whether a study describes all necessary components in the experimental design, such as objects, treatments, and treatment assignment (yes or no).
31 Table 3.5: Replicability Checklist.

Papers 1[1] 2[10] 3[16] 4[42] 5[46] 6[52] 7[58] 8[80] 9[87] 10[89] 11[94] 12[115] 13[119] 14[120] 15[122] Object Description yes yes yes yes yes yes yes no yes yes yes yes yes yes yes Object Source yes yes yes yes yes yes yes no yes no no yes yes yes yes Technique Description yes yes yes yes no yes yes yes yes yes yes yes yes yes yes Study Setting yes yes yes/no yes yes yes no yes no no yes no yes no yes Data Collec. Details yes yes yes yes yes yes yes no yes no yes yes yes yes yes Unit of Measure no yes yes yes yes yes yes yes yes yes yes yes yes yes yes Replicab. moder. high moder. high moder. high high poor high poor high high high high high
Study Setting: Species whether a study describes the environmental setting or tools that are necessary to all objects and treatments, such as machine specications (yes or no). When a study contains multiple settings and describes some part of them, we specify yes/no. Data Collection: Species whether a study describes data collection and measurement processes (yes or no) Analysis: species whether a study performs statistical analysis (yes or no). Interpretation: Species whether a study interprets results with respect to the research questions and considers practical issues related to results ((yes or no)/(yes or no) the rst indicates whether relevant interpretations are made, and the second indicates whether practical issues are discussed). Limit: Species whether a study describes the limitations of the study, such as threats to internal, external, construct, and conclusion validity. Since none of the studies we consider address threats to conclusion validity, we denote this column using the validity initials (I/E/C). Replicability: Species the degree of replicability of a study (high, moderate, or poor). Decisions are made from Table 3.5, considering whether the following elements are addressed in the study: object description details, source of object, technique implementation details, study setting, data collection details, and unit of measure. Based on this information, the degree of replicability is determined. Two exceptions are made, for the rst and third studies. These studies satisfy most of the requirements, but in each one component poses a high risk to
32 replication.5 Long-term View: Species whether a study considers a long-term view of the lifetime of techniques (yes or no). Industrial Practice: Species whether a study considers results in any industrial contexts (yes or no). Having analyzed these papers, now we discuss observations drawn from our analysis and open problems in understanding testing techniques from an empirical point of view. Some of items are related to earlier discussion. Researchers often neglect to formulate specic research questions: Of the 15 papers we reviewed, only nine studies clearly stated specic research questions. Others provided only goals or objectives, which are often too general. For example, the third paper in Table 3.4 states the objective of their study as follows: Empirically evaluate tree-based integration testing strategies. We cannot determine what the study intends to investigate from such a description. The objective should specify at least those attributes of testing strategies that the researchers want to compare, such as the number of stubs the testing strategies require. Statistical analysis is not well established: To draw meaningful conclusions from a study, we must perform statistical analysis and interpret the results if the number of data points is large enough to do so. The papers we considered report on formal experiments, but surprisingly, the number of studies (ve) that include statistical analysis is very small. Even more problematic is that only one study among the ve includes an assumption check before applying statistical tests. Without checking assumptions, statistical tests may be incorrectly applied. Kitchenham et al. [77] point out this issue as follows: The researchers were likely to nd statistically signicant results that did not really exist. Practical importance should be drawn from results: As we learned in Section 3.3, researchers often overlook the practical implications of the results. Ten of the studies discuss the practical implications of their results. It is an encouraging sign that many researchers are trying to interpret their results in a practical way, but this should be more widespread. Value to industrial practice is not addressed: Only two studies address this issue in their discussion: The second paper in Table 3.4 discusses the cost model for comparing regression techniques in terms of the product development phases, and the fourth paper performs a cost-benet analysis considering the industrial context (such as translating potential savings from techniques into savings in the organization). Researchers should think one step further in their studies and not just interpret the results in a limited context. Long-term and lifetime views are overlooked: Despite its importance, none of the studies consider long-term or lifetime views of the techniques. It is not an easy task
5 The rst study uses a mutation tool for the program instrumentation, but the tool is not available publicly, and detailed descriptions are omitted. The third study uses a reverse engineering tool to determine the attributes of objects. Hence their results are dependent on that particular reverse engineering tool.
33 to consider these factors because it requires complex and large scale experiments. However, in some areas, particularly those in testing, we can investigate testing techniques eectiveness in a static environment by using software programs having several versions that can be obtained from open-source software hosts, or (preferably) from software companies. This task is still not easy, but it is worthwhile to investigate techniques in this manner, as it may make a signicant contribution to the testing community. Replicability requires substantial infrastructure support: Even though descriptions of experiment processes and all relevant components are detailed enough to perform replication studies, there exists a fundamental problem that may introduce uncertainty to replication studies: studies of testing techniques tend to be very dependent on tools utilized during the experiment. For example, the third paper in Table 3.4 uses a reverse engineering tool to build objects from Java programs on which they conduct their experiment. Hence their results are dependent on that particular reverse engineering tool. Without being able to obtain this tool, a replication study would be confounded. One possible solution to this problem is to build an infrastructure that helps with artifact sharing between researchers. Chapter 4 discusses several challenges that researchers face when they conduct an experiment, and addresses the replicability problem as one of those challenges. Presentation and documentation are not well organized: During our evaluation of these papers, we found that presentation of experiment processes is problematic. Some papers introduce research questions in the introduction section, instead of in the experiment section. All relevant information to an experiment should appear in the experiment section. Some papers tend to describe entire experiment processes and results in one section without any break points. This style makes it dicult to understand what is described. We recommend that the experiment section in a study utilize a standard template that can be generally understood. Having a standard template, in turn, helps researchers to remember to report all the important information about an experiment.
3.5
Conclusions
We have provided an overview of the current state of controlled experimentation on testing techniques, and established evaluation criteria for studies based on two surveys of empirical studies. During the evaluation of those studies in Section 3.4, we identied several areas for improvement in empirical studies. While we have recognized problems with experimentation from a specic set of papers, we also noted several additional problems. The papers reviewed in Section 3.4 are considered to be high quality papers in software engineering, having been selected from preeminent journals. Thus, the problems described in this chapter raise concerns. First, if journal papers exhibit problems in experimentation, then these problems may be more serious in conference papers, which have space constraints
34 and tend to be simplied. Second, papers appearing in journals typically go through a lengthy review process, and thus these papers are signicantly revised in response to comments from reviewers. Yet we still see problems in these studies. We conjecture that reviewers or editors may not be aware of the importance of experimentation and thus may not have appropriate knowledge to recognize the problems we discussed. There is no quick remedy to these problems. Software engineers need to be aware of these problems and build a consensus that performing empirical studies is an essential activity for providing scientic evidence and inputs to the decision-making processes in an organization. The observations and problems presented in this chapter motivate the thesis we are investigating in this dissertation, and they also motivate our concern with infrastructure, which we discuss in greater detail in the next chapter.
35
Chapter 4 Infrastructure Support for Empirical Studies

As we have asserted earlier, it is important for researchers and practitioners to understand the tradeos and factors that inuence testing techniques.1 Some understanding of these can be obtained by using analytical frameworks, subsumption relationships, or axioms. In general, however, testing techniques are heuristics and their performance varies with dierent workloads and contexts; thus, they must be studied empirically. Empirical studies of testing techniques depend on numerous software-related artifacts, including software systems, test suites, and fault data; for regression testing experimentation, multiple versions of software systems are also required. Obtaining such artifacts and organizing them in a manner that supports controlled experimentation is a dicult task. Section 4.1 illustrates these diculties by further analyzing papers surveyed in Section 3.1 with respect to software artifacts utilized. Section 4.2 further discusses these diculties in terms of the challenges faced by researchers wishing to perform controlled experimentation, which include the needs to generalize results, ensure replicability, aggregate ndings, isolate factors, and amortize the costs of experimentation. To help address these challenges, we have been designing and constructing infrastructure to support controlled experimentation with software testing and regression testing techniques.2 Section 4.3 presents this infrastructure, describing its organization and primary components, our approaches for making it available and augmenting it, some examples of the infrastructure being used, and potential threats to validity. Section 4.4 concludes by reporting on the impact this infrastructure has had, and can be expected to have, on further controlled experimentation.
Much of the material presented in this chapter has appeared in [32]. Work described here is a part of much larger ongoing project; this chapter describes both work completed for this dissertation and ongoing work. As you will see, some of the infrastructure described here was constructed during the process of performing experiments described later in this dissertation.
2 1
36
4.1
A Survey of Studies of Testing: Infrastructure Use
In Section 3.1, we surveyed research papers to provide a view of the state of art in empirical studies of software testing. In this section, we further analyze these papers by focusing on software artifacts utilization in empirical studies. Table 4.1 shows a further classication of the empirical studies reported in Table 3.2 with the following columns: Example (the number of papers that presented examples), Case Study (the number of papers that presented case studies), Controlled Experiment (the number of papers that presented controlled experiments), Multiple Programs (the number of papers that utilized multiple programs), Multiple Versions (the number of papers that utilized multiple versions of program), Tests (the number of papers that utilized test suites), Faults (the number of papers that utilized fault data), and Shared Artifacts (the number of papers that involved artifacts provided by or made available to other researchers). Table 4.1: Further classication of published empirical studies.
Publication TSE (1999-2003) TSE (1994-1998) TSE (1994-2003) TOSEM (1999-2003) TOSEM (1994-1998) TOSEM (1994-2003) ISSTA (1999-2003) ISSTA (1994-1998) ISSTA (1994-2003) ICSE (1999-2003) ICSE (1994-1998) ICSE (1994-2003) FSE (1999-2003) FSE (1994-1998) FSE (1994-2003) ICSM (1999-2003) ICSM (1994-1998) ICSM (1994-2003) Total (1999-2003) Total (1994-1998) Total (1994-2003) Example 0 1 1 0 1 1 0 0 0 1 0 1 2 0 2 2 3 5 5 5 10 Case Study 9 9 18 0 1 1 4 2 6 6 7 13 6 2 8 11 3 14 36 24 60 Controlled Experiment 9 1 10 3 3 6 2 1 3 7 3 10 2 2 4 3 1 4 26 11 37 Multiple Programs 15 6 21 3 4 7 5 3 8 9 6 15 5 3 8 10 3 13 47 25 72 Multiple Versions 6 2 8 3 2 5 2 1 3 6 7 13 2 2 4 11 3 14 30 17 47 Tests 16 8 24 3 5 8 6 3 9 14 10 24 8 4 12 10 2 12 57 32 89 Faults 7 2 9 3 3 6 1 1 2 8 5 13 1 2 3 3 2 5 23 15 38 Shared Artifacts 5 1 6 2 1 3 1 0 1 6 2 8 0 0 0 5 1 6 19 5 24
As analysis of the data in the table shows, 32.7% of the studies utilize data from only one program (although this is not necessarily problematic for case studies). Also, only 44% of the studies utilize multiple versions and only 35.5% utilize fault data. Finally, the table shows that of the 107 studies, only 22.4% involved artifact sharing. This table exhibits an increasing trend in sharing from 12.5% in the early time period to 28% in the later period.
37 Further investigation of this data is revealing. Of the 24 papers in which artifacts were shared among researchers, 21 use one or both of a set of programs known as the Siemens programs, or a somewhat larger program known as space. (Four of these 21 papers also use one or two other large programs, but these programs have not to date been made available to other researchers as shared artifacts.) The Siemens programs, originally introduced to the research community by Hutchins et al. [63] and subsequently augmented, reorganized, and made available as sharable infrastructure in prior eorts, consist of seven C programs each of no more than 1000 lines of code, 132 seeded faults for those programs, and several sets of test suites satisfying various test adequacy criteria. Space, appearing initially in papers by other researchers [135, 142] and also processed and made available as sharable infrastructure in prior eorts, is a single application of nearly 10,000 lines of code, provided with various test suites and 35 actual faults. In the cases in which multiple versions of software systems are used in studies involving these programs, these versions dier only in terms of faults, rather than in terms of a set of changes of which some have caused faults; ignoring these cases, only four exist in which actual, realistic multiple versions of programs are utilized. This is just a starting point for infrastructure sharing; as we describe later in Section 4.2, the use of the Siemens programs and space poses several threats to validity.
4.2
Challenges for Experimentation
Researchers attempting to conduct controlled experiments examining the application of testing techniques to artifacts face several challenges. The survey of the literature just summarized provides evidence of the eects of these challenges: in particular in the small number of controlled experiments, the small percentage of studies utilizing multiple programs, versions, and faults data, and the limited artifact sharing evident. The survey also suggests, however, that researchers are becoming increasingly willing to conduct controlled experiments, and are increasing the extent to which they utilize shared artifacts. These tendencies are related: utilizing shared artifacts is likely to facilitate controlled experimentation. The Siemens programs and space, in spite of their limitations, have facilitated a number of controlled experiments that might not otherwise have been possible. This argues for the utility of making additional infrastructure available to other researchers, as we have done. Before proceeding further, however, it is worthwhile to identify the challenges faced by researchers performing experimentation on testing techniques in the presence of limited infrastructure. Identifying such challenges provides insights into the limited progress in this area that goes beyond the availability of artifacts. Furthermore, identifying these challenges has helped us dene the infrastructure requirements for such experiments, and has helped us shape the design of an experiment infrastructure.
38
Challenge 1: Supporting replicability across experiments.

A scientic nding is not trusted unless it can be independently replicated. When performing a replication, researchers duplicate the experimental design of an experiment on a dierent sample to increase the condence in the ndings [140] or on an extended hypothesis to evaluate additional variables [7]. Supporting replicability for controlled experiments requires establishment of control on experimental factors and context; this is increasingly dicult to achieve as the units of analysis and context become more complex. When performing controlled experimentation with software testing techniques, several replicability challenges exist. First, artifacts utilized by researchers are rarely homogeneous. For example, programs may belong to dierent domains and have dierent complexities and sizes, versions may exhibit dierent rates of evolution, processes employed to create programs and versions may vary, and faults available for the study of fault detection may vary in type and magnitude. Second, artifacts are provided in widely varying levels of detail. For example, programs freely available through the open source initiative are often missing formal documentation or rigorous test suites. On the other hand, condentiality agreements often constrain the industry data that can be utilized in published experiments, especially data related to faults and failures. Third, experiment design and process details are often not standardized or reported in sucient detail. For example, dierent types of oracles may be used to evaluate technique eectiveness, dierent, non-comparable tools may be used to capture coverage data, and when fault seeding is employed it may not be clear who performed the activity and what process they followed.
Challenge 2: Supporting aggregation of ndings.

Individual experiments may produce interesting ndings, but can claim only limited validity under dierent contexts. In contrast, a family of experiments following a similar operational framework can enable the aggregation of ndings, leading to generalization of results and further theory development. Opportunities for aggregation are highly correlated with the replicability of an experiment (Challenge 1); that is, a highly replicable experiment is likely to provide detail sucient to determine whether results across experiments can be aggregated. (This reveals just one instance in which the relationship between challenges is not orthogonal, and in which providing support to address one challenge may impact others.) Still, even high levels of replicability cannot guarantee correct aggregation of ndings unless there is a systematic capture of experimental context [110]. Such systematic capture typically does not occur in the domain of testing experimentation. For example, versions utilized in experiments to evaluate regression testing techniques may represent minor internal versions or major external releases; these two scenarios
39 clearly involve very distinct levels of validation. Although capturing complete context is often infeasible, the challenge is to provide enough support so that the evidence obtained across experiments can be leveraged.
Challenge 3: Reducing the cost of controlled experiments.

Controlled experimentation is expensive, and there are several strategies available for reducing this expense. For example, experiment design and sampling processes can reduce the number of participants required for a study of engineer behavior, thereby reducing data collection costs. Even with such reductions, obtaining and preparing participants for experimentation is costly, and that cost varies with the domain of study, the hypotheses being evaluated, and the applicability of multiple and repeated treatments on the same participants. Controlled experimentation in which testing techniques are applied to artifacts does not require human participants, it requires objects such as programs, versions, tests, and faults. This is advantageous because artifacts are more likely to be reusable across experiments, and multiple treatments can be validly applied across all artifacts at no cost to validity. Still, artifact reuse is often jeopardized due to several factors. First, artifact organization is not standardized. For example, dierent programs may be presented in dierent directory structures, with dierent build processes, fault information, and naming conventions. Second, artifacts are incomplete. For example, open source systems seldom provide comprehensive test suites, and industrial systems are often sanitized to remove information on faults and their corrections. Third, artifacts require manual handling. For example, build processes may require software engineers to congure various les, and test suites may require a tester to control execution and audit results.
Challenge 4: Obtaining sample representativeness.

Sampling is the process of selecting a subset of a population with the intent of making statements about the entire population. The degree of representativeness of the sample is important because it directly impacts the applicability of the conclusions to the rest of the population. However, representativeness needs to be balanced with considerations for the homogeneity of the sampled artifacts in order to facilitate replication as well. Within the software testing domain, we have found two common problems for sample representativeness. First, sample size is limited. Since preparing an artifact is expensive, experiments often use small numbers of programs, versions, and faults. Further, researchers trying to reduce costs (Challenge 3) do not prepare artifacts for repeated experimentation (e.g., test suite execution is not automated). Lack of preparation for reuse limits the growth of the sample size even when the same researchers perform similar studies.
40 Second, samples are biased. Even when a large number of programs are collected they usually belong to a set of similar programs. For example, as described in Section 4.1, many researchers have employed the Siemens programs in controlled experiments with testing. This set of objects includes seven programs with faults, versions, processing scripts, and automated test suites. The Siemens programs, however, each involve fewer than 1000 lines of code. Other sources of sample bias include the types of faults seeded or considered, processes used for test suite creation, and code changes considered.
Challenge 5: Isolating the eects of individual factors.

Understanding causality relationships between factors lies at the core of experimentation. Blocking and manipulating the eects of a factor increases the power of an experiment to explain causality. Within the testing domain, we have identied two major problems for controlling and isolating individual eects. First, artifacts may not oer the same opportunities for manipulation. For example, programs with multiple faults oer opportunities for analyzing faults individually or in groups, which can aect the performance of testing techniques as it introduces masking eects. Another example involves whether or not automated and partitionable test suites are available; these may oer opportunities for isolating test case size as a factor. Second, artifacts may make it dicult to decouple factors. For example, it is often not clear what program changes in a given version occurred in response to a fault, an enhancement, or both. Furthermore, it is not clear at what point the fault was introduced in the rst place. As a result, the assessment of testing techniques designed to increase the detection of regression faults may be biased.
4.3
Infrastructure
We have described what we believe are the primary challenges faced by researchers wishing to perform controlled experimentation with testing techniques, and that have limited progress in this area. Some of these challenges involve issues for experiment design, and guidelines such as those we provide in Section 3.3. Other challenges relate to the process of conducting families of experiments with which to incrementally build knowledge, and lessons such as those presented by Basili et al. [7] could be valuable in addressing these. All of these challenges, however, can be traced at least partly (and some primarily) to issues involving infrastructure. To address these challenges, we have been designing and constructing infrastructure to support controlled experimentation with software testing and regression testing techniques. Our infrastructure includes, rst and foremost, artifacts (programs, versions, test cases, faults, and scripts) that enable researchers to perform controlled experimentation and replications.
41 We have been staging our artifact collection and construction eorts along two dimensions, breadth and depth, where breadth involves adding new object systems to the infrastructure, and depth involves adding new attributes to the systems currently contained in the infrastructure. Where breadth is concerned, we focused initially on systems prepared in C; however, given the increasing interest on the part of the testing community in objectoriented systems and Java, we then shifted our attention to Java systems. Where depth is concerned, we have been incrementally expanding our infrastructure to accommodate additional attributes. Ultimately, we expect to continually extend these attributes based on input from other researchers who utilize our infrastructure. Our current set of attributes, however, can be broadly categorized as follows: System attributes: source code, and (as available) user documentation. We also store all necessary system build materials, together with information on congurations supported, and compilers and operating systems on which we have built and executed the system. Version attributes: system releases created in the course of system enhancement and maintenance. We store versions of source code, versions of documentation, and all associated system attributes for those versions. Test attributes: pools of input data, and test suites of various types. We store all harnesses, stubs, drivers, test classes, oracles, and materials necessary to execute test cases. Fault attributes: fault data about the system and versions, and information on fault-revealing inputs. Execution attributes: operational proles of the systems execution, expected runtimes for tests or analyses, results of analysis runs, and data on test coverage obtained. Beyond these artifacts, our infrastructure also includes documentation on the processes used to select, organize, and further set up artifacts, and supporting tools that help with these processes. Together with our plans for sharing and extending the infrastructure, these objects, documents, tools, and processes help address the challenges described in the preceding section as summarized in Table 4.2. The following subsections provide further details on each of these aspects of our infrastructure. Further subsections then describe examples of the use of the infrastructure that have occurred, and threats to validity to consider with respect to the use of the infrastructure.
42 Table 4.2: Challenges and Infrastructure.

Infrastructure attributes Artifact Docs, Organization Setup Tools X X X X X X X X X X X Share, Extend X X X X
Challenges Support Replicability Support Aggregation Reduce Cost Representativeness Isolate Eects
Selection X X X
4.3.1
Object Selection, Organization, and Setup
Our infrastructure provides guidelines for object selection, organization, and setup processes. The selection and setup guidelines assist in the construction of a sample of complete artifacts. The organization guidelines provide a consistent context for all artifacts, facilitating the development of generic experiment tools, and reducing the experimentation overhead for researchers. Object selection Object selection guidelines direct persons assembling infrastructure in the task of selecting suitable objects, and are provided through a set of on-line instructions that include artifact selection requirements. Thus far, we have specied two levels of required qualities for objects: 1st-tier required-qualities (minimum lines of code required, source freely available, ve or more versions available) and 2nd-tier requiredqualities (runs on platforms we utilize, can be built from source, allows automation of test input application and output validation). When assembling objects, we rst identify objects that meet rst-tier requirements, which can be determined relatively easily, and then we prioritize these, and for each, investigate second-tier requirements. Part of the object selection task involves ensuring that programs and their versions can be built and executed automatically. Because experimentation requires the ability to repeatedly execute and validate large numbers of test cases, automatic execution and validation must be possible for candidate programs. Thus, our infrastructure currently excludes programs that require graphical input/output that cannot easily be automatically executed or validated. At present we also require programs that execute, or through edits can be made to execute, deterministically; this too is a requirement for automated validation, and implies that programs involving concurrency and heavy thread use might not be directly suitable. Our infrastructure now consists of 17 C and seven Java programs, as shown in Table 4.3. The rst eight programs listed are the Siemens programs and space, which
43 Table 4.3: Objects in our Infrastructure.

Objects tcas schedule2 schedule replace tot info print tokens2 print tokens space gzip sed ex grep make bash emp-server pine vim nanoxml siena galileo ant xml-security jmeter jtopas Size 173 374 412 564 565 570 726 9564 6582 11148 15297 15633 27879 48171 64396 156037 224751 7646 (24) 6035 (26) 15200 (79) 179827 (627) 41404 (143) 79516 (389) 19341 (50) No. of Versions 1 1 1 1 1 1 1 1 6 5 6 6 5 10 10 4 9 6 8 16 9 4 6 4 No. of Tests 1608 2710 2650 5542 1052 4115 4130 13585 217 1293 567 809 1043 1168 1985 288 975 217 567 1537 150 14 28 11 No. of Faults 41 10 9 32 23 10 7 35 15 40 81 75 17 69 90 24 7 33 3 0 21 6 9 5 Release Status released released released released released released released released released released released released released released near release near release released released released released released released released released
constituted our rst set of experiment objects; the remaining programs include nine larger C programs and seven Java programs, selected via the foregoing process. The other columns are as follows: The Size column presents the total number of lines of code, including comments, present in each program, and illustrates our attempts to incorporate progressively larger programs. For Java programs, the additional parenthetical number denotes the number of classes in the program. The No. of Versions column lists how many versions each program has. The Siemens programs and space are available only in single versions (with multiple faults), a serious limitation, although the availability of multiple faults has been leveraged, in experiments, to create various alternative versions containing one or more faults. Our more recently collected objects, however, are available
44
object
scripts
source
versions.alt
inputs
testplans testplans.alt
outputs outputs.alt
traces traces.alt
info
Figure 4.1: Object directory structure (top level). in multiple, sequential releases (corresponding to actual eld releases of the systems). The No. of Tests column lists the number of test cases available for the program (for multi-version programs, this is the number available for the nal version). Each program has one or more types of test cases and/or one or more types of test suites (described below). Two of the Java programs (nanoxml and siena) are also provided with test drivers that invoke classes under test. The last four Java programs (ant, xml-security, jmeter, and jtopas) have JUnit test suites. The test suites were supplied with each Java program from the open source software hosts. The No. of Faults column indicates the total number of faults available for each of the programs; for multi-version programs we list the sum of faults available across all versions. The Release Status column indicates the current release status of each object as one of released, or near release. The Siemens programs and space, as detailed above, have been provided to and used by many other researchers, so we categorize them as released. Emp-server and pine are undergoing nal formatting and testing and thus are listed as near release. The rest of the programs listed are now available in our infrastructure repository, and we also categorize them as released. Our object selection process helps provide consistency in the preparation of artifacts, supporting replicability. The same process also reduces costs by discarding earlier the artifacts that are not likely to meet the experimental requirements. Last, the selection mechanism lets us adjust our sampling process to facilitate the collection of a representative set of artifacts. Object organization We organize objects and associated artifacts into a directory structure that supports experimentation. Each object we create has its own object directory, as shown in Figure 4.1. An object directory is organized into specic subdirectories (which in turn may contain subdirectories), as follows.
45 The scripts directory is the staging platform from which experiments are run; it may also contain saved scripts that perform object-related tasks. The source directory is a working directory in which, during experiments, the program version being worked with is temporarily placed. The versions.alt directory contains various variants of the source for building program versions. Variants are alternative copies of the object, created when the initial copy will not function for some particular purpose, but when a somewhat modied alternative will. For example: The basic variant, which every object has, is contained in a subdirectory of versions.alt called versions.orig, which contains subdirectories v0, v1, . . ., vk, where v0 is the earliest version, and the other vj contain the next sequential versions in turn. A second variant provides faults; for this purpose a directory that may exist is versions.seeded, which also contains subdirectories v1, v2, . . ., vk. In this directory each vk area contains the .c le and .h les (or .java les) needed to build the version with some number of seeded faults that are inserted and a le FaultSeeds.h which contains all declarations needed to dene all the faults. A third general class of variants is created to handle cases in which some prototype analysis tool is not robust enough to process a particular syntactic construct; in such cases a variant of a program may be created in which that construct is replaced by a semantically equivalent alternative. In the case of Java programs, if a program consists of applications and components, then the versions directory itself is subdivided into multiple subdirectories for these applications and components. Each subdirectory contains variants such as versions.orig and versions.seeded subdirectories. The inputs directory contains les containing inputs, or directories of inputs used by various test cases. The testplans.alt directory contains subdirectories v0, v1, . . ., vk, each of which contains testing information for a system version; this information typically includes a universe le containing a pool of test cases, and various test suites drawn from that pool. We organize the vj subdirectories into four types of les and subdirectories: General les. These are .universe, .tsl, and .frame les. A universe le (.universe extension) is a le listing test cases. An automated test-script generation program transforms universe les into various types of scripts that can be used to automatically execute the test cases or gather traces
46 for them. TSL and frame les facilitate the use of black-box test cases designed using the category-partition method (described further in Section 4.3.1). TSL les (.tsl extension) are named vk.tsl (k=0,1,2...) for dierent versions. The sets of test frames generated from .tsl les (.frame extension) are named vk.frame, in reference to their corresponding TSL (vk.tsl) les. Link les. These are les that link to previous versions of general les. These links allow the inheriting of testing materials from a prior version, preventing multiple copies. Testsuites subdirectories. Some objects have directories containing various test suites built from the universe les. For example, the Siemens programs each have test suites containing randomly selected test cases from the universe le, test suites that are statement-coverage adequate, and test suites that are both statement-coverage adequate and minimal in size. The testscripts subdirectory. If the test cases require startup or exit activities prior to or after execution, scripts encoding these are stored in a testscripts subdirectory. The traces.alt directory contains subdirectories v0, v1, . . ., vk, each holding trace information for a version of the system, in the form of individual test traces or summaries of coverage information. The outputs.alt directory permanently stores the outputs of test runs, which is especially useful when experimenting with regression testing where outputs are compared against previous outputs. The testplans, outputs, and traces directories serve as staging platforms during specic experiments. Data from a specic testplans.alt subdirectory is placed into the testplans directory prior to experimentation; data from outputs and traces directories is placed into subdirectories in their corresponding .alt directories following experimentation. The info directory contains additional information about the program, especially information gathered by analysis tools and worth saving for experiments, such as fault-matrix information (which describe the faults that various test cases reveal). Java objects may contain a testdrivers directory that contains test drivers that invoke the application or its components. Our object organization supports consistent experimentation conditions and environments, allowing us to write generic tools for experimentation that know where to nd things, and that function across all of our objects (or a signicant subset, such as those written in Java). This in turn helps reduce the costs of executing and replicating controlled experiments, and aggregating results across experiments. The
47 use of this structure can potentially limit external validity by restricting the types of objects that can be accommodated, and the transformation of objects to t the infrastructure can create some internal validity threats. However, the continued use of this structure and the generic tools it supports ultimately reduces a large class of potential threats to internal validity arising from errors in automation, by facilitating cross-checks on tools, and leveraging previous tool validation eorts. The structure also accommodates objects with various types and classes of artifacts, such as multiple versions, fault types, and test suites, enabling us to control for and isolate individual eects in conducting experimentation. Object setup Test suites Systems we have selected for our repository have only occasionally arrived equipped with anything more than rudimentary test suites. When suites are provided, we incorporate them into our infrastructure because they are useful for case studies. For controlled experiments, however, we prefer to have test suites created by uniform processes. Such test suites can also be created in ways that render them partitionable, facilitating studies that isolate factors such as test case size, as mentioned in Section 4.2 (Challenge 5). To construct test suites that represent those that might be constructed in practice for particular programs, we have relied primarily on two general processes, following the approach used by Hutchins et al. [63] in their initial construction of the Siemens programs. The rst process involves specication-based testing using the category-partition method, based on a test specication language, TSL, described in [101]. A TSL specication is written for an initial version of an object, based on its documentation, by persons who have become familiar with that documentation and the functionality of the object. Subsequent versions of the object inherit this specication, or most of it, and may need additional test cases to exercise new functionality, which can be encoded in an additional specication added to that version, or in a rened TSL specication. TSL specications are processed by a tool, provided with our infrastructure, into test frames, which describe the requirements for specic test cases. Each test case is created and encoded in proper places within the object directory. The second test process we have used involves coverage-based testing, in which we instrument the object program, measure the code coverage achieved by specicationbased test cases, and then create test cases that exercise code not covered by them. Employing these processes using multiple testers helps reduce threats to validity involving specic test cases that are created. Creating larger pools of test cases in this fashion and sampling them to obtain various test suites, such as test suites that achieve branch coverage or test suites of specic sizes, provides further assistance with generalization. We store such suites with the objects along with their pools of test cases.
48 In addition to the two types of test suites just described, for our Java objects, we are also beginning to provide JUnit test suites (and have done so for ant, xml-security, jmeter, and jtopas). JUnit [69] is a Java testing framework that allows automation of tests for classes, and that is increasingly being used with Java systems. As noted in our description of Table 4.3, the JUnit test suites have been retrieved with each Java program from its open source software host. At present, not all of our objects possess equivalent types of test cases and test suites, but one goal in extending our infrastructure is to ensure that specic types of test cases and test suites are available across all objects on which they are applicable, to aid with the aggregation of ndings. A further goal, of course, is to provide multiple instances and types of tests suites per object, a goal that has been achieved for the Siemens programs and space allowing the completion of several comparative studies. Meeting this goal will be further facilitated through sharing of the infrastructure and collaboration with other researchers. Faults For studies of fault detection, we have provided processes for two cases: the case in which naturally occurring faults can be identied, and the case in which faults must be seeded. Either possibility presents advantages and disadvantages: naturally occurring faults are costly to locate and typically cannot be found in large numbers, but they represent actual events. Seeded faults can be costly to place, but can be provided in larger numbers, allowing more data to be gathered than otherwise possible, but with less external validity. To help with the process for hand-seeding faults, and increase the potential external validity of results obtained on these faults, we insert faults by following fault localization guidelines such as the one shown in Figure 4.2 (excerpted from our infrastructure documentation). These guidelines provide direction on fault placement. We also provide fault classications based on published fault data (such as the simple one present in the gure), so that faults will correspond, to the extent possible, to faults found in practice. To further reduce the potential for bias, fault seeding is performed independently of experimentation, by multiple persons with sucient programming experience, and who do not possess knowledge of specic experiment plans. Another motivation for seeding faults occurs when experimentation concerned with regression testing is the goal. For regression testing, we wish to investigate errors caused by code change (regression faults). With the assistance of a dierencing tool, fault seeders locate code changes, and place faults within those. Fault seeding can also be performed using program mutation [19, 94]. Program mutation can produce large numbers of faults at low cost, and a recent study [2] indicates that mutation faults can in fact be representative of real faults. If these results generalize, then we can extend the validity of experimental results by using mutation, and the large number of faults that result can yield data sets on which statistically signicant conclusions can be obtained, facilitating the understanding of
49 Figure 4.2: Fault localization guidelines for C programs. 1. When the goal is to seed regression faults in versions, use a dierencing tool to determine where the changes occurred. If the changes between versions are large, the tool may carry dierences forward, even when the code is identical. Verify the dierences independently and use di only as a guide. 2. The fault seeding process needs to reect real types of faults. A simple fault classication scheme can be helpful for thinking about dierent types of faults: (a) Faults associated with variables: denition of variable, redenition of variable, deletion of variable, change value of variable in existing assignment statement. (b) Faults associated with control ow: addition of new block of code, deletion of path, removal of block, redenition of execution condition, change in order of execution, new call to external function, removal of call to external function, addition of function, removal of function. (c) Faults associated with memory allocation: allocated memory not freed or not initialized, or erroneous use of pointers. 3. When creating regression faults, assume that the programmer who made the modication inserted the fault, and simulate this behavior by inserting articial faults into the modied code. There is no restriction on the size or type of fault at this stage. The only requirement is that the changed program can still be compiled and executed. 4. Since more than two programmers perform the seeding process independently, some faults may be repeated. Repeated faults must be removed after all faults have been inserted. Although the programmers could work together to avoid overlapping, this would hurt the validity and credibility of the process. The modied code from multiple programmers needs to be merged. This should be a simple and short (maximum 10 modications) cut/paste process but needs to be done carefully. Make sure you compile, link and test your program. 5. Next, run the test suite on each version to perform further ltering. Filter out two types of faults: (a) faults that are exposed by more than 20% of the tests, because if they are introduced, they are likely to be detected during unit testing. (b) faults that are not detected by any tests. 6. Keep only those faults that have not been ltered out. Test the program and then move on to the next version.
50 causality.3 Such studies can then be coupled with studies using other fault types to address external validity concerns. Similarly, in the future, other researchers might be interested in other types of seeded faults, such as integration faults or faults related to side-eects. By dening appropriate mutation operators or tables of fault types, these researchers can simulate these fault types, allowing studies of their detection. For example, Delamaro et al. [29] have assessed the adequacy of tests for integration testing by mutating interfaces, in ways that simulate involve integration faults. Faults of these types can then also be incorporated into our infrastructure. All of the objects in our infrastructure that are listed in Table 4.3 except space contain seeded faults; space contains 35 real faults that were identied during the testing and operational use of the program.
4.3.2
Documentation and Supporting Tools
Documentation and guidelines supplied with our infrastructure provide detailed procedures for object selection and organization, test generation, fault localization, tool usage, and current object descriptions. The following materials are available: C (Java) Object Handbook: these documents describe the steps we follow to set up a typical, less than 30K LOC C program (or a typical 5-15K LOC Java program) as an experiment object. They are written primarily as sets of instructions to persons who have or wish to set up these objects, but they are also useful as an aid to understanding the object setup, and choices made in that setup. C (Java) Object Download: these web pages provide access to the most recent releases of each of our C (Java) objects, in tarred, gzipped directories. C (Java) Object Biographies: these web pages provide information about each of the C (Java) objects we currently make available, including what they are and how they were prepared. Tools: this web page describes and provides download capabilities for the tools. Short descriptions for supporting tools are as follows: tsl: this program generates test frames based on a TSL specication. javamts/mts: these programs generate test scripts of various forms and functionalities from universe les. gen-fault-matrix: this module contains various scripts needed to generate a fault matrix le (a le relating faults to the test cases that expose them) for an object.
3 Section 5.3 of this dissertation investigates this issue further, and provides some further evidence of generalizability.
51 adi: this tool produces a list of functions changed across two versions of a C object. Reporting Problems: this web page provides instructions regarding how to report problems in the infrastructure. As suggested in Section 4.2, guidelines such as those we provide support sharing (and thus cost reduction), as well as facilitating replication and aggregation across experiments. Documentation and guidelines are thus as important as objects and associated artifacts. Depending on the research questions being investigated, experiment designs and processes used for testing experiments can be complex and require multiple executions, so automation is important. Our infrastructure provides a set of testing tools that build scripts that execute tests automatically, gather traces for tests, generate test frames based on TSL specications, generate mutation faults, and generate fault matrices for objects. These tools make experiments simpler to execute, and reduce the possibility of human errors such as typing mistakes, supporting replicability as well. The automated testing tools function across all objects, given the uniform directory structure for objects; thus, we can reuse these tools on new objects as they are completed, reducing the costs of preparing such objects.
4.3.3
Sharing and Extending the Infrastructure
Our standard object organization and tool support help make our infrastructure extensible; objects that meet our requirements can be assembled using the required formats and tools. This is still an expensive process, but in the long run such extension will help us achieve sample representativeness, and help with problems in replicability and aggregation as discussed in Section 4.2. In our initial infrastructure construction, we have focused on gathering objects and artifacts for regression testing study, and on facilitating this with faults, multiple versions and tests. Such materials can also be used, however, for experimentation with testing techniques generally, and with other program analysis techniques. Still, we intend that our infrastructure be extended through addition of objects with other types of associated artifacts, such as may be useful for dierent types of controlled experiments. For example, one of our Java objects, nanoxml, is provided with UML statechart diagrams, and this would facilitate experimentation with UML-based testing techniques. Extending our infrastructure can be accomplished in two ways: by our research group, and by collaboration with other research groups. To date we have proceeded primarily through the rst approach, but the second has many benets. First, it is cost eective, mutually leveraging the eorts of others. Second, through this approach we can achieve greater diversity among objects and associated artifacts, which will be important in helping to increase sample size and achieve representativeness. Third,
52 sharing implies more researchers inspecting the artifact setup, tools, and documentation, reducing threats to internal validity. Ultimately, collaboration in constructing and sharing infrastructure can help us contribute to the growth in the ability of researchers to perform controlled experimentation on testing in general. As mentioned earlier, other researchers have made the Siemens and space infrastructure available, on request, for several years. Web pages that provide this infrastructure, together with all more recently created infrastructure described in this chapter, and all of the programs listed in Table 4.3 with the exception of those listed as near release, have been created. These pages reside at http://esquared.unl.edu/sir/, which provides a password-protected portal to any researchers who request access to the infrastructure, with the proviso that they agree to report to us any experiences that will help us to improve the infrastructure.
4.3.4
Examples of the Infrastructure Being Used
In this section we present three examples of our infrastructure being used by other researchers and ourselves. The rst two cases describe examples involving the C programs and their artifacts, and the third case is an example of the use of Java programs and their artifacts.4 In presenting each example we address the following questions: What What What What problem did the researchers wish to address? types of artifacts were needed to investigate this problem? did the researchers take from our infrastructure? did the researchers learn from their study?
Example 1: Improving test suites via operational abstraction. Harder et al. [55] present the operational dierence technique for improving test suites using augmentation, minimization, and generation processes. The authors evaluated improved test suites by comparing them with other techniques in terms of the fault detection ability and code-coverage of the test suites. To do this, the authors required objects that have test suites and faults. They selected eight such C objects from our infrastructure: the Siemens programs and space. They describe why they selected these programs for their experiment: the programs are well-understood from previous research, and no other programs that have human-generated tests and faults were immediately available. Through their experiment the authors discovered that their technique produced test suites that were smaller, and slightly more eective at fault detection, than branch coverage suites.
This third example is described in detail in Section 5.1; we summarize it here for the completeness.
4
53 Example 2: Is mutation an appropriate tool for testing experiments? Andrews et al. [2] investigate whether automatically generated faults (mutants) can be used to assess the fault detection eectiveness of testing techniques. To answer their research question, they compared the fault detection ability of test suites on hand-seeded, automatically-generated (mutation), and real-world faults. For their experiment, they required objects that had test suites and faults. Similar to Example 1, they also used eight C programs: the Siemens programs and space. Since the Siemens programs have seeded faults and space contains real faults, the only additional artifacts they needed to obtain were automatically-generated faults (mutants). The authors generated mutants over the C programs using a mutation generation tool. The reason the authors chose the programs was that they considered the associated artifacts to be mature due to their extensive usage in experiments. The authors compared the adequacy ratio of test suites in terms of mutants and faults. Their analysis suggests that mutants can provide a good indication of the fault detection ability of a test suite; generated mutants were similar to the real faults in terms of fault detection, but dierent from the hand-seeded faults. Example 3: Empirical studies of test case prioritization in a JUnit testing environment. In [35] we investigate the eectiveness of prioritization techniques on Java programs tested using JUnit test cases. We measured the eectiveness of prioritization techniques using the prioritized test suites rate of fault detection. To answer our research questions, we required Java programs that provided JUnit test suites and faults. The four Java programs (ant, jmeter, xml-security, and jtopas) from our infrastructure have the required JUnit test cases and seeded faults. Through our experiment, we found that test case prioritization can signicantly improve the rate of fault detection of JUnit test suites, but also reveals dierences with respect to previous studies that can be related to the language and testing paradigm.
4.3.5
Threats to Validity: Things to Keep in Mind When Using the Infrastructure
As mentioned in Section 4.3.3, sharing of our infrastructure can bring many benets to both ourselves and other researchers, but may introduce other problems, such as that users of the infrastructure might misinterpret some artifacts or object organization mechanism. This, in turn, can generate experiments with misleading ndings. We believe that our description of the infrastructure organization and its artifacts is detailed enough to limit misuse and misinterpretation, and in practice many researchers are using the infrastructure. Using the infrastructure also demands users caution; they must read documentation and follow directions carefully.
54 Extending the infrastructure by collaborating with other researchers also introduces potential threats to validity (internal and construct). First, to extend the infrastructure, we need to motivate others to contribute. It is not easy to convince people to do this because it requires extra eort to adjust their artifacts to our standard format and some researchers may not be willing to share their artifacts. We expect, however, that our direct collaborators will contribute to the infrastructure in the next phase of its expansion, and they, in turn, will bring more collaborators who can contribute to the infrastructure. Second, if people contribute their artifacts, then we need a way to check the quality of the artifacts contributed. We expect the primary responsibility for quality to lie with contributors, but again by sharing contributed artifacts, we can reduce this problem since researchers will inspect artifacts as they use them. Another potential problem with our infrastructure involves threats to the external validity of experiments performed using them, since the artifacts we provide have not been chosen by random selection. The infrastructure we are providing is not intended to be a benchmark; rather, we are creating a resource to support experimentation. We hope, however, to increase the external validity of experimentation using our infrastructure over time by providing a larger pool of artifacts through continuous support from our research group and collaboration with other researchers.
4.4
Conclusion
We have presented our infrastructure for supporting controlled experimentation with testing techniques, and we have described several of the ways in which it can potentially help address many of the challenges faced by researchers wishing to conduct controlled experiments on testing. This section provides additional discussion of the impact, both demonstrated and potential, of our infrastructure to date. In our review of the literature, we have found no similar organized provision of artifacts for controlled experimentation in software testing. On the one hand, the willingness of other researchers to use the Siemens and space artifacts attests to the potential for infrastructure, once made available, to have an impact on research. On the other hand, this same willingness also illustrates the need for improvements to infrastructure, given that the Siemens and space artifacts present only a small sample of the population of programs, versions, tests, and faults. It seems reasonable, then, to expect our extended infrastructure to be used for experimentation by others, and to help extend the validity of experimental results through widened scope. Indeed, we ourselves have been able to use several of the newer infrastructure objects that are about to be released in controlled experiments described in recent publications [33, 35, 38, 39, 86, 116]. In terms of impact, it is also worthwhile to discuss the costs involved in preparing infrastructure; it is these costs that we save when we re-use infrastructure. For example, the emp-server and bash objects required between 80 and 300 person-hours
55 per version to prepare; two faculty and ve graduate research assistants were involved in this preparation. The ex, grep, make, sed and gzip programs involved two faculty, three graduate students, and ve undergraduate students; these students worked 1020 hours per week on these programs for between 20 and 30 weeks. These costs are not costs typically aordable by researchers; it is only by amortizing the costs over the potential controlled experiments that can follow that we render the costs acceptable. Finally, there are several additional potential benets to be realized through sharing of infrastructure in terms of challenges addressed; these translate into a reduction of threats to validity that would exist were the infrastructure not shared. By sharing our infrastructure with others, we have been able to receive feedback that will improve it. User feedback will allow us to improve the robustness of our tools and the clarity and completeness of our documentation, enhancing the opportunities for replication of experiments, aggregation of ndings, and manipulation of individual factors. More important for this dissertation, building infrastructure enabled us to proceed with a family of controlled experiments in order to investigate better evaluation models and methodologies for regression testing techniques. We present these controlled experiments that involve various artifacts from the infrastructure described in this chapter in greater detail in the chapters that follow.
56
Chapter 5 Empirical Studies of Regression Testing Techniques

Concurrent with our construction of infrastructure, we have conducted three initial empirical studies of regression testing techniques considering dierent types of artifacts and techniques. These empirical studies focus on systems written in Java. Through these initial empirical studies, we identify problems that require improvements of infrastructure and empirical methodologies, and important factors to consider when assessing regression testing techniques. The following subsections present these studies. The results presented here appeared previously in [33, 35, 98]; our discussions in this chapter present the results from those papers.
5.1
Empirical Studies of Test Case Prioritization in a JUnit Testing Environment
We have created infrastructure to allow us to consider Java programs, but to validate that infrastructure as a tool for assessing regression testing techniques on Java systems, we wanted to use it in an initial study. A natural rst study topic, given that there has been a lot of data given on test case prioritization for C, is how test case prioritization for Java compares to that. So we set out to design a study replicating earlier C studies but using our Java infrastructure.
5.1.1
Study Overview
Numerous empirical studies have shown that prioritization can improve a test suites rate of fault detection, but the extent to which these results generalize is an open question because the studies have all focused on systems developed using C, and a few specic types of test suites. In particular, Java and the JUnit testing framework are being used extensively in practice, and the eectiveness of prioritization techniques on Java systems tested under JUnit has not been investigated. We therefore designed
57 and performed a controlled experiment examining whether test case prioritization can be eective on Java programs tested under JUnit, and comparing the results to those achieved in earlier studies.
5.1.2
JUnit Testing and Prioritization
JUnit testing involves Java classes that contain one or more test methods and that are grouped into test suites, as shown in Figure 5.1. The gure presents a simple hierarchy having only a single test-class level, but the tree can extend deeper through additional nesting of Test Suites. The leaf nodes in such a hierarchy, however, always consist of test-methods, where a test-method is a minimal unit of test code.
Test Suite
testclass
testclass
... ...
testclass
testclass level
testmethod
testmethod
testmethod testmethod level
Figure 5.1: JUnit test suite structure JUnit test classes can be run individually. Running individual test classes is reasonable for small programs, but for programs having large numbers of test classes it can be expensive, because each independent execution of a test class incurs startup costs. Thus in practice, developers design JUnit test suites that invoke sequences of test classes. Studies of C programs have shown that choices in test suite granularity (the number and size of the test cases making up a test suite) can aect the cost of executing test suites and the cost-eectiveness of prioritization techniques [117]. As the test hierarchy presented in Figure 5.1 shows, a test suite granularity choice exists for JUnit tests, as well: we can prioritize JUnit tests at the test-class level, or at the test-method level ignoring groupings in classes. (Prioritization at the test-method level assumes that test methods have no dependencies between one another and thus can be ordered arbitrarily, but this is typical practice in JUnit test design, and is also true of the programs that we study empirically in this work). We refer to these levels of focus as coarse-granularity and ne-granularity, respectively. Following [117], we refer to the minimal units of testing code that can be dened and executed independently at a particular level of focus as test cases. We then describe test cases at the coarse-granularity level of JUnit TestClass classes as test-class level test cases, and test cases at the ne-granularity level of JUnit methods within TestClass classes as test-method level test cases.
58 In this work, we wish to investigate the eects of ne- versus coarse-granularity choices in test design on JUnit prioritization, and to do this, we need to ensure that the JUnit framework allows us to achieve the following four objectives: 1. identify and execute each JUnit TestCase class individually; 2. reorder TestCase classes to produce a (test-class level) prioritized order; 3. identify and execute each individual test method within a JUnit TestCase class individually; 4. reorder test methods to produce a (test-method level) prioritized order. Objectives 1 and 2 can be trivially achieved due to the fact that the default unit of test code that can be specied for execution in the JUnit framework is a TestCase class. Thus it is necessary only to extract the names of all TestCase classes invoked by the top level TestSuite for the object program1 (a simple task) and then execute them individually with the JUnit test runner in a desired order. Objectives 3 and 4 are more dicult to achieve, due to the fact that a TestCase class is also the minimal unit of test code that can be specied for execution in the normal JUnit framework. Since a TestCase class can dene multiple test methods, all of which are executed when that TestCase is specied for execution, providing the ability to treat individual methods across a set of TestCase classes as test cases required us to extend the JUnit framework. Since the fundamental purpose of the JUnit framework is to discover and execute test methods dened in TestCase classes, the problem of providing test-method level testing reduces to the problem of uniquely identifying each test method discovered by the framework and making them available for individual execution by the tester. We addressed this problem by subclassing various components of the framework and inserting mechanisms for assigning numeric test IDs to each test method identied. Within the JUnit framework, a single test executor class is responsible for the actual execution and collection of the outcomes from all JUnit test methods. To implement ordinal test selection, we introduced a new subclass of the test executor that accepts a list of selected test numbers and uses an internal counter to track the current test number. By checking the current test number against the list of selected tests prior to each test, the extended test executor can determine whether any given test is to be run. A new test runner (SelectiveTestRunner) instantiates an instance of the extended test executor in place of the standard test executor. The relationship between our extensions and the existing JUnit framework is shown in Figure 5.2, which also shows how the JUnit framework is related to the Galileo2 system [50] for analyzing Java bytecode (which we used to obtain coverage information for use in prioritization). Our new SelectiveTestRunner is able to access test cases individually using numeric test IDs.
1 2
This process is repeated iteratively if Test Suites are nested in other Test Suites. In subsequent versions, Galileo has been renamed Sofya.
59
JUnit Framework
Prioritization Extensions
Galileo
JUnitFilter (handles generation of traces from instrumented code) Instrumentor
TestRunner
SelectiveTestRunner
object
JUnit Tests Classes
Figure 5.2: JUnit framework and Galileo To implement prioritization at the test-method level we also needed to provide a way for the test methods to be executed in a tester-specied order. Because the JUnit framework must discover the test methods, and our extensions assign numeric IDs to tests in the order of discovery, to execute the test cases in an order other than the one in which they are provided requires that all test cases be discovered prior to execution. We accomplished this by further extending the executor subclass to perform a simple two-pass technique. When the test executor is instructed to operate in prioritizing mode, the rst time it is asked to execute a test suite, it simply passes through all of the test classes (and test methods) in that suite and creates a mapping from test numbers to corresponding test methods. The test runner then issues a special call back to the test executor, at which time the tests are retrieved from the map and executed in exactly the order in which the list of tests is provided to the test executor. In practice, this technique introduces minimal overhead because the rst pass by the test executor is extremely fast.
5.1.3
Experiments
5.1.3.1 Research Questions We wish to address the following research questions: RQ1: Can test case prioritization improve the rate of fault detection of JUnit test suites? RQ2: How do the three types of information and information use that distinguish prioritization techniques (type of coverage information, use of feedback, and use of modication information) impact the eectiveness of prioritization techniques? RQ3: Can test suite granularity (the choice of running test-class level versus testmethod level JUnit test cases) impact the eectiveness of prioritization techniques?
60 In addition to these research questions, we examine whether test case prioritization results obtained from systems written in an object-oriented language (Java) and using JUnit test cases reveal dierent trends than those obtained from systems written in a procedural language (C) using traditional coverage- or requirements-based system test cases. To address our questions we designed several controlled experiments. The following subsections present, for these experiments, our objects of analysis, independent variables, dependent variables and measures, experiment setup and design, threats to validity, and data and analysis. 5.1.3.2 Objects of Analysis We used four Java programs as objects of analysis: ant, xml-security, jmeter, and jtopas. Ant is a Java-based build tool [3]; it is similar to make, but instead of being extended with shell-based commands it is extended using Java classes. Jmeter is a Java desktop application designed to load-test functional behavior and measure performance [66]. Xml-security implements security standards for XML [143]. Jtopas is a Java library used for parsing text data [68]. Several sequential versions of each of these programs were available in our infrastructure and were selected for these experiments. Table 5.1 lists, for each of our objects, Versions (the number of versions), Classes (the number of classes), KLOCs (the total number of lines of code, excluding comments), Test Classes (the number of JUnit test-class level test cases), Test Methods (the number of JUnit test-method level test cases), and Faults (the number of faults). The number of classes (KLOCs) corresponds to the total number of class les (lines of code) in the nal version. The numbers for test-class (test-method) list the number of test classes (test methods) in the most recent version. The number of faults indicates the total number of faults available for each of the objects (explained further in Section 5.1.3.4). To help assess the representativeness of the objects we selected, we collected size metrics on a number of popular Open Source Java programs available through the SourceForge [130] and Apache Jakarta [65] project web sites. Using all the projects available through Jakarta, and the Java projects in the top 50 most popular on SourceForge, we obtained 27 projects and measured the numbers of classes and lines of code in those programs. The programs ranged in size from 13 to 1168 classes, and from 2.9 to 157.9 KLOCs, containing 333 classes and 41.2 KLOCs on average. The objects we use in these experiments have a similar average size and range of sizes, and thus are similar in terms of sizes to a signicant class of Java programs.
61 Table 5.1: Experiment objects

Objects ant jmeter xml-security jtopas Versions 9 6 4 4 Classes 627 389 143 50 KLOCs 80.4 43.4 16.3 5.4 Test Classes 150 28 14 11 Test Methods 877 78 83 128 Faults 21 9 6 5
5.1.3.3 Variables and Measures Independent Variables Our experiments manipulated two independent variables: prioritization technique and test suite granularity. Variable 1: Prioritization Technique We consider nine dierent test case prioritization techniques, which we classify into three groups to match an earlier study on prioritization for C programs [42]. Table 5.2 summarizes these groups and techniques. The rst group is the control group, containing three techniques that serve as experimental controls. (We use the term technique here as a convenience; in actuality, the control group does not involve any practical prioritization heuristics; rather, it involves various orderings against which practical heuristics should be compared.) The other two groups contain the (noncontrol) techniques that we wish to investigate: the second group is the block level group, containing two ne granularity prioritization techniques, and the third group is the method level group, containing four coarse granularity prioritization techniques. Table 5.2: Test case prioritization techniques.
Label T1 T2 T3 T4 T5 T6 T7 T8 T9 Mnemonic untreated random optimal block-total block-addtl method-total method-addtl method-di-total method-di-addtl Description original ordering random ordering ordered to optimize rate of fault detection prioritize on coverage of blocks prioritize on coverage of blocks not yet covered prioritize on coverage of methods prioritize on coverage of methods not yet covered prioritize on coverage of methods and change information prioritize on coverage of methods/change information, and adjusted on previous coverage
62 Control techniques (T1) No prioritization (untreated): One control that we consider is simply the application of no technique; this lets us consider untreated JUnit test suites. (T2) Random prioritization (random): As a second control we use random prioritization, in which we randomly order the test cases in a JUnit test suite. (T3) Optimal prioritization (optimal): To measure the eects of prioritization techniques on rate of fault detection, our empirical study uses programs that contain known faults. For the purposes of experimentation we can determine, for any test suite, which test cases expose which faults, and using this information we can determine an (approximate) optimal ordering of test cases in a JUnit test suite for maximizing that suites rate of fault detection. This is not a viable practical technique, but it provides an upper bound on the eectiveness of prioritization heuristics. Block level techniques (T4) Total block coverage prioritization (block-total): By instrumenting a program we can determine, for any test case, the number of basic blocks in that program that are exercised by that test case. We can prioritize these test cases according to the total number of blocks they cover simply by sorting them in terms of that number (and resolving ties randomly). (T5) Additional block coverage prioritization (block-addtl): Additional block coverage prioritization combines feedback with coverage information. It iteratively selects a test case that yields the greatest block coverage, adjusts the coverage information on subsequent test cases to indicate their coverage of blocks not yet covered, and repeats this process until all blocks covered by at least one test case have been covered. If multiple test cases cover the same number of blocks not yet covered, they are ordered randomly. When all blocks have been covered, this process is repeated on the remaining test cases until all have been ordered. Method level techniques (T6) Total method coverage prioritization (method-total): Total method coverage prioritization is the same as total block coverage prioritization, except that it relies on coverage in terms of methods. (T7) Additional method coverage prioritization (method-addtl): Additional method coverage prioritization is the same as additional block coverage prioritization, except that it relies on coverage in terms of methods. (T8) Total di method coverage prioritization (method-di-total): Total di method coverage prioritization uses modication information; it sorts test cases in the order of their coverage of methods that dier textually (as measured by a Java parser that extracts pairs of individual Java methods and passes them
63 through the Unix di function). If multiple test cases cover the same number of diering methods, they are ordered randomly. (T9) Additional di method coverage prioritization (method-di-addtl): Additional di method coverage prioritization uses both feedback and modication information. It iteratively selects a test case that yields the greatest coverage of methods that dier, adjusts the information on subsequent test cases to indicate their coverage of methods not yet covered, and then repeats this process until all methods that dier and have been covered by at least one test case have been covered. If multiple test cases cover the same number of diering methods not yet covered, they are ordered randomly. This process is repeated until all test cases that execute methods that dier have been used; additional method coverage prioritization is applied to remaining test cases. The foregoing set of techniques matches the set examined in [42] in all but two respects. First, we use three control techniques, considering an untreated technique in which test cases are run in the order in which they are given in the original JUnit test cases. This is a sensible control technique for our study since in practice developers would run JUnit test cases in their original ordering. Second, the studies with C programs used statement and function level prioritization techniques, where coverage was based on source code, whereas our study uses coverage based on Java bytecode. Analysis at the bytecode level is appropriate for Java environments. Since Java is a platform independent language, vendors or programmers might choose to provide just class les for system components. In such cases we want to be able to analyze even those class les, and bytecode analysis allows this. The use of bytecode level analysis does aect our choice of prioritization techniques. As an equivalent to C function level coverage, a method level granularity was an obvious choice. As a statement level equivalent, we could use either individual bytecode instructions, or basic blocks of instructions, but we cannot infer a one-to-one correspondence between Java source statements and either bytecode instructions or blocks.3 We chose the basic block because the basic block representation is a more cost-eective unit of analysis for bytecode. Although basic blocks of bytecode and source code statements represent dierent types of code entries we can still study the eect that an increase in granularity (from basic blocks to methods) has on prioritization in Java, as compared to in C. Variable 2: Test Suite Granularity To investigate the impact of test suite granularity on the eectiveness of test case prioritization techniques we considered two test suite granularity levels for JUnit test cases: test-class level and test-method level as described in Section 5.1.2. At the
A Java source statement typically compiles to several bytecode instructions, and a basic block from bytecode often corresponds to more than one Java source code statement.
3
64 test-class level, JUnit test cases are of relatively coarse granularity; each test-class that is invoked by a TestSuite is considered to be one test case consisting of one or more test-methods. At the test-method level, JUnit test cases are relatively ne granularity; each test-method is considered to be one test case. Dependent Variables and Measures Rate of Fault Detection To investigate our research questions we need to measure the benets of various prioritization techniques in terms of rate of fault detection. To measure rate of fault detection, we use the metric described in Section 2.1.2, APFD (Average Percentage Faults Detected). 5.1.3.4 Experiment Setup To perform test case prioritization we required several types of data. Since the process used to collect this data is complicated and requires signicant time and eort, we automated a large part of the experiment.
Object
Galileo
(Bytecode Analyzer) JUnit tests
coverage information
Prioritization Techniques
total addtl difftotal diffaddtl random optimal untreated
Reordered Test Suites
fault matrices
APFD Computation
change information
APFD
Figure 5.3: Overview of experiment process Figure 5.3 illustrates our experiment process. There were three types of data to be collected prior to applying prioritization techniques: coverage information, faultmatrices, and change information, as follows: We obtained coverage information by running test cases over instrumented object programs using the Galileo system for analysis of Java bytecode in conjunction with a special JUnit adaptor. This information lists which test cases exercised which blocks and methods; a previous versions coverage information is used to prioritize the current set of test cases.
65 Fault-matrices list which test cases detect which faults and are used to measure the rate of fault detection for each prioritization technique. Since the optimal technique needs to know which test cases expose which faults in advance to determine an optimal ordering of test cases, it uses fault-matrices when the prioritization technique is applied. Change information lists which methods dier from those in the preceding version and how many lines of each method were changed (deleted and added methods are also listed). This information is used when method-di-total (T8) and method-di-addtl (T9) techniques are applied. Each prioritization technique uses some or all of this data to prioritize JUnit test suites based on its analysis; then APFD scores are obtained from the reordered test suites. The collected scores are analyzed to determine whether the techniques improved the rate of fault detection. For random orderings, we generated random orders from untreated test cases 20 times per version for each object program, collected APFD values from those orderings, and calculated the average of these APFD values. Object Instrumentation To perform our experiment, we required that programs be instrumented to support the techniques described in Section 5.1.3.3. We instrumented class les in two ways: all basic blocks, and all method entry blocks (prior to the rst instruction of the method), using the Galileo bytecode analysis system (see Figure 5.2). Test Cases and Test Automation As described previously, test cases were obtained from each object programs distribution, and we considered test suites at the two levels of granularity previously described. To execute and validate test cases automatically, we created test scripts that invoke JUnit test cases (at the test-class or test-method level), save all outputs from test execution, and compare outputs with those for the previous version. As shown in Figure 5.2, JUnit test cases are run through JUnitFilter and TestRunner (SelectiveTestRunner) over the instrumented classes because this produces coverage data for subsequent prioritizations. Faults We wished to evaluate the performance of prioritization techniques with respect to detection of regression faults. The object programs we obtained were not supplied with any such faults or fault data. As descried in Section 4.3.1, there are two choices for obtaining seeded faults: mutation and fault seeding. In this study, to obtain faults, we chose the second approach, and following the procedure described in [63], used also in the study described in [42], we seeded faults. Two graduate students performed this fault seeding; they were instructed to insert faults that were as realistic as possible
66 based on their experience with real programs, and that involved code inserted into, or modied in each of the versions. To be consistent with previous studies of C programs to which we wish to compare our results, we excluded any faults that were detected by more than 20% of the test cases at both granularity levels. 5.1.3.5 Threats to Validity In this section we describe the external, internal, and construct threats to the validity of our experiments, and the approaches we used to limit the eects that these threats might have. External Validity Three issues limit the generalization of our results. The rst issue is object program representativeness. Our programs are of small and medium size, and other industrial programs with dierent characteristics may be subject to dierent cost-benet tradeos. The second issue involves testing process representativeness. The test suites our object programs possess are only one type of test suite that could be created by practitioners. On the other hand, as Section 5.1.3.2 points out, our programs and test suites do represent an important class of such objects found in practice. The third issue for generalization involves fault representativeness. We used seeded faults that were as realistic as possible, but they were not actual faults found in the wild. Future studies will need to consider other fault types. Internal Validity The inferences we have made about the eectiveness of prioritization techniques could have been aected by two factors. The rst factor involves potential faults in our experiment tools. To control for this threat, we validated tools through testing on various sizes of Java programs. The second factor involves the faults seeded in our object programs. Our procedure for seeding faults followed a set process as described in Section 5.1.3.4, which as mentioned above reduced the chances of obtaining biased faults. Some of our object programs, however (xml-security and jtopas) ultimately contained a relatively small number of faults, and this could aect the inferences we are able to draw. Construct Validity The dependent measure that we have considered, APFD, is not the only possible measure of prioritization eectiveness and has some limitations. For example, APFD assigns no value to subsequent test cases that detect a fault already detected; such inputs may, however, help debuggers isolate the fault, and for that reason might be worth measuring. Also, APFD does not account for the possibility that faults and test cases may have dierent costs. Third, APFD focuses on all faults, considering the value added by detecting each fault earlier, but in some situations, engineers may be
67 concerned solely with the rst fault detected by a test suite. Future studies will need to consider other measures of eectiveness, such as the cost of analysis, execution, result checking, and test suite maintenance. 5.1.3.5 Data and Analysis To provide an overview of all the collected data we present boxplots in Figure 5.4. The left side of the gure presents results from test case prioritization applied to testclass level test cases, and the right side presents results from test case prioritization applied to test-method level test cases. Each row presents results per object program. Each plot contains a box for each of the nine prioritization techniques, showing the distribution of APFD scores for that technique, across each of the versions of the object program. (See Table 5.2 for a legend of the techniques.) Because we apply each technique to each pair of successive versions of each program, the number of data points represented by the boxes for a given program is equal to the number of versions of that program minus 1 (thus, 8 for ant, 3 for xml-security, 5 for jmeter, and 3 for jtopas). To further facilitate interpretation, Table 5.3 presents the mean value and standard deviation of the APFD scores for each boxplot that appears in the gure. The data sets depicted in Figure 5.4 served as the basis for our analyses of results. The following sections describe, for each of our research questions in turn, the experiments relevant to that question, and the associated analyses. Table 5.3: Mean Value and Standard Deviation (SD) of APFD Scores in Figure 5.4.
Objects ant jmeter xml-sec. jtopas Objects ant jmeter xml-sec. jtopas Mean & SD Mean SD Mean SD Mean SD Mean SD Mean & SD Mean SD Mean SD Mean SD Mean SD T1 46 26 54 34 62 43 32 15 T1 38 28 48 37 48 34 35 21 T2 59 8 64 12 67 24 56 1 T2 64 13 60 19 71 17 61 12 T3 98 1 97 2 95 1 94 1 T3 99 1 99 1 99 1 99 1 Test-class level T4 T5 T6 57 83 57 27 8 27 57 65 57 37 33 37 92 86 95 3 11 2 41 40 36 38 21 30 Test-method level T4 T5 T6 52 87 51 30 11 29 34 74 34 38 24 38 96 96 97 3 3 3 68 97 68 50 2 51 T7 79 12 65 33 95 2 37 13 T7 84 12 77 18 87 12 97 2 T8 59 26 59 36 91 4 40 15 T8 54 31 42 35 96 3 77 19 T9 76 22 62 35 91 4 40 15 T9 76 28 55 35 96 3 75 19
68
100
100
50
50
0 T1 T2 T3 T4 T5 T6 T7 T8 T9 ant
0 T1 T2 T3 T4 T5 T6 T7 T8 T9 ant
100 90
50
40
0 T1 T2 T3 T4 T5 T6 jmeter T7 T8 T9 T1 T2 T3 T4 T5 T6 jmeter T7 T8 T9
100
100
50
50
0 T1 T2 T3 T4 T5 T6 T7 T8 T9 xml-security
0 T1 T2 T3 T4 T5 T6 T7 T8 T9 xml-security
100
100
50
50
0 T1 T2 T3 T4 T5 T6 T7 T8 T9 jtopas test-class level
0 T1 T2 T3 T4 T5 T6 T7 T8 T9 jtopas test-method level
Figure 5.4: APFD boxplots, all programs. The horizontal axes list techniques, and the vertical axes list APFD scores. The left column presents results for test-class level test cases and the right column presents results for test-method level test cases. See Table 5.2 for a legend of the techniques. RQ1: Prioritization eectiveness Our rst research question considers whether test case prioritization can improve the rate of fault detection for JUnit test cases applied to our Java programs. An initial indication of how each prioritization technique aected our JUnit test suites rates of fault detection in this study can be obtained from Figure 5.4. Com-
69 paring the boxplots of untreated (T1) and random (T2) to the boxplot for optimal (T3), it is apparent on each object program and test suite level that they are not close to an optimal prioritization orders rate of fault detection. Comparing the boxplots of untreated (T1) to those of random (T2), it appears that random test case orderings outperform untreated test case orderings at the test-method level on all programs other than jmeter, and on two of the four programs (ant and jtopas) at the test-class level. Comparing results for untreated (T1) to results for actual, non-control techniques (T4 T9) for both test suite levels, it appears that all non-control techniques yield improvement for ant and xml-security at both test suite levels. However, the results from jmeter and jtopas show that some techniques yield improvement with respect to untreated while others do not. The comparison of the results of random orderings (T2) with those produced by non-control techniques shows varying results across all programs. For instance, the results from ant show that all non-control techniques yield improvement at the test-class level, but at the test-method level, some techniques yield improvement while others do not. To determine whether the dierences observed in the boxplots are statistically signicant we performed three sets of analyses, considering test-class and test-method levels independently, for ant. (We performed a formal analysis only for ant because the other programs provide too few data points to support such an analysis. Formal analyses in RQ2 and RQ3 are also performed only for ant for the same reason.) The analyses were: 1. UNTREATED (T1) vs NON-CONTROL (T4-T9): We consider untreated and non-control techniques to determine whether there is a dierence between untreated and each non-control technique. We use the Wilcoxon Rank-Sum nonparametric [113] test because the variability between techniques is very dierent and distributions for some techniques are not normal. We used the Splus statistics package [131] to perform the analysis. 2. RANDOM (T2) vs NON-CONTROL (T4-T9): We perform the same analyses as in (1), against the random ordering. 3. UNTREATED (T1) vs RANDOM (T2): We consider untreated and random techniques to determine whether there is a dierence between the two techniques using the Wilcoxon Rank-Sum test. Table 5.4 presents the results of analysis (1), for a signicance level of 0.05. The Wilcoxon Rank-Sum test4 reports that block-addtl (T5), method-addtl (T7), and method-di-addtl (T9) are signicantly dierent from untreated (T1) at both test suite levels. Table 5.5 presents the results of analysis (2). Similar to the rst analysis, the Wilcoxon Rank-Sum test reports that block-addtl (T5), method-addtl (T7), and method-di-addtl (T9) are signicantly dierent from random (T2) at the test-class
4
The Wilcoxon Rank-Sum test used here reports a two-sided p-value.
70 level, and block-addtl and method-addtl are signicantly dierent from untreated at the test-method level. For the comparison between untreated and random, the Wilcoxon Rank-Sum test reports that there is no dierence between the two test orderings at the test-class level (two-sided p-value = 0.1949), but there is a signicant dierence at the test-method level (two-sided p-value = 0.0281). Table 5.4: Wilcoxon Rank-Sum Test, Untreated (T1) vs Non-control Techniques (T4 T9), ant.
untr. vs bl-tot 0.5992 untr. vs bl-tot 0.5054 untr. vs bl-add 0.0148 untr. vs bl-add 0.0047 Test-class level untr. vs untr. vs meth-tot meth-add 0.5992 0.02 Test-method level untr. vs untr. vs meth-tot meth-add 0.4418 0.0047 untr. vs meth-di-tot 0.4 untr. vs meth-di-tot 0.3442 untr. vs meth-di-add 0.0379 untr. vs meth-di-add 0.0379
p-value
p-value
Table 5.5: Wilcoxon Rank-Sum Test, Random (T2) vs Non-control Techniques (T4 T9), ant.
rand vs bl-tot 0.9591 rand vs bl-tot 0.3282 rand vs bl-add 0.0003 rand vs bl-add 0.003 Test-class level rand vs rand vs meth-tot meth-add 0.9691 0.0047 Test-method level rand vs rand vs meth-tot meth-add 0.3282 0.0104 rand vs meth-di-tot 0.7984 rand vs meth-di-tot 0.5054 rand vs meth-di-add 0.0281 rand vs meth-di-add 0.1049
p-value
p-value
RQ2: The eects of information types and use on prioritization results Our second research question concerns whether dierences in the types of information and information use that distinguish prioritization techniques (type of coverage information, use of feedback, type of modication information) impact the eectiveness of prioritization. Comparing the results of block-total (T4) to method-total (T6) and block-addtl (T5) to method-addtl (T7) for both test suite levels for all programs, it appears that
71 the level of coverage information utilized (ne vs coarse) has no eect on prioritization results. Comparing the results of block-total to block-addtl and method-total to method-addtl at both test suite levels for ant and jmeter, it appears that techniques using feedback outperform those not using feedback. However, in the case of xml-security and jtopas, no dierences between those techniques are apparent. Finally, comparison of the results of method-total (T6) to method-di-total (T8) and method-addtl (T7) to method-di-addtl (T9) suggests no apparent eect from using modication information on prioritization eectiveness for all programs. To determine whether the dierences observed in the boxplots are statistically signicant we compared each pair of non-control techniques using a Wilcoxon RankSum test, considering test-class level and test-method level independently, for ant. Table 5.6 presents the results of these analyses. The results show the following with respect to information types and information use: Coverage information. The results indicate that there is no dierence between comparable block-level and method-level techniques at either test suite level, considering block-total (T4) versus method-total (T6) and block-addtl (T5) versus method-addtl (T7). Dierent levels of coverage information did not impact the eectiveness of prioritization. Use of Feedback. The results indicate a signicant dierence between techniques that use feedback and those that do not use feedback at the test-method level, namely block-total (T4) versus block-addtl (T5) and method-total (T6) versus method-addtl (T7). At the test-class level, there is a signicant dierence between block-total and block-addtl, but not between method-total and method-addtl. Modication information. The results indicate no signicant dierence between techniques that use modication information and those that do not use modication information, namely method-total (T6) versus method-di-total (T8) and method-addtl (T7) versus method-di-addtl (T9), at either test suite level. RQ3: Test suite granularity eects Our third research question considers the impact of test suite granularity, comparing test-class level test suites to test-method level test suites. The boxplots and the analysis related to our rst two research questions suggest that there is a dierence between the two levels of test suite granularity, thus we performed Wilcoxon RankSum tests for non-control techniques and ant comparing test-method level to testclass level. Table 5.7 presents the results of these analyses. The results indicate that test suite granularity did not aect the rate of fault detection for all non-control techniques.
72 Table 5.6: Wilcoxon Rank-Sum Test, All Non-control Techniques (T4 T9), ant.
bl-tot vs bl-add 0.0499 bl-tot vs bl-add 0.0281 Test-class level meth-tot vs bl-tot vs bl-add vs meth-add meth-tot meth-add 0.1304 0.9161 0.7525 Test-method level meth-tot vs bl-tot vs bl-add vs meth-add meth-tot meth-add 0.0281 1 0.7909 meth-tot vs meth-di-tot 0.7525 meth-tot vs meth-di-tot 0.8335 meth-add vs meth-di-add 1 meth-add vs meth-di-add 1
p-value
p-value
Table 5.7: Wilcoxon Rank-Sum Test, All Non-control Techniques (T4 T9), Test-Class Level vs Test-Method Level, ant.
p-value bl-tot 0.5054 bl-add 0.4418 meth-tot 0.4619 meth-add 0.2786 meth-di-tot 0.5737 meth-di-add 0.8785
5.1.4
Discussion
Our results support the conclusion that some of test case prioritization techniques we considered can improve the rate of fault detection for JUnit test suites applied to Java programs.5 Considering results from ant, at both test suite levels, there were prioritization techniques that outperformed both untreated and randomly ordered test suites. We also observed that random test case orderings outperformed untreated test case orderings at the test-method level. We conjecture that this dierence is due to the construction of the JUnit test cases supplied with the programs used in this study. It is typical in practice for developers to add new test cases at the end of a test suite. Since newer test cases tend to exercise new code, these new test cases may be more likely to be fault-revealing than previous test cases. Randomly ordered test cases, on average, achieve orders that do not put the new test cases later. The practical implication of this result, then, is that the worst thing JUnit users can do is leave their test suites untreated. To further investigate the foregoing eect and conjecture, we measured the rate of fault detection achieved by reversing the order of test cases supplied with the programs studied, and our results were supportive of the conjecture. The average APFDs for all programs were 82 for test-method level and 75 for test-class level, much better than the original untreated order (and better, even, than random order). Thus, a minimalist approach to improving JUnit testing in future versions of JUnit
Although we applied formal analyses only to ant, examination of the boxplots for other programs suggest that results on these programs are consistent with this conclusion.
5
73 might involve oering an option by which its execution mechanism can be reversed. We now consider whether the results we have observed are consistent with those from the previous studies with C programs and coverage or requirements based test suites. Both this study (at the test-method level) and previous studies [40, 42, 116, 117, 122] showed that prioritization techniques can improve the rate of fault detection of test suites compared to random and untreated orderings. Also, both this study and earlier studies found that techniques using additional coverage information were usually better than other techniques, for both ne and coarse granularity test cases. There were some sources of variation between the studies, however. Regarding the impact of granularity, previous studies on C showed that statementlevel techniques as a whole were better than function-level techniques. This study found, however, that block-level techniques were not signicantly dierent overall from method-level techniques. This result may be due to the fact that the instrumentation granularity we used for Java programs diers from that used for C programs, as we explained in Section 5.1.3.3. Block-level instrumentation is not as sensitive as statement-level instrumentation because a block may combine a sequence of consecutive statements into a single unit in a control ow graph. A related explanatory factor is that the instrumentation dierence between blocks and methods is not as pronounced in Java as is the dierence between statements and functions in C. Java methods tend to be more concise than C functions, possibly due to object-oriented language characteristics [64] and code refactoring [48], which tend to result in methods that contain small numbers of blocks. Constructors, get methods, and set methods are examples of methods that frequently contain only one basic block. A study reported in [111] supports this interpretation, providing evidence that the sizes of the methods called most frequently in object-oriented programs are between one and nine statements on average, which generally corresponds (in our measurements on our object programs) to only one or two basic blocks. To further investigate this result, we measured the number of instrumentation points per method in the four Java programs used in this study and the number of instrumentation points per function in 13 C programs studied in other prioritization research [42, 43]. Tables 5.8 and 5.9 show the resulting data for the initial version of each program. Columns 2 through 7 present the number of methods or functions that contain the listed range of instrumentation points. The additional parenthetical number denotes the percentage of the total number of methods (functions) that contain that number of points. In ant, for example, 609 methods contain zero or one instrumentation point, and in printtokens, four functions contain three instrumentation points. As the data shows, the majority of Java methods tend to contain a small number of instrumentation points (in all cases, over 46% contain two or fewer), whereas C functions tend to include larger numbers (4-9, 10-19, and 20+). Because instrumentation at the method level is less expensive than instrumentation at the basic-block level, if these results generalize, they suggest that practitioners may not lose much, when working with Java programs, by selecting method-level rather than block-level instrumentation approaches.
74
Table 5.8: Instrumentation Points per Method in Java

Objects ant jmeter xml-security jtopas 0-1 609 (37.99%) 403 (27.13%) 398 (19.26%) 62 (31.00%) 2 305 (19.00%) 421 (28.35%) 563 (27.25%) 48 (24.00%) Blocks 3 4-9 95 353 (5.92%) (22.00%) 140 341 (9.42%) (22.96%) 202 598 (9.77%) (28.94%) 18 48 (9.00%) (24.00%) 10-19 141 (8.80%) 134 (9.02%) 202 (9.77%) 16 (8.00%) 20+ 100 (6.29%) 46 (3.12%) 103 (5.01%) 8 (4.00%)
Table 5.9: Instrumentation Points per Function in C

Objects printtokens printtokens2 replace schedule schedule2 tcas totinfo space ex grep gzip make sed 0-1 0 (0.00%) 0 (0.00%) 2 (9.50%) 0 (0.00%) 1 (6.25%) 4 (44.45%) 0 (0.00%) 1 (0.74%) 1 (0.70%) 14 (10.60%) 1 (1.20%) 10 (5.37%) 2 (2.70%) 2 0 (0.00%) 2 (10.53%) 4 (19.10%) 1 (5.60%) 0 (0.00%) 0 (0.00%) 1 (14.29%) 7 (5.15%) 15 (10.80%) 10 (7.60%) 4 (4.80%) 15 (8.00%) 10 (13.70%) Nodes 3 4-9 4 7 (22.22%) (38.89%) 3 7 (15.79%) (36.84%) 0 4 (0.00%) (19.10%) 2 11 (11.10%) (61.10%) 0 10 (0.00%) (62.50%) 0 3 (0.00%) (33.33%) 1 0 (14.29%) (0.00%) 7 17 (5.15%) (12.50%) 8 45 (5.80%) (32.37%) 3 40 (2.27%) (30.30%) 1 28 (1.20%) (33.73%) 6 42 (3.20%) (22.58%) 3 29 (4.10%) (39.70%) 10-19 5 (27.78%) 3 (15.79%) 8 (38.10%) 3 (16.70%) 4 (25.00%) 1 (11.11%) 3 (42.85%) 44 (32.35%) 36 (25.90%) 27 (20.45%) 13 (15.70%) 41 (22.00%) 11 (15.00%) 20+ 2 (11.11%) 4 (21.05%) 3 (14.20%) 1 (5.50%) 1 (6.25%) 1 (11.11%) 2 (28.57%) 60 (44.11%) 34 (24.43%) 38 (28.78%) 39 (43.37%) 72 (38.85%) 18 (24.80%)
Where test suite granularity eects are concerned, an early study of C programs [116] revealed no dierences between ne and coarse granularity prioritization tech-
75 niques, but a subsequent study [117] found that ne granularity could increase prioritization eectiveness compared to coarse granularity. Like the early C study, this study also revealed no dierence between ne and coarse granularity prioritization techniques. Since the scope of each test case in a test-class level test suite is limited to a specic class under test,6 one possible explanation for this result is that, in our object programs, faults are located only in a few of the classes under test, and the number of faults in several cases is relatively small.
5.1.5
Cost-Benets Analysis
Our results show that there can be dierences in the rates of fault detection achieved by various test case prioritization techniques applied to Java programs and JUnit test cases. However, the improvements in the rates of fault detection demonstrated by certain techniques do not guarantee the practical cost-eectiveness of those techniques, because the techniques also have associated costs, and because the benets of early fault detection can vary. In general, detecting faults earlier may have practical advantages only when the ratio of benets of early detection relative to prioritization costs is suciently high [42]. In this section, we provide further understanding of the implications of our empirical results for practice, via a cost-benets analysis accounting for the factors that aect the costs and benets of the prioritization techniques considered. To do this we require (1) cost models that capture the important factors, and (2) test process models that consider the dierent settings in which prioritization may be performed. In the next subsection, we present such models. Then, in the subsection Applying the Models to Our Data, we (1) present a general strategy for analyzing the costeectiveness of prioritization techniques based on the models, and (2) an analysis of the specic data obtained from our study. Prioritization Cost Models As briey outlined in Section 2.2, Malishevsky et al. [86] present cost models and methodologies for assessing the practical cost-benets of prioritization techniques, considering the factors aecting those costs and benets. Following [86], let P be a program, let P be a modied version of P , let T be the test suite for P , and consider the application of prioritization techniques relative to P and P . We dene the following variables related to the cost of applying prioritization to P , P , and T : Ca(T ) is the cost of analysis needed to support prioritization. Cp(T ) is the cost of using a prioritization technique itself.
This is true in the objects that we considered; in general, however, the scope of a unit test could be subject to a developers specic practices.
6
76 C(T ) = Ca(T ) + Cp(T ) is the total cost associated with the use of a prioritization technique. In the foregoing equations, Cp(T ) varies with the prioritization technique used; for example, techniques that use feedback perform more computation than techniques that do not. Ca(T ) varies with technique, as well, but its constituent costs can include the costs of program instrumentation (Cainst ), analysis of changes between old and new versions (Cachanges ), collection of execution traces (Catraces ), and other nontest dependent activities (Cant ). For example, the analysis cost for a prioritization technique that uses modication information can include all four of these constituent costs: Ca(T ) = Cainst + Catraces + Cachanges + Cant . In practice, however, the constituent costs that are signicant for a technique in a given context depend on the regression testing process that is in use. Whereas the foregoing [86] considered only one regression testing process (batch), we consider two: a batch process and an incremental process, as follows. Batch Process In a batch regression testing process, a period of software maintenance lasting several days, weeks, or months occurs until a release is ready for test, and then regression testing begins. In this process, we distinguish two dierent phases of regression testing - the preliminary and critical phases - which correspond to the times before and after the release is available for testing. Prioritization costs are distributed across these phases. Preliminary phase costs Preliminary phase costs are incurred as maintenance proceeds. During this phase, analysis required for prioritization for the upcoming release can be accomplished with the following costs: Ca(T ) = Cainst + Catraces + Cant The cost for prioritization in this phase, Cprelim (T ), is then: Cprelim (T ) = Ca(T ) = Cainst + Catraces + Cant
77 Critical phase costs Critical phase costs, in contrast to preliminary phase costs, occur in the time period designated specically for regression testing. At this point, product development is essentially complete. In this phase, the primary cost incurred is the cost of prioritization (Cp(T )), but depending on the prioritization technique applied, additional analysis costs, such as the cost of analysis of code modications required for the method-di and method-di-addtl techniques, may also be incurred. Thus, the phase involves the following costs: Ca(T ) = Cachanges Ccrit (T ) = Ca(T ) + Cp(T ) = Cachanges + Cp(T ) Incremental Process A batch process may not suit typical development practices associated with the use of JUnit test cases, and thus we also consider an incremental process. In the incremental process, no distinction is made between preliminary and critical phases because regression testing is performed very frequently (iteratively) and thus the testing and product development phases cannot reasonably be treated separately. In this case all prioritization costs are incurred in each testing step, and may be characterized as follows (modulo adjustments in analysis cost constituents for specic prioritization techniques): Ca(T ) = Cainst + Catraces + Cachanges + Cant C(T ) = Ca(T ) + Cp(T ) = Cainst + Catraces + Cachanges + Cant + Cp(T ) Discussion of Processes and Models Depending on which regression testing process an organization uses, the cost model associated with prioritization diers. Under the batch process, the data needed for prioritization techniques (with the exception of Cachange ) can potentially be collected automatically in the preliminary phase, with little or no eect on release time. Prioritization techniques can then be applied during the critical phase with relatively low analysis costs. Under the incremental process, all analysis costs can aect release time; thus, it may not be appropriate to apply stronger (but more expensive) techniques. Note further that the incremental model covers a wide range of development and testing processes that are currently in use or under study. For example, many software organizations use nightly build-and-test processes [97], in which modications made during the day are tested overnight, all directed toward a product release which may itself then be tested under a batch process. More recently, Extreme Programming [138] processes advocate testing following every modication or small set of related modications, at a rate much more frequent than nightly. Finally, researchers have
78 also recently begun to investigate continuous testing [124, 125, 126], in which tests are run in the time periods (potentially just a few minutes or seconds long) between program edits. Although other regression testing processes could be considered, the two that we have described represent two ends of a spectrum that exist in practice. It might also be benecial in practice to use a combined approach, such as employing weaker (but less expensive) prioritization techniques during periods in which testing is frequent, and employing stronger prioritization techniques during the major release testing cycles. The analyses that we perform in the rest of this section can be applied to such alternative processes, as well. Quantifying Costs To use the foregoing cost models to assess prioritization cost-eectiveness, we need to quantify the costs involved. In general, where improving the rate of fault detection is the goal, the decision to use a certain prioritization technique depends on the benets of discovering faults sooner versus the cost of the technique itself. One procedure for savings quantication is to translate the cumulative cost of waiting for each fault to be exposed while executing test suite T under order O, dened as delayso by Malishevsky et al. [86], to a meaningful value scale based on the assessment of the benets of detecting faults early. Following [86], given an order O of test suite T , where T contains n test cases and detects m faults, delayso is dened as follows:
m
delayso =
i=1
T Fio
k=1
eo f i k
In this equation, T Fio is the test number under order O that rst detects fault i, eo is k the cost associated with the time required to run and validate test k in suite T under order O, and fi is the cost of waiting a unit of time for a fault i to be exposed. The cost savings due to the application of a prioritization technique that creates order O at cost C(T ), relative to some other test order O obtained at no cost (i.e. original test order), is: delayso delayso C(T ). From this equation, we can state that the prioritization technique producing test order O is cost-eective compared to some original test order O if and only if: C(T ) < delayso delayso When comparing two prioritization techniques 1 and 2, with costs C1 (T ) and C2 (T ), respectively, and that produce test orders O1 and O2 , respectively, technique 1 is cost-benecial with respect to technique 2 if and only if:
79 C1 (T ) C2 (T ) < delayso2 delayso1 (1)
Note that as presented, this cost model is unit-less: the delays measure is dened in terms of the costs associated with test execution and waiting for faults to be exposed. In a particular application, these costs must be dened by the organization in a manner appropriate to their situation. We further consider this issue in the following sections. Applying the Models to Our Data To better understand our empirical data in light of the foregoing models, we apply those models to that data and discuss their implications for the use of prioritization in that context. Our prioritization tools and the analysis tools on which they depend are prototypes, and are not implemented for eciency. Moreover, the (delays) variables on the right hand sides of our formulas for assessing the cost savings due to prioritization techniques involve many organization-specic factors, such as testers salaries, the impact of a late release of a system, or the reliability expectations for the system. A direct comparison of such cost-related data as could be gathered in our study supports only a limited examination of tradeos. We thus begin our investigation of implications by using an approach that is more general than the use of specic data allows. This approach, used originally in [42], uses savings factors to establish relationships between delays and a savings scale. The approach also provides a general method by which practitioners could assess prioritization cost-eectiveness in their particular application setting. Following this more general illustration, we then consider an analysis of our particular data. General Analysis Let A and B be prioritization techniques with costs CA (T ) and CB (T ), respectively, let the test orders produced by A and B be OA and OB , respectively, and let the delays associated with A and B be delaysOA and delaysOB , respectively. The decision to use technique A rather than B can be framed as one of determining whether CA (T ) CB (T ) < delaysOB delaysOA . Even when we do not have specic cost values for the techniques (left hand side), we can complete the right hand side of this equation to compute potential savings; the result constitutes an upper bound on possible cost savings that could be achieved given favorable conditions. Furthermore, the resulting analysis applies for either of the cost models presented in this section. Recall that our delays measure is unit-less, and to be applied in practice must be translated into a concrete measure of benet. A savings factor (SF) is a weight that translates a reduction in delays into such a measure of benet. For example, it can associate benets such as time or dollars saved with reductions in delays. The greater the SF, the greater the benets generated by a reduction in delays.
80 By applying a range of scaling factors to a specic dierence in delays values between two test orders A and B, we can obtain a view of the potential payos associated with that range of savings factors. This view can help us assess the breakeven point at which A (and the technique that produced it) truly becomes more cost-eective than B (and the technique that produced it). Tables 5.10, 5.11, and 5.12 show the results of applying this approach to our data, presenting (i) comparisons between untreated test suites and test suites ordered by heuristics, (ii) comparisons between randomly ordered test suites and test suites ordered by heuristics, and (iii) comparisons between various heuristics, respectively, for each of the four object programs considered in our studies. Comparisons are made at the test-method level, in which overall improvements in rates of fault detection were observed. For comparisons (i) and (ii), we consider only those heuristics for which statistically signicant dierences were reported in Tables 5.4, 5.5 and 5.6. (Although, for programs other than ant, we do not have any statistical evidence of signicant dierences involving those techniques, we investigate them to see if their results have any practical implications.) For comparisons between heuristics we consider two cases that showed statistically signicant dierences: block-total vs block-addtl, and method-total vs method-addtl. Table 5.10: Comparisons: Untreated vs. Heuristics
Objects ant jmeter xml. jtopas Objects ant jmeter xml. jtopas Objects ant jmeter xml. jtopas (T1) 694.3 57.7 81.6 121.1 (T1) 694.3 57.7 81.6 121.1 (T1) 694.3 57.7 81.6 121.1 Delays (T5) 248.5 26.7 8.0 5.8 Delays (T7) 267.8 23.4 18.3 5.8 Delays (T9) 338.8 45.5 8.3 39.5 di 445.7 31.0 73.6 115.3 di 426.5 34.3 63.3 115.3 di 355.5 12.2 73.3 81.6 1 445.7 31.0 73.6 115.3 1 426.5 34.3 63.3 115.3 1 355.5 12.2 73.3 81.6 SF*(delays(T1)-delays(T5)) 10 50 100 500 4457 22288 44576 222880 310 1550 3100 15500 736 3683 7360 36830 1153 5765 11530 57650 SF*(delays(T1)-delays(T7)) 5 10 50 100 500 2132.5 4265 21325 42650 213250 171.5 343 1715 3430 17150 316.5 633 3165 6330 31650 576.5 1153 5765 11530 57650 SF*(delays(T1)-delays(T9)) 5 10 50 100 500 1777.5 3555 17775 35550 177750 61.0 122 610 1220 6100 366.5 733 3665 7366 36650 408.3 816 4083 8160 40830 5 2228.8 155.0 368.3 576.5 1000 445760 31000 73600 115300 1000 426500 34300 6330 115300 1000 355500 12200 73300 81600
Columns 2 and 3 of Tables 5.10, 5.11 and 5.12 present the delays values calculated for our techniques using the formula for calculating delays presented in the subsection Quantifying Costs, averaged across the versions of the object programs, and column 4 presents the dierences between those average delays. Note that in this usage, these delays values are calculated relative to our test case orders and fault detection information in which the values concerned are test indices, not times; thus, these
81 Table 5.11: Comparisons: Random vs. Heuristics

Objects ant jmeter xml. jtopas Objects ant jmeter xml. jtopas rand (T2) 444.2 45.1 46.8 67.3 rand (T2) 444.2 45.1 46.8 67.3 Delays bl-add (T5) 248.5 26.7 8 5.8 Delays meth-add (T7) 267.8 23.4 18.3 5.8 di 195.7 18.4 38.8 61.5 di 176.4 21.7 28.5 61.5 1 195.7 18.4 38.8 61.5 1 176.4 21.7 28.5 61.5 5 978.5 92.0 194.0 307.5 SF*(delays(T2)-delays(T5)) 10 50 100 500 1000 195700 18400 38800 61500 1000 176400 21700 28500 61500
1957 9785 19570 97850 184 920 1840 9200 388 1940 3880 19400 615 3075 6150 30750 SF*(delays(T2)-delays(T7)) 5 10 50 100 500 1764 217 285 615 8820 1085 1425 3075 17640 2170 2850 6150 88200 10850 14250 30750
882.0 108.5 142.5 307.5
Table 5.12: Comparisons: Between Heuristics

Objects ant jmeter xml. jtopas Objects ant jmeter xml. jtopas (T4) 521.4 82.0 8.3 77.8 (T6) 530.9 82.2 7.3 77.8 Delays (T5) di 248.5 272.9 26.7 55.3 8.0 0.3 5.8 72.0 Delays (T7) di 267.8 263.1 23.4 58.8 18.3 -11.0 5.8 72.0 1 272.9 55.3 0.3 72.0 1 263.1 58.8 -11.0 72.0 SF*(delays(T4)-delays(T5)) 10 50 100 500 2729 13645 27290 136450 553 2765 5530 27650 3 15 30 150 720 3600 7200 36000 SF*(delays(T6)-delays(T7)) 5 10 50 100 500 1315.5 2631 13155 26310 131550 294.0 588 2940 5880 29400 -55.0 -110 -550 -1100 -5500 360.0 720 3600 7200 36000 5 1364.5 276.5 1.5 360.0 1000 272900 55300 300 72000 1000 263100 58800 -11000 72000
delays values measure delays solely in terms of numbers of tests spent waiting for faults to be revealed. Columns 5 through 11 of the tables show the potential savings that result from these dierences in delays for each object program, for each of seven SFs: 1, 5, 10, 50, 100, 500, 1000. Here, the values in the table represent the delays values, which in this case have been calculated in terms of numbers of tests spent waiting to reveal faults, multiplied by scaling factors, but these scaling factors can be thought of as assigning more meaningful cost-benet values to the dierences in delays. For example, a scaling factor may be taken to represent the savings, in dollars, associated with the removal of one unit. In such cases, for ant, in Table 5.10, if a one unit dierence in delays is worth one dollar to the organization, then block-addtl prioritization saves $445.7 in comparison to an untreated order, and if a one unit dierence in delays is worth $1000, it saves the organization $445,760. As we would expect, given that we have used only test indices in calculating de-
82 lay values, the data is similar to the APFD dierences observed for the techniques: with the exception of the results on xml-security in Table 5.12, all results favor the technique in the third column over the technique in the second column. That is, Tables 5.10 and 5.11 show that the prioritization techniques considered are better than untreated and random techniques. Table 5.12 shows that techniques using feedback information perform better than techniques without feedback information. In particular ant, which has the largest number of classes and tests among our object programs, has large weighted delays values on all comparisons, (For xml-security, which has negative values in Table 5.12, the superiority of method-total holds since the cost of method-total is less than that of method-addtl.) More important, the use of various scaling factors in the tables lets us examine the extent to which these dierences in delays might translate, in practice, into savings that are greater than the additional cost incurred by prioritization techniques (or than the costs incurred by the superior prioritization technique in the case of comparison between two prioritization techniques). For suciently large SF values, prioritization techniques may yield large reductions in delays (and corresponding savings) even for a program like jmeter, on which, considering untreated versus method-di-addtl prioritization (Table 5.10), a small dierence in delays (12.2) translates into 12200 when SF is 1000. If SF is small, however, even large dierences in delays may not translate into savings; for example, on ant, an SF of 1 yields a dierence of only 445.7 for block-addtl vs untreated (Table 5.10). As an illustration, we use the analysis strategy just presented to assess the possible practical implications of our results on the last two versions of ant, the largest of our four object programs and the one with the largest number of test cases. Using the cost models from Section 6.1, we measured the costs of applying the methodaddtl technique (run-time, in seconds). The costs were: Cainst : 1156; Catraces : 589; Cp(T ): 1566.7 Given these numbers, under the batch process, the costs of prioritization in the preliminary and critical phases are 1745 and 1566 seconds, respectively. Under the incremental process, the cost is 3311 seconds. If we compare these costs with the unweighted dierence in delays gained by method-addtl, they would not be cost-benecial, because the unweighted delays dierences between control orders and method-addtl are less than the cost incurred by the method-addtl prioritization technique: 426.5 in comparison with untreated, and 176.4 in comparison with random. However, somewhere between SF 3 and SF 8, the benets of method-addtl reach a break even point in comparison to untreated orders a point at which they begin to pay o. For the batch regression testing process with its lower cost (1566, because its preliminary costs can be ignored) the break even point is at SF = 3.7, whereas for the incremental process with cost 3311, the break even point is at SF = 7.7. As SF grows higher, possible savings increase; if an organizations estimated SF is 1000, where SF is a measure of the time (in seconds) saved by a one unit dierence in delays, then the savings gained from the method-addtl technique would be considerably higher
7
Cant was too small to be measured, and thus, can be omitted.
83 (426,500), and thus the costs incurred by the prioritization technique would not be detrimental to savings. Moreover, if prioritization tools were optimized for industrial use, the cost of prioritization would have less impact even for low values of SFs. Analysis Relative to Specic Data The foregoing analysis shows, in general terms, the degree to which certain prioritization techniques might be eective relative to the objects we studied if certain costbenets relationships (as captured abstractly by appropriate scaling factors) were to hold. Whether these particular relationships would in fact hold in practice for the particular objects we studied and under a particular testing process, however, is an open question. The test suites for the object programs we studied are not particularly long-running, and for many development and testing processes (such as batch testing, or overnight incremental testing) could not be cost-eectively prioritized. For shorter testing cycles, however, gains from prioritization are at least theoretically possible even for these suites. To investigate whether this possibility holds for our objects, as well as to show how the analysis strategy we have presented can be applied in practice, we next turn to our specic data and examine it relative to the incremental testing process in which testing increments are shortest: continuous testing [124, 125, 126]. We consider two prioritization techniques that resulted in dierences (statistically signicant on ant) in prioritization results, block-addtl and method-addtl, and compare them to untreated test case orderings, at the test-method level. In our analysis, we consider three of our object programs (jmeter, xml-security, and jtopas) in terms of average costs across versions, because these programs possess tests suites that remain relatively constant across versions. For ant, whose test suites vary considerably in size, we examine averages across four subsets of versions that possess similar test suite sizes: v1-v2 (around 100 test cases), v3-v4 (around 200 test cases), v5-v7 (around 500 test cases), and v8 (877 test cases). Table 5.13 presents, for each program or subset of ant versions, the average time required to execute a test case for that program or subset of versions (column 2), and the delays dierences calculated for that program or subset of versions for the comparisons of block-addtl and method-addtl to untreated orders, respectively (columns 3 and 4). The time was measured on a machine running SuSE Linux 9.1 with 1G RAM and with a 3 GHZ processor. Columns 5 and 6 translate delays values into actual delays dierences in seconds eectively using the average time required to execute a test as the scaling factor introduced in the subsection General Analysis, multiplying that number by the delays value. Columns 7 and 8 present the average time required to prioritize the test cases for each program or subset of versions, using the block-addtl and method-addtl techniques, respectively. Given this data, we can assess the cost-benets of prioritization during continuous testing as follows. Using formula (1) from the subsection Quantifying Costs, we know that a prioritization heuristic is more cost-eective than the untreated ordering if and
84 Table 5.13: Costs and Savings Data for Prioritization on Our Object Programs, Considering Block-addtl and Method-addtl Techniques versus Untreated Test Orders and Test-method Level Test Suites.
Objects Avg. run time per test (sec) (AVG) 0.30 0.59 0.78 0.64 0.37 0.42 0.87 Delay di untr. vs untr. vs bl-add meth-add (D1) (D2) 90.4 90.4 143.9 88.4 560.0 601.6 1416.0 1248.0 31.0 34.3 73.6 63.3 115.3 115.3 AVG*(delay di) (sec) AVG*D1 27.12 84.90 436.80 905.24 11.47 30.91 100.31 AVG*D2 27.12 52.15 469.24 798.72 12.69 26.58 100.31 Prior. time (sec) bl-add (T5) 9.56 46.15 447.46 1460.42 5.92 8.82 1.21 meth-add (T7) 10.42 51.22 465.31 1566.94 6.89 8.74 1.24
ant (v1-v2) ant (v3-v4) ant (v5-v7) ant (v8) jmeter xml-sec. jtopas
only if the following constraint is satised:8 prioritization time < AVG delay dierences With respect to our data, this constraint is satised for jmeter, xml-security, and jtopas when considering the block-addtl and method-addtl heuristics. Among these cases, the largest savings are observed for jtopas: 99.1 and 99.07 seconds for block-addtl and method-addtl, respectively. Results on ant vary across subsets of versions. On the rst and second subsets (v1-v2 and v3-v4) prioritization produces some savings: 17.56 and 16.7 seconds for block-addtl and method-addtl, respectively, on the rst subset, and 38.75 and 0.93 seconds on the second. However, the third and fourth subsets do not result in savings; for these, the costs of applying prioritization exceeds the potential reduction in delays. In assessing the implications of these results, it is important to recognize the iterative nature of the continuous testing process. In their empirical study of continuous testing [124], Sa et al.s subjects utilized, on average, 266 and 116 individual test runs on the two programs they considered. In such situations, even small savings on individual test runs such as those we see on some of our programs can be magnied across the project cycle. For instance, in a development cycle of 266 iterations, the 99.1 second savings per jtopas test run would translate to 26361 seconds (438 minutes) across the entire cycle.
On the left side of the equation, the cost of prioritizing for the untreated order is zero, and we assume that the development environment includes facilities for instrumenting code during compilation, and executes tests on that instrumented code regardless of the prioritization method used; thus, the costs of instrumentation for the prioritization heuristic and the untreated order are equivalent and cancel out. Note that using time as the component by which delays are translated into values makes sense in the context of continuous testing, where testing is vying with short edit cycles for attention. Also, another measure of interest in this context might be the delay until exposure of a rst fault, which could be of particular interest in the continuous testing context. We retain our focus on delays, however, as they capture more fully the potential benets of a complete prioritized order.
8
85
5.1.6
Conclusions
We have presented a study of prioritization techniques applied to JUnit test suites provided with four Java programs. Although several studies of test case prioritization have been conducted previously, most have focused on a single procedural language, C, and on only a few specic types of test suites. Our study, in contrast, applied prioritization techniques to an object-oriented language (Java) tested under the JUnit testing framework, to investigate whether the results observed in previous studies generalize to other language and testing paradigms. Our results regarding the eectiveness of prioritization techniques conrm several previous ndings [40, 42, 116, 117, 122], while also revealing some dierences regarding the eects of prioritization technique granularity and test suite granularity. As discussed in Section 5.1.4, these dierences can be explained in relation to characteristics of the Java language and JUnit testing paradigm. We also investigated the practical impact of our results with respect to dierences in delays values, relating these to the many cost factors involved in regression testing and prioritization processes, and illustrating the approach with respect to our particular data, relative to a continuous testing process. As discussed in Section 5.1.4, we speculated that our results might be aected by the number of faults located in programs. Thus, it is worth investigating the eectiveness of prioritization techniques using a wider range and distribution of faults, and dierent types of faults, such as faults seeded by mutation. We present the results of such an investigation in detail in Section 5.3.
5.2
Using Component Metadata to Regression Test Component-Based Software
In Section 5.1, we investigated regression testing problems focusing on test case prioritization techniques using a particular type of test cases, JUnit, and a code-based approach. However, in practice, dierent types of testing techniques have dierent, and possibly complementary, strengths and weaknesses. No single technique will be most appropriate in all situations, and depending on the goal testers want to achieve, dierent types of regression testing techniques might be considered. Thus, in this study, we consider another type of regression testing technique, regression test selection, and design a study using our Java infrastructure, which also provides further validation of our infrastructure. In particular, we investigate two dierent types of regression test selection techniques that dier substantially in terms of the types of information they utilize and the approaches they take. We use our results to further explore some implications related to infrastructure and empirical studies of regression testing techniques.
86
5.2.1
Study Overview
We investigate the problem of regression testing an application that uses components after those components have been modied, and a metadata-based method for supporting regression testing tasks on component-based applications. We focus on regression test selection techniques [97, 118], as described in Section 2.1. The two dierent types of regression testing techniques we consider are code-based and specication-based regression test selection techniques (RTS). Code-based RTS techniques select test cases from an existing test suite based on a coverage goal expressed in terms of some measurable characteristic of the code. There are many such characteristics that can be considered, including statements, branches, paths, methods, and classes. In this study, we use the DEJAVU approach [119] to select test cases based on metadata about test coverage of branches. Specication-based RTS techniques select test cases based on some form of functional specication of a system, such as natural language specications, FSM diagrams, or UML diagrams. In this study, we consider an approach to regression test selection based on UML statechart diagrams.
5.2.2
Component Metadata-based Regression Test Selection
5.2.2.1 Component Metadata Component metadata consists of metadata and metamethods; metadata are information about components and metamethods are methods, associated with components, that can compute or retrieve metadata. Component metadata can provide a wide range of static and dynamic information about a component, such as coverage information, built-in test cases, abstract representations of source code, or assertions about security properties. Thus, we envision dierent types of component metadata and dierent protocols for accessing them. We distinguish between a priori and on-demand metadata. In the rst case, a priori metadata provide information about a component that can be computed beforehand and attached to the component (e.g., a components version number). In the second case, on-demand metadata provide information that either (1) can be gathered only by execution or analysis of the component in the context of the application (e.g., the execution prole of the component in the component-users environment), or (2) are too expensive to compute exhaustively a priori (e.g., metadata for impact analysis that provide forward slices on all program points in a component). Like metadata, metamethods are highly dependent on the characteristics of the data involved. Metamethods for a priori metadata are in general simple, returning static metadata. Metamethods for on-demand metadata, in contrast, usually involve an interaction protocol between the component user (which could be another component, a tool, or a human being) and the component. Consider, for example, metadata by which self-checking code obtains a log of assertion violations during an execution. In this case, the user may have to invoke a metamethod for enabling assertion checking and logging, a metamethod for disabling assertion checking and logging, and a
87 metamethod for retrieving log information. In the metadata framework that underlies this work [99], each component provides, in addition to specic metamethods, two generic metamethods that let the user gather information about the metadata available for the component: getMetadata and getMetadataUsage. Method getMetadata provides a list of metadata available for the component, and getMetadataUsage provides information on how to gather a given type of metadata. To illustrate, Figure 5.5 shows a possible interaction with a component to obtain method proling information.
1. Get the list of types of proling metadata provided by component c: List lo = c.getMetadata(analysis/dynamic/profiling) 2. Check whether lo contains the metadata needed (e.g., analysis/dynamic/proling/method); 3. Get information on how to gather proling data: MetadataUsage ou = c.getMetadataUsage(analysis/dynamic/profiling/method) 4. Based on information in ou, gather the proling data by rst enabling built-in coverage facilities: c.enableMetadata(analysis/dynamic/profiling/method) 5. The built-in proling facilities provided with c are now enabled, execute the application that uses c. 6. Disable the built-in coverage facilities: c.disableMetadata(analysis/dynamic/profiling/method) 7. Get the proling data: Metadata md = getMetadata(analysis/dynamic/profiling/method)
Figure 5.5: Steps for gathering method proling metadata. This example assumes the existence of some hierarchical scheme for naming and accessing available metamethods, as described by Orso et al. [99]. Obviously, many mechanisms for identifying and retrieving information about metadata may be available, and the mechanism shown here is just one possible alternative. In the remainder of this article we assume that a mechanism such as this is available, and we use this mechanism to describe metadata-based techniques. 5.2.2.2 Component Metadata-based RTS: Example Our example represents a case in which application A uses a component C. (Our approaches presented also handle the case in which an application uses multiple components, but our single-component example simplies the presentation.) The application models a vending machine. A user can insert coins into the machine, ask the machine to cancel the transaction, which results in the machine returning all the coins inserted and not consumed, or ask the machine to vend an item. If an item is not available, a users credit is insucient, or a selection is invalid, the machine prints an error message and does not dispense the item, but instead returns any accumulated coins.
88 Because the techniques that we present involve code-based and specication-based testing techniques, we provide both an implementation and a possible specication for the example, expressing the latter using component diagrams and statechart diagrams in UML (Unied Modeling Language [15]) notation.
VendingMachine Dispenser
Figure 5.6: Component diagram for VendingMachine and Dispenser. Figure 5.6 shows a component diagram that represents application VendingMachine, component Dispenser, and their interactions. Dispenser provides an interface that is used by VendingMachine. VendingMachine uses the services provided by Dispenser to manage credits inserted into the vending machine, validate selections, and check for availability of requested items. Figure 5.7 shows a statechart specication of VendingMachine. The machine has ve states (NoCoins, SingleCoin, MultipleCoins, ReadyToDispense, and Dispensing), accepts ve events (insert, cancel, vend, nok, and ok), and produces three actions (setCredit, dispense, and returnCoins). If the machine receives an insert event while in state NoCoins, it reaches state SingleCoin, from which an additional insert event takes it to state MultipleCoins; if the machine receives a vend or cancel event while in state NoCoins, it remains in that state. In states SingleCoin and MultipleCoins, a vend event triggers action setCredit (with dierent parameters in the two cases: 1 and 2..max, respectively) and brings the machine to state ReadyToDispense. (We use notation 2..max to indicate that the value of the parameter can vary between 2 and some predened maximum.) In state ReadyToDispense, the machine produces action dispense and enters state Dispensing, from which it returns to state NoCoins when it receives a nok or ok event; in both cases, any remaining coins are returned through a returnCoins event. Figure 5.8 shows a statechart specication of Dispenser. The machine has three states (Empty, Insucient, and Enabled), accepts two events (setCredit and dispense), and produces two actions (nok and ok). In state Empty, the machine accepts event setCredit and stays in state Empty, reaches state Insucient, or reaches state Enabled based on the value of the credit, as specied by the guards in the gure. In all three states, the machine accepts event dispense. Event dispense triggers a nok or ok action depending on the availability of the requested item. Figure 5.9 shows a Java implementation of VendingMachine. Methods insert and cancel increment and reset the counter of inserted coins. Method vend calls Dispenser.setCredit and Dispenser.dispense. If the value returned by Dispenser.dispense is greater than 0 (which means that the selection is valid, the item is available, and the credit is sucient), the item is dispensed and the change, if any, is returned to the user; otherwise, an error message is displayed and credit is reset to 0. Figure 5.10 shows a Java implementation of Dispenser. The code contains an error:
89
Figure 5.7: Statechart specication of VendingMachine.
Figure 5.8: Statechart specication of Dispenser. after a successful call to Dispense.dispense (i.e., a call resulting in the dispensing of the requested item), the value returned to the caller is the actual cost of the item (COST), rather than the number of coins consumed (COST divided by COINVALUE). As a result, VendingMachine fails in computing the change to be returned. For example, the sequence of calls: v.insert(); v.insert(); v.insert(); v.vend(4), where v is an instance of VendingMachine, causes item 4 to be dispensed, but the one-coin change is not
90
1. public class VendingMachine { 2. private int coins; 3. private Dispenser d; 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. } public VendingMachine() { coins = 0; d = new Dispenser(); } public void insert() { coins++; System.out.println("Inserted coins = "+coins ); } public void cancel() { if( coins > 0 ) { System.out.println("Take your change" ); } coins = 0; } public void vend( int item ) { if( coins == 0 ) { System.out.println("Insufficient credit"); return; } d.setCredit(coins); int result = d.dispense( item ); if( result > 0 ) { \\ event OK System.out.println("Take your item" ); coins -= result; } else switch( result ) { \\ event NOK case -1: System.out.println("Invalid selection "+item); break; case -2: System.out.println("Item "+item+" unavailable"); break; case -3: System.out.println("Insufficient credit"); break; } cancel(); } // class VendingMachine
Figure 5.9: Application VendingMachine.
91 returned to the user.

40. class Dispenser { 41. final private int COINVALUE = 25; 42. final private int COST = 50; 43. final private int MAXSEL = 4; 44. private int credit; 45. private int itemsInStock[] = {2, 0, 0, 5, 4}; 46. public Dispenser() { 47. credit = 0; 48. } 49. public void setCredit(int nOfCoins) { 50. if( credit != 0 ) 51. System.out.println("Credit already set"); 52. else 53. credit = nOfCoins * COINVALUE; 54. } 55. public int dispense( int selection ) { 56. int val = 0; 57. if ( selection > MAXSEL ) 58. val = -1; // Invalid selection 59. else if ( itemsInStock[selection] < 1 ) 60. val = -2; // Selection unavailable 61. else if ( credit < COST ) 62. val = -3; // Insufficient credit 63. else { 64. val = COST; 65. itemsInStock[selection]--; 66. } 67. credit = 0; 68. return val; 69. } 70. } // class Dispenser
Figure 5.10: Component Dispenser. We present this example solely to illustrate our techniques. Thus, in both the code and the specication, we omit unnecessary details and show only the parts of the code or specications that are involved in the interactions between the application and the component. For example, the set of available items is hard-coded in the Dispenser component and we do not provide a way to update it. To complete the example, Table 5.14 shows a test suite for VendingMachine, created to cover usage scenarios of VendingMachine based on our knowledge of its behavior. Each test case in this test suite is a sequence of method calls. For brevity, we omit initial calls to the constructor of class VendingMachine, which is implicitly
92 invoked when the class is instantiated. The test cases are grouped into three sets (114, 1519, 2024) based on the value of parameter selection that is passed to method VendingMachine.vend. The table indicates whether each test case passes or fails. Test cases 6 and 12 fail due to the error in Dispenser.dispense. Table 5.14: A Test Suite for VendingMachine.
TC# Test Case Result Parameter to vend: 3 (valid selection, available item) 1 cancel Passed 2 vend Passed 3 insert, cancel Passed 4 insert, vend Passed 5 insert, insert, vend Passed 6 insert, insert, insert, vend Failed 7 insert, insert, cancel, vend Passed 8 insert, cancel, insert, vend Passed 9 insert, cancel, insert, insert, vend Passed 10 insert, insert, cancel, insert, vend Passed 11 insert, insert, vend, insert, insert, vend Passed 12 insert, insert, insert, insert, vend, vend Failed 13 insert, insert, vend, vend Passed 14 insert, vend, insert, vend Passed Parameter to vend: 2 (valid selection, unavailable item) 15 insert, vend Passed 16 insert, insert, vend Passed 17 insert, cancel, insert, vend Passed 18 insert, insert, insert, vend Passed 19 insert, cancel, insert, insert, vend Passed Parameter to vend: 35 (invalid selection) 20 insert, vend Passed 21 insert, insert, vend Passed 22 insert, cancel, insert, vend Passed 23 insert, insert, insert, vend Passed 24 insert, cancel, insert, insert, vend Passed
5.2.2.3 Code-Based RTS Using Component-Metadata The rst type of approach that we present involves code-based RTS techniques. We begin by describing traditional code-based RTS techniques, and then present corresponding techniques that utilize metadata.
93 Code-based Regression Test Selection Code-based RTS techniques select test cases from an existing test suite based on a coverage goal expressed in terms of some measurable characteristic of the code. There are many such characteristics that can be considered, including statements, branches, paths, methods, and classes. In particular, for techniques that use branch-coverage information, the program under test is instrumented such that, when it executes, it records which branches (i.e., method entries and outcomes of decision statements) are traversed by each test case in the test suite. Given coverage of branches, we can infer coverage of all other edges (i.e., transitions between individual statements)9 in the program. For example, coverage of edge (26,27) in VendingMachine (Figure 5.9) is implied by coverage of branch (25,26). For our example application and component, the branches are shown in Table 5.15. Table 5.15: Branches for VendingMachine and Dispenser.
VendingMachine (4,5), (8,9), (12,13), (18,19) (13,14), (13,16), (19,20), (19,23) (25,26), (25,29), (29,30), (29,32) (29,34), (29,37) Dispenser (46,47), (49,50), (55,56) (50,51), (50,53), (57,58) (57,59), (59,60), (59,61) (61,62), (61,64)
Method Entries Branches
Suppose the developer of VendingMachine runs the test suite shown in Table 5.14 on that system, with the faulty version of Dispenser incorporated. In this case, test cases 6 and 12 fail. Suppose the component user communicates this failure to the component developer, who xes the fault by changing statement 64 to val = COST / COINVALUE;, and releases a new version Dispenser of the component. When the component user integrates Dispenser into VendingMachine, it is important to regression test the resulting application. For eciency, the component user could choose to rerun only those test cases that exercise code modied in changing Dispenser to Dispenser . However, without information about the modications to Dispenser and how they relate to the test suite, the component user is forced to run all test cases that exercise the component (20 of the 24 test cases all except test cases 1, 2, 3, and 7). Code-based RTS techniques (e.g., [23, 57, 119, 121, 139]) construct a representation, such as a control-ow graph, call graph, or class-hierarchy graph, for a program P , and record the coverage achieved by the original test suite T with respect to entiTechnically, edge coverage measures coverage of edges in a programs control-ow graph, hence the use of the term edge. To simplify the discussion, we refer to edges as if they were in the program, implicitly assuming the mapping of edges from the ow graph to the code.
9
94 ties (i.e., nodes, branches, or edges) in that representation. When a modied version P of P becomes available, these techniques construct the same type of representation for P that they constructed for P . The algorithms then compare the representations for P and P to select test cases from T for use in testing P . The selection is based on dierences between such representations with respect to the entities considered and on information about which test cases cover the modied entities. Consider the Dejavu approach [119], which utilizes control-ow graph representations of the original and modied versions of the program, treating edges in the graph as entities. To select test cases, Dejavu performs a synchronous traversal of the control-ow graph (CFG) for P and the control-ow graph (CFG ) for P , and identies aected edges, that is, edges that lead to statements that have been added, deleted, or modied from CFG to CFG . Then, the algorithm uses these aected edges to infer a set of dangerous branches. Dangerous branches are branches that control the execution of aected edges (i.e., branches that, if executed, may lead to the execution of aected edges). For example, if edge (20,21) in Figure 5.9 were identied as an aected edge, branch (19,20) would be the corresponding dangerous branch. Finally, the algorithm selects the test cases in T that cover dangerous branches in P as test cases to be rerun on P . Following the terminology of Section 2.1.1, Dejavu is a safe RTS technique. Because code-based RTS techniques consider only changes in code, if there are no such changes, the technique selects no test cases. When applied to systems built from components for which code is not available, Dejavu makes conservative approximations. For example, to perform RTS on VendingMachine when Dispenser is changed to Dispenser , Dejavu constructs control-ow graphs CFG for methods in VendingMachine . However, because the code for Dispenser is unavailable to the developer of VendingMachine, Dejavu cannot construct control-ow graphs for the methods in Dispenser. Therefore, Dejavu can select test cases based on the analysis of CFG and CFG for VendingMachine only by conservatively considering each branch that leads to a call to component Dispenser as dangerous. In this case, when Dejavu performs its synchronous traversal of CFG and CFG , it identies branch (19,23) as dangerous because it leads to a call to component Dispenser, and selects all test cases that exercise this branch test cases 46 and 824. (This result is identical to the result, discussed above, in which the component user selects all test cases that exercise the component. In this context, in fact, Dejavu is an automated approach to identifying that set of test cases.) A Component-Metadata-Based Approach To achieve more precise regression test selection results in situations such as the foregoing, we can use component metadata. To do this, we require three types of information for each component: 1. coverage of the component achieved by the test suite for the application, when the component is tested within the application;
95 2. the component version; 3. information on the dangerous branches in the component, given the previous and the current version of the component. The component developer can provide the last two forms of information with the component, retrievable through metamethods. In particular, they can compute the set of dangerous branches using the Dejavu approach discussed above and provide this set as a priori metadata in Dispenser . Coverage information, however, can be provided only as on-demand metadata; it must be collected by the application developer while testing the application. Therefore, the component developer must provide built-in instrumentation facilities with the component, as metamethods packaged with the component. For example, the component can be equipped with an instrumented and an uninstrumented version of each method; a test at the beginning of each method would decide which version to execute (similar to the approach presented by Arnold and Ryder [4]). Given the foregoing metadata and metamethods, when the component user acquires and wishes to regression test Dispenser , they begin by constructing a coverage table for Dispenser, as follows: 1. Verify that coverage metadata are available for Dispenser 2. Enable the built-in instrumentation facilities in Dispenser 3. For each test case t in T (a) run t and gather coverage information (b) use coverage information for t to incrementally populate the coverage table 4. Disable the built-in coverage facilities in Dispenser (In cases in which more than one component is involved, the same process would be applied simultaneously for all components of concern.) Table 5.16 shows the coverage information that would be computed by this approach for component Dispenser and the test suite in Table 5.14. Given this coverage information, the component user invokes a metadata-aware version of Dejavu, DejavuM B , on their application. DejavuM B proceeds as follows: 1. On methods contained in the users application, for which source code is available, DejavuM B performs its usual actions, previously described 2. On methods contained in Dispenser DejavuM B performs the following actions: (a) retrieve Dispenser s version number (b) use this information to query Dispenser about the dangerous branches with respect to Dispenser
96 Table 5.16: Branch Coverage for Component Dispenser.

TC# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Branches Covered none none none (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) none (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56) (49,50) (50,53) (55,56)
(57,59) (59,61) (61,62) (57,59) (59,61) (61,64) (57,59) (59,61) (61,64) (57,59) (57,59) (57,59) (57,59) (57,59) (57,59) (57,59) (57,59) (57,59) (57,59) (57,59) (57,59) (57,58) (57,58) (57,58) (57,58) (57,58) (59,61) (59,61) (59,61) (59,61) (59,61) (59,61) (59,61) (59,60) (59,60) (59,60) (59,60) (59,60) (61,62) (61,64) (61,62) (61,64) (61,64) (61,64) (61,62)
(c) select the test cases associated with dangerous branches, referencing the coverage table. In our example, dierences between Dispenser and Dispenser cause branch (61,64) to be the only dangerous branch reported. This branch is exercised by test cases 5, 6, 9, 11, 12, and 13, so these six test cases are selected for re-execution many fewer test cases than would be required without component metadata. 5.2.2.4 Specication-based Regression Test Selection Specication-based Regression Test Selection Specication-based RTS techniques [17, 136] select test cases based on some form of functional specication of a system, such as natural language specications, FSM diagrams, or UML diagrams [15], and are complementary to code-based techniques. We consider an approach to regression test selection based on UML statecharts. Our technique applies to software that integrates an application A with a set of software components C by (1) combining specications in the form of statecharts for A and C to build a global behavioral model, (2) identifying dierences in the
97 global behavioral model when a new version of C is integrated, and (3) selecting the test cases that exercise changed sections of the model. Note, however, that a naive algorithm for step 1 could result in a global statechart of size exponential in the numbers of states and transitions in the individual statecharts, so a heuristic for cost-eectively combining statecharts is needed. (One could also directly compare statecharts for an original and new version of a component. However, the focus here is on the interactions between applications and components, and the combined behavioral model facilitates addressing this issue.) To illustrate, consider again the case in which a component user is integrating a set of components C with their application A. In step 1 of the approach, the user constructs a global behavioral model GB for their application (including components) by composing the statecharts for components incrementally, using an heuristic reduction algorithm and composition rules that reduce the size of a composed statechart by eliminating some unreachable states. To perform this step, we use a version of the technique for composing general communicating nite state machines presented by Sabnani, Lapone, and Uyar [123] and specialized for integration testing by Hartmann, Imoberdorf, and Meisinger [59]. We generalize these approaches, which are targeted primarily at communicating sequential processes, to extend the applicability of the technique to a wider set of systems. To this end, we allow for simple parameterized events and actions, and the use of a constrained form of guards. Intuitively, we allow for scalar parameters and for guards containing conditions either related to a parameter and composed of a single clause, or unrelated to parameters. Note that this is just one possible technique for composing specications at the component level to derive a specication for the overall system. There exist other, related techniques that use dierent notations with similar goals (e.g., [5, 53]). Given a global behavioral model GB, the component user can generate a set of testing requirements for the application using any testing technique based on statemachine coverage, such as Binders adaptation [12] of Chows method [24]. These testing requirements, which are typically paths through the state-machine, are then used by the tester to generate test cases, resulting in test suite T . When the statechart specications for one or more components in C are changed, and new versions C of these components incorporating these changes are released, the component user must retest their application with these changed components. To perform this task using the model composition approach, the component user does the following: 1. generate a new global behavioral model GB by composing statecharts for A with components in C ; 2. compare GB and GB by performing a pairwise walk of the two models, marking dangerous transitions, that is, edges that have been added, deleted, or modied in the new model, or edges leading to states that dier;
98 3. select all test cases in T that traverse at least one dangerous transition. The algorithm for performing the pairwise walk in step 2 of this process is similar to that used by Dejavu on control-ow graphs for code-based selection, and follows a similar motivation the notion that test cases that reach changes should be selected, because these changes indicate places in which program behavior might change and erroneous behavior might be revealed by tests. (In this case, however, the graphs model required behavior, and the changes involve changes in requirements.) The algorithm begins at initial states, comparing states reached along identically-labeled outgoing edges, until dierences are found. Then the dangerous transitions that are dened in step 2 are identied; detailed descriptions of the process of identifying dangerous transitions can be found in [119]. When new edges out of a node are found, two approaches are possible: they can be ignored for purposes of test selection (under the assumptions that previous testing requirements cannot include them, and that new test cases will be created to exercise them), or all transitions into their source node can be identied as dangerous. Because testing requirements are generated based on the statechart specication of the system, and dened as sequences of states and transitions, once the dangerous transitions have been identied, selecting the test cases associated with dangerous transitions requires a simple set union over entries in the test coverage table. Rather than present an example of this technique at this point, we defer it to the next section, where we illustrate the metadata-based adaptation of the approach. A Component-Metadata-Based Approach The technique just described requires UML statecharts for components, and is applicable only in cases in which the component developer can provide them (e.g., when sharing components in-house). The component developer can provide these statecharts as component metadata, by encoding them in appropriate data structures accessed through metamethods. One simple approach utilizes statecharts output by Rational Rose [114]; these can be parsed by a statechart composition tool, and the results rendered in the same format, for use in generating testing requirements and test cases and selecting test cases. Analogous to the code-based approach, if a new version of C, C , is released, and no information about the changes between C and C is available, the component user may need to execute all of the test cases for A that exercise components in C. If the specication for the components in C is available, however, the component user can exploit it to perform regression test selection, as we now illustrate using our vending machine example. (Note that the process can also be applied when components in A are modied, but we focus on the case in which C alone is changed for simplicity.) Consider the application VendingMachine and its statechart (Figure 5.7). When the component user rst acquires Dispenser, they retrieve its statechart using a metamethod. Next, they use an implementation of the composition algorithm dis-
99 cussed above to compose that statechart with the statechart for VendingMachine, obtaining the global statechart VendingMachine-Dispenser shown in Figure 5.11. 10
Figure 5.11: Global statechart for VendingMachine and Dispenser. As described above, using the global statechart, the component user can create a set of test cases by applying various state-machine-based testing approaches (e.g., [9, 24, 59]). Table 5.17 shows one possible set of testing requirements for VendingMachine, designed to cover each path p = (s0 , s1 , . . . , sn ) in the global statechart such that (1) n = 0 and s0 = sn = NoCoins Empty, (2) n = 0 and si = NoCoins Empty for i = 1, . . . , n 1, and (3) p does not traverse the same edge twice. (The user would
10 Because a global statechart is constructed after normalizing the individual statecharts to contain only one event or action on each transition, some of the states in the global statechart are composed of states that are not present in the original statecharts. For example, the two composing states of state Dispensing InsucientDispensing are Dispensing and InsucientDispensing; the latter is a state of the normalized statechart for Dispenser. The normalized statecharts for VendingMachine and Dispenser are provided in in Figures 5.15 anf 5.16.
100 Table 5.17: Testing Requirements for VendingMachine-Dispenser.

TR# 1 2 3 4 5 6 7 8 9 10 Testing Requirement vend(i) returnCoins insert, returnCoins insert, insert, returnCoins insert, insert, insert, returnCoins insert, vend(i), setCredit(1), dispense, nok insert, insert, vend(i), setCredit(2..max), dispense[unavail], nok insert, insert, vend(i), setCredit(2..max), dispense[avail], ok insert, insert, insert, vend(i), setCredit(2..max), dispense[unavail], nok insert, insert, insert, vend(i), setCredit(2..max), dispense[avail], ok
create test cases for each of these requirements, using appropriate combinations of test drivers and inputs.) Suppose the component developer releases a new version of Dispenser, Dispenser , in which the specication is changed by inserting a new intermediate state, Preparing, between states Enabled and Empty (see Figure 5.12). When Dispenser is acquired, the user uses a metamethod to retrieve its new specication, and build a new global behavioral model (Figure 5.13).
Figure 5.12: Statechart specication of Dispenser .
101
Figure 5.13: Global statechart for VendingMachine and Dispenser . Finally, the user applies the statechart-based version of Dejavu. When this algorithm processes the models in Figures 5.12 and 5.13, it identies the transition from ReadyToDispenseEnabled to DispensingEnabledAvail as dangerous, because it leads to states that dier in the two models. This transition is exercised by test cases 8 and 10, so these two testing requirements (or ultimately, their associated test cases) are selected.
5.2.3
Empirical Study of Code-based Regression Test Selection
For component-metadata-based approaches to regression test selection to be useful, they must be able to reduce testing costs. RTS algorithms have shown potential for reducing testing costs in prior empirical studies [23, 51, 100, 121, 120]; however, these studies have not involved component-based software systems, and it is not appropriate to conclude that their results will generalize to such systems. Moreover, the tradeos
102 that exist among dierent types of metadata should be investigated for such systems. We therefore performed an empirical study examining the results of applying several dierent code-based RTS techniques to a non-trivial component-based system. The remainder of this section describes our study design and results. 5.2.3.1 Research Question Given an application A that uses a set of externally-developed components C, and has been tested with a test suite T , and given a new version of this set of components C , can we exploit component metadata to select a test suite T T that can reveal possible regression faults in A due to changes in C and that does not exclude test cases impacted by such changes? 5.2.3.2 Object of Analysis As an object for the study we used several versions of the Java implementation of the Siena server [21]. Siena (Scalable Internet Event Notication Architecture) is an Internet-scale event notication middleware for distributed event-based applications deployed over wide-area networks, responsible for accepting notications from publishers and for selecting notications that are of interest to subscribers and delivering those notications to the clients via access points. To investigate the eects of using component metadata we required an application program constructed using external components. Siena is logically divided into a set of six components (consisting of nine classes of about 1.5KLOC), which constitute a set of external components C, and a set of 17 other classes of about 2KLOC, which constitute an application that could be constructed using C. We obtained the source code for eight dierent sequentially released versions of Siena (versions 1.8 through 1.15). Each version provides enhanced functionality or corrections with respect to the preceding version. The net eect of this process was the provision of eight successive versions of Siena, A1 , A2 , . . . , A8 , constructed using C1 , C2 , . . . , C8 , respectively. These versions of Siena represent a sequence of versions, each of which the developer of A would want to retest. The pairs of versions (Ak , Ak+1 ), 1 k 7, formed the (version, modied-version) pairs for our study. To investigate the impact of component metadata on regression test selection we also required a comprehensive test suite for our base version A1 of Siena that could be reused in retesting subsequent versions. Such a test suite did not already exist for the Siena release we considered, so we had one created. To do this in an unbiased manner, we asked one of the persons involved in requirements denition and design of Siena to independently create a black-box test specication using the category-partition method and TSL test specication language [101]. We then created individual test cases for these testing requirements. The resulting suite contains 567 test cases and served as the object regression test suite for our study.
103 5.2.3.3 Variables and Measures Independent Variables The independent variable in this study is the particular code-based RTS technique utilized. In the following, let A be an application that uses a set of externallydeveloped components C, and that has been tested with a test suite T . Let C be a new version of set of components C. Based on the discussion in Section 5.2.2.3, we consider four safe RTS techniques: No component metadata. The developer of A knows only that one or more of the components in C have been modied, but not which ones. Therefore, to selectively retest A safely, the developer must rerun any test case in T that exercises code in one or more of the components in C. We refer to this as the NO-META technique. Component-level RTS. The developer of A possesses component metadata provided by the developer of C, supporting selection of test cases that exercise components changed in producing C from C. We refer to this as the META-C technique. Method-level RTS. The developer of A possesses component metadata provided by the developer of C, supporting selection of test cases that exercise methods changed in producing C from C. We refer to this as the META-M technique Statement-level RTS. The developer of A possesses component metadata provided by the developer of C, supporting selection of test cases that exercise statements changed in producing C from C. We refer to this as the META-S technique. The NO-META technique does not use component metadata and serves as a control technique. The other three techniques do use metadata, and rely on dierent levels of information about changes between versions of components: componentlevel, method-level, or statement-level. We refer to these three techniques collectively as META techniques. By observing the application of these techniques we can investigate the eectiveness of component-metadata-based techniques generally, along with the further question of whether the level of information about changes between versions of components can aect the performance of such techniques. Dependent Variables and Measures Our dependent variable involves technique eectiveness in terms of savings in testing eort. We utilize two measures for this variable: reduction in test suite size and reduction in test-execution time.
104 RTS techniques provide savings by reducing the eort required to regression test a modied program. Thus, one method used to compare such techniques [10] considers the degree to which the techniques reduce test-suite size for given modied versions of a program. Using this method, for each RTS technique R that we consider, and for each (version, subsequent-version) pair (Pi ,Pi+1 ) of program P , where Pi is tested by test suite T , we measure the percentage of T selected by R to test Pi+1 . The fact that an RTS technique reduces the number of test cases that must be run does not guarantee that the technique will be cost-eective. That is, even though we reduce the number of test cases that need to be rerun, if this does not produce savings in testing time, the reduction in number of test cases will not produce savings. Moreover, savings in testing time might not be proportional to savings in number of test cases for example, consider the case in which the test cases excluded are all inexpensive, while those not excluded are expensive. (See Section 2.2 for an applicable cost model.) Thus, to further evaluate savings, for each RTS technique, we measure the time required to execute the selected subset of T on Pi+1 . 5.2.3.4 Experiment Procedure Because the implementation of component metadata and support tools for directly applying our techniques would be expensive, we sought a way to study the use of metadata, initially, without creating such infrastructure. This approach makes sense from the standpoint of research methodologies because, if there is no evidence that component metadata can be useful for regression test selection, there may be no reason to create infrastructure to support direct experimentation with the use of component metadata. Furthermore, if a study conducted without formal infrastructure suggests that component metadata are useful, its results can help direct the subsequent implementation eort. We therefore designed a procedure by which we could determine precisely, for a given test suite and (program, modied-version) pair, which test cases would be selected by our four target techniques. For each (program, modied-version) pair (Pi , Pi+1 ), we used the Unix diff utility and inspection of the code to locate dierences between Pi and Pi+1 , including modied, new, and deleted code. In cases where variable or type declarations diered, we determined the components, methods, or statements in which those variables or types were used, and treated those components as if they had been modied. We used this information to determine (for the METAC, META-M, and META-S techniques, respectively) the components, methods, or statements in Pi that would be reported changed for that technique. For each of the META techniques, we instrumented the changed code (component, method, or statement) for each version of the object program so that, when reached, the instrumentation code outputs the text selected, and we then constructed executables of the application from this instrumented code (one for each META technique). Given this procedure, to determine which test cases in T would be selected by the META techniques for (Pi , Pi+1 ) it was sucient to execute all test
105 cases in T on our instrumented version of Pi , and record which test cases caused Pi to output (one or more times) the text selected. By construction, these are exactly the test cases that would be selected by an implementation of that META technique. Determining the test cases that would be selected by the NO-META technique required a similar, but simpler approach. We instrumented the application developers portion of the code for P , inserting code that outputs selected prior to any invocation of any method in C, and then we executed the test cases in T on that instrumented version. The foregoing process requires us to execute all test cases in T to determine which would be selected by an actual RTS tool, and thus is useful only for experimentation. However, the approach lets us determine exactly the test cases that would be selected by the techniques, without providing full implementations. We applied this approach to each of the seven (program, modied-version) pairs of the Siena system with our given test suite, and recorded, for each of the four RTS techniques, the percentage of the test suite selected by that technique for that (program, modied-version) pair, and the time required to run the selected test cases. These percentages and times served as the data sets for our analysis. 5.2.3.5 Threats to Validity Like any empirical study, this study has limitations that must be considered when interpreting its results. We have considered the application of four componentmetadata-based RTS techniques to a single program and test suite and seven subsequent modied versions of the components that make up that program, and cannot claim that these results generalize to other programs and versions. On the other hand, the program and versions used are part of an actual implementation of a non-trivial software system, and our test suite represents a test suite that could realistically be used in practice. Nevertheless, additional studies with other objects are needed to address such questions of external validity. Other limitations involve internal and construct validity. We have considered only two measures of regression test selection eectiveness: percentage reduction in test suite size and percentage reduction in test-execution time. Other costs, such as the cost of providing component metadata and performing test selection, may be important in practice. Also, the execution times we report do not factor in the cost of the analysis required to perform test selection, which would add costs to test selection; however, in other studies of test selection, those costs have been shown to be quite low [120]. Finally, our execution times include only the times required to execute, and not to validate, our test cases. Measuring validation costs would further increase test execution time, and increase savings associated with reductions in test-suite size.
106 5.2.3.6 Data and Analysis Figure 5.14 depicts the test selection results measured. In the graph, each modied version of Sienas components occupies a position along the horizontal axis, and the test selection data for that version are represented by a vertical bar, black for the NOMETA technique, dark grey for the META-C technique, light grey for the META-M technique, and white for the META-S technique. The height of the bar depicts the percentage of test cases selected by the technique on that version. As the gure shows, the NO-META technique always selected 99.2% of the test cases. Only 0.8% of the test cases for Siena do not exercise components in C (the set of external components), and thus all others must be re-executed. Also, because the NO-META technique selects all test cases that execute any components in C, and the test cases in the test suite that encounter C did not vary across versions, the NO-META technique selected the same test cases for each version. As the gure also shows, the three META techniques always selected a smaller subset of the test suite than the NO-META technique. For the META-C technique, the selected subset did not dier greatly from the subset selected by the NO-META technique. The META-C technique always selected 98.9% of the test cases, a dierence of only 0.3% in comparison with the NO-META technique.
100 90 80 70 60 50 40 30 20 10 0 C2 C3 C4 C5 C6 C7 C8
pct. of tests selected
modified version
Figure 5.14: Test selection results for the NO-META (black), META-C (dark grey), META-M (light grey), and META-S (white) techniques. The META-M and META-S techniques provided greater savings. In the case of version C7, the dierences were extreme: the META-M technique selected only 1.4% of the test cases in the test suite and the META-S technique selected none of the test cases, whereas the NO-META technique selected 99.2% of the test cases. (The fact that META-S selected no test cases on version C7 is not a drawback of that technique: it simply shows that existing test cases were inadequate for testing the code that was modied. Running existing test cases cannot help in testing this code, and re-using such test cases in this case is wasted eort. This case does suggest, however, that the META-S technique can help testers identify areas of the modied system requiring additional testing.) This large dierence arose because the changes within C7 involved
107 only a few methods and statements, where these methods were encountered by only a few test cases, and these statements were encountered by none of the test cases. On other versions, for the META-M technique, dierences in selection were more modest, ranging from 0.7% to 32.6% of the test suite. The overall savings for the META-S technique, however, were greater, ranging from 11.8% to 84.6%. Note that on versions C3, C5, and C6, inspection of the data shows that the META-M technique selected identical test cases, even though the code changes in those versions diered. This occurred because the code changes involved the same sets of methods. Similarly, inspection shows that the META-S technique selected identical test cases on versions C5 and C6. This occurred because the code changes involved the same sets of statements. Table 5.18: Execution Times (Hours:Minutes:Seconds) for Test Cases Selected by Techniques.
Version C2 C3 C4 C5 C6 C7 C8 average cumulative NO-META 2:20:23 2:20:37 2:20:19 2:20:37 2:20:30 2:20:16 2:20:19 2:20:26 16:23:01 META-C 2:20:05 2:19:49 2:19:50 2:19:58 2:19:51 2:20:06 2:19:47 2:19:55 16:19:26 META-M 2:04:44 2:01:19 1:34:49 2:02:13 2:02:13 0:03:15 2:19:33 1:43:59 12:08:06 META-S 2:04:01 0:20:43 0:20:30 2:02:02 2:02:03 0:00:00 1:57:26 1:15:15 8:46:45
We next consider test-execution times. Table 5.18 shows, for each version of Sienas components considered, the hours, minutes and seconds required to test that version. The columns show the version number, and the time required to run the test cases selected by the NO-META, META-C, META-M, and META-S techniques, respectively. The last two rows of the table show average and cumulative times. On average over the seven modied versions, the META-C technique produced little reduction in testing time: from 2 hours, 20 minutes and 26 seconds to 2 hours, 19 minutes and 55 seconds. The META-M technique did better, reducing average regression testing time to 1 hour, 43 minutes and 59 seconds, and META-S reduced average testing time to 1 hour, 15 minutes and 15 seconds. Because regression testing is repeated, as a software system evolves, over sequences of releases, savings over such sequences are also important, so we also consider these. For the META-C technique, cumulative time savings were only 3 minutes and 35 seconds (0.37% of total time.) For the META-M and META-S techniques, however, cumulative savings were larger: 4 hours, 14 minutes and 55 seconds (25.9% of total time) in the former case, and 7 hours, 36 minutes and 16 seconds (46.4% of total
108 time) in the latter case. Since the runtime for each test case was roughly the same across test cases, the percentage of cumulative time savings were about the same as the percentage of cumulative numbers of test cases saved, for all techniques; 0.4% savings of test cases for META-C, 25.9% for META-M, and 46.4% for META-S. Considering results for META-M and META-S on individual versions, in their worst cases, the META-M technique saved only 46 seconds (0.55%) of testing time (on version C8), and the META-S technique saved 16 minutes and 22 seconds (11.7%) of testing time (on version C2). In their best cases, both on version C7, the META-M technique saved 2 hours, 17 minutes and 1 second (97.7%) of testing time and the META-S technique saved 2 hours, 20 minutes and 16 seconds (100%) of testing time. Here too, similar to the case with the cumulative results, savings in time are similar to savings in numbers of tests executed. Of course, savings of a few hours or a few minutes and seconds, such as those exhibited in the dierences in testing time seen in this study, may be unimportant. In practice, however, regression testing can require days, or even weeks of eort, and much of this eort may be human-intensive. For the META-C technique, a savings of 0.37% of the overall testing eort for a sequence of seven releases would most likely be unimportant. For the META-M and META-S techniques, however, results suggest the possibility of achieving meaningful savings. If results such as those demonstrated by this study scale up, a savings of 25.9% of the overall testing eort for a sequence of seven releases using the META-M technique may be worthwhile, and a savings of 97.7% of the testing eort for a version may be substantial. To determine whether the impact of RTS technique on test-execution time in our study was statistically signicant, we performed an ANOVA test [113] on the testexecution time data in Table 5.18. Table 5.2.3 shows the results of this analysis. The results indicate that there is strong evidence that at least one of the techniques testexecution times dier from one of the other techniques test-execution times (p-value = 0.0035), for a signicance level of 0.05. Table 5.19: ANOVA for Test Time
Source Technique Version Residuals Sum of squares 74803768 50215323 68811162 d.f 3 6 18 Mean square 24934589 8369220 3822842 p-value 0.0035 0.0924
We next performed a multiple comparison with a control technique (the NOMETA technique) to investigate whether there is a dierence between each META technique and the NO-META technique using Dunnetts method [62]. Table 5.2.3 presents the results of these comparisons. In the table, we mark with **** cases that were statistically signicant (which indicates condence intervals that do not include zero), with a 95% condence interval. The results indicate that the dierences between the NO-META and META-C techniques, and between the NO-META and
109 Table 5.20: Comparisons Between NO-META and each META.

multiple comparison with a control by Dunnetts method Comparison Estimate Std. Error Lower Bound Upper Bound META-C : NO-META -30.7 1050 -2710 2650 META-M : NO-META -2180.0 1050 -4860 494 META-S : NO-META -3910.0 1050 -6590 -1230
****
Table 5.21: Comparisons Between Pairs of META Techniques.

all pair comparison by Tukeys method Comparison Estimate Std. Error Lower Bound Upper Bound META-M : META-S 1730.0 1110 -1250 4700 META-M : META-C -2150.0 1110 -5130 819 META-S : META-C -3880.0 1110 -6850 -907
****
META-M techniques, are not statistically signicant. However, the results show that there is a statistically signicant dierence between the META-S and the NO-META techniques. (The average test-execution time for the META-S technique is 3910.0 seconds less than the average test-execution time for the NO-META technique) We also performed an all-pair comparison for all the META techniques to investigate whether there was a dierence between META techniques, using Tukeys method [62]. Table 5.2.3 presents the results of these comparisons. Also in this case, we mark with **** statistically-signicant cases, with a 95% condence interval. The results show that there is a statistically signicant dierence between the META-S and META-C techniques. Therefore, these results provide evidence that testing with component metadata could provide savings in testing costs, and that component metadata could be useful for regression test selection in component-based software. Furthermore, the dierences among the META techniques in savings in test-execution time indicate that the level of information about changes can aect the degrees of savings in test-execution time, and suggest that a proper focus of implementation eorts would be development of the META-S technique.
5.2.4
Empirical Study of Specication-based Regression Test Selection
To investigate whether the use of component metadata can benet specication-based regression test selection for applications built with external components, we performed an empirical study similar to our rst study, but focusing on our statechart-based technique.
110 5.2.4.1 Research Question Given an application A that uses a set of externally-developed components C, and has been tested with a test suite T , and given a new version of this set of components C , can we exploit component metadata to select a test suite T T that can reveal possible regression faults in A due to changes in C and that does not exclude test cases impacted by such changes? 5.2.4.2 Object of Analysis In practice, dierent techniques are expected to be of dierent appropriateness for application to dierent programs. The statechart-based technique presented here is not appropriate for Siena Siena involves only a single connection between application and component, which is not adequate to illustrate general software behavior where more than one interface between application and components exist. Thus, as an object for this study we selected six sequentially released versions of the Java implementation of an XML parser, NanoXML [93], a component library consisting of 17 classes and 226 methods. We also obtained an application program, JXML2SQL, that uses this library to read XML documents and respond to user queries about those documents, generating either an HTML le (showing its contents in tabular form) or an SQL le. Because neither a textual specication nor a statechart diagram for NanoXML or JXML2SQL was available, we constructed UML statechart diagrams for these systems by reverse engineering from their code; this step was performed by several graduate students experienced in UML, but unaquainted with the plans for this study or the technique being investigated. We grouped the classes in the NanoXML library into four logical groups: Parser, Validator, Builder, and XMLElement handler. We obtained one statechart diagram for the application, and four statechart diagrams (one per logical group) for each of the six dierent sequentially released versions of the NanoXML library. We used the Rational Rose Case tool, which complies with UML notations, to build these statechart diagrams. 5.2.4.3 Variables and Measures Independent Variables The independent variable is again RTS technique; we considered two specic techniques: UML-state-diagram-based component metadata. Components in C are provided with UML statecharts, in the form of metadata, sucient for the developer of A to (1) build a global behavioral model of their application, (2) identify dangerous transitions, and (3) select test cases through dangerous transitions, following the procedure described in Section 5.2.2.4. We refer to this as the META-U technique.
111 No component metadata. The developer of A knows only that the statecharts for one or more of the components in C have been modied, but not which. Therefore, to selectively retest A safely, the developer must rerun any test case in T that exercises code in one or more of the components in C. This technique is the same control technique used in our rst study, and we continue to refer to it as the NO-META technique. Dependent Variables and Measures Our dependent variable is a single measure of eciency. Analogous to the measures we used to investigate code-based techniques, we consider RTS techniques abilities to reduce retesting eort by reducing the number of requirements needing retesting. Specically, for each (version, subsequent-version) pair (GSDi ,GSDi+1 ) of global statechart diagrams for program P , where a set of testing requirements T Ri can be generated from GSDi , we measure the percentage of requirements in T Ri selected by the technique as necessary to test Pi+1 , given its statechart diagram GSDi+1 . 5.2.4.4 Procedure For each version of statechart diagrams SDi1 , SDi2 , . . . , SDi5 , 1 i 6, we constructed a global statechart diagram GSDi by composing the statechart diagrams incrementally using a composition tool implementing the technique described in Section 5.2.2.4. Table 5.22 shows the number of states and edges in each version of the individual component statechart diagrams and global statechart diagrams for the object program.11 Note that by inspection, we determined that versions 3 and 6 contained no functional (statechart-level) changes with respect to preceding versions, whereas other versions did contain changes (this includes V5 which, though possessing the same total number of states and transitions in the global statechart as V4 , did contain dierent transitions, due to dierences in the statecharts for component 4). To generate testing requirements from the global statechart diagrams, we constructed a tool to generate linearly independent test paths. A linearly independent path is a path that includes at least one edge that has not been traversed previously (in a given set of paths under construction) [112], and such sets of linearly independent paths are used by testers as requirements for test cases [9]. Applying this tool to the statecharts produced 54 testing requirements for the global statechart for version 1, and 66 testing requirements for each of the global statecharts for versions 2 through 5. Given the global statechart diagrams and testing requirements for each version, for each pair (GSDi , GSDi+1 ) of sequential global statechart diagrams, we used an
By applying the heuristics described in Section 5.2.2.4, the number of states in global statechart diagrams can be reduced well beyond the number that might be present given a naive approach. For instance, for V1 , the total number of states in the global statechart diagram is 87 instead of 18480, which is the number that would have been obtained simply by naively composing all the statecharts.
11
112 Table 5.22: The Number of States and Transitions for Individual and Global Statechart Diagrams Per Version. Each Entry Takes the Form: Number of states (Number of transitions).
V1 8(13) 14(22) 5(12) 11(18) 3(10) 87(262) V2 8(13) 16(26) 5(14) 11(19) 3(10) 89(285) V3 8(13) 16(26) 5(14) 11(19) 3(10) 89(285) V4 8(13) 16(25) 5(14) 11(19) 3(10) 84(277) V5 8(13) 16(25) 5(14) 11(19) 3(11) 84(277) V6 8(13) 16(25) 5(14) 11(19) 3(11) 84(277)
App Comp1 Comp2 Comp3 Comp4 Global
implementation of the DejaVu algorithm for statecharts to locate dierences between GSDi and GSDi+1 . This implementation outputs dangerous transitions all testing requirements containing these transitions are selected. We applied these procedures to each of the ve (GSDi , GSDi+1 ) pairs of NanoXML and its application, and recorded the percentage of testing requirements selected by the spec-based RTS technique for that (GSDi , GSDi+1 ) pair. These percentages served as the data sets for our analysis. 5.2.4.5 Data and Analysis Table 5.23 shows the testing requirement selection rates observed in this study. On versions 3 and 6, no testing requirements were selected by either technique because there were no changes made to the statecharts for these versions: the code changes for these versions did not involve functional changes. Because the main goal of statechartbased regression test selection is to identify test cases addressing changed system requirements, these results are appropriate. Table 5.23: Testing Requirement Selection Rates
Version C2 C3 C4 C5 C6 NO-META 100.0% 0.0% 100.0% 100.0% 0.0% META-U 83.33% 0.0% 86.36% 39.39% 0.0%
On versions 2, 4, and 5, selection was able to reduce the number of requirements needing retesting, with respect to those identied by the NO-META technique. Selection rates for versions 2 and 4 were 83.33% and 86.36%, respectively, and the selection rate for version 5 was 39.39%. On versions 2 and 4, in fact, only a few (in each case,
113 4) dangerous edges were identied, but these edges were present in most testing requirements. On version 5, 12 dangerous edges were identied, but these were present in fewer requirements. Like our study of code-based regression, this study has limitations. We have considered only a single program, and constructed statechart diagrams through reverse engineering, and cannot claim that these results will generalize to other systems and diagrams. Additional studies with other objects that can provide statechart diagrams as component metadata are needed to address such questions of external validity. These results do provide, however, an initial look at the feasibility of statechartbased regression test selection using metadata, and suggest the potential utility of further work.
5.2.5
Discussion
In this study, we have presented the results of our investigation of whether component metadata can be leveraged to support and improve the cost-eectiveness of softwareengineering tasks for component-based applications. In particular, we focused on regression testing. We have introduced two new techniques for performing regression test selection based on component metadata. Being code-based and specicationbased, the two techniques are potentially complementary on systems to which both can be applied. We have illustrated the application of these techniques on examples, and provided initial empirical results of applying them to two real Java systems. These results show that component metadata can feasibly be used to produce savings in retesting cost through regression test selection; such a demonstration is a prerequisite for arguing for the utility of further research on component metadata. Having fullled this prerequisite, however, there are many open questions that must be addressed. A rst set of questions continues this studys focus on regression testing. Here, we have considered the problem of re-testing applications as the components they use evolve. The re-testing of applications as they use entirely new variants of particular components is another problem worth addressing. We also limit discussion to selection of existing test cases, but identication of situations where new test cases are needed is also important. Finally, we consider one mechanism for combining one particular form of specication, but many alternatives could also be investigated. A second set of questions involves whether other uses of component metadata may provide cost-eective solutions to software-engineering problems involving components. Can component metadata aid maintainers in judging the impact of changes in components on their applications? Can component metadata help engineers verify security properties for their applications that use components? If so, can component metadata be guaranteed to reect actual properties of the component in a manner similar to that achieved by proof-carrying code? Also, can component metadata be retrotted with new information after the component has been deployed?
114 A third set of questions involve the mechanics of component metadata. What is the overhead in terms of component size and component execution time of incorporating metadata and metamethods into a component? How does this overhead vary as we increase the amount and variety of metadata and metamethods? How can component-metadata-based techniques function in the presence of deep nestings of components? For example, suppose application A depends on component C1, and component C1 depends on component C2, but C1 and C2 are developed by dierent (or even multiple) vendors, and it is the developer of A who decides which avor of C1 and C2 to integrate. In this case, what mechanisms are needed to allow C1 to detect and extract metadata from C2 as part of its metadata generation for A? What happens if the developer of A decides to switch vendors from one version of a component to the next? A fourth set of questions involve trust and privacy. What kind of softwareengineering tasks can be addressed using component metadata without revealing too much information about the component? For the tasks for which component metadata must contain sensitive information (e.g., a dependence graph), can the information be encoded so that the component user cannot understand it, but the technique can still use it? In general, is there some way of mathematically showing that adding a certain kind of metadata does not reveal too much about a component? The results of the research presented in this study provide evidence that component metadata can have value, and thus, that these sets of questions are worth addressing. Through such consideration, it may be possible to nd new ways to address some of the critical software-engineering problems raised by component-based software systems.
Figure 5.15: Normalized statechart for VendingMachine.
115
Figure 5.16: Normalized statechart for Dispenser.
5.3
On the Use of Mutation Faults in Empirical Assessments of Test Case Prioritization Techniques
In the empirical study presented in Section 5.1 we learned that the number of faults considered in studies might aect the results of evaluations of test case prioritization. One possible way to accommodate a larger number of faults is to use mutation faults.12 A recent study by Andrews et al. [2] suggests that mutation faults can be representative of real faults, and that the use of hand-seeded faults can be problematic for the validity of empirical results focusing on fault detection. So we set out to design a study to assess the eects of fault type on evaluations of regression testing techniques using our Java infrastructure, focusing in particular on mutation faults.
5.3.1
Study Overview
In this study, we designed and performed two controlled experiments to assess the ability of prioritization techniques to improve the rate of fault detection of test case prioritization techniques, measured relative to mutation faults. In the rst experiment, we examine the abilities of several prioritization techniques to improve the rate of fault detection of JUnit test suites on four Java systems, while also varying other factors that aect prioritization eectiveness. In the second experiment we replicate
Mutation faults are syntactic code changes made to program source code to cause that programs behavior to change.
12
116 the rst, but we consider a pair of Java programs provided with system, rather than JUnit, test suites. We explore four important issues through the experimental results: (1) an analysis of the prioritization results obtained in these experiments; (2) an analysis of the dierences between mutation and hand-seeded faults with respect to prioritization results; (3) an analysis (replicating the analysis performed by Andrews et al. [2]) of the dierences between mutation and hand-seeded faults with respect to fault detection ability; and (4) a discussion of the practical implications of our results. The State of the Empirical Understanding of Prioritization to Date Since we are investigating issues related to the types of faults used in experimentation with prioritization techniques, we here analyze prior research that has involved similar experimentation, to provide insights into the relationships that exist between the objects used in experiments and prioritization results. There have been no prior studies conducted of prioritization using dierent types of faults over the same programs. Thus, we are not able to directly compare prior empirical results to see whether or not the types of faults utilized could aect results for the same programs; however, we can obtain some general ideas by comparing prioritization results across studies qualitatively. As shown earlier in Chapter 2, many such studies have been conducted; for this analysis we chose ve ([37, 42, 39, 81, 132]) that involve dierent object programs and types of faults. Table 5.24 summarizes characteristics of the object programs used in the ve studies we consider. Five of the programs (javac, ant, jmeter, xml-security, and jtopas) are written in Java, and the rest are written in C/C++. Program size varies from 138 LOC to 1.8 million LOC. Six programs (OA, GCC, Jikes, javac, space, and QT B) have real faults, and the others have hand-seeded faults. As a general trend observed in the table, the number of real faults is larger than the number of hand-seeded faults on all programs except bash and some of the Siemens programs. The number of faults for OA was not provided in [132]. Table 5.25 shows prioritization results measured using the APFD metric, for all programs except OA, and for four prioritization techniques and one control technique (random ordering) investigated in the papers. The result for OA presents the percentage of faults in the program detected by the rst test sequence in the prioritized order. (A test sequence as dened in [132] is a list of test cases that achieves maximal coverage of the program, relative to the coverage achieved by the test suite being prioritized.) In the experiment described in [132], the rst sequence contained four test cases, and the entry in Table 5.25 indicates that those four test cases detected 85% of the faults in the program. For GCC, Jikes, and javac, a prioritization technique (comb) that combines test execution prole distribution and coverage information was applied. For the other programs, two coverage-based prioritization techniques, total and addtl, which order test cases in terms of their total coverage of program components (functions, methods, or statements), or their coverage of program components not yet covered by test cases already ordered, were applied. As a control technique,
117 Table 5.24: Object Programs Used in Prioritization Studies

Studies Srivastava & Thiagarajan [132] Leon & Podgurski [81] Elbaum et al. [42] Program OA (Oce App.) GCC Jikes javac Siemens space grep ex QTB bash gzip Program size (LOC) 1.8 million Ver. 2 Test Cases 157 Faults Type Faults real Prioritization Techniques coverage & change-based distribution & coverage-based coverage-based (random, func-total, func-addtl) coverage-based (random, func-total, func-addtl) coverage-based (random, meth-total, meth-addtl)
Elbaum et al. [39]
230K 94K 28K 0.1K-0.5K 6.2K 7.4k 9K 300K 65.6K 6.5K
1 1 1 1 1 5 5 6 10 6
3333 3149 3140 6-19 155 613 525 135 1168 217
27 107 67 7-41 35 11 18 22 58 15
real real real seeded real seeded seeded real seeded seeded
Do et al. [37]
ant jmeter xml-sec. jtopas
80.4K 43.4K 16.3K 5.4K
9 6 4 4
877 78 83 128
21 9 6 5
seeded seeded seeded seeded
a random ordering of test cases was used in all studies other than the one involving OA. Table 5.25: Results of Prior Prioritization Studies: Measured Using the APFD Metric, for all Programs Except OA
Technique ccb comb. total addtl random Technique ccb comb. total addtl random OA 85% QTB 78 67 63 GCC 84 80 bash 90 96 80 Jikes 74 58 gzip 50 88 75 javac 77 64 ant 51 84 64 Siemens 86 82 68 jmeter 34 77 60 space 92 94 85 xml. 97 87 71 grep ex 38 66 92 97 78 88 jtopas 68 97 61
Examining the data in Tables 5.24 and 5.25, we observed that the results vary across programs, and thus we further analyzed the data to see what sorts of attributes might have aected these results if any, considering several attributes:
118 Program size. To investigate whether program size aected the results, we compared results considering three dierent classes of program size that are applicable to the programs we consider: small (smaller than 10K LOC) Siemens, space, grep, f lex, gzip, and jtopas, medium (larger than 10K LOC and smaller than 100K LOC) Jikes, javac, bash, ant, jmeter, and xml-security, and large (larger than 100K LOC) OA, GCC, and QT B. While large programs are associated with moderate fault detection rates and with prioritization techniques outperforming random ordering, small and medium sized programs do not show any specic trends. Test case source. The test cases used in the ve studies were obtained from one of two dierent sources: provided with the programs by developers or generated by researchers. Table 5.26 shows prioritization results grouped by test case source, considering the types of test cases involved (traditional and JUnit) separately. For traditional test suites, prioritization techniques reveal dierent trends across the two groups: for the provided group, prioritization techniques are always better than random ordering. In particular, bash displays relatively high fault detection rates. For the generated group, we can classify programs into two groups relative to results: 1) Siemens and space, 2) grep, f lex, and gzip. The results on Siemens and space show that prioritization techniques outperform random ordering. Results on the other three programs, however, dier: on these, the total coverage technique does not improve the rate of fault detection, but the additional coverage technique performs well. One possible reason for this dierence is that test cases for Siemens and space were created to rigorously achieve complete code coverage of branches and statements. The test cases for the other three programs, in contrast, were created primarily based on the programs functionality, and do not possess strong code coverage. For JUnit test suites, all of which came with the programs, the results vary depending on program, and it is dicult to see any general trends in these results. In general, however, since JUnit test cases do not focus on code coverage, varying results are not surprising. Number of faults. On all artifacts equipped with JUnit test suites, as well as on grep, f lex, and gzip, the number of faults per version is relatively small compared to on other programs. This, too, may be responsible for the greater variance in prioritization results on the associated programs. Type of faults. Considering hand-seeded versus real faults, results using real faults show that prioritization techniques always outperform random ordering, but fault detection rates varied across programs. For example, while fault detection rates on space are very high, fault detection rates on QT B, Jikes, and javac are relatively low. The study of OA does not use the APFD metric, but
119 Table 5.26: Prioritization Results Grouped by Test Case Source: Measured Using the APFD Metric, for all Programs Except OA. To Facilitate Interpretation, the Last Row Indicates the Average Number of Faults per Version
JUnit Provided jmeter xml-security 34 97 77 87 60 71 1.5 1.5
Technique ccb comb. total addtl random # faults per ver.
ant 51 84 64 2.3
jtopas 68 97 61 1.3
(a) Prioritization Results Grouped by Test Case Source: JUnit Traditional Technique ccb comb. total addtl random # faults per ver. OA 85% GCC 84 80 27 Provided Jikes javac 74 77 58 64 107 67 QTB 78 67 63 3.7 bash 90 96 80 5.8 Siemens 86 82 68 7-41 Generated space grep 92 38 94 92 85 78 35 2.2 ex 66 97 88 3.6 gzip 50 88 75 2.5
(b) Prioritization Results Grouped by Test Case Source: Traditional
from the high fault detection percentage (85%) obtained by the rst prioritized test sequence derived for OA and the APFD metric calculation method (see Section 2.1.2), we can infer that the prioritization technique for OA also yields high fault detection rates. For programs using hand-seeded faults, results vary across programs. For Siemens, bash, xml-security, and jtopas, prioritization techniques outperform random ordering. For grep, f lex, gzip, ant, and jmeter, the total coverage technique is not better than random ordering, while the additional coverage technique performs better than random ordering. Other attributes. We also considered two other attributes that might have affected the results: the type of language used, and the type of testing being performed (JUnit versus functional), but we can observe no specic trends regarding these attributes. From the foregoing analysis, we conclude that at least three specic attributes
120 could have aected prioritization results: the type of faults, the number of faults, and the source of test cases. The fact that the type and number of faults could aect prioritization results provides further motivation toward the investigation of the usefulness of mutation faults in empirical investigations of prioritization. Further, this provides motivation for considering evaluations of client testing techniques, and for using dierent types of faults in relation to the Andrews et al. study. Since the source of test cases could aect prioritization results, some consideration of this factor may also be worthwhile.
5.3.2
Mutation
5.3.2.1 Program Mutation As briey discussed in Chapter 4, the notion of mutation faults grew out of the notion of mutation testing, a testing technique that evaluates the adequacy of a test suite for a program [20, 30, 54] by inserting simple syntactic code changes into the program, and checking whether the test suite can detect these changes. The potential eectiveness of mutation testing has been suggested through many empirical studies (e.g., [49, 95]) focusing on procedural languages. Recently, researchers have begun to investigate mutation testing of object-oriented programs written in Java [11, 74, 75, 85]. While most of this work has focused on implementing object-oriented specic mutant generators, Kim et al. [75] apply mutation faults to several testing strategies for object-oriented software, and assess them in terms of the eectiveness of those strategies. Most recently, Andrews et al. [2] investigated the representativeness of mutation faults by comparing the fault detection ability of test suites on hand-seeded, mutation, and real faults, focusing on C systems, with results favorable to mutation faults and problematic for hand-seeded faults. Coupled with the fact that mutation faults are much less expensive to produce than hand-seeded faults, mutation faults may provide an attractive alternative for researchers when their experiments require programs with faults. Additional studies are needed, however, to further generalize this conclusion. In this study we further investigate ndings of Andrews et al. in the context of test case prioritization using Java programs and JUnit test suites, considering mutation faults and hand seeded faults. 5.3.2.2 Mutation Approach To conduct our investigation we required a tool for generating program mutants for systems written in Java. The mutation testing techniques described in the previous section use source-code-based mutant generators, but for this study we implemented a mutation tool that generates mutants for Java bytecode. There are benets associated with this approach. First, it is easier to generate mutants for bytecode than for source code because this does not require the parsing of source code. Instead, we manipulate Java bytecode using pre-dened libraries contained in BCEL (Byte Code
121 Engineering Library) [8], which provides convenient facilities for analyzing, creating, and manipulating Java class les. Second, because Java is a platform independent language, vendors or programmers might choose to provide just class les for system components, and bytecode mutation lets us handle these les. Third, working at the bytecode level means that we do not need to recompile Java programs after we generate mutants. Mutation Operators To create reasonable mutants for Java programs, we surveyed papers that consider mutation testing techniques for object-oriented programs [11, 74, 85]. There are many mutation operators suggested in these papers that handle aspects of object orientation such as inheritance and polymorphism. From among these operators, we selected the following mutation operators that are applicable to Java bytecode (Table 5.27 summarizes): Table 5.27: Mutation Operators for Java Bytecode
Operators AOP LCC ROC AFC OVD OVI OMD AOC Descriptions Arithmetic Operator Change Logical Connector Change Relational Operator Change Access Flag Change Overriding Variable Deletion Overriding Variable Insertion Overriding Method Deletion Argument Order Change
Arithmetic OPerator change (AOP). The AOP operator replaces an arithmetic operator with other arithmetic operators. For example, the addition (+) operator is replaced with a subtraction, multiplication, or division operator. Logical Connector Change (LCC). The LCC operator replaces a logical connector with other logical connectors. For example, the AND connector is replaced with an OR or XOR connector. Relational Operator Change (ROC). The ROC operator replaces a relational operator with other relational operators. For example, the greater-than-orequal-to operator is replaced with a less-than-or-equal-to, equal-to, or not-equalto operator.
122 Access Flag Change (AFC). The AFC operator replaces an access ag with other ags. For example, this operator changes a private access ag to a public access ag. Overriding Variable Deletion (OVD). The OVD operator deletes a declaration of overriding variables. This change makes a child class attempt to reference the variable as dened in the parent class. Overriding Variable Insertion (OVI). The OVI operator causes behavior opposite to that of OVD. The OVI operator inserts variables from a parent class into the child class. Overriding Method Deletion (OMD). The OMD operator deletes a declaration of an overriding method in a subclass so that the overridden method is referenced. Argument Order Change (AOC). The AOC operator changes the order of arguments in a method invocation, if there is more than one argument. The change is applied only if arguments have the appropriate type. The rst three of these operators are also typical mutation operators for procedural languages, and the other operators are object-oriented specic. Mutation Sets for Regression Testing Because this study focuses on regression faults, we needed to generate mutants that involve only code modied in transforming one version of a system to a subsequent version. To do this, we built a tool that generates a list of the names of Java methods, in a version of program P , that dier from those in a previous version of P . Our mutant generator generates mutants using this information. We refer to this mutant generator as a selective mutant generator. Figure 5.17 illustrates the selective mutant generation process. A dierencing tool reads two consecutive versions of a Java source program, P and P , and generates a list of names (di method name) of methods that are modied in P with respect to P , or newly added to P . The selective mutant generator reads di method name and Java class les for P , and generates mutants (Mutant 1, Mutant 2, ..., Mutant k) only in the listed (modied) methods.13 We then compared outputs from program runs in which these mutants were enabled (one by one) with outputs from a run of the original program, and retained mutants only if their outputs were dierent. This process is reasonable because we are interested only in mutants that can be revealed by our test cases since prioritization aects only the rate at which faults that can be revealed by a test suite are
13 The selective mutant generator generates mutants that occur in both the changed code and its neighborhood, where the neighborhood is the enclosing function, and this process matches our hand-seeding process.
123
P P
Javadifftool Java class files diffmethodname
Run output_orig
Mutant 1
Table Generator
output_m1 output_m2
Mutant 2 ... Mutant k

OP1 OP2
...
OPn
Mutant Generator
output_mk
Figure 5.17: Selective mutant generation process detected in a use of that suite. We also discarded mutants that caused verication errors14 during execution, because these represent errors that would be revealed by any simple execution of the program.
5.3.3
Experiment 1
Our primary goal is to replicate prior experiments with prioritization using a new population of faults mutation faults in order to consider whether prioritization results obtained with mutation faults dier from those obtained with hand-seeded faults, and if there are dierences, explore what factors might be involved in those dierences and what implications this may have for empirical studies of prioritization. In doing this, we also gain the opportunity to generalize our empirical knowledge about prioritization techniques, taking into account new study settings. We begin with a controlled experiment utilizing the same object programs and versions used in an earlier study [36] in which only hand-seeded faults were considered. Our experimental design replicates that of [36]. 5.3.3.1 Research Questions We investigate all three research questions considered in Section 5.1.3 using a new population of faults, and further address whether prioritization results obtained with mutation faults dier from those obtained with hand-seeded faults. If there are dif14 As part of the class loading process, a thorough verication of the bytecode in the le being loaded takes place, to ensure that the le holds a valid Java class and does not break any of the rules for class behavior [84].
124 Table 5.28: Experiment Objects and Associated Data

Objects ant xml-security jmeter jtopas No. of versions 9 4 6 4 No. of classes 627 143 389 50 No. of test cases (test-class level) 150 14 28 11 No. of test cases (test-method level) 877 83 78 128 No. of faults 21 6 9 5 No. of mutants 2907 127 295 8 No. of mutant groups 187 52 109 7
ferences, we explore what factors might be involved in those dierences and what implications this may have for empirical studies of prioritization. 5.3.3.2 Objects of Analysis We used four Java programs with JUnit test cases as objects of analysis: ant, xmlsecurity, jmeter, and jtopas. The detailed descriptions of these programs are given in Section 5.1.3.2. Table 5.28 lists, for each of our objects, the following data: No. of versions. The number of versions of the program that we utilized. No. of classes. The total number of class les in the latest version of that program. No. of test cases (test-class level). The numbers of distinct test cases in the JUnit suites for the programs following a test-class level view of testing. No. of test cases (test-method level). The numbers of distinct test cases in the JUnit suites for the programs following a test-method level view of testing. No. of faults. The total number of hand-seeded faults available (summed across all versions) for each of the objects. No. of mutants. The total number of mutants generated (summed across all versions) for each of the objects. No. of mutant groups. The total number of sets of mutants that were formed randomly for each of the objects for use in experimentation (summed across all versions); this is explained further in Section 5.3.3.4. 5.3.3.3 Variables and Measures Independent Variables The experiment manipulated two independent variables: prioritization technique and test suite granularity. We considered the following seven dierent test case prioritization techniques (detailed descriptions for techniques are given in Section 5.1.3.3):
125 untreated (T1), random (T2), optimal (T3), total block coverage (T4), additional block coverage (T5), total method coverage (T6), additional method coverage (T7) technqiues. We also considered the same test suite granularity levels for JUnit test cases described in Section 5.1.3.3: test-class level and test-method level. Dependent Variables and Measures Rate of Fault Detection To investigate our research questions we need to measure the benets of various prioritization techniques in terms of rate of fault detection. To measure rate of fault detection, we again used the APFD metric described in Section 2.1.2. 5.3.3.4 Experiment Setup To assess test case prioritization relative to mutation faults, we needed to generate mutants. As described in Section 5.3.2, we considered mutants created, selectively, in locations in which code modications occurred in a program version, relative to the previous version. The foregoing process created mutant pools; one for each version of each object after the rst (base) version. The numbers of mutants contained in the mutant pools for our object programs (summed across versions) are shown in Table 5.28. These mutant pools provide universes of potential program faults. In actual testing scenarios, however, programs do not typically contain faults in numbers as large as the size of these pools. To simulate more realistic testing scenarios, we randomly selected smaller sets of mutants, mutant groups, from the mutant pools for each program version. Each mutant group thus selected varied randomly in size between one and ve mutants, and no mutant was used in more than one mutant group. We limited the number of mutant groups to 30 per program version, but many versions did not have enough mutants to allow formation of this many groups, so in these cases we stopped generating mutant groups for each object when no additional unique groups could be created. This resulted in several cases in which mutant groups are smaller than 30; for example, jtopas has only seven mutant groups across its three versions. Given these mutant groups, our experiment then required application of prioritization techniques over each mutant group. The rest of our experiment process is similar to the process described in Section 5.1.3.4. 5.3.3.5 Threats to Validity This experiment shares most of the threats to validity addressed in Section 5.1.3.5.
126
Figure 5.18: APFD boxplots, all programs, all techniques. The horizontal axes list techniques, and the vertical axes denote APFD scores. The plots on the left present results for test-class level test cases and the plots on the right present results for test-method level test cases. See Table 5.2 for a legend of the techniques. 5.3.3.5 Data and Analysis To provide an overview of the collected data we present boxplots in Figure 5.18. The plots on the left side of the gure present results from test case prioritization applied to the test-class level test cases, and the plots on the right side present results from test case prioritization applied to the test-method level test cases. Each row presents results for one object program. Each plot contains a box for each of the seven prioritization techniques, showing the distribution of APFD scores for that technique across all of the mutant groups used for all of the versions of that object program. See Table 5.2 for a legend of the techniques. Examining the boxplots for each object program, we observe that the results vary substantially across programs. For example, while the boxplots for xml-security indicate that the spread of results among non-control techniques was very small for
127 Table 5.29: Experiment 1: Kruskal-Wallis Test Results, per Program

Program ant ant jmeter jmeter xml sec. xml sec. jtopas jtopas control untrtd rand untrtd rand untrtd rand untrtd rand test class ch-square d.f 56.12 4 136.69 4 71.81 4 55.79 4 134.68 4 125.1 4 71.86 4 9.52 4 p-value < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 0.0492 control untrtd rand untrtd rand untrtd rand untrtd rand test method ch-square d.f 124.79 4 145.04 4 37.09 4 5.38 4 136.18 4 114.47 4 37.09 4 16.24 4 p-value < 0.0001 < 0.0001 < 0.0001 0.2499 < 0.0001 < 0.0001 < 0.0001 < 0.0027
both test suite levels, and all non-control techniques improved fault detection rate with respect to randomly ordered and untreated test suites, the boxplots for jtopas show various spreads across techniques and some cases in which heuristics were no better than randomly ordered or untreated test suites. For this reason, we analyze the data for each program separately. For each program, following the procedure used in Section 5.1, we rst consider the data descriptively, and then we statistically analyze the data to (1) compare the heuristics to randomly ordered and untreated test suites, (2) consider the eects of information types on the performance of heuristics, and (3) consider the eects of feedback on the performance of heuristics. For our statistical analyses, we used the Kruskal-Wallis non-parametric one-way analysis of variance followed (in cases where the Kruskal-Wallis showed signicance) by Bonferronis test for multiple comparisons. (We used the Kruskal-Wallis test because our data did not meet the assumptions necessary for using ANOVA: our data sets do not have equal variance, and some data sets have severe outliers. For multiple comparisons, we used the Bonferroni method for its conservatism and generality.) We used the Splus statistics package [131] to perform the analyses. For each program, we performed two sets of analyses, considering both test suite levels: untreated versus non-control and random versus non-control. Table 5.29 presents the results of the Kruskal-Wallis tests, for a signicance level of 0.05, and Tables 5.30 and 5.31 present the results of the Bonferroni tests at the test-class level and test-method level, respectively. In the two Bonferroni tables, cases in which the dierences between techniques compared were statistically signicant are marked by **** (which indicates condence intervals that do not include zero). Cases in which Bonferroni tests were not performed are marked by -. Analysis of results for ant The boxplots for ant suggest that non-control techniques yielded improvements over random and untreated orderings at both test suite levels. As shown in Table 5.29, the
128
Comparison T1-T4 T1-T5 T1-T6 T1-T7 T2-T4 T2-T5 T2-T6 T2-T7 T4-T5 T4-T6 T4-T7 T5-T6 T5-T7 T6-T7 Comparison T1-T4 T1-T5 T1-T6 T1-T7 T2-T4 T2-T5 T2-T6 T2-T7 T4-T5 T4-T6 T4-T7 T5-T6 T5-T7 T6-T7 Estimate -14.7 -16.1 -14.3 -15.2 -16.6 -18.1 -16.3 -17.2 -1.4 0.3 -0.5 1.8 0.8 -0.9 Estimate -35.4 -34.7 -35.1 -34.2 -17.7 -17.1 -17.4 -16.6 0.6 0.3 1.1 -0.3 0.4 0.8 ant Lower Upper -19.3 -10.0 -20.8 -11.4 -18.9 -9.62 -19.9 -10.6 -20.9 -12.4 -22.4 -13.8 -20.6 -12.0 -21.5 -12.9 -5.7 2.8 -3.9 4.6 -4.8 3.7 -2.4 6.1 -3.4 5.1 -5.2 3.3 xml-security Lower Upper -41.6 -29.1 -41.0 -28.5 -41.3 -28.8 -40.5 -28.0 -21.1 -14.4 -20.4 -13.7 -20.8 -14.1 -19.9 -13.3 -2.6 4.0 -3.0 3.6 -2.1 4.4 -3.6 2.9 -2.8 3.8 -2.5 4.1 Estimate -19.5 -14.3 -18.3 -13.4 -13.7 -8.4 -12.5 -7.5 5.2 1.2 6.1 -4.0 0.8 4.9 Estimate 0.6 -16.8 -3.4 -30.1 38.6 21.2 34.5 7.8 -17.4 -4.0 -30.7 13.3 -13.3 -26.6 jmeter Lower Upper -26.5 -12.6 -21.2 -7.3 -25.2 -11.4 -20.3 -6.4 -20.2 -7.2 -15.0 -1.9 -19.0 -5.9 -14.1 -1.0 -1.2 11.8 -5.2 7.7 -0.3 12.7 -10.5 2.4 -5.6 7.4 -1.6 11.4 jtopas Lower Upper -43.6 44.8 -61.0 27.4 -47.6 40.7 -74.3 14.1 -4.0 81.2 -21.5 63.8 -8.1 77.1 -34.8 50.5 -61.6 26.8 -48.3 40.1 -74.9 13.5 -30.9 57.5 -57.5 30.9 -70.8 17.6
**** **** **** **** **** **** **** ****
**** **** **** **** **** **** **** ****
**** **** **** **** **** **** **** ****
Table 5.30: Experiment 1: Bonferroni Analysis, All Programs, Test-class Level Granularity. Kruskal-Wallis test reports that there is a signicant dierence between techniques for both test suite levels. Thus we performed multiple pair-wise comparisons on the data using the Bonferroni procedure for both test suite levels. The results in Tables 5.30 and 5.31 conrm that non-control techniques improved the rate of fault detection compared to both randomly ordered and untreated test suites (as shown in the rst eight rows in Tables 5.30 and 5.31). Regarding the eects of information types on prioritization, comparing the boxplots of block-total (T4) to method-total (T6) and block-addtl (T5) to method-addtl (T7), it appears that the level of coverage information utilized (block versus method) had no eect on techniques rate of fault detection, at both test-method and testclass levels. In contrast, comparing the results of block-total to block-addtl and
129
Comparison T1-T4 T1-T5 T1-T6 T1-T7 T2-T4 T2-T5 T2-T6 T2-T7 T4-T5 T4-T6 T4-T7 T5-T6 T5-T7 T6-T7 Comparison T1-T4 T1-T5 T1-T6 T1-T7 T2-T4 T2-T5 T2-T6 T2-T7 T4-T5 T4-T6 T4-T7 T5-T6 T5-T7 T6-T7 Estimate -21.7 -28.6 -22.2 -28.3 -9.7 -16.6 -10.2 -16.3 -6.8 -0.4 -6.5 6.4 0.2 -6.1 Estimate -42.7 -40.6 -42.9 -42.7 -11.2 -9.0 -11.3 -11.1 2.1 -0.1 0.0 -2.2 -2.0 0.1 ant Lower Upper -26.4 -17.1 -33.3 -23.9 -26.8 -17.5 -33.0 -23.7 -13.4 -6.0 -20.3 -12.9 -13.9 -6.4 -20.0 -12.6 -10.6 -3.1 -4.1 3.2 -10.3 -2.9 2.7 10.1 -3.4 3.9 -9.8 -2.4 xml-security Lower Upper -48.9 -36.6 -46.7 -34.5 -49.0 -36.7 -48.8 -36.6 -15.1 -7.2 -13.0 -5.0 -15.2 -7.3 -15.1 -7.1 -1.8 6.0 -4.0 3.8 -3.9 4.0 -6.2 1.7 -6.0 1.8 -3.7 4.1 Estimate -9.4 -12.8 -9.7 -13.8 -3.3 -0.2 -4.3 3.1 -0.9 -4.0 Estimate 9.9e+000 -6.2e+001 9.9e+000 -6.2e+001 5.9e+001 -1.3e+001 5.9e+001 -1.2e+001 -7.2e+001 2.0e-014 -2.2e+001 7.2e+001 3.3e+001 -7.2e+001 jmeter Lower Upper -17.0 -2.0 -20.3 -5.3 -17.2 -2.2 -21.3 -6.3 -10.6 3.9 -7.4 7.0 -11.6 2.9 -4.1 10.4 -8.2 6.2 -11.3 3.1 jtopas Lower Upper -28.3 48.2 -101.0 -24.3 -28.3 48.2 -100.0 24.0 23.2 95.5 -49.3 23.0 23.2 95.5 -48.9 23.4 -109.0 -36.3 -36.2 36.2 -108.0 -36.0 36.3 109.0 35.8 36.5 -108.0 36.0
**** **** **** **** **** **** **** **** **** **** **** ****
**** **** **** ****
**** **** **** **** **** **** **** ****
**** **** **** **** **** **** **** ****
Table 5.31: Experiment 2: Bonferroni Analysis, All Programs, Test-method Level Granularity. method-total to method-addtl at the test-method level, it appears that techniques using feedback did yield improvement over those not using feedback. The Bonferroni analyses in Tables 5.30 and 5.31 conrm these impressions. Analysis of results for jmeter The boxplots for jmeter suggest that non-control techniques improved rate of fault detection with respect to randomly ordered and untreated test suites at the testclass level, but display fewer dierences at the test-method level. The Kruskal-Wallis test reports that there is a signicant dierence between techniques at both test suite levels with respect to untreated suites, but the analysis for random orderings reveals dierences between techniques only at the test-class level. Thus we conducted
130 multiple pair-wise comparisons using the Bonferroni procedure at both test suite levels in the analysis with untreated suites, and at just the test-class level in the analysis with random orderings. The results show that non-control techniques signicantly improved the rate of fault detection compared to random and untreated orderings in all cases other than the one involving random orderings at the test-method level. Regarding the eects of information types and feedback, in the boxplots we observe no visible dierences between techniques. The Bonferroni analyses conrm that there are no signicant dierences, at either test suite level, between block-level and method-level coverage, or between techniques that do and do not use feedback. Analysis of results for xml-security The boxplots for xml-security suggest that non-control techniques were close to optimal with the exception of the presence of outliers. Similar to the results on ant, the Kruskal-Wallis test reports that there are signicant dierences between techniques at both test suite levels. Thus we conducted multiple pair-wise comparisons using Bonferroni in all cases; the results show that non-control techniques improved the rate of fault detection compared to both randomly ordered and untreated test suites. Regarding the eects of information types and feedback, the results of each technique are very similar, so it is dicult to observe any dierences. Similar to results observed on jmeter, the Bonferroni analyses revealed no signicant dierences between block-level and method-level coverage at either test suite level, or between techniques that use and do not use feedback. Analysis of results for jtopas The boxplots of jtopas are very dierent from those of the other three programs. It appears from these plots that some non-control techniques at the test-method level are better than random and untreated orderings, but other techniques are no better than these orderings. No non-control prioritization technique produces results better than random orderings at the test-class level. From the Kruskal-Wallis test, for a comparison with random orderings, there is a signicant dierence between techniques at the test-method level, but just suggestive evidence of dierences between techniques at the test-class level (p-value = 0.0492). The Bonferroni results with both untreated and random orderings at the test-class level show that there was no signicant dierence between pairs of techniques. The multiple comparisons at the test-method level, however, show that some non-control techniques improved the rate of fault detection compared to untreated orderings. Regarding the eects of information types and feedback on prioritization, the multiple comparisons among heuristic techniques report that there is no dierence between block-level and method-level tests at either test suite level. Further, techniques using feedback information did outperform those without feedback at the test-method level.

Objects galileo nanoxml Versions 9 6 Classes 87 26 Tests 1533 216 Faults 35 33 Mutants 1568 132 Mutant groups 231 60
5.3.4
Experiment 2
To investigate our research questions further, we replicated Experiment 1 using two additional Java programs with dierent types of test suites. 5.3.4.1 Objects of Analysis As objects of analysis, we selected two Java programs that are equipped with specicationbased test suites constructed using the category-partition method and TSL (Test Specication Language) presented in [101]: galileo and nanoxml. Galileo is a Java bytecode analyzer, and nanoxml is a small XML parser for Java. Galileo was developed by a group of graduate students who created its TSL test suite during its development. The detailed descriptions of nanoxml are given in Section 5.2.4.2. Both of these programs, along with all artifacts used in the experiment reported here, are publically available as part of our infrastructure described in Chapter 4. To obtain seeded faults for these programs, we followed the same procedure used originally to seed faults in the Java objects used in Experiment 1, summarized in Section 5.1. Table 5.32 lists, for each of these objects, data similar to that provided for the objects in our rst experiment (see Section 5.3.3.2); the only exception being that the test suites used for these objects are all system level, and thus, the distinction between test-class and test-method levels does not apply here. Variables and Measures This experiment manipulated just one independent variable, prioritization technique. We consider the same set of prioritization techniques used in Experiment 1 and described in Section 5.3.3.3. Similarly, as our dependent variable, we use the same metric, APFD, described in Section 2.1.2. Experiment Setup This experiment used the same setup as Experiment 1 (see Section 5.3.3), but in addition to the steps detailed for that experiment, we also needed to gather prioritization data using our seeded faults since that data was not available from the previous study described in Section 5.1. We did this following the same procedure given in Section 5.3.3.4.
132 Threats to Validity This experiment also shares most of the threats to validity addressed in Section 5.1.3.5, together with additional questions involving the representativeness of the TSL test cases created for the objects. On the other hand, however, by considering additional objects of study, and a new type of test suites, this experiment helps to generalize those results, reducing threats to external validity. Data and Analysis To provide an overview of the collected data we present boxplots in Figure 5.19. The left side of the gure presents results from test case prioritization applied to galileo, and the right side presents results from test case prioritization applied to nanoxml. The upper row presents results for mutation faults, and the lower row presents results for hand-seeded faults. (We postpone discussion of results for hand-seeded faults until Section 5.3.5, but we include them in this gure to facilitate comparison at that time.) Each plot contains a box for each of the seven prioritization techniques, showing the distribution of APFD scores for that technique across each of the versions of the object program. See Table 5.2 for a legend of the techniques. Examining the boxplots for each object program, we observe that results on the two programs display several similar trends: all prioritization heuristics outperform untreated test suites, but some heuristics are no better than randomly ordered test suites. Results on galileo, however, display more outliers than do results on nanoxml, and the variance and skewness in APFD values achieved by corresponding techniques across the two programs dier. For example, APFD values for randomly ordered test suites (T2) show dierent variance across the programs, and the APFD values from block-total (T4) for galileo appear to form a normal distribution, while they are more skewed for nanoxml. For this reason, we analyzed the data for each program separately. For statistical analysis, for reasons similar to those used in Experiment 1, we used a Kruskal-Wallis non-parametric one-way analysis of variance followed by Bonferronis test for multiple comparisons. Again, we compared the heuristics to randomly ordered and untreated test suites, in turn, and also considered the eects of information types and feedback on the performance of heuristics. Table 5.33 presents the results of the Kruskal-Wallis tests, and Table 5.34 presents the results of the Bonferroni tests. Table 5.33: Experiment 2: Kruskal-Wallis Test Results, per Program
Program galileo galileo control untrtd rand ch-sq. 459.2 338.5 d.f 4 4 p-value < 0.0001 < 0.0001 Program nanoxml nanoxml control untrtd rand ch-sq. 135.3 60.8 d.f 4 4 p-value < 0.0001 < 0.0001
133
Figure 5.19: APFD boxplots, all techniques for galileo (left) and nanoxml (right). The horizontal axes list techniques, and the vertical axes list APFD scores. The upper row presents results for mutation faults and the lower row presents results for hand-seeded faults. See Table 5.2 for a legend of the techniques. Analysis of results for galileo The boxplots for galileo suggest that non-control techniques yielded improvement over untreated test suites, and some non-control techniques were slightly better than randomly ordered test suites. The Kruskal-Wallis test (Table 5.33) reports that there is a signicant dierence between techniques with respect to untreated and randomly ordered test suites. Thus we performed multiple pair-wise comparisons on the data using the Bonferroni procedure. The results (Table 5.34) conrm that non-control techniques improved the rate of fault detection compared to untreated test suites. No non-control techniques produced results better than random orderings; however, random orderings outperformed both total techniques overall. (Note, however, that random orderings can often yield worse performance in specic individual runs due to its random nature. The box plots for random orderings show APFD values that are averages of 20 runs for each instance, but individual runs exhibit large variance in APFD values. We discuss this further in Section 5.3.5.)
134
Comparison T1-T4 T1-T5 T1-T6 T1-T7 T2-T4 T2-T5 T2-T6 T2-T7 T4-T5 T4-T6 T4-T7 T5-T6 T5-T7 T6-T7 Estimate -8.6 -37.4 -20.1 -34.5 24.7 -4.1 13.2 -1.2 -28.8 -11.5 -25.9 17.3 2.8 -14.5 galileo Lower Upper -13.6 -3.5 -42.4 -32.3 -25.1 -15.0 -39.6 -29.5 -7.83 29.0 21.80 0.2 -8.16 17.5 -5.5 3.0 -33.1 -24.4 -15.8 -7.1 -30.3 -21.6 13.0 21.6 -1.4 7.1 -18.8 -10.1 Estimate -32.9 -45.6 -33.0 -43.0 5.0 -7.6 4.8 -5.1 -12.7 -0.1 -10.1 12.5 2.5 -9.9 nanoxml Lower Upper -40.3 -25.5 -53.0 -38.1 -40.5 -25.6 -50.4 -35.6 -1.1 11.2 -13.8 -1.4 -1.2 11.1 -11.3 1.0 -18.9 6.5 -6.33 6.0 -16.3 -3.9 6.35 18.7 -3.6 8.7 -16.2 -3.8
**** **** **** **** **** **** **** **** **** **** ****
**** **** **** **** ****
**** **** **** ****
Table 5.34: Experiment 2: Bonferroni Analysis, per Program. Regarding the eects of information types and their use in prioritization, comparing the boxplots of block-total (T4) to method-total (T6) and block-addtl (T5) to method-addtl (T7), it appears that the level of coverage information utilized (block vs method) had an eect on techniques rate of fault detection for total coverage techniques, but not for additional coverage techniques. Comparing the results of block-total to block-addtl and method-total to method-addtl, it appears that techniques using feedback do yield improvements over those not using feedback. The Bonferroni analyses (Table 5.34) conrm these impressions. Analysis of results for nanoxml Similar to results on galileo, the boxplots for nanoxml suggest that non-control techniques improved rate of fault detection with respect to untreated test suites. Comparing results from randomly ordered test suites, however, techniques using feedback information appear to improve rate of fault detection, but techniques using total coverage information appear to be worse than randomly ordered test suites. The Kruskal-Wallis test reports that there is a signicant dierence between techniques with respect to untreated and randomly ordered test suites. Thus we conducted multiple pair-wise comparisons using the Bonferroni procedure. The results show that all non-control techniques signicantly improved the rate of fault detection compared to untreated test suites, whereas the only signicant dierence involving randomly ordered test suites was an improvement associated with block-addtl. Regarding the eects of information types and feedback and their use in prioritization, the results are the same as those seen on galileo, except for one case (block-total (T4) versus method-total (T6)). The Bonferroni analyses conrm these observations.
135
5.3.5
Discussion
To further explore the results of our experiments we consider four topics: (1) a summary of the prioritization results obtained in these experiments and prior studies; (2) an analysis of the dierences between mutation and hand-seeded faults with respect to prioritization results; (3) an analysis (replicating the analysis performed by Andrews et al. [2]) of the dierences between mutation and hand-seeded faults with respect to fault detection ability; and (4) a discussion of the practical implications of our results. Prioritization Results Results from this study show that non-control prioritization techniques outperformed both untreated and randomly ordered test suites in all but a few cases for JUnit object programs, and outperformed untreated test suites for TSL object programs. The level of coverage information utilized (block vs method) had no eect on techniques rate of fault detection with one exception on galileo (block-total vs method-total). The eects of feedback information varied across programs: results on ant and jtopas at the test-method level, and on galileo and nanoxml, were cases in which techniques using feedback produced improvements over those not using feedback. Results from our previous prioritization study described in Section 5.1 that used the same set of JUnit programs as those used in Experiment 1 (with hand-seeded faults) also showed that the non-control prioritization techniques we examined outperformed both untreated and randomly ordered test suites, as a whole, at the testmethod level. Overall, at the test-class level, non-control prioritization techniques did not improve eectiveness compared to untreated or randomly ordered test suites, but individual comparisons indicated that techniques using additional coverage information did improve the rate of fault detection. Results from previous studies of C programs [40, 42, 116, 122] showed that noncontrol prioritization techniques improved the rate of fault detection compared to both random and untreated orderings. Those studies found that techniques using additional coverage information were usually better than other techniques, for both ne and coarse granularity test cases. They also showed that statement-level techniques as a whole were better than function-level techniques. Interestingly, the results of this study exhibit trends similar to those seen in studies of prioritization applied to the Siemens programs and space [42], with the exception of results for jtopas. Our results include some outliers, but overall the data distribution patterns for both studies appear similar; with results on jmeter being most similar to results on the Siemens programs. The results for xml-security are more comparable to those for space, showing a small spread of data and high APFD values across all non-control techniques.
136
Figure 5.20: APFD boxplots, all programs, for results with handed-seeded faults (replicated from [36]). The horizontal axes list techniques, and the vertical axes list fault detection rate. Mutation versus Hand-Seeded Faults: Prioritization Eects We next consider the implications, for experimentation on prioritization, of using mutation versus hand-seeded faults. Our results from Experiment 1 show that non-control test case prioritization techniques (assessed using mutation faults) outperformed both untreated and randomly ordered test suites in all but a few cases. Comparing these results with those observed in the earlier study of test case prioritization using hand-seeded faults (reproduced from Section 5.1 in Figure 5.20) on the same object programs and test suites, we observe both similarities and dissimilarities.
137 First, on all programs, results of Experiment 1 often show less spread of data than do results from the study with hand-seeded faults. In particular, the total techniques (T4 and T6) on ant and jtopas, and all non-control techniques at the test-class level on jmeter, exhibit large dierences. We believe that this result is primarily due to the fact that the number of mutants placed in the programs is much larger than the number of seeded faults, which implies that ndings from studies with handseeded faults might be biased compared to studies with mutation faults due to larger sampling errors. Second, results on jtopas dier from results for the other three programs. On jtopas, total coverage techniques are no better than random orderings for both test suite levels, and the data spread among techniques is not consistent, showing some similarities with results of the study with hand-seeded faults. We believe that this result is due to the small number of mutants that were placed in jtopas. In fact, the total number of mutants for jtopas, eight, is much less than the numbers of mutants placed in other programs, which varied from 127 to 2907, and is in fact close to the number of hand-seeded faults for the program, ve. Similar to the results of Experiment 1, results of Experiment 2 show that noncontrol test case prioritization techniques (assessed using mutation faults) outperformed untreated test suites, and some non-control techniques were better than randomly ordered test suites. Comparing these results with those using hand-seeded faults (see Figure 5.19) on the same object programs and test suites, we also observe both similarities and dissimilarities, and these observations are somewhat dierent from the observations drawn above. First, unlike observations drawn from Experiment 1, both results using mutation and hand-seeded faults show similar trends and distribution patterns: all non-control techniques are better than the untreated technique, total coverage techniques are worse than randomly ordered test suites, and the variance between corresponding techniques is not much dierent. This observation also supports our conjecture regarding the relationship between numbers of faults and prioritization results. Galileo and nanoxml have larger number of hand-seeded faults, 35 and 33, respectively, than the object programs used in the Experiment 1. When we consider the number of hand-seeded faults per version, the dierence between two groups of programs persists: while galileo and nanoxml have 3.9 and 5.5 faults per version on average, respectively, JUnit object programs have 2.3, 1.5, 1.5, and 1.4 faults on average per version, respectively. Overall, results with mutation faults reveal higher fault detection rates and more outliers than those with hand-seeded faults. In particular, the total techniques using mutation faults show more visible dierences: for galileo, these techniques yield much higher fault detection rates and less spread of data with outliers; for nanoxml, they yield higher fault detection rates, but more spread of data. The total techniques are worse than randomly ordered test suites for results using both mutation and hand-seeded faults, and this trend is more apparent with handseeded faults. One possible reason for this trend is the location of faults in the
138 program and the code coverage ability of test suites that reveal those faults. For example, some faults in nanoxml cause exception handling errors, and thus test cases that reveal those faults tend to have small amounts of code coverage because once a test case reaches the location that causes an exception handling error, then the program execution is terminated with a small portion of code exercised. From these observations, we infer that studies of prioritization techniques using small numbers of faults may lead to inappropriate assessments of those techniques. Small data sets, and possibly biased results due to large sampling errors, could signicantly aect the legitimacy of ndings from such studies. Mutation versus Hand-Seeded Faults: Fault Detection Ability To further understand the results just described, we consider a view of the data similar to that considered by Andrews et al. [2] when comparing mutation to hand-seeded faults, comparing the fault detection abilities of test suites on mutation faults and hand-seeded faults, and noting how our ndings dier from those of Andrews et al. [2]. To obtain this view of the data we measured fault detection rates for our six object programs following the experimental procedure used by Andrews et al.. In their study, for each program, 5000 test suites of size 100 were formed by randomly sampling the available test pool.15 In our case, since the numbers of test cases for our object programs are relatively small compared to those available for the Siemens programs and space, we randomly selected between 20 and 100 test suites16 of size 10 for each version of each program. Figure 5.21 shows the fault detection abilities of the test suites created by our sampling process, measured on our mutation and hand-seeded faults. The upper row presents mutation fault detection rates for the four programs used in Experiment 1,17 where JUnit test suites were employed, and the lower row presents results of mutation (left side) and hand-seeded (right side) fault detection rates for the programs used in Experiment 2, where TSL test suites were employed. The vertical axes indicate fault detection ratios, which are calculated for each test suite S on each program version V by the equation Dm(S)/N m(V ), where Dm(S) is the number of mutants detected by S, and N m(V ) is the total number of mutants in V . Unlike the results of the Andrews et al. study, our results vary widely across and between programs with dierent types of test suites. The result for ant shows relatively low fault detection ability, which means that mutants in ant were relatively dicult to detect, and this might be caused by any of several factors. As two possibilities, test cases for ant do not have strong coverage of the program, and the subsets of
15 The authors experimented using various test suite sizes (10, 20, 50, and 100), but these other sizes obtained similar results. 16 The number of test suites selected varied depending on the number of test cases available in the pools for the programs. 17 Since the results of hand-seeded fault detection rates for all four programs are similar to the result for jtopas, we omitted them.
139
mutation.detection.rate
0.9
0.4
-0.1 ant jmeter xml jtopas
mutation.detection.rate
fault.detection.rate
0.9
0.9
0.4
0.4
-0.1 galileo nanoxml
-0.1 galileo nanoxml
Figure 5.21: Fault detection ability boxplots for selected small test suites across all program versions. The horizontal axes list techniques, and the vertical axes list fault (or mutant) detection ratios. these test cases that we randomly grouped have relatively little overlapping coverage. We speculate that the latter eect is a more plausible cause of dierences since the ant test suite taken as a whole can detect all mutants. In other words, the test suite for ant may have fewer coverage-redundant test cases compared to the test suites for the Siemens programs and space. Results for xml-security are more similar to those of the Andrews et al. study (mean for space: 0.75) than those of other programs; the fault detection rates (APFD metric) for xml-security are similar to those for space (means for func-total and funcaddtl are 94 and 96, respectively). As mentioned in the discussion of results for ant, the test suite for xml-security might contain many redundant test cases, or each group of test cases might cover more functionality in the program than the test cases for ant. To further consider this point, we compared the ratio of the number of test cases for ant (at the test-method level) to the number of class les (the size of the program) for ant and xml-security. The last version of ant has 877 test cases and 627 class les (ratio: 877/627 = 1.39), and the last version of xml-security has 83
140 test cases and 143 class les (83/143 = 0.58).18 This means that, proportionally, xml-security has a smaller number of test cases relative to program size than ant, favoring the suggestion that each group of test cases might cover more functionality as a reason for its higher fault detection ratio. Fault detection values for jtopas have a large spread; this result is also due to the small number of mutants in the program. The rst version has only one mutant, so the fault detection ratio for this version can be just two distinct numbers, 0 or 1. The fault detection ratio for jmeter also appears to be low, but it does have a normal distribution with a couple of outliers. The fault detection ability of hand-seeded faults observed in Section 5.1 and reconsidered here, overall, is similar to the result seen on the mutation faults in jtopas. We conjecture that this is primarily due to the small numbers of faults in these cases. Even ant, which has the largest number of hand seeded faults in total, displays results similar to those on jtopas with mutation faults, because ve out of eight versions of ant contain only one or two faults, and thus the majority of fault detection ratios present 0 or 1 values. While results among programs with JUnit test suites vary widely, results of mutation detection rates from galileo and nanoxml show more consistent trends. The distributions of detection rates are a bit skewed, but close to a normal distribution, and in particular the fault detection rate distribution for nanoxml is very close to that measured for space in the Andrews et al. study (mean: 0.75, max: 0.82, min: 0.62, 75% :0.77, and 25%: 0.74). The fault detection rate on hand-seeded faults for these two programs also appears dierent than that of the JUnit object programs. As shown in Figure 5.21 (lower right), these fault detection rate values are also skewed, but show some variability. From the two sets of analyses of mutation and hand-seeded fault detection rates, we also observe that the two types of test suites considered are associated with different fault detection rates. The methods used to construct test suites might be a factor in this case: JUnit test suites perform unit tests of Java class les, so the code coverage achieved by these test cases is limited to the class under test. TSL test suites, in contrast, perform functional system-level tests, which cover larger portions of the code than unit tests. Practical Implications While our results show that there can be statistically signicant dierences in the rates of fault detection achieved by various test case prioritization techniques applied to Java programs with JUnit and TSL test cases, the improvements seen in rates of fault detection cannot be assumed to be practically signicant. Thus, we further consider the eect sizes of dierences to see whether the dierences we observed
We also measured the ratio using the number of lines of code instead the number of class les, and found results consistent with these: for ant, 877/80.4KLOCs = 10.9 test cases per 1KLOCs; for xml-security, 83/16.3KLOCs = 5 test cases per 1KLOCs.
18
141 through statistical analyses are practically meaningful [92]. Table 5.35 shows the eect sizes of dierences between non-control and control techniques; we calculated these eect sizes only in cases in which the dierences between control and non-control techniques showed statistical signicance. With the exception of one case (eect size = 0.3 for galileo), the eect sizes range from 0.6 to 4.7, which are considered to be large eect sizes [25], so we can say that the dierences we observed in this study are indeed practically signicant. Table 5.35: Experiment Objects and Associated Data Control orig orig orig orig rand rand rand rand Non-control block-total block-addtl method-total method-addtl block-total block-addtl method-total method-addtl ant 1.0 1.4 1.0 1.4 0.6 1.3 0.6 1.2 jmeter 0.5 0.7 0.5 0.7 Eect Size xml-security jtopas 2.7 2.4 4.7 2.7 2.7 4.5 1.4 0.8 1.4 1.4 -
galileo 0.3 2.0 0.9 1.7 -
nanoxml 2.0 2.9 2.0 2.7 0.8 -
Even though the foregoing analysis of eect size indicates the practical signicance of the dierences that we observed in our statistical analyses, a further practical issue to consider is the relationship between the associated costs of prioritization techniques and the benets of applying them. In practice, prioritization techniques have associated costs, and depending on the testing processes employed and other cost factors, these techniques may not provide savings even though providing higher rates of fault detection. Our previous study described in Section 5.1 investigated issues involving costs and practical aspects of prioritization results. The study considered models of two dierent testing processes, batch and incremental testing, and used the resulting models to consider the practical implications for prioritization cost-eectiveness across the dierent approaches. Further, the study investigated the practical impact of empirical results relative to the four Java programs with JUnit test suites and seeded faults used in Experiment 1, with respect to dierences in delays values, which represent the cumulative cost of waiting for faults to be exposed while executing a test suite. The study showed that several of the prioritization techniques considered did indeed produce practically signicant reductions in delays relative to costs, and thus, that these techniques could be cost-eectively applied in at least certain situations. Because Experiment 1 utilized the same set of Java programs and JUnit tests
142 for which the foregoing analysis showed practical benets, the analysis utilized in Section 5.1, and its results, are applicable in this case as well. Because our results using mutation faults exhibit better fault detection rates for non-control techniques than those using hand-seeded faults, we expect better practical savings in testing costs relative to these faults in at least the cases considered in Experiment 1.
5.3.6
Conclusions
Studies on the possible usage of mutation faults for controlled experiments with testing techniques have been overlooked prior to the work by Andrews et al. [2]. Whereas Andrews et al. consider the usage of mutation faults on C programs and on the relative fault detection eectiveness of test suites, however, we consider this issue in the context of a study assessing prioritization techniques using mutation faults, focusing on Java programs. We have examined prioritization eectiveness in terms of rate of fault detection, considering the abilities of several prioritization techniques to improve the rate of fault detection of JUnit and TSL test suites on open-source Java systems, while also varying other factors that aect prioritization eectiveness. Our analyses show that non-control test case prioritization can improve the rate of fault detection of both types of test suites, assessed relative to mutation faults, but the results vary with the numbers of mutation faults and with the test suites fault detection ability. Our results also reveal similarities and dissimilarities between results using handseeded and mutation faults, and in particular, dierent data spreads between the two were observed. As discussed in Section 5.3.5, this dierence can be partly explained in relation to the sizes of the mutation fault sets and hand-seeded fault sets, but more studies and analysis should be done to further investigate this eect. More important, comparing our results to those collected in earlier studies with hand-seeded faults, our results reveal several implications for researchers performing empirical studies of test case prioritization techniques, and testing techniques in general. In particular, mutation faults may provide a low-cost avenue to obtaining data sets on which statistically signicant conclusions can be obtained, with prospects for assessing causal relationships.
143
Chapter 6 A Cost-benet Model for Evaluating Regression Testing Techniques

In Chapter 5, we extended our infrastructure to accommodate several experiments that helped us investigate specic research questions. In particular, we investigated regression testing problems considering test case prioritization as well as regression test selection. We also provided initial cost-benet models that enable us to evaluate regression testing techniques. Our initial empirical results show that test case prioritization and regression test selection techniques could be benecial for Java systems. However, the improvements in savings or rates of fault detection demonstrated by certain techniques do not guarantee the practical cost-eectiveness of those techniques, because the techniques also have associated costs. Thus, initially we investigated costbenet tradeos between techniques using our relatively simple cost-benet model as shown in Section 5.1.5 of Chapter 5. This investigation allowed us to understand relative benets of regression testing techniques, but some important factors are not considered. Therefore, we need to provide further understanding of the implications of our empirical results for practice, via a cost-benets analysis accounting for the factors that aect the costs and benets of the regression testing techniques. To provide such understanding,1 we identify limitations that can make it dicult to properly evaluate regression testing techniques in practical contexts, and provide a cost model that addresses these limitations. As described in Chapter 1, these limitations involve context factors, lifetime factors, and cost-benet models, and can be summarized as follows. Context factors. Previous empirical studies have considered only a few context factors when assessing techniques. Most studies have considered dierences in programs and regression testing techniques, but none have considered costs of other essential testing activities such as test setup and obsolete test identication, or collection and
1
Much of the material presented in this chapter has appeared in [34].
144 maintenance of resources (e.g., test coverage information) needed for retesting. And only a few studies have considered the eects of time constraints on testing costeectiveness. Lifetime factors. Previous studies have calculated costs and benets independently per system version. This snapshot view of costs and benets masks the fact that regression testing techniques are applied repeatedly across system lifetimes. The cost-benet tradeos for techniques across entire lifetimes may be more relevant for choosing a technique than the tradeos on single releases. Cost-benet models. Previous studies have relied on limited cost-benet models. Costs are often ignored, or calculated solely in terms of time or numbers of faults missed. Benets are often calculated solely in terms of reduced test suite size or increased rate of fault detection. Costs of missed faults and human time, and tradeos involving product revenue, have not been considered. Moreover, often dierent techniques are evaluated using dierent metrics, rendering their relative performance incomparable. Limitations such as these can make it dicult to empirically compare regression testing techniques, or can lead evaluations to improperly assess the costs and benets of techniques in practical contexts. Ultimately, this can lead to inaccurate conclusions about the relative cost-eectiveness of techniques, and inappropriate decisions by engineers relying on such conclusions to select techniques for particular situations. It follows that researchers who empirically investigate regression testing techniques, and practitioners who might act on the results of those investigations, would be better served by empirical investigations founded on more comprehensive costbenet models for those techniques, that incorporate richer context and lifetime factors. Therefore, we provide such a model to be able to compare several common regression testing techniques using that model. We now describe our new cost-benet model. We begin our discussion by outlining the regression testing process on which our model is based. Section 6.2 describes the constituent costs of regression testing techniques that we model, Section 6.3 presents our model, and Section 6.4 describes how the model is utilized to assess and compare techniques.
6.1
Regression Testing Process
Cost-benet models capture costs and benets of methodologies relative to particular processes. We described various process models in Chapter 5. In this work, we focus on a model of the regression testing process that corresponds to the most commonly used approach for regression testing at the system test level [97] a batch process model and which, though simple, is sucient to allow us to investigate our research questions. Figure 6.1 presents a timeline depicting the maintenance, regression testing, and post-release phases for a single release of a software system following a batch process
145
scheduled product release date time: phase: t1 maintenance t2 regression testing & fault correction t3 postrelease (revenue) t4
Figure 6.1: Maintenance and regression testing cycle. model. Time t1 represents the time at which maintenance (including all planning, analysis, design, and implementation activities) of the release begins. At time t2 the release is code-complete, and regression testing and fault correction begin. (These activities may be repeated and may overlap within time interval t2-t3, as faults are found and corrected.) When this phase ends, at time t3, product release can occur at this time, revenue associated with the release (together with associated increases in the companys market value) begin to accrue. In a perfect world, actual product release coincides with scheduled release time, following completion of testing and fault correction activities, and this is the situation depicted in the gure. This process model relates to the regression testing techniques such as regression test selection and test case prioritization we wish to investigate as follows. Suppose the timeline shows the situation in which the retest-all technique is employed. In this case, regression test selection techniques attempt to reduce time interval t2-t3 by reducing testing time, with, for non-safe techniques, a possible increase in the number of faults that slip through testing and are detected in the post-release phase. Test case prioritization techniques attempt to reduce time interval t2-t3 by allowing greater portions of the fault correction activities that occur in that period to be performed in parallel with testing, rather than afterward. If either of these techniques succeeds, the software can be released prior to its scheduled release date, and overall revenue can increase. If prioritization is unsuccessful and fault correction activities cause time interval t2-t3 to increase, then the release will be delayed and revenue can decrease. We also use this model to explore one further dimension of regression testing that occurs commonly in practice, involving the interaction between resource availability and process decisions related to product release and revenue. Organizations that create software for sale regression test it with the goal of improving its dependability and attracting greater revenue by reducing the costs of post-release fault correction and increasing the perceived value of the released software and the market value of the company [26]. The cost of this testing activity competes, however, with the desire to eld the software earlier, which itself can also result in greater revenue and company market value. Releasing software at time t2 in the timeline can increase revenue due to the benets of timeliness, but potentially increases costs due to missed faults.
146 In practice, pressure to release software and preserve revenue may cause organizations to terminate testing early. In this case, also, revenue may increase but with potential for costs downstream due to missed faults. An analogous situation occurs when the maintenance period runs long and the organization terminates testing early in order to meet scheduled release dates, although in this case the focus is on not losing revenue. Note that in such cases, test case prioritization can decrease the degree to which such costs occur, by increasing the likelihood that faults are detected prior to the termination of testing. In our empirical study we investigate the eects of early regression testing termination. The process model we have just described makes several assumptions. For example, organizations may also create software for reasons other than to create revenue. Organizations that complete testing early could in theory spend additional time performing other forms of verication until the scheduled release date arrives, and this could lead to increased revenue via reduced fault cost downstream. Moreover, revenue itself is not the sole measure of benet, because market value is also important. These dierences noted, this process model does allow us to investigate a costbenet model that is much more complex than those used in research on regression testing to date. More important, we believe that the cost-benet model we present here can be adjusted to accommodate relaxations in the foregoing assumptions, as well as process dierences.
6.2
Costs Modelled
We now describe the constituent costs of regression testing techniques that we consider in this work. In this section we focus on what these costs are, not on methods for measuring or estimating them. To model the costs and benets of regression testing, we consider nine constituent cost components. Here we describe each component and some of the factors that cause it to vary. Test setup (CS ). CS includes the cost of activities required to prepare to run tests, such as setting up the testing environment (hardware and software) and arranging for the use of resources. Thus, CS varies with characteristics of the system under test, such as whether it exists standalone, in a distributed environment, or in an environment involving special hardware or human interaction. Identifying obsolete test cases (COi ). COi represents the cost of determining which of the test cases in a test suite are still applicable to a new system version to be tested. This cost varies with the type of test cases (e.g., specication-based, codebased, system, unit), the amount of change occurring between consecutive versions, and the availability of documentation or engineer experience. Repairing obsolete test cases (COr ). Often, obsolete test cases are still potentially useful for the current system. For example, when a class interface is changed by one parameter type, existing test cases related to that class can not be used directly,
147 but a simple change may repair them. Similarly, a test case for which inputs remain the same but for which expected output has changed can require edits of oracle information. This cost varies with the number of test cases needing repair, and the complexity of the repairs, test cases, oracle procedures, and system. Supporting analysis (CA). CA represents the cost of the analysis needed to support a regression testing technique. For the techniques being considered here, CA can include costs of instrumenting code, analyzing changes between old and new versions, and collecting test execution traces, and thus can vary widely with characteristics of techniques, programs, tests, and executions. Signicantly, CA can also vary with the extent to which data from previous testing sessions is reused or leveraged in the current testing session. For example, suppose engineers previously instrumented and collected test execution traces for release r 1 of program P in order to apply a regression testing technique to a subsequent release r2 of P . When r2 is regression tested, to prepare for the next release, r3 , engineers must instrument and collect test execution traces for r2 . As a software system evolves, however, a large percentage of its code may be shared between consecutive versions. Thus, engineers can re-instrument a version incrementally by identifying code changes between consecutive versions, and using previous instrumentation in unchanged code. Similarly, engineers can collect test execution traces for only the subset of test cases that are aected by instrumentation changes, reusing prior traces for others. If the costs of instrumentation and trace collection are suciently high and the changes between versions are suciently small, then we would expect lower costs to be associated with this approach. Regression testing technique execution (CR). CR represents the cost of applying a regression testing technique or tool (for regression test selection or test case prioritization), itself, after supporting analyses have been completed. This cost also varies with characteristics of techniques, programs, test suites, and changes [43]. Test execution (CE ). CE represents the cost of executing tests. This cost varies with test execution processes (e.g., manual, automatic, or semi-automatic), as well as with characteristics of the system under test and the particular test cases utilized. Many organizations attempt to run test cases automatically, but many others continue to use manual or semi-automated testing approaches; for example, in human/machine interface testing, test cases may primarily involve human interaction [26]. Test result validation (CVd and CVi ). CVd and CVi represent the cost of checking test results to determine whether or not test cases reveal failures. These two variables represent two components of the validation task: (1) CVd is the cost of using automated dierencing tools on test outputs to detect output dierences with respect to prior testing sessions, and (2) CVi is the (human time) cost of inspecting test outputs agged as dierent to determine whether the dierence in fact represents a failure. These costs vary with the number of test cases and the complexity of test output, as well as with the automated technique used to check outputs for dierences. A regression testing technique that reduces the number of test cases to be executed
148 also reduces CVd and CVi . Missing faults (CF ). CF represents the cost of missing faults. Regression test selection techniques can miss faults due to omission of existing test cases that could, if executed, have revealed them. In this work, we focus on the costs of missing faults that the regression test suite could, if executed in full, have detected. (In addition, all regression testing techniques can miss faults that are not detectable by any of the test cases executed; however, these costs are incurred similarly by all techniques so we do not consider them here.) CF varies with regression testing technique; clearly, non-safe techniques incur this cost to a greater extent than safe techniques. As discussed in Section 6.1, CF also varies with the testing organizations processes (e.g., with reduced testing time caused by early test termination). Finally, CF varies with business and nancial characteristics such as market conditions, product sensitivity to the market, and the severity of missed faults. Delayed fault detection feedback (CD). CD captures the cost of delayed fault detection feedback. When faults are detected late in a regression testing cycle, eorts to correct them can delay product release. Faults detected early in a cycle can potentially be addressed, prior to completion of the cycle. As a simple example, suppose a fault requiring ve days to correct is discovered on the last day of a ten day regression testing cycle. In this case, product delivery is delayed by the four days required to correct the fault, and also by the time required to (again) regression test the corrected program (another ten days under the retest-all approach). If this fault is detected prior to the fth day of the testing cycle, it does not add any additional delay to product delivery time, beyond the time required to retest the corrected program. Other costs not considered. In addition to the costs we have described, there are other testing costs, such as initial test case development, initial automation costs, test tool development, test suite maintenance, management overhead, database cost, and the cost of developing new test cases. In this work we restrict our attention to the costs just listed, but our cost-benet model could easily be extended to incorporate these other costs.
6.3
A Cost-Benet Model
We use the preceding costs to formulate a cost-benet model that allows us to investigate the research questions we focus on in this dissertation. We consider all of the costs just outlined, and for analysis costs we consider two analyses on which the specic regression test selection and test case prioritization techniques we study depend: the cost of inserting instrumentation into the system, and the cost of collecting test traces. The model that we present is constructed based on the regression testing process model discussed in Section 6.1, but the method we have used to construct the model can be used to construct models for other processes.
149 Before we describe our cost-benet model, we dene several terms and coecients that are used in the model, most of which instantiate the general constituent costs outlined in Section 6.2. Assume that we are considering regression testing technique R, n releases of software system S denoted S1 , S2 , . . . , Sn , and n versions of test suite T (one per release of S) denoted T1 , T2 , . . . , Tn . i is an index denoting a particular release Si of S. u is a unit of time (e.g., hours or days). REV is an organizations revenue in dollars per time unit u, relative to S. ED(i ) is the expected time-to-delivery in units u for release Si when testing begins (in Figure 6.1, interval t2-t3). PS is a measure of the cost (average hourly salary) associated with employing a programmer per unit of time u. CS (i ) is the setup cost for testing release Si . COi (i ) is the cost of identifying obsolete tests for release Si . COr (i ) is the cost of repairing obsolete tests for release Si . CAin (i ) is the time needed to instrument all units in i.2 CAtr (i ) is the time required to collect traces for test cases in Ti1 for use in analyses needed to regression test release Si . CR(i ) is the time required to execute R itself on release Si . CE (i ) is the time required to execute test cases on release Si (either all of the test cases in Ti or some subset of Ti ). CVd (i ) is the cost of applying automated dierencing tools to the outputs of test cases run on release Si (all test cases in Ti or some subset of Ti ). CVi (i ) is the (human) cost of checking the results of test cases determined to have produced dierent outputs when run on release Si (all test cases in Ti or some subset of Ti ). CF (i ) is the cost associated with a missed fault after the delivery of release Si . CD(i ) is the cost associated with delayed fault detection feedback on release Si . ain (i ) is a coecient used to capture reductions in costs of instrumentation required for release i following changes, in terms of the ratio of the number of units instrumented in i to total number of units in i:
ain (i) = numberOf U nitsInstrumented totalN umberOf U nits (6.1)
When all units are instrumented, this ratio equals 1. atr (i ) is a coecient used to capture reductions in costs of the trace collection required for i following changes, in terms of the ratio of the reduced number of traces collected when focusing on changes in i to the total number of traces
Systems can be incrementally instrumented at various levels, such as per le, per class, or per method. We use unit generically to account for this; in our studies we consider instrumentation at the level of class les.
2
150 that would need to have been collected otherwise.

atr (i) = numberOf T racesCollected totalN umberOf T races (6.2)
When all traces are collected, this ratio equals 1. b(i ) is a coecient used to capture reductions in costs of executing and validating test cases for i, when only a subset of T is rerun:
b(i) = N umberOf T estsRerun T otalN umberOf T estsInT (6.3)
When all test cases are run, this ratio equals 1. c(i ) is the number of faults that could be detected by T on release i but that are missed due to execution of subsets of T. To formulate a cost-benet model incorporating the foregoing costs, we must ensure that all costs are measured in identical units. To do this, we initially record all costs for which the mnemonics take the form CX using a time metric in some unit u. We then convert these costs into monetary values so that we can combine them in calculations involving revenues. To perform this conversion, we categorize the costs into two groups: costs related to human eorts (CS , COi , COr , CVi and CF ), and costs related to machine time (CAin , CAtr , CR, CVd , and CE ). We then project the cost-benets of regression testing by considering techniques in light of their business value to organizations, in terms of how much organizations pay for applying the techniques and how much revenue they gain or lose by doing so. This involves two equations: one that captures costs in terms of salaries of the engineers who perform regression testing tasks (using PS to translate time spent by one or more engineers into monetary values), and one that captures revenue gains or losses related to changes in product release time (using REV to translate times into monetary values). Further, in keeping with our desire to account for lifetime factors by tracking costs and benets across entire system lifetimes, our equations calculate costs and benets across entire sequences of system releases, rather than simply on individual system releases. The two equations that comprise our model are as follows:
n
Cost = P S
i=2
(CS(i) + COi (i) + COr (i) + b(i) CVi (i) + c(i) CF (i)) (6.4)
n
Benef it = REV
i=2
(ED(i) (CS(i) + COi (i) + COr (i) + ain (i 1) CAin (i 1)
+ atr (i 1) CAtr (i 1) + CR(i) + b(i) (CE(i) + CVi (i) + CVd (i)) + CD(i))) (6.5)
Relating these formulas to our prior discussions of processes and cost-benets,
151 if an organization does not test their product at all before delivery, then they gain potential revenue by reducing all of the cost terms other than CF in Equation 6.4 to zero, and all the cost terms of form CX in Equation 6.5 to zero. If CF is zero, the resulting revenue increase is proportional to the saved expected delivery time ED. When a regression testing technique reduces (increases) testing time, either through selection or prioritization, the right hand side of Equation 6.5 is positive (negative), indicating an increase (decrease) in revenue. These revenue changes are coupled, however, with changes in costs captured in Equation 6.4 in determining whether techniques are cost-benecial overall. Note that of the costs that we consider in this work, several (CS , COi , COr , CA) can potentially be partially ooaded from the critical testing phase to the maintenance phase; that is, the phase denoted t1-t2 in Figure 6.1. For example, test engineers can make test hardware ready or perform preliminary analyses on modules on which maintenance is complete. In this case, costs may decrease: they continue to have associated salary and hardware aspects, but may be less likely to contribute directly to delays in release dates. Four other costs (CR, CE , CVi , CVd ) are incurred primarily during the regression testing phase. CD occurs during the regression testing and fault correction phase, but may also extend into the post-release phase. CF is incurred during the post-release phase. In constructing the foregoing model we make several simplifying assumptions. We assume that S has just one (evolving) test suite, that tests have equal run times, that instrumentation costs per unit and trace are uniform, and that fault costs are all the same. We assume that test case execution, analysis, and regression testing technique costs involve only machine time, with no human cost component, and we consider test setup and obsolete test detection to have only human eort cost, (an assumption appropriate to our experiment objects). In this work, where we consider the relative ecacy of regression testing techniques that re-use T, we consider only fault losses incurred due to execution of subsets of T. We make these assumptions for convenience, but all of these assumptions can be relaxed given appropriate changes made in the model and suciently accurate measurement instruments.
6.4
Evaluating and Comparing Techniques
The foregoing cost model can be used in cost-benet analyses in various ways. Let A and B be regression testing techniques with costs CostA and CostB , and benets BenetA and BenetB . We can determine whether A is benecial by calculating:
Benef itA CostA (6.6)
Further, we can determine the dierence in value between A and B by calculating:

(Benef itA CostA ) (Benef itB CostB ) (6.7)
152 with positive values indicating that A has greater value than B, and negative values indicating that A has lesser value than B. Having developed a cost-benet model and evaluation schemes in this chapter, we have conducted empirical studies to investigate our models feasibility and how regression testing techniques are assessed using the model, and next chapter describes them.
153
Chapter 7 Empirical Evaluation of Regression Testing Techniques using our new Cost-benet Model
The model described in Chapter 6 captures a richer set of factors than have been considered in prior research on regression testing techniques, and allows us to address various questions about those techniques. So we set out to design a study to provide an initial view of our cost-benet models feasibility and explore the usefulness of our model for assessing regression testing techniques.1
7.1
Study Overview
In this study, we conduct an empirical study comparing several common regression testing techniques using our cost-benet model. We investigate the eects of two context factors (time constraints and incremental resource availability) on regression testing cost-eectiveness focusing on two particular classes of techniques: test case prioritization and regression test selection. Furthermore, our model facilitates the investigation and comparison of these two classes of techniques, which has not previously been possible.
7.2
7.2.1
Experiment
Research Questions
RQ1: What eect does the imposition of time constraints have on the relative cost-benets of regression testing techniques? RQ2: What eect does availability of incremental resources have on the relative cost-benets of regression testing techniques?
1
Much of the material presented in this chapter has appeared in [34].

Objects ant xml-security jmeter galileo nanoxml Versions 9 4 6 16 6 Classes 627 143 389 87 26 Size (KLOCs) 80.4 16.3 43.4 15.2 7.6 Test Cases 877 83 78 1533 216 Mutants 2907 127 295 1923 132
RQ3: What are the relative cost-benets of regression test selection and test case prioritization techniques? None of these research questions have been addressed previously in empirical studies of regression testing. In fact, no cost-benet models previously dened capture the necessary factors. In the case of RQ1 and RQ2, no prior models consider the relationship between fault omission or rate of fault detection and technique execution costs. In the case of RQ3, no prior models have been capable of expressing the cost-eectiveness of these two classes of techniques in comparable units. All three of these questions are important, however, for practitioners who wish to determine what technique might be most cost-eective in their organizations.
7.2.2
Objects of Analysis
We used ve Java programs as objects of analysis: ant, xml-security, jmeter, galileo, and nanoxml. The detailed descriptions of these programs are given in Sections 5.1.3.2, 5.2.4.2, and 5.3.4.1. Table 7.1 lists, for each of our objects, its associated Versions (the number of versions of the object program), Classes (the number of class les in the most recent version of that program), Size (KLOCs) (the total lines of code in the most recent version of that program), and Test Cases (the number of test cases available for the most recent version of that program). The rightmost column is described in Section 7.2.4.
7.2.3
Variables and Measures
Independent Variables Our study manipulated one independent variable, regression testing technique. We consider the three dierent regression testing methodologies described earlier in Chapter 2: retest-all, regression test selection, and test case prioritization. For each of these methodologies, we consider one or more specic techniques, as follows. Retest-all (control). The retest-all technique (reusing an entire existing test suite)
155 together with original test case order serves as our control technique, representing the typical common practice of running all non-obsolete test cases on a new version of a system, in whatever order they are presented in. Regression test selection. For regression test selection we consider a safe technique, that selects test cases which exercise code that has been changed to produce a modied program version. The technique relies on control ow graphs and program coverage information at the basic block level to select all test cases that execute changed code. (These techniques have already been described in in Section 2.1.1.) Test case prioritization. For test case prioritization we consider two coveragebased techniques: total block coverage prioritization and additional block coverage prioritization. (These techniques have already been described in in Section 2.1.2.) Techniques facing time constraints. We also consider each of the techniques just described in a manner that reects the eects of time constraints, in which regression testing activities are terminated early. To do this, for each of the foregoing techniques, we shorten the test execution process by 50%, simulating the eects of having the testing process halted half way through. Techniques using incremental resources. To investigate the eects of incremental resource availability, we consider versions of each of our prioritization techniques and our regression test selection technique that re-use analysis data pertaining to instrumentation from previous testing sessions. In contrast to the non-incremental techniques just discussed, which re-instrument all code and re-execute all tests under instrumentation, these incremental techniques re-instrument only classes that have changed, and re-execute only test cases known to have passed through changed classes previously. Dependent Variables and Measures Our dependent variables are the cost and benet factors presented in Section 6.3, and calculated by Equations 6.4 and 6.5. These values are measured in dollars, and their calculation depends on several constituent cost measures, which we collect as follows. Cost of test setup (CS ). For our objects, the cost of test setup involves only human resources, not hardware resources. The relevant activities include setting up a working directory for testing, compiling the program version to be tested, conguring test drivers and test scripts, and (in some cases) performing minor edits to test scripts. We measured the costs of these activities directly as an average of the time taken by two graduate students (Ph.D. students from our research group) to perform them. Cost of identifying obsolete test cases (COi ). For our objects, identication of obsolete test cases as versions were developed would have required manual inspection of a version and its test cases, and determination, given modications made to the system, of the test cases that need to be modied for the next version (due to changes in inputs or expected output). Our objects were already provided with test suites, so to measure this cost we asked a graduate student to perform these activities, working
156 with the given suites. Cost of repairing obsolete test cases (COr ). For our objects the cost of repairing obsolete test cases includes the costs of examining specications, existing test cases, and test drivers, as well as observing the execution of suspect tests and drivers. Although all of our objects had obsolete test cases, and the cost of identifying them was measured as described above, on only one object, nanoxml, were repaired tests present. To measure the cost of repairing tests on this object, we asked two graduate students (Ph.D. students from our research group) to perform these activities. We averaged the times taken by these students. Cost of supporting analysis non-incremental (CA). The analysis costs for the non-incremental regression testing techniques include the costs of instrumenting programs (CAin ) and collecting test execution traces (CAtr ). We calculated these values directly for each version of each object program, by measuring the time required to run the Sofya system [76] for instrumentation of Java bytecode, and the time required to execute the test cases for that version on that instrumented version. Cost of supporting analysis incremental (CA). Incremental analysis costs consist of the time required to re-instrument only modied classes for a given version (given a version previously fully instrumented), and the time required to re-execute, on that version, only those test cases known to have reached modied classes in the prior version. Our code instrumenter does not support incremental instrumentation, so we partially estimated these values by utilizing the directly measured nonincremental analysis costs collected as just described, and (as shown in Equation 6.5), multiplying this number by (in the case of re-instrumentation) the ratio of the number of classes requiring re-instrumentation to the total number of classes and (in the case of re-execution) the ratio of the number of traces requiring recollection to the total number of traces. Cost of regression testing technique execution (CR). We directly measured the time required to apply each regression testing technique studied, by running it against each version of each object program using appropriate analysis information. Cost of test execution (CE ). For cases in which all test cases were executed, we directly measured execution time of test suites automatically, by running them against each version of each object program using appropriate analysis information. For cases in which a subset of a test suite was executed, we estimated execution time by multiplying the cost of executing the entire test suite by the ratio of the number of test cases being rerun to the total number of test cases, as shown in Equation 6.5.2 Cost of test result validation (automatic via dierencing) (CVd ). For cases in which all test cases were executed, we directly measured this validation time auWe used estimation in this case for two reasons: (1) the cost of executing every test suite subset considered in this study was large; and (2) because the test cases for each of our particular object versions are quite similar to one another in terms of execution times, and test suite execution time ultimately accounts for a small fraction of overall costs, this estimation could not aect overall results.
2
157 tomatically, by measuring the cost of running a dierencing tool on test outputs as test cases were executed, for each version of each object program. For cases in which a subset of a test suites was executed, for reasons similar to those discussed immediately above, we estimated this time by multiplying the cost of validating the entire test suite by the ratio of the number of test cases being rerun to the total number of test cases. Cost of test result validation (human via inspection) (CVi ). To measure the cost of validating test results, we averaged the time taken by two graduate students (Ph.D. students from our research group) to compare program outputs across versions, for each pair of versions. For cases in which a subset of a test suites was executed, we estimated this time (for reasons discussed above) by multiplying the cost thus measured by the ratio of the number of test cases being rerun to the total number of test cases. Cost of missing faults (CF ). For each regression testing technique that could omit faults, we measured the number of faults omitted during a testing session on each version of each object program. Determining the cost of missing faults, however, is much more dicult. Given the many factors that can contribute to these costs, and the long-term nature of these costs, we could not obtain this measure directly. Instead, we rely on data provided in [128] to obtain estimates of the costs of faults. Because fault diculties range widely, we decided to analyze results relative to two classes of fault importance: one corresponding to costs attributed in [128] to severe faults, and one corresponding to costs attributed to ordinary faults. These costs, respectively, are 22 and 1.5 hours. Cost of delayed fault detection feedback (CD). For each prioritization technique applied to each object version and test suite, we measured the rate of fault detection using the APFD described in Section 2.1.2. for that version and test suite. Then, following the approach described in Section 5.1.5, we translated APFD scores into the cumulative costs (in time) of waiting for each fault to be exposed while executing test cases under a particular order, dened as delays. Revenue (REV ). A second metric that we cannot measure directly relative to our object programs involves revenue, and to utilize our cost models we required an estimate of this value. To obtain such an estimate, we utilized revenue values cited in survey data from software products [27], ranging from $116,000 to $596,000 per employee. Because our object programs are relatively small compared to many commercial software systems, we utilize the smallest revenue and a headcount of ten in this study. Programmer salary (PS ). A third metric that we cannot measure directly on our object programs involves the salaries of programmers. To obtain an estimate, we rely on a gure of $100 per person-hour, obtained by adjusting an amount cited in [67] by an appropriate cost of living factor. Expected time-to-delivery (ED). We do not calculate ED, because the comparisons we need to perform to address our research questions do not require its
158 calculation. To explain: we use Equation 6.7 to compare techniques, and this equation subtracts the benet value for a second technique from the benet value for the rst. In so doing, because ED is necessarily identical for two techniques compared on the same version, the value of ED is canceled out.
7.2.4
Experiment Setup
To perform test case prioritization and regression test selection we required two types of data: coverage information and fault data. We obtained coverage information by running test cases on our object programs instrumented using Sofya [76]. The resulting information lists which test cases exercised which blocks in the program; a previous versions coverage information is then used to prioritize a current versions set of test cases, and to support the selection of a subset of test cases for the current version. To measure rate of fault detection for test case prioritization techniques, and fault omission for non-safe regression test selection, we required object programs containing faults. The object programs we obtained had not been supplied with any such faults or fault data. Thus we used mutation faults generated using a Java bytecode mutant generator described in Section 5.3.2. Because our focus is regression testing, however, we use only generated mutants that fall within modied areas of code. The number of mutants created for each of our object programs is shown in column ve of Table 7.1. In actual testing scenarios, programs do not typically contain as many faults as the number of mutants we generated. Also, we wish to investigate the use of regression testing techniques (relative to the lifetime factor) across the entire sequences of versions of our object programs. To do this, for each version of each program we randomly selected several mutant groups from the mutant pool for that version; each mutant groups size varied randomly between one and ve.3 Then, for each program, we obtained 30 sequences of mutant groups by randomly selecting a mutant group for each version of that program. Given these materials, to collect the data necessary to investigate our research questions, we considered each object program in turn, and for each version of that program, applied each regression testing technique, and collected the appropriate values for necessary cost variables (as indicated in Section 7.2.3). In this process, all times were measured on a PC running SuSE Linux 9.1 with 1G RAM and with a 3 GHZ processor. Given these cost variables we calculated, for each object program and each technique, the benet and cost of that technique applied to the sequence of versions (with their associated test suites) of that program for each of its 30 sequences of mutant groups. These benet and cost numbers serve as the data for our subsequent analysis.4
3 These numbers were chosen to maintain consistency with setup procedures followed in an earlier experiment [33]. 4 Complete data sets can be obtained from the author.
159
7.2.5
Threats to Validity
In this section we describe the construct, internal, and external threats to the validity of our study, and the approaches we used to limit the eects of these threats. External Validity The Java programs that we study are relatively small (7K - 80K), and their test suites execution times are relatively short. Complex industrial programs with dierent characteristics may be subject to dierent cost-benet tradeos, including also dierent amounts of revenue that could yield dierent cost-benet tradeos. The testing process we used is not representative of all processes used in practice, and our results should be interpreted in light of this. The tools we use in this study are prototypes, and thus may not reect tools used in a typical industrial environment. Control for these threats can be achieved only through additional studies with wider populations of programs, other testing processes, and enhanced performance-ecient tools. Internal Validity The inferences we have made about the cost-benets of regression testing techniques could have been aected by two factors. The rst factor involves potential faults in the tools that we used to collect data. To control for this threat, we validated our tools on several simple Java programs. The second factor involves the actual values we used to calculate costs, some of which involve estimations. For example, we used code change ratios to estimate incremental instrumentation costs, and an average test case execution time over the instrumented program to estimate incremental trace collection costs. We also measured the costs of test setup, nding obsolete tests, repairing obsolete tests, and validating outputs by measuring the time taken by one or two graduate students. The use of such estimates could confound results. The values we used for revenue and costs of correcting and missing faults are estimated based on surveys found in the literature, but such values can be situation-dependent; for example, Perry and Stieg [103] present a dierent set of fault costs. However, we did choose a relatively small revenue gure so as not to inate results, given that our object programs are relatively small. In summary, we exercised care in selecting reasonable estimations relevant to our object programs, but larger-scale industrial case studies will be needed to follow up on these results. Construct Validity The dependent measures that we have considered for costs and benets are not the only possible measures related to regression testing cost-eectiveness. As described in Section 6.2, other testing costs might be worth measuring for dierent testing situations and organizations.
160
7.2.6
Data and Analysis
In our analysis of results, in keeping with our research questions, we organize the data considering two dierent context factors: time constraints (captured in our process model, and through various factors in our cost-benet model, through the early termination of testing activities), and availability of incremental analysis resources (captured in our cost-benet model in terms of incorporation of diering forms of analysis costs in relation to instrumentation and trace collection). The combinations of these context factors yield four classes of technique applications, as illustrated in Figure 7.1. Each of these classes (each of the four boxes in the gure) denotes a dierent scenario that an organization could face in testing, depending on resource availability and time constraints. We describe each scenario further as follows. Upper left (Box 1): no time constraints are applied and no incremental analysis resources are available. In this situation, ve of the regression testing techniques we consider apply: three test case prioritization techniques (original test order (org), total block coverage (tot), and additional block coverage (add)), and two regression test selection techniques (retest-all (rta) and regression test selection (rts)). Upper right (Box 2): no time constraints are applied and incremental analysis resources are available. In this situation we consider the same ve techniques considered in Box 1, but three of the heuristics (tot, add, and rts) use incremental analysis resources. To identify techniques succinctly we add the tag i to each techniques mnemonic: tot.i, add.i, and rts.i..
Incremental Analysis Resources No Yes BOX 1 BOX 2 Time Constraints Yes No
org rta rts tot add org.50 rta.50 rts.50 tot.50 add.50
original order retestall safe reg. test. sel. total cov. prio. addtl cov. prio.
BOX 3
org rta rts.i tot.i add.i org.50
original order retestall safe reg. test. sel. total cov. prio. addtl cov. prio.
BOX 4
original order retestall reg. test. sel. total cov. prio. addtl cov. prio.
original order
tot.50.i total cov. prio. add.50.i addtl cov. prio.
Figure 7.1: Cost factor scenarios. Lower left (Box 3): time constraints are applied and no incremental analysis resources are available. In this situation we also consider ve techniques, representing the case in which testing activities following the application of techniques in Box 1 are terminated early. We eliminate the second half of the test suites for four of the
161 techniques (org.50, tot.50, add.50, and rta.50).5 For regression test selection, we chose a dierent approach, randomly selecting half of the test cases in the test suite, because we wished all test suites for a given version in Box 3 to have the same size. Lower right (Box 4): time constraints are applied and incremental analysis resources are available. In this situation we consider only three techniques, org.50, tot.50.i, and add.50.i, because incremental analysis does not apply to a test suite obtained (for regression test selection) by random reduction. To address each of our research questions, we need to compare pairs of techniques for cost-benet tradeos, and then compare the relationships that occur between techniques under one set of factors to the relationships that occur under another set of factors. For example, we ask whether the relationship between org and rts in Box 1 is the same as the relationship between org and rts in Box 3, in order to assess whether the eects of early test termination aect the relative cost-benets of these two techniques. We rst perform technique comparisons within each box. Table 7.2 summarizes the result of this comparison, reporting relative cost-benet relationships measured for each pair of techniques within each box, per program, using Equation 6.7. Here, for these comparisons, we use the average of the values obtained for the 30 sequences of mutant groups. Table 7.2 contains one subtable corresponding to Box 1, one subtable corresponding to Box 2, two subtables for Box 3 (one for the case in which non-severe faults are utilized in the cost-benet equations, and another for the case in which severe faults are utilized), and one subtable corresponding to Box 4 (contains data for both types of faults). (The use of pairs of tables for Boxes 3 and 4 corresponds to our wish to analyze results relative to two classes of faults diering in severity. Note, however, that dierences between fault severities have eects only for cases in which time constraints limit test execution, because when constraints are not applied and full test suites are executed, there are no omitted faults and thus no fault costs. Thus these results are reported only for Boxes 3 and 4.) All of the data in Table 7.2 is represented in dollar values, obtained by converting time measurements using the formulas and values described in Sections 6.3 and 7.2.3, respectively. Within each subtable in the tables, columns are labeled with pairs of regression testing techniques compared, and rows are labeled with object programs considered. If an entry in the table in Column B (T1 , T2 ) and row foo contains a positive amount, then T1 yields benet by that amount, in dollars, over T2 , for program foo. If an entry in Column B (T1 , T2 ) and row foo contains a negative amount, then T2 yields benet by that amount, in dollars, over T1 for foo. For example, the cell in Column B (tot, org), row ant, in the topmost subtable in Table 7.2, lists the result of applying Equation 6.7 treating tot as technique A and org as technique B; the amount
In principle test suites are sets, but in practice test cases are ordered, and thus the notion of using the rst half of a suite applies.
5
162 listed, -930, is the dollar-cost advantage (or rather, disadvantage) of applying A rather than B to ant. We now use the data in Table 7.2 address each of our research questions, in turn. RQ1: Eects of time constraints Our rst research question considers whether the imposition of time constraints aects the relative cost-benets of regression testing techniques. To answer this question, we compare technique pairs in Boxes 1 and 2 in Figure 7.1 to corresponding technique pairs in Boxes 3 and 4, respectively. We restrict our attention to comparisons between heuristics and control techniques, deferring comparisons between regression test selection and test case prioritization techniques to our discussion of RQ3. Columns 2 through 4 in Table 7.2, in the subtable for Box 1, indicate that heuristic regression testing techniques are not benecial compared to corresponding control techniques for all but one case (add vs org for nanoxml). The comparisons yield negative numbers, indicating that the original and retest-all techniques outperformed the heuristics in those cases. Data in the same columns in the subtable for Box 2 also shows similar trends in the cases of columns 2 through 4: heuristic regression testing techniques are not benecial compared to corresponding control techniques for all but two cases (tot.i vs org and add.i vs org for nanoxml). Comparing this data to that for corresponding technique comparisons in Boxes 3 and 4 for non-severe faults reveals dierent trends: in all but four cases in Box 3 and three in Box 4, heuristics outperform control techniques. Furthermore, even the few cases in which heuristics are not benecial over control techniques are altered when we consider the case in which faults are severe. RQ2: Eects of incremental resource use Our second research question considers whether the availability of incremental resources aects the relative cost-benets of regression testing techniques. To answer this question, we compare technique pairs in Boxes 1 and 3 in Figure 7.1 to corresponding technique pairs in Boxes 2 and 4, respectively. Again, we focus on comparisons between heuristics and control techniques. As already noted previously, all three comparisons among heuristics and control techniques in Box 1 show no benets accruing to heuristics except for the case of add versus org for nanoxml. When we consider the use of incremental analysis resources (Box 2), however, comparisons do reveal a few dierences. In all cases, the use of incremental analysis yields advantages over the use of non-incremental analysis: all numbers in the table are higher than their corresponding numbers in Box 1. In particular, for nanoxml, both comparisons of tot to orig and add to orig , the use of incremental analysis does add additional benet to the heuristic.
163 Table 7.2: Relative Benets Between Technique Pairs (dollars)

No incremental analysis resource & no time constraints (Box 1) B(tot, org) B(add, org) B(rts, rta) B(rts,tot) -930 -1145 -1799 -655 -290 -279 -552 -134 -100 -98 -223 -48 -35 62 -188 196 -1006 -360 -415 2968 Incremental analysis resource & no time constraints (Box 2) B(tot.i, org) B(add.i, org) B(rts.i, rta) B(rts.i,tot.i) -614 -829 -1483 -655 -136 -125 -399 -134 -25 -23 -148 -48 2 100 -150 196 -774 -127 -182 2968 No incremental analysis resource & time constraints (Box 3: non-severe faults) B(tot.50, org.50) B(add.50, org.50) B(rts.50, rta.50) B(rts.50, tot.50) -727 -805 204 978 -58 43 194 274 335 372 340 11 620 844 928 322 -993 261 1788 3041 No incremental analysis resource & time constraints (Box 3: severe faults) B(tot.50, org.50) B(add.50, org.50) B(rts.50, rta.50) B(rts.50, tot.50) 3699 5904 3109 -543 3538 5024 2961 -555 7025 7912 5182 -1836 13210 16408 14140 945 1773 32635 27244 25731 Incremental analysis resource & time constraints (Box 4) B(tot.50.i, B(tot.50.i, B(add.50.i, B(add.50.i, org.50) org.50) org.50) org.50) non-severe severe non-severe severe -411 4015 -489 6220 95 3692 197 5177 410 7100 447 7987 659 13248 882 16447 -761 2005 494 32867
Object ant jmeter xml-security nanoxml galileo Object ant jmeter xml-security nanoxml galileo
B(rts, add) -439 -145 -50 99 2321 B(rts.i, add.i) -439 -145 -50 99 2321
Object ant jmeter xml-security nanoxml galileo
B(rts.50, add.50) 1056 172 -26 98 1786
Object ant jmeter xml-security nanoxml galileo Object
B(rts.50, add.50) -2748 -2041 -2723 -2253 -5130
ant jmeter xml-security nanoxml galileo
RQ3: Test selection versus prioritization Our third research question considers whether the relative benets of regression test selection and test case prioritization techniques dier. Columns 5 and 6 in the subtables for Boxes 1 and 2 show the comparison results between these techniques when no time constraints are applied, and when safe regression test selection is involved. (Note that Columns 5 and 6 in Box 1 contain values identical to those in Box 2; this is because the techniques used in the two boxes dier only in terms of their use of incremental analysis resources, and in the case of these particular techniques, where time constraints are not applied, the costs of the activities performed do not dier across the boxes.)
164 The results show that the regression test selection technique (rts) is more costeective than test case prioritization techniques (tot and add) for the two object programs (nanoxml and galileo) that have specication-based test suites. For the other three programs, which use JUnit test suites, test case prioritization techniques are more benecial than regression test selection. This result is important because it suggests that in practice, a preferred technique might vary with test suite type. Further study of this eect is needed, however, to determine whether test suite type, technique, or their interaction are responsible for this eect. Turning to the subtable for Box 3, when we compare results between test case prioritization and regression test selection in the case in which time constraints apply, we see dierent relationships between techniques. For non-severe faults, the selection technique is better than the tot and add techniques in all but one case. For severe faults, the comparison between selection and add reveals that the add technique is better than selection, but the comparison between selection and tot reveals two cases (nanoxml and galileo) in which the tot technique ceases to be better than selection.
7.3
Discussion
To further explore the results of our study we consider two topics: (1) the ramications for practice of the results we obtained; and (2) a comparison of our results with those obtained in earlier work using dierent cost models. Where the rst topic is concerned, our results support the conclusion that accounting for dierent context factors in assessing regression testing techniques makes a dierence when assessing the relative benets of those techniques. In particular, our analysis shows that the time constraints factor had a large impact on relative benets. In practice, cases in which time constraints intervene to aect product release are frequent in the software industry; Hendricks et al. [60] report that a typical reason for product delays is the need for additional testing and debugging. At other times, organizations cut back on testing activities in order to ensure timely release of their product. Further study of our data suggests, in fact, that the primary cause of the impact of time constraints was the tradeo between the costs of applying additional tests and not missing faults, and the costs of reduced (non-safe) testing in which faults are missed. On-time but incomplete-test delivery can lead to revenue increases, but if the product contains defects after delivery, the organization can suer from post-delivery revenue losses (due to additional defect removal costs and the loss of customers due to distrust of the product). Meanwhile, complete-test but late delivery can lead to smaller numbers of post-release defects, but if the delivery date is delayed long, the company can lose opportunities to earn revenue from the product. These inferences are not unexpected, but what our empirical results suggest is that cost models such as ours can be used to ascertain the regression testing technique that can best be
165 used in a particular scenario, based on expected revenues and values of other factors related to testing costs. Regarding the use of incremental resources, our results show that this factor, too, can aect evaluations of regression testing techniques, but such impact was apparent in only some cases. One cause of this was the relationship observed, for our objects, between instrumentation and trace collection costs. In general, we expect that if we reduce the number of class les that need to be instrumented to collect information for a testing session, we could also reduce the number of trace les to be collected by a proportional amount, because we need to collect only traces that are aected by instrumentation changes. However, this expectation was often not met on our objects. For example, in the case of nanoxml, incremental instrumentation required only 30% of total instrumentation time, but incremental trace collection required 98% of total trace collection cost. This result occurred because some of the newly instrumented les are accessed by most test cases. The lesson learned from this example, where our study and the use of cost models are concerned, is that it can be important to decouple factors in those models, to avoid conating dierent eects. Where our second topic of discussion is concerned, in this work we evaluated regression testing techniques using a cost model that (1) allows comparisons of previously incomparable classes of techniques (prioritization and selection) and (2) includes a richer set of factors than has been employed in prior evaluations. Our comparison of prioritization and selection (RQ3) illustrates the eects of considering such factors on the relative cost-eectiveness of these classes of techniques: not only time constraints, but also fault severity and test suite type potentially aect tradeos between them. To gain further insights into how evaluations of regression testing techniques dier as cost models vary, we compared our results with those from a previous empirical study of prioritization techniques in which three of the same JUnit object programs were used (ant, jmeter, and xml-security), which is described in Section 5.3. The previous study evaluated a prioritization technique (additional block coverage) using a cost model that considered only two cost components (test case execution time and prioritization technique cost) and one benet (fault detection rate). The study showed that the additional block coverage technique was benecial compared to original test case orderings for two of the three programs (jmeter, and xml-security). This result is quite dierent from our results in this study, which do not show benets for any heuristics over control techniques in the case in which time constraints do not apply. One lesson suggested by this observation is that evaluations of techniques based on dierent models can result in quite dierent evaluations of the cost-benets of techniques, and so, eorts to capture richer sets of factors in models, as we have done in this work, are worthwhile.
166
7.4
Conclusions
Empirical assessments of regression testing techniques depend on cost-benet models. In the previous chapter we have presented such a model, that captures a richer set of the factors (including context and lifetime factors) that aect technique costeectiveness than prior models. Our model facilitates the investigation and comparison of techniques along dimensions that have not previously been possible, and our empirical results presented in this chapter indicate that this expanded view has practical implications for users and researchers of techniques. Although the cost-benet model that we present captures specic testing-related factors relative to just one (common) regression testing process, it can be adapted to include other factors and apply to other processes, and our future work will consider such adaptations. In the study reported in this chapter, we evaluated regression testing techniques using systems of size (7K - 80K) and relatively small revenue estimates. Program size, however, does not appear to be a factor in our results, for the programs that we consider; thus, we conjecture that similar trends could be expected for larger, industrial systems. Such larger systems, however, will also be associated with higher revenues than those considered here, and we expect that in such cases, the context factors we have considered will have an even greater impact on the relative costbenets of regression testing techniques.
167
Chapter 8 Conclusions and Future Directions

In this dissertation, we have addressed problems involving empirical methodologies and evaluation of testing techniques. We have taken the following steps to achieve our research goal: We have surveyed empirical studies of regression testing techniques and analyzed them to identify problems in evaluation processes. We have developed infrastructure to support empirical studies of regression testing techniques. We have performed initial empirical studies using the infrastructure developed in the second step, and discussed and addressed issues related to the infrastructure and empirical methodologies. We have developed a cost-benet-model that can be used to assess the costeectiveness of regression testing techniques considering system lifetime and context factors. We have performed empirical studies with regression testing techniques, and evaluated the cost-eectiveness of techniques using our new cost-model.
8.1
Merit and Impact of This Research
This research promises to provide two classes of benets: benets to researchers studying regression testing and empirical methodologies, and benets to testing practitioners seeking better regression testing techniques. We provide several important advantages for practitioners and researchers. For practitioners, we provide new practical understanding of regression test techniques. For researchers, we provide a new cost-benet model that can be used to compare and empirically evaluate regression testing techniques, and that accounts for testing context and system lifetime factors. We identify problems involving infrastructure, and provide infrastructure that can help researchers conduct various controlled experiments considering a wide variety of software artifacts. Finally, we provide better
168 understanding of empirical methodologies that can be used by other researchers to make further progress in this area.
8.2
Future Directions
Our continuing research will focus on software testing, empirical studies considering the eects of testing processes, and improving our ability to assess techniques through more rigorous cost-benet models. In particular, we will focus on the following directions: So far we have considered only a few of the many dierent regression testing techniques that have been proposed in the research literature, and other techniques such as non-safe regression test selection and test suite reduction techniques are subject to dierent considerations. We intend to study these techniques. We intend to perform additional empirical studies using a wider range of object programs and other artifacts to address questions about the generality of our previous studies results. For example, these include studies using larger programs, additional types of test suites, a wider range and distribution of handseeded faults, and dierent types of faults, such as real faults and mutation faults that are related to various mutation operators. Our initial studies of our cost-benet model have shown that a model that accounts for context and lifetime factors can evaluate techniques dierently. There are several directions in which we anticipate additional work on a costbenet model will be needed, and that we expect to pursue in future work, as follows: Develop cost-benet models for other common regression testing processes. The process model assumed in our cost-benet model is only one model, and there are many practical testing scenarios that it does not accurately represent. To account for practical testing scenarios, we will consider common regression testing processes such as big-bang, nightly-build-and-test, and test-driven-development and develop cost-benet models for them. Improve and rene models. Generally, with any model being developed, there will be choices made in terms of costs and benets to consider, and there will be simplifying assumptions involved, and any of these may be addressed through improvements and renements. We will address these issues through families of empirical studies and analytical approaches, such as sensitivity analysis. Along with the future research directions stated above, we intend to explore the ramications for practice of the results of that research, including: 1) helping
169 practitioners choose appropriate techniques by providing guidelines or methodologies that are based on knowledge gained from families of empirical studies; 2) helping practitioners estimate cost-benets associated with specic regression testing techniques; 3) creating new regression testing processes and methodologies, which our cost-benet model can facilitate investigation of. So far we have studied testing problems using generic desktop applications. Depending on the application domain and testing environment, however, we might face dierent problems and constraints related to testing. In particular, embedded, concurrent, and real-time systems involve very complicated testing issues because they involve cross-development environments, and tight resources and timing constraints on execution platforms. We intend to investigate these issues. This involves activities much like those performed for this dissertation, including infrastructure, experimentation, and cost-benet models. However, there are also new challenges; for example, typically testing embedded systems involves many resources and complications, so it is very costly to test them in real environments, and thus often testing is performed in a simulated environment. Testing in a real environment, however, might uncover problems that can not be exposed in a simulated environment, so it is important to investigate, through cost-benet models and various experiments, a cost eective transition point from a simulated environment to a real environment.
170
Bibliography
[1] J. Andrews and Y. Zhang. General test result checking with log le analysis. IEEE Transactions on Software Engineering, 29(7):634648, July 2003. [2] J. H. Andrews, L. C. Briand, and Y. Labiche. Is mutation an appropriate tool for testing experiments. In Proceedings of the International Conference on Software Engineering, pages 402411, May 2005. [3] http://ant.apache.org. [4] M. Arnold and B. Ryder. A framework for reducing the cost of instrumented code. In ACM SIGPLAN Notices, volume 36(5), pages 168179, May 2001. [5] M. Barnett, C. Campbell, W. Schulte, and M. Veanes. Specication, simulation and testing of COM components using abstract state machines. In Proceedings of the International Workshop on Abstract State Machines (ASM 2001), pages 266270, Feb. 2001. [6] V. Basili, R. Selby, E. Heinz, and D. Hutchens. Experimentation in software engineering. IEEE TSE, 12(7):733743, July 1986. [7] V. Basili, F. Shull, and F. Lanubile. Building knowledge through families of experiments. IEEE Transactions on Software Engineering, 25(4):456473, 1999. [8] http://jakarta.apache.org/bcel. [9] B. Beizer. Black-Box Testing: Techniques for Functional Testing of Software and Systems. John Wiley & Sons, 1999. [10] J. Bible, G. Rothermel, and D. Rosenblum. Coarse- and ne-grained safe regression test selection. ACM Trans. Softw. Eng. Meth., 10(2):149183, Apr. 2001. [11] J. M. Bieman, S. Ghosh, and R. T. Alexander. A technique for mutation of Java objects. In Proc. Automated Softw. Eng., pages 337340, Nov. 2001. [12] R. Binder. Testing Object-Oriented Systems. Addison Wesley, Reading, MA, 2000.
171 [13] D. Binkley. Semantics guided regression test cost reduction. IEEE Transactions on Software Engineering, 23(8):498516, Aug. 1997. [14] B. Boehm. Value-based software engineering. Software Engineering Notes, 28(2), Mar. 2003. [15] G. Booch, J. Rumbaugh, and I. Jacobson. The Unied Modeling Language User Guide. Addison-Wesley, Reading, Massachusetts, USA, rst edition, 1999. [16] L. Briand, Y. Labiche, and Y. Wang. An investigation of graph-based class integration test order strategies. IEEE Transactions on Software Engineering, 29(7), July 2003. [17] L. C. Briand, Y. Labiche, and G. Soccar. Automating impact analysis and regression test selection based on UML design. In Proceedings of the International Conference on Software Maintenance, pages 252261, Oct. 2002. [18] A. Brooks, J. Daly, M. Miller, M. Roper, and M. Wood. Replication of experimental results in software engineering. Technical Report ISERN-96-10, ISERN, 1996. [19] T. Budd and A. Gopal. Program testing by specication mutation. Computer Languages, 10(1):6373, 1985. [20] T. A. Budd. Mutation analysis of program test data. Ph.D. dissertation, Yale University, 1980. [21] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf. Design and evaluation of a wide-area event notication service. ACM Transactions on Computer Systems, 19(3):332383, August 2001. [22] T. Chen and M. Lau. Dividing strategies for the optimization of a test suite. Information Processing Letters, 60(3):135141, Mar. 1996. [23] Y. Chen, D. Rosenblum, and K. Vo. TestTube: A system for selective regression testing. In Intl. Conf. Softw. Eng., pages 211220, May 1994. [24] T. Chow. Testing software design modeled by nite-state machines. IEEE Transactions on Software Engineering, SE-4(3):178187, May 1978. [25] R. Coe. What is an eect size?, CEM Centre, Durham University, Mar. 2000. [26] R. D. Craig and S. P. Jaskiel. Systematic Software Testing. Artech House Publishers, Boston, MA, rst edition, 2002. [27] http://www.culpepper. [28] A. Dean and D. Voss. Design and Analysis of Experiments. Springer, 1999.
172 [29] M. Delamaro, J. Maldonado, and A. Mathur. Interface mutation: An approach for integration testing. IEEE Transactions on Software Engineering, 27(3), Mar. 2001. [30] R. A. Demillo, R. J. Lipton, and F. G. Sayward. Hints on test data selection: Help for the practicing programmer. In IEEE Computer, pages 3441, 1978. [31] H. Do, S. Elbaum, and G. Rothermel. Infrastructure support for controlled experimentation with software testing and regression testing techniques. In Intl. Symp. Emp. Softw. Eng., pages 6070, Aug. 2004. [32] H. Do, S. Elbaum, and G. Rothermel. Supporting controlled experimentation with testing techniques: An infrastructure and its potential impact. Intl. J. Emp. Softw. Eng., 10(4):405435, 2005. [33] H. Do and G. Rothermel. A controlled experiment assessing test case prioritization techniques via mutation faults. In Conf. Softw. Maint., pages 113124, Sept. 2005. [34] H. Do and G. Rothermel. An Empirical Study of Regression Testing Techniques Incorporating Context and Lifecycle Factors and Improved Cost-Benet Models. In Proceedings of the ACM SIGSOFT Symposium on Foundations of Software Engineering, Nov. 2006. [35] H. Do, G. Rothermel, and A. Kinneer. Empirical studies of test case prioritization in a JUnit testing environment. In Proc. Intl. Symp. Softw. Rel. Engr., pages 113124, Nov. 2004. [36] H. Do, G. Rothermel, and A. Kinneer. Prioritizing JUnit test cases: An empirical assessment and cost-benets analysis. Intl. J. Emp. Softw. Eng., 11(1):33 70, 2006. [37] H. Do, G. Rothermel, and A. Kinneer. Prioritizing JUnit test cases: An empirical assessment and cost-benets analysis. Empirical Software Engineering: An International Journal, 11(1):3370, Mar. 2006. [38] S. Elbaum, D. Gable, and G. Rothermel. The Impact of Software Evolution on Code Coverage. In Proceedings of the International Conference of Software Maintenance, pages 169179, Nov. 2001. [39] S. Elbaum, P. Kallakuri, A. Malishevsky, G. Rothermel, and S. Kanduri. Understanding the eects of changes on the cost-eectiveness of regression testing techniques. Journal of Software Testing, Verication and Reliability, 12(2), 2003. [40] S. Elbaum, A. Malishevsky, and G. Rothermel. Prioritizing test cases for regression testing. In Intl. Symp. Softw. Test. Anal., pages 102112, Aug. 2000.
173 [41] S. Elbaum, A. Malishevsky, and G. Rothermel. Incorporating varying test costs and fault severities into test case prioritization. In Proc. Intl. Conf. Softw. Eng., pages 329338, May 2001. [42] S. Elbaum, A. G. Malishevsky, and G. Rothermel. Test case prioritization: A family of empirical studies. IEEE Trans. Softw. Eng., 28(2):159182, Feb. 2002. [43] S. Elbaum, G. Rothermel, S. Kanduri, and A. G. Malishevsky. Selecting a cost-eective test case prioritization technique. Softw. Quality J., 12(3), 2004. [44] J. Feldman and W. R. Sutherlan. Rejuvenating experimental computer science - A report to the national science foundation and others. Comm. ACM, 22(9):497502, Sept. 1979. [45] N. Fenton and S. Peeger. Science and substance: A challenge to software engineers. IEEE Software, pages 8695, July 1994. [46] R. Ferguson and B. Korel. The chaining approach for software test data generation. ACM Trans. Softw. Eng. Meth., 5(1), Jan. 1996. [47] K. Fischer, F. Raji, and A. Chruscicki. A methodology for retesting modied software. In Proceedings of the Natl. Tele. Conference B-6-3, pages 16, Nov. 1981. [48] M. Fowler. Refactoring: Improving the Design of Existing Code. AddisonWesley, 1999. [49] P. G. Frankl, S. N. Weiss, and C. Hu. All-uses versus mutation testing: An experimental comparison of eectiveness. J. Sys. Softw., 38(3):235253, 1997. [50] http://csce.unl.edu/galileo/pub/galileo. [51] T. L. Graves, M. J. Harrold, J.-M. Kim, A. Porter, and G. Rothermel. An empirical study of regression test selection techniques. In Proc. Intl. Conf. Softw. Eng., Apr. 1998. [52] T. L. Graves, M. J. Harrold, J.-M. Kim, A. Porter, and G. Rothermel. An empirical study of regression test selection techniques. ACM Trans. Softw. Eng. Meth., 10(2):184208, Apr. 2001. [53] W. Grieskamp, Y. Gurevich, W. Schulte, and M. Veanes. Testing with abstract state machines. In Proceedings of the International Workshop on Abstract State Machines (ASM 2001), pages 257261, Feb. 2001. [54] R. G. Hamlet. Testing programs with the aid of a compiler. IEEE Transactions on Software Engineering, 3(4):279290, 1977.
174 [55] M. a. Harger. Improving test suites via operational abstraction. In Proc. 25rd International Conference on Software Eng., May 2003. [56] M. J. Harrold, R. Gupta, and M. L. Soa. A methodology for controlling the size of a test suite. ACM Transactions on Software Engineering and Methodology, 2(3):270285, July 1993. [57] M. J. Harrold, J. Jones, T. Li, D. Liang, A. Orso, M. Pennings, S. Sinha, S. Spoon, and A. Gujarathi. Regression test selection for Java software. In Proc. Conf. O.-O. Programming, Systems, Langs., and Apps., Oct. 2001. [58] M. J. Harrold, D. Rosenblum, G. Rothermel, and E. Weyuker. Empirical studies of a prediction model for regression test selection. IEEE Transactions on Software Engineering, 27(3):248263, Mar. 2001. [59] J. Hartmann, C. Imoberdorf, and M. Meisinger. UML-based integration testing. In Proceedings of the International Symposium on Software Testing and Analysis, pages 6070, August 2000. [60] K. B. Hendricks and R. V. Singhal. Delays in new product introductions and the market value of the rm: The consequences of being late to the market. Mgmt. Science, 43(4):422436, Apr. 1997. [61] K. Hinkelmann and O. Kempthorne. Design and Analysis of Experiments: Introduction to Experimental Design. John Wiley and Sons, 1999. [62] J. Hsu. Multiple Comparisons: Theory and Methods. Chapman & Hall, London, 1996. [63] M. Hutchins, H. Foster, T. Goradia, and T. Ostrand. Experiments on the eectiveness of dataow- and controlow-based test adequacy criteria. In Proc. Intl. Conf. Softw. Eng., pages 191200, May 1994. [64] K. Ishizaki, M. Kawahito, T. Yasue, M. Takeuchi, T. Ogasawara, T. Suganuma, T. Onodera, H. Komatsu, and T. Nakatani. Design, implementation and evaluation of optimisations in a just-in-time compiler. In ACM 1999 Java Grande Conf., pages 119128, June 1999. [65] http://jakarta.apache.org. [66] http://jakarta.apache.org/jmeter. [67] C. Jones. Applied Software Measurement: Assuring productivity and quality. McGraw-Hill, 1997. [68] http://jtopas.sourceforge.net/jtopas. [69] http://www.junit.org.
175 [70] N. Juristo and A. M. Moreno. Basics of Software Engineering Experimentation. Kluwer Academic Publishers, 2001. [71] N. Juristo, A. M. Moreno, and S. Vegas. Reviewing 25 years of testing technique experiments. Empirical Software Engineering: An International Journal, 9(1), Mar. 2004. [72] J. Kim and A. Porter. A history-based test prioritization technique for regression testing in resource constrained environments. In Intl. Conf. Softw. Eng., May 2002. [73] J. Kim, A. Porter, and G. Rothermel. An empirical study of regression test application frequency. In Intl. Conf. Softw. Eng., pages 126135, June 2000. [74] S. Kim, J. A. Clark, and J. A. McDermid. Class mutation: Mutation testing for object-oriented programs. In Proc. Net.ObjectDays Conf. Object-Oriented Softw. Sys., Oct. 2000. [75] S. Kim, J. A. Clark, and J. A. McDermid. Investigating the eectiveness of object-oriented testing strategies with the mutation method. J. of Softw. Testing, Verif., and Rel., 11(4):207225, 2001. [76] A. Kinneer, M. Dwyer, and G. Rothermel. Sofya: A exible framework for development of dynamic program analysis for Java software. Technical Report TR-UNL-CSE-2006-0006, University of NebraskaLincoln, Apr. 2006. [77] B. Kitchenham, T. Dyba, and M. Jorgensen. Evidence-based Software Engineering. In Proc. Intl. Conf. Softw. Eng., 2004. [78] B. Kitchenham, S. Peeger, L. Pickard, P. Jones, D. Hoaglin, K. Emam, and J. Rosenberg. Preliminary guidelines for empirical research in software engineering. IEEE Transactions on Software Engineering, 28(8):721734, Aug. 2002. [79] B. Kitchenham, L. Pickard, and S. Peeger. Case studies for method and tool evaluation. IEEE Software, pages 5262, July 1995. [80] P. Koppol, R. Carver, and K. Tai. Incremental integration testing of concurrent programs. IEEE Transactions on Software Engineering, 28(6), June 2002. [81] D. Leon and A. Podgurski. A comparison of coverage-based and distributionbased techniques for ltering and prioritizing test cases. In Proceedings of the International Symposium on Software Reliability Engineering, pages 442453, Nov. 2003. [82] H. Leung and L. White. Insights into regression testing. In Conf. Softw. Maint., pages 6069, Oct. 1989.
176 [83] H. Leung and L. White. A cost model to compare regression test strategies. In Conf. Softw. Maint., Oct. 1991. [84] C. Lindsey, J. Tolliver, and T. Lindblad. JavaTech: An Introduction to Scientic and Technical Computing with Java. Cambridge University Press, 2005. [85] Y. Ma, Y. Kwon, and J. Outt. Inter-class mutation operators for java. In Proc. Intl. Symp. Softw. Rel. Engr., pages 352363, Nov. 2002. [86] A. Malishevsky, G. Rothermel, and S. Elbaum. Modeling the cost-benets tradeos for regression testing techniques. In Conf. Softw. Maint., pages 204 213, Oct. 2002. [87] M. a. Marre. Using spanning sets for coverage testing. IEEE TSE, 29(11), Nov. 2003. [88] D. D. McCracken, P. J. Denning, and D. H. Brandin. An ACM executive committe position on the crisis in experimental computer science. Comm. ACM, 22(9):503504, Sept. 1979. [89] C. Michael, G. McGraw, and M. Schatz. Generation software test data by evolution. IEEE Transactions on Software Engineering, 27(12), Dec. 2001. [90] D. C. Montgomery. Design and Analysis of Experiments. John Wiley and Sons, New York, fourth edition, 1997. [91] M. Muller and F. Padberg. About the return on investment of test-driven development. In EDSER-1, May 2003. [92] C. Murphy and B. Myors. Statistical Power Analysis: A Simple and General Model for Traditional and Modern Hypothesis Tests. Lawrence Erlbaum Associates, 1998. [93] http://nanoxml.sourceforge.net/orig. [94] A. Outt, G. Rothermel, R. Untch, and C. Zapf. An experimental determination of sucient mutant operators. ACM Transactions on Software Engineering and Methodology, 5(2), Apr. 1996. [95] A. J. Outt, J. Pan, K. Tewary, and T. Zhang. An experimental evaluation of data ow and mutation testing. Softw. Pract. and Exp., 26(2):165176, Feb. 1996. [96] J. Outt, J. Pan, and J. M. Voas. Procedures for reducing the size of coveragebased test sets. In Proc. Intl. Conf. Testing Comp. Softw., pages 111123, June 1995.
177 [97] K. Onoma, W.-T. Tsai, M. Poonawala, and H. Suganuma. Regression testing in an industrial environment. Comm. ACM, 41(5):8186, May 1988. [98] A. Orso, H. Do, G. Rothermel, M. J. Harrold, and D. S. Rosenblum. Using component metadata to regression test component-based software. Journal of Software Testing, Verication, and Reliability, 2005. [99] A. Orso, M. J. Harrold, and D. S. Rosenblum. Component metadata for software engineering tasks. In W. Emmerich and S. Tai, editors, EDO 00, volume 1999 of Lecture Notes in Computer Science, pages 126140. Springer-Verlag / ACM Press, November 2000. [100] A. Orso, N. Shi, and M. J. Harrold. Scaling regression testing to large software systems. In Found. Softw. Eng., Nov. 2004. [101] T. Ostrand and M. J. Balcer. The category-partition method for specifying and generating functional tests. Comm. ACM, 31(6), June 1988. [102] D. Perry, A. Porter, and L. Votta. Empirical Software Engineering: A Roadmap. In Proceedings International Conference of Software Engineering, pages 345 355, May 2000. [103] D. E. Perry and C. S. Stieg. Software faults in evolving a large, real-time system: A case study. In Eur. S.E. Conf., 1993. [104] S. L. Peeger. Design and analysis in software engineering. Part 1: The language of case studies and formal experiments. ACM SIGSOFT Software Engineering Notes, 19(4):1620, Oct. 1994. [105] S. L. Peeger. Experimental design and analysis in software engineering. Part 2: How to set up an experiment. ACM SIGSOFT Software Engineering Notes, 20(1):2226, Jan. 1995. [106] S. L. Peeger. Experimental design and analysis in software engineering. Part 3: Types of experimental design. ACM SIGSOFT Software Engineering Notes, 20(2):1416, Apr. 1995. [107] S. L. Peeger. Experimental design and analysis in software engineering. Part 4: Choosing an experimental design. ACM SIGSOFT Software Engineering Notes, 20(3):1315, July 1995. [108] S. L. Peeger. Experimental design and analysis in software engineering. Part 5: Analyzing the data. ACM SIGSOFT Software Engineering Notes, 20(5):1416, Dec. 1995. [109] J. J. Phillips. Return on Investment in Training and Performance Improvement Programs. Gulf Publishing Company, Houston, TX, 1997.
178 [110] L. Pickard and B. Kitchenham. Combining empirical results in software engineering. Inf. Softw. Tech., 40(14):811821, Aug. 1998. [111] J. Power and J. Waldron. A method-level analysis of object-oriented techniques in Java applications. Technical Report NUM-CS-TR-2002-07, National University of Ireland, July 2002. [112] R. Pressman. Software Eng.: A Practitioners Approach. McGraw-Hill, New York, NY, 1987. [113] F. L. Ramsey and D. W. Schafer. The Statistical Sleuth. Duxbury Press, 1st edition, 1997. [114] IBM Rational Software, 2003. [115] G. Rothermel, M. Burnett, L. Li, C. Dupuis, and A. Sheretov. A methodology for testing spreadsheets. ACM Trans. Softw. Eng. Meth., 10(1), Jan. 2001. [116] G. Rothermel, S. Elbaum, A. Malishevsky, P. Kallakuri, and B. Davia. The impact of test suite granularity on the cost-eectiveness of regression testing. In Proc. Intl. Conf. Softw. Eng., May 2002. [117] G. Rothermel, S. Elbaum, A. G. Malishevsky, P. Kallakuri, and X. Qiu. On test suite composition and cost-eective regression testing. ACM Trans. Softw. Eng. Meth., 13(3):227331, July 2004. [118] G. Rothermel and M. J. Harrold. Analyzing regression test selection techniques. IEEE Trans. Softw. Eng., 22(8):529551, Aug. 1996. [119] G. Rothermel and M. J. Harrold. A safe, ecient regression test selection technique. ACM Trans. Softw. Eng. Meth., 6(2):173210, Apr. 1997. [120] G. Rothermel and M. J. Harrold. Empirical studies of a safe regression test selection technique. IEEE Trans. Softw. Eng., 24(6):401419, June 1998. [121] G. Rothermel, M. J. Harrold, and J. Dedhia. Regression test selection for C++ programs. Journal of Software Testing, Verication, and Reliability, 10(2), June 2000. [122] G. Rothermel, R. Untch, C. Chu, and M. J. Harrold. Prioritizing test cases for regression testing. IEEE Trans. Softw. Eng., 27(10):929948, Oct. 2001. [123] K. K. Sabnani, A. M. Lapone, and M. U. Uyar. An algorithmic procedure for checking safety properties of protocols. IEEE Transactions on Communications, 37(9):940948, September 1989. [124] D. Sa and M. D. Ernst. Reducing wasted development time via continuous testing. In Proc. Intl. Symp. Softw. Rel. Engr., pages 281292, Nov. 2003.
179 [125] D. Sa and M. D. Ernst. An experimental evaluation of continuous testing during development. In Proceedings of the 2004 International Symposium on Software Testing and Analysis, pages 7685, July 2004. [126] D. Sa and M. D. Ernst. Continuous testing in Eclipse. In Proceedings of the 2nd Eclipse Technology Exchange Workshop, Mar. 2004. [127] C. Science and T. Board. Academic careers for experimental computer scientists and engineers. Comm. ACM, 37(4):8790, Apr. 1994. [128] Shull, F. et al. What we have learned about ghting defects. In Intl. Softw. Metrics Symp., 2002. [129] R. Solingen. Measuring the ROI of software process improvement. In IEEE Software, pages 3238, May 2004. [130] http://sourceforge.net. [131] http://www.insightful.com/products/splus. [132] A. Srivastava and J. Thiagarajan. Eectively prioritizing tests in development environment. In Intl. Symp. Softw. Test. Anal., pages 97106, July 2002. [133] W. Tichy. Should computer scientists experiment more? 31(5):3240, May 1998. IEEE Computer,
[134] W. Tichy, P. Lukowicz, E. Heinz, and L. Prechelt. Experimental evaluation in computer science: a quantitative study. Journal of Systems and Software, 28(1):918, Jan. 1995. [135] F. I. Vokolos and P. G. Frankl. Empirical evaluation of the textual dierencing regression testing technique. In Intl. Conf. Softw. Maint., pages 4453, Nov. 1998. [136] A. von Mayrhauser and N. Zhang. Automated regression testing using DBT and Sleuth. Journal of Software Maintenance, 11(2):93116, 1999. [137] S. Wagner. A model and sensitivity analysis of the quality economics of defectdetection techniques. In Proceedings of the International Symposium on Software Testing and Analysis, July 2006. [138] D. Wells. Extreme Programming: A http://www.extremeprogramming.org, Jan. 2003. Gentle Introduction.
[139] L. White and H. Leung. A rewall concept for both control-ow and data-ow in regression integration testing. In Proc. Conf. Softw. Maint., pages 262270, Nov. 1992.
180 [140] C. Wohlin, P. Runeson, M. Host, M. Ohlsoon, B. Regnell, and A. Wesslen. Experimentation in software engineering: an introduction. Kluwer Academic Publishers, 2000. [141] W. Wong, J. Horgan, S. London, and H. Agrawal. A study of eective regression testing in practice. In Intl. Symp. Softw. Rel. Engr., pages 230238, Nov. 1997. [142] W. E. Wong, J. R. Horgan, A. P. Mathur, and A. Pasquini. Test set size minimization and fault detection eectiveness: A case study in a space application. In Proceedings of the 21st Annual International Computer Software & Applications Conference, pages 522528, Aug. 1997. [143] http://xml.apache.org/security. [144] M. Zelkowitz and D. Wallace. Experimental models for validating technology. IEEE Computer, pages 2331, May 1998.

Hyunsook Do

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Hyunsook Do

Загружено:

Авторское право:

Доступные форматы

ACCOUNTING FOR CONTEXT AND LIFETIME FACTORS: A NEW APPROACH FOR EVALUATING REGRESSION TESTING TECHNIQUES

Major: Computer Science

Under the Supervision of Professor Gregg Rothermel

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

68 87 88 89 89 90 91 99 100 101 106 114 115 123

100 107 108 109 109

112 112 117 117

Goals of this Research

Overview of this Dissertation

Chapter 2.2 describes prior work on cost-benet models in detail.

Chapter 2 Background and Related Work

Regression Test Selection

Test Case Prioritization

Cost-benet Models for Regression Testing

(a) Test suite and faults exposed

Test Case Order T2: IJEBCDFGHA

Percent Detected Faults

Test Suite Fraction

Percent Detected Faults

Test Suite Fraction

APFD for test case order T1

APFD for test case order T2

A Survey of Studies of Testing

A BBAA A A BBA A BAA AA BBAA AA BBA A BAA AA BBAA A

9@ @@ 99@9 9@ @ 9@@9 9 9@ @ 9@9 @9@ 9@ 9@9 @ 9@@9 9 9@ @ 9@9@9 9@@ 9@ 9@9

78 8878 877 7 8 7887 87 7 8 787 8878 87 77 88 7 787

56 5655 56 5655 56 565 5655 656 555 56 56 55

% &&%% % % &&% % &%% %% &&%% % % &&% % &%% %% &&%% % % &&%% %

#$ $ #$$# # #$ $ #$$# # #$ #$ $#$ #$ #$ #$# $ #$$# # #$ $ #$# $$#$ # $ # #$#

           

               

16 Table 3.2: The Numeric Data for Figure 3.1.

A Survey of Literature on Empirical Studies: Reviews and Guidelines

A Survey of Reviews of Empirical Studies

Section 3.2.3 will discuss these problems in detail.

A Survey of Experimentation Guidelines

22 Table 3.3: The Experiment Processes.

Design - hypothesis formulation - subj./obj. selection - treatment identif. - variables selection

Operation - preparation - execution - analysis

Operation - preparation - execution - data validation

Execution - collect data

Execution & data collection - dene all software measures

Interpretation - interpretation context - extrapolation - impact

Analysis - examination of data - statistical inference

Analysis - graphical exam. of data - assumption check

Observations and Problem Areas

Criteria for the Evaluation of Empirical Studies of Testing Techniques

The State of the Art in Controlled Experiments of Testing

yes no no yes no yes yes yes yes no yes yes no no yes

no no yes yes no no no no yes no no yes no no yes

I/E/C I/E/C I/E/C I/E/C I/E/C E I/E E I/E/C I/E/C E I/E/C

31 Table 3.5: Replicability Checklist.

Chapter 4 Infrastructure Support for Empirical Studies

A Survey of Studies of Testing: Infrastructure Use

Challenges for Experimentation

Challenge 1: Supporting replicability across experiments.

Challenge 2: Supporting aggregation of ndings.

Challenge 3: Reducing the cost of controlled experiments.

Challenge 4: Obtaining sample representativeness.

Challenge 5: Isolating the eects of individual factors.

42 Table 4.2: Challenges and Infrastructure.

Object Selection, Organization, and Setup