You are on page 1of 69

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 2 January 7, 2015


Patrick Lam

version 1

Faults, Errors, and Failures


For this course, we are going to define the following terminology.
Fault (also known as a bug): A static defect in softwareincorrect lines of code.
Error: An incorrect internal statenot necessarily observed yet.
Failure: External, incorrect behaviour with respect to the expected behaviourmust be
visible (e.g. EPIC FAIL).
These terms are not used consistently in the literature. Dont get stuck on memorizing them.
Motivating Example. Heres a train-tracks analogy.

(all railroad pictures inspired by: Bernd Bruegge & Allen H. Dutoit, Object Oriented Software
Engineering: Using UML, Patterns and Java.)
Is it a failure? An error? A fault? Clearly, its not right, But no failure has occurred yet; there
is no behaviour. Id also say that nothing analogous to execution has occurred yet either. If there
was a train on the tracks, pre-derailment, then there would be an error. That picture most closely
corresponds to a fault.

Perhaps it was caused by mechanical stresses.

Or maybe it was caused by poor design.


Software-related Example. Lets get back to software and consider the code from last time:
public static numZero ( int [] x ) {
// effects : if x is null , throw N u l l P o i n t e r E x c e p t i o n
//
otherwise , return number of o c c u r r e n c e s of 0 in x .
int count = 0;
for ( int i = 1; i < x . length ; i ++) {
// program point (*)
if ( x [ i ] == 0) count ++;
}
return count ;
}

As we saw, it has a fault (independent of whether it is executed or not): its supposed to return the
number of 0s, but it doesnt always do so. We define the state for this method to be the variables
x, i, count, and the Program Counter (PC).
Feeding this numZero the input {2, 7, 0} shows a wrong state.
The wrong state is as follows: x = {2, 7, 0}, i = 1, count = 0, PC = (*), on the first time
around the loop.
The expected state is: x = {2, 7, 0}, i = 0, count = 0, PC = (*)
However, running numZero on {2, 7, 0} executes the fault and causes a (transient) error state,
but doesnt result in a failure, as the output value count is 1 as expected.
On the other hand, running numZero on {0, 2, 7} causes an error state with count = 0 on return,
hence leading to a failure.
2

RIP Fault Model


To get from a fault to a failure:
1. Fault must be reachable;
2. Program state subsequent to reaching fault must be incorrect: infection; and
3. Infected state must propagate to output to cause a visible failure.
Applications of the RIP model: automatic generation of test data, mutation testing.

Dealing with Faults, Errors and Failures


Three strategies for dealing with faults are avoidance, detection and tolerance. Or, you can just
try to declare that the fault is not a bug, if the specification is ambiguous.
Fault Avoidance. Certain faults can just be avoided by not programming in vulnerable languages; buffer overflows, for instance, are impossible in Java. Better system design can also help
avoid faults, for instance by making an error state unreachable.
Fault Detection. Testing (construed broadly) is the primary means of fault detection. Software
verification also qualifies. Once you have detected a fault, if it is economically viable, you might
repair it:

Fault Tolerance. You are never going to remove all of the bugs, and some errors arise from
conditions beyond your control (such as hardware faults). Its worthwhile to tolerate faults too.
Strategies include redundancy and isolation. An example of redundancy is provisioning extra
hardware in case a server goes down. Isolation includes things as simple as checking preconditions.

Testing vs Debugging
Recall from last time:
Testing: evaluating software by observing its execution.
Debugging: finding (and fixing) a fault given a failure.
I said that you really need to automate your tests. But even so, testing is still hard: only certain
inputs expose the fault in the form of a failure. As youve experienced, debugging is hard too: you
have the failure, but you have to find the fault.
Contrived example. Consider the following code:
if ( x - 100 <= 0)
if ( y - 100 <= 0)
if ( x + y - 200 == 0)
crash () ;

Only one input, x = 100 and y = 100, will trigger the crash. If youre just going to do a random
brute-force search over all 32-bit integers, you are never going to find the crash.

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 3 January 9, 2015


Patrick Lam

version 2

(We saw the numZero example today. That example is in the L02 lecture notes.)
Exercise: findLast.
1
2
3
4
5
6
7
8
9

Heres a faulty program from last time, again.

static public int findLast ( int [] x , int y ) {


/*

bug :

loop

test

should

be

>=

*/

for ( int i = x . length - 1; i > 0; i - -) {


if ( x [ i ] == y ) {
return i ;
}
}
return -1;
}
On the assignment, you will have a question like this:
(a) Identify the fault, and fix it.
(b) If possible, identify a test case that does not execute the fault.
(c) If possible, identify a test case that executes the fault, but does not result in an error state.
(d) If possible, identify a test case that results in an error, but not a failure. (Hint: PC)
(e) For the given test case, identify the first error state. Be sure to describe the complete state.
I asked you to work on that question in class. (I noticed that most people were on-topic, thanks!)
Here are some answers:
(a) The loop condition must include index 0: i >= 0.
(b) This is a bit of a trick question. To avoid the loop condition, you must not enter the loop body.
The only way to do that is to pass in x = null. You should also state, for instance, y = 3.
The expected output is a NullPointerException, which is also the actual output.
(c) Inputs where y appears in the second or later position execute the fault but do not result in
the error state; nor do inputs where x is an empty array. (There may be other inputs as well.)
So, a concrete input: x = {2, 3, 5}; y = 3. Expected output = actual output = 1.
(d) One error input that does not lead to a failure is where y does not occur in x. That results in
an incorrect PC after the final executed iteration of the loop.

(e) After running line 6 with i = 1, the decrement occurs, followed by the evaluation of i > 0,
causing the PC to exit the loop (statement 8) instead of returning to statement 4. The faulty
state is x = {2, 3, 5}; y = 3; i = 0; PC = 8, while correct state would be PC = 4.
Someone asked about distinguishing errors from failures. These questions were about failures at
the method level, and so a wrong return value would be a failure when were asking about methods.
In the context of a bigger program, that return value might not be visible to the user, and so it
might not constitute a failure, just an error.

Line Intersections
We then talked about different ways of validating a test suite. Consider the following Python code,
found by Michael Thiessen on stackoverflow (http://stackoverflow.com/questions/306316/
determine-if-two-rectangles-overlap-each-other).
1 class LineSegment :
2
def __init__ ( self , x1 , x2 ) :
3
self . x1 = x1 ; self . x2 = x2 ;
4
5 def intersect (a , b ) :
6
return ( a . x1 < b . x2 ) & ( a . x2 > b . x1 ) ;
We could construct test suites that:
execute every statement in intersect (node coverage, well see later). Well, thats not very
useful; any old test case will do that. There are no branches, so what well call edge coverage
doesnt help either.
feed random inputs to intersect; unfortunately, interesting behaviours are not random, so
it wont help much in general.
check all outputs of intersect (i.e. a test case with lines that intersect and one with lines
that dont intersect): were getting somewherethat will certify that the method works in
some cases, but its easy to think of situations that we missed.
check different values of clauses a.x1 < b.x2 and a.x2 > b.x1 (logic coverage)better than
the above coverage criteria, but still misses interesting behaviours;
analyze possible inputs and cover all interesting combinations of inputs (input space coverage)
can create an exhaustive test suite, if your analysis is sound.
Lets try to prove correctness of intersect. Theres an old saying about testingsupposedly, it
can only find the presence of bugs, not their absence. This is not completely true, especially if you
have reliable software that automatically constructs an exhaustive test suite. But that is beyond
the state of the practice in 2015, for the most part.
[Below is a cleaner presentation than the one in class. It also has the advantage of being not wrong.]

Inputs to intersect. There are essentially four inputs to this function. Rename them aAbB,
for a.x1, a.x2, etc.
Lets first assume that all points are distinct. We should make a note to ourselves to check
violations of this, as well: we may have a = A, a = b, a = B, and symmetrically for B:
b = a, b = A, b = B.
For the purpose of the analysis, lets assume that a < b; when constructing testcases, we can
swap a and b around. Thats why there are duplicate assert statements below.
Without loss of generality, we can assume that a < A and b < B. (We ought to update the
constructor if we want to make that assumption.)
With these assumptions, we have to test the three possible permutations aAbB, abAB, and abBa.
It is simple to construct test cases for these permutations, using Pythons unittest framework:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

def test_aAbB ( self ) :


a = LineSegment (0 ,2)
b = LineSegment (3 ,7)
self . assertFalse ( intersect (a , b ) )
self . assertFalse ( intersect (b , a ) )
def test_abAB ( self ) :
a = LineSegment (0 ,4)
b = LineSegment (3 ,7)
self . assertTrue ( intersect (a , b ) )
self . assertTrue ( intersect (b , a ) )
def test_abBA ( self ) :
a = LineSegment (0 ,4)
b = LineSegment (1 ,2)
self . assertTrue ( intersect (a , b ) )
self . assertTrue ( intersect (b , a ) )
Those test cases pass. However, if you construct test cases for equality (as Ive committed to the
repository), you see that the given intersect function fails on line segments that intersect only at
a point. Replacing < with <= and > with >= fixes the code.

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 4 January 12, 2015


Patrick Lam

version 1

About Testing
We can look at testing statically or dynamically.
Static Testing (ahead-of-time): this includes static analysis, which is typically automated and runs
at compile time (or, say, nightly), as well human-driven static testingwalk-throughs (informal)
and code inspection (formal).
Dynamic Testing (at run-time): observe program behaviour by executing it; includes black-box
testing and white-box testing.
Usually the word testing means dynamic testing.
Naughty words. People like to talk about complete testing, exhaustive testing, and full
coverage. However, for many systems, the number of potential inputs is infinite. Its therefore
impossible to completely test a nontrivial system, i.e. run it on all possible inputs. There are both
practical limitations (time and cost) and theoretical limitations (i.e. the halting problem).
In the absence of complete testing, we will define testing criteria and evaluate test suites with them.

Test cases
Informally, a test case contains:
what you feed to software; and
what the software should output in response.
Here are two definitions to help evaluate how hard it might be to create test cases.
Definition 1 Observability is how easy it is to observe the systems behaviour, e.g. its outputs,
effects on the environment, hardware and software.
Definition 2 Controlability is how easy it is to provide the system with needed inputs and to get
the system into the right state.

Anatomy of a Test Case


Consider testing a cellphone from the off state:

h on i

1 519 888 4567

h talk i

h end i

prefix values

test case values

verification values

exit codes

postfix values
Definition 3
Test Case Values: input values necessary to complete some execution of the software.
(often called the test case itself )
Expected Results: result to be produced iff program satisfies intended behaviour on a test case.
Prefix Values: inputs to prepare software for test case values.
Postfix Values: inputs for software after test case values;
verification values: inputs to show results of test case values;
exit commands: inputs to terminate program or to return it to initial state.
Definition 4
Test Case: test case values, expected results, prefix values, and postfix values necessary to
evaluate software under test.
Test Set: set of test cases.
Executable Test Script: test case prepared in a form to be executable automatically and which
generates a report.

On Coverage
Ideally, wed run the program on the whole input space and find bugs. Unfortunately, such a plan
is usually infeasible: there are too many potential inputs.
Key Idea: Coverage. Find a reduced space and cover that space.
We hope that covering the reduced space is going to be more exhaustive than arbitrarily creating
test cases. It at least tells us when we can plausibly stop testing.
The following definition helps us evaluate coverage.
Definition 5 A test requirement is a specific element of a (software) artifact that a test case must
satisfy or cover.

We write TR for a set of test requirements; a test set may cover a set of TRs.
For instance, consider three ice cream cone flavours: vanilla, chocolate and mint. A possible test
requirement would be to test one chocolate cone. (Volunteers?)
Two software examples:
cover all decisions in a program (branch coverage); each decision gives two test requirements:
branch is true; branch is false.
each method must be called at least once; each method gives one test requirement.
Definition 6 A coverage criterion is a rule or collection of rules that impose test requirements on
a test set.
A test set may or may not satisfy a coverage criterion. The coverage criterion gives a recipe for
generating TRs systemically.
Returning to the ice cream example, a flavour criterion might be cover all flavours, and that
would generate three TRs: {flavour: chocolate, flavour: vanilla, flavour: mint}.
We can test an ice cream stand by running two test sets on it, for instance: test set 1 includes 3
chocolate cones and 1 vanilla cone, while test set 2 includes 1 chocolate cone, 1 vanilla cone, and 1
mint cone.
Definition 7 (Coverage). Given a set of test requirements TR for a coverage criterion C, a test
set T satisfies C iff for every test requirement tr TR, at least one t T satisfies TR.
Infeasible Test Requirements. Sometimes, no test case will satisfy a test requirement. For
instance, dead code can make statement coverage infeasible, e.g.:
if (false)
unreachableCall();
or, a real example from the Linux kernel:
while (0)
{local_irq_disable();}
Hence, a criterion which says test every statement is going to be infeasible for many programs.
Quantifying Coverage. How good is a test set? Its great if it covers everything, but sometimes
thats impossible. We can instead assign a number.
Definition 8 (Coverage Level). Given a set of test requirements TR and a test set T , the coverage
level is the ratio of the number of test requirements satisfied by T to the size of TR.
Returning to our example, say TR = {flavour: chocolate, flavour: vanilla, flavour: mint}, and test
set T1 contains {3 chocolate, 1 vanilla}, then the coverage level is 2/3 or about 67%.
3

Subsumption
Sometimes one coverage criterion is strictly more powerful than another one: any test set that
satisfies C1 might automatically satisfy C2 .
Definition 9 Criteria subsumption: coverage criterion C1 subsumes C2 iff every test set that
satisfies C1 also satisfies C2 .
Software example: branch coverage (Edge Coverage) subsumes statement coverage (Node Coverage). Which is stronger?
Evaluating coverage criteria. Subsumption is a rough guide for comparing criteria, but its
hard to use in practice. Consider also:
1. difficulty of generating test requirements;
2. difficulty of generating tests;
3. how well tests reveal faults.

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 5 January 14, 2015


Patrick Lam

version 0

Graph Coverage
We will discuss graph coverage in great detail. Many forms of software testing reduce to graph
coverage, so once you understand how graph coverage works, you will have a good understanding
of a key software testing topic.
Definition of Graphs. (You should already be familiar with this material, but lets establish
the terms well use in this class). Lets start with an example graph.
A
N

N0 : Set of initial nodes {A}

Nf
E N N
C

: Set of nodes {A, B, C, D, E}

: Set of final nodes {E}


: Edges, e.g. (A, B) and (C, E);
C is the predecessor and
E is the successor in (C, E).

Subgraph: Let G0 be a subgraph of G; then the nodes of G0 must be a subset Nsub of N . Then
the initial nodes of G0 are N0 Nsub and its final nodes are Nf Nsub . The edges of G0 are
E (Nsub Nsub ).
For example, consider the case where we set Nsub = {A, B, E}. This induces the subgraph:

Note that graphs need not be connected.


Paths. The most important thing about a graph, for testing purposes, is the path through the
graph. Here is a graph G# and some example paths through the graph.

path 1: [2, 3, 5], with length 2.


path 2: [1, 2, 3, 5, 6, 2], with length 5.
not a path: [1, 2, 5].
We say that path 1 is from node 2 to node 5. We can also say that path 1 is from edge (2, 3) to
(3, 5).
Definition 1 A path is a sequence of nodes from a graph G whose adjacent pairs all belong to the
set of edges E of G.
Note that length 0 paths are still paths.
Definition 2 A subpath is a subsequence of a path.
This textbook definition is ambiguous: is [1, 2, 5] a subpath of [1, 2, 3, 5, 6, 2]?
Definition 3 A subsequence is a sequence that can be derived from another sequence by deleting
some elements without changing the order of the remaining elements.
Yes, [1, 2, 5] is a subsequence of [1, 2, 3, 5, 6, 2]. However, [1, 2, 5] is not a path, so we are going to
say that it is not a subpath.

Test cases and test paths.


Some paths are also test paths. Here is a test path from G# :
[1, 2, 3, 5, 6, 7];
here is another one:
[1, 2, 3, 5, 6, 2, 3, 5, 6, 7].
You can easily come up with more paths. Now, test paths are linked to test cases. First, lets define
the notion of a test path.
2

Definition 4 A test path is a path p (possibly of length 0) that starts at some node in N0 and
ends at some node in Nf .
Running the test case on the program or method yields one or more test paths. A test path may
represent many test cases (for instance, if a program takes the same branches on all of those test
cases); or a test path may represent no test cases (if it is infeasible).
Paths and semantics. When a graph is a programs control-flow graph, some of the paths in
the graph may not correspond to program semantics. Consider the following graph.
:
if (false)

Clearly will never execute.


In this course, we will generally only talk about the syntax of a graphits nodes and edgesand
not its semantics.
However, in the following definition, well talk about both notions.
Definition 5 A node n (or edge e) is syntactically reachable from ni if there exists a path from
ni to n (or e). A node n (or edge e) is semantically reachable if one of the paths from ni to n can
be reached on some input.
Standard graph algorithms, like breadth-first search and depth-first search, can compute syntactic
reachability. (Semantic reachability is undecidable; no algorithm can precisely compute semantic
reachability for all programs.)
We define reachG () as the portion of the graph syntactically reachable from . ( might be a
node, an edge, a set of nodes, or a set of edges.) For example:
reachG (N0 ) is the set of nodes and edges reachable from the initial node(s);
reachG# (2) in the above graph G# is {2, 3, 4, 5, 6, 7};
reachG# (7) is {7}.
Note that we include in the set of nodes reachable from , because paths of length 0 are paths.
When we talk about the nodes or edges in a graph G in a coverage criterion, well generally mean
reachG (N0 ); the unreachable nodes tend to (1) be uninteresting; and (2) to frustrate coverage
criteria.

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 6 January 16, 2015


Patrick Lam

version 1

Some binary distinctions


Lets digress for a bit and define some older terms which we wont use much in this course, but
which we should discuss briefly.
Black-box testing. Deriving tests from external descriptions of software: specifications, requirements, designs; anything but the code.
White-box testing. Deriving tests from the source code, e.g. branches, conditions, statements.
Our model-based approach makes this distinction less important.

Test paths and cases


We resume our graph coverage content with the following definition:
Definition 1 A graph is single-entry/single-exit (SESE) if N0 and Nf have exactly one element
each. Nf must be reachable from every node in N , and no node in N \ Nf may be reachable from
Nf , unless N0 = Nf .
The graphs that well be talking about in this course will almost always be SESE.
Heres another example of a graph, which happens to be SESE, and test paths in that graph. Well
call this graph D, for double-diamond, and itll come up a few times.

n2

n5

n0

n3

n1

n6

n4

Here are the four test paths in D:


[n0 , n1 , n3 , n4 , n6 ]
[n0 , n1 , n3 , n5 , n6 ]
[n0 , n2 , n3 , n4 , n6 ]
[n0 , n2 , n3 , n5 , n6 ]
We next focus on the path p = [n0 , n1 , n3 , n4 , n6 ] and use it to explain several path-related definitions. We can say that p visits node n3 and edge (n0 , n1 ); we can write n3 p and (n0 , n1 ) p
respectively.
Let p0 = [n1 , n3 , n4 ]. Then p0 is a subpath of p.
Test cases and test paths. We connect test cases and test paths with a mapping pathG from
test cases to test paths; e.g. pathG (t) is the set of test paths corresponding to test case t.
usually we just write path since G is obvious from the context.
we can lift the definition of path to test sets T by defining path(T ) = {path(t)|t T }.
each test case gives at least one test path. If the software is deterministic, then each test case
gives exactly one test path; otherwise, multiple test cases may arise from one test path.
Heres an example of deterministic and nondeterministic control-flow graphs:
if (x > 3)
x=0

x=5

if (x.hashCode() > 3)

Causes of nondeterminism include dependence on inputs; on the thread scheduler; and on memory
addresses, for instance as seen in calls to the default Java hashCode() implementation.
Nondeterminism makes it hard to check test case output, since more than one output might be a
valid result of a single test input.
Indirection. Note that we will describe coverage criteria with respect to test paths, but we always
run test cases.

Example.
test paths.

Here is a short method, the associated control-flow graph, and some test cases and

int foo ( int x) {


i f ( x < 5) {
x ++;
} else {
x ;
}
return x ;
}

(1) if (x < 5)
F
(3) x--

T
(2) x++

(4) return x

Test case: x = 5; test path: [(1), (3), (4)].


Test case: x = 2; test path: [(1), (2), (4)].
Note that (1) we can deduce properties of the test case from the test path; and (2) in this example,
since our method is deterministic, the test case determines the test path.

Graph Coverage
Having defined all of the graph notions well need for now, we apply them to graphs. Recall our
previous definition of coverage:
Definition 2 Given a set of test requirements TR for a coverage criterion C, a test set T satisfies
C iff for every test requirement tr in TR, at least one t in T exists such that t satisfies tr.
We apply this definition to graph coverage:
Definition 3 Given a set of test requirements TR for a graph criterion C, a test set T satisfies C
on graph G iff for every test requirement tr in TR, at least one test path p in path(T ) exists such
that p satisfies tr.
Well use this notion to define a number of standard testing coverage criteria. (At this point, the
textbook defines predicates, but mostly ignores them afterwards. Ill just ignore them right away.)
Recall the double-diamond graph D which we saw on page 1. For the node coverage criterion, we
get the following test requirements:
{n0 , n1 , n2 , n3 , n4 , n5 , n6 }
That is, any test set T which satisfies node coverage on D must include test cases t; the cases t
give rise to test paths path(t), and some path must include each node from n0 to n6 . (No single
path must include all of these nodes; the requirement applies to the set of test paths.)
Lets formally define node coverage.
3

Definition 4 Node coverage: For each node n reachG (N0 ), TR contains a requirement to visit
node n.
We will state all of the coverage criteria in the following form:
Criterion 1 Node Coverage (NC): TR contains each reachable node in G.
We can then write
TR = {n0 , n1 , n2 , n3 , n4 , n5 , n6 }.
Lets consider an example of a test set which satisfies node coverage on D, the double-diamond
graph from last time.
Start with a test case t1 ; assume that executing t1 gives the test path
path(t1 ) = p1 = [n0 , n1 , n3 , n4 , n6 ].
Then test set {t1 } does not give node coverage on D, because no test case covers node n2 or n5 . If
we can find a test case t2 with test path
path(t2 ) = p2 = [n0 , n2 , n3 , n5 , n6 ],
then the test set T = {t1 , t2 } satisfies node coverage on D.
What is another test set which satisfies node coverage on D?

Here is a more verbose definition of node coverage.


Definition 5 Test set T satisfies node coverage on graph G if and only if for every syntactically
reachable node n N , there is some path p in path(T ) such that p visits n.
A second standard criterion is that of edge coverage.
Criterion 2 Edge Coverage (EC). TR contains each reachable path of length up to 1, inclusive,
in G.
We describe edge coverage this way so that, as far as possible, new criteria in a series will subsume
previous criteria.
Here are some examples of paths of length 1:
Note that since were not talking about test paths, these reachable paths need not start in N0 .
In general, paths of length 1 consist of nodes and edges. (Why not just say edges?)

Saying edges on the above graph would not be the same as saying paths of length 1.
4

Here is a more involved example:


n0
x<y
n1

xy

n2

Lets define
path(t1 ) = [n0 , n1 , n2 ]
path(t2 ) = [n0 , n2 ]
Then
T1 =

satisfies node coverage

T2 =

satisfies edge coverage

Going beyond 1. So far weve seen length 0 (node coverage) and length 1. Of course, we
can go to lengths 2, etc., but we quickly get diminishing returns. Here is the criterion for length
2.
Criterion 3 Edge-Pair Coverage. (EPC) TR contains each reachable path of length up to 2,
inclusive, in G.
Heres an example.
4

nodes:
edges:
paths of length 2:

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 7 January 19, 2015


Patrick Lam

version 2

Further properties of paths


Weve seen so far length-0 (NC), length-1 (EC) and length-2 (EPC) paths. We could keep on
going, but that gets us to Complete Path Coverage (CPC), which requires an infinite number of
test requirements.
Criterion 1 Complete Path Coverage. (CPC) TR contains all paths in G.
Note that CPC is impossible to achieve for graphs with loops.
Instead of going to CPC, we would like to capture the essence of all loops. To do so, we first set
up a few definitions:
Definition 1 A path is simple if no node appears more than once in the path, except that the first
and last nodes may be the same.
In the graphs above, some simple paths are:
but not:

Some properties of simple paths:


no internal loops;
can bound their length;
can create any path by composing simple paths; and
many simple paths exist (too many!)
Because there are so many simple paths, lets instead consider prime paths, which are simple paths
of maximal length.

For instance, in the following graph:


3

4
Simple paths:
Prime paths:
Definition 2 A path is prime if it is simple and does not appear as a proper subpath of any other
simple path.
Criterion 2 Prime Path Coverage. (PPC) TR contains each prime path in G.
There is a problem with using PPC as a coverage criterion: a prime path may be infeasible but
contain feasible simple paths.
Example:

One could replace infeasible prime paths in TR with feasible subpaths, but we wont bother.
Beyond CFGs. Lets now move beyond control-flow graphs and think about a different type of
graph. The next few graphs represent finite state machines rather than control-flow graphs. Our
motivation will be to set up criteria that visit round trips in cyclic graphs.
read
n1

open

close

n2

n3

or perhaps
q1

socket

q2

bind

q3

listen
accept
q5

q4

closeS
closeC

send

q6

q7

Prime paths apply to both CFGs and other graphs. The next criteria are mostly not for CFGs.
Definition 3 A round trip path is a prime path of nonzero length that starts and ends at the same
node.
Criterion 3 Simple Round Trip Coverage. (SRTC) TR contains at least one round-trip path
for each reachable node in G that begins and ends a round-trip path.
Criterion 4 Complete Round Trip Coverage. (CRTC) TR contains all round-trip paths for
each reachable node in G.
Exercise: Computing Prime Paths.

Last time, we saw this graph:


4

I recommend computing all of the simple paths for this example. (There are 53 simple paths, with
length up to 5, and 12 prime paths.) Hint: write out simple paths of length up to N . To get
the simple paths of length N + 1, start with the paths of length N and extend the ones that are
extendable; the non-extendable paths are prime, so mark them as you go along.
Heres another graph:
5
1

6
4

This graph has 38 simple paths and 9 prime paths:


[0, 1, 2, 3, 6], [0, 1, 2, 4, 6], [0, 2, 3, 6], [0, 2, 4, 6], [0, 1, 2, 3, 5], [0, 2, 3, 5], [5, 3, 6], [3, 5, 3], [5, 3, 5].
To compute the prime paths, we could also enumerate all of the simple paths and note nonextendable paths and cycles (which are both prime). In doing so, one would also get all of the
information necessary to write out the test requirements for NC, EC and EPC. Since there is a
loop, it is impossible to explicitly write out all test requirements for CPC, but one could write out
some of them.
3

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 8 January 21, 2015


Patrick Lam

version 1

Another coverage criterion that Ive mentioned, but is not in the notes:
Criterion 1 Specified Path Coverage. (SPC) TR contains a specified set S of paths.
Specified path coverage might be useful for encoding a set of usage scenarios.
Prime Path Coverage versus Complete Path Coverage.
n0
n1

n2
n3

Prime paths:
path(t1 ) =
path(t2 ) =
T1 = {t1 , t2 } satisfies both PPC and CPC.

q0

q1
q3

q2
q4

Prime paths:
path(t3 ) =
path(t4 ) =
T1 = {t3 , t4 } satisfies both PPC but not CPC.
Now, well see another graph theory-inspired coverage criterion. First, some definitions:

Definition 1 A graph G is connected if every node in G is reachable from the set of initial nodes
N0 .
Definition 2 An edge e is a bridge for a graph G if G is connected, but removing e from G results
in a disconnected graph.
(This is similar to an articulation point in a graph, except that articulation points are nodes, not
edges.)
Criterion 2 Bridge Coverage. (BC) TR contains all bridges.
Assume that a graph contains at least two nodes, and all nodes in a graph are reachable from
the initial nodes. Does bridge coverage subsume node coverage? Justify your answer, providing a
counterexample if appropriate.
Specifying versus meeting test requirements. Consider this graph.

n0

n1

n2

n3

The following simple (and loop-free) path is, in fact, prime:


p=
PPC includes this path as a test requirement. The test path

meets the test requirement induced by p even though it is not prime. Note that a test path may
satisfy the prime path test requirement even though it is not prime.
Graph Coverage Exercise. Consider the graph defined by the following sets.
N = {1, 2, 3, 4, 5, 6, 7}
N0 = {1}
Nf = {7}
E = {[1, 2], [1, 7], [2, 3], [2, 4], [3, 2], [4, 5], [4, 6], [5, 6], [6, 1]}
Also consider the following test paths:
t0 = [1, 2, 4, 5, 6, 1, 7]
t1 = [1, 2, 3, 2, 4, 6, 1, 7]
Answer the following questions.
2

(a) Draw the graph.


(b) List the test requirements for EPC. [hint: 12 requirements of length 2]
(c) Does the given set of test paths satisfy EPC? If not, identify what is missing.
(d) Consider the simple path [3, 2, 4, 5, 6] and test path [1, 2, 3, 2, 4, 6, 1, 2, 4, 5, 6, 1, 7]. Is the simple
path a subpath of the test path?
(e) List the test requirements for NC, EC, and PPC on this graph.
(f) List a test path that achieves NC but not EC on the graph.
(g) List a test path that achieves EC but not PPC on the graph.
(Note: Weve talked about test sets meeting criteria in the past. For (f) and (g), we are simply
talking about a test set with one test case that induces the given test paths.)

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 9 January 23, 2015


Patrick Lam

version 1

Graph Coverage for Source Code


So far, weve seen a number of coverage criteria for graphs, but Ive been vague about how to
actually construct graphs. For the most part, its fairly obvious.

Structural Graph Coverage for Source Code


Fundamental graph for source code: Control-Flow Graph (CFG).
CFG nodes: zero or more statements;
CFG edges: an edge (s1 , s2 ) indicates that s1 may be followed by s2 in an execution.
Basic Blocks. We can simplify a CFG by grouping together statements which always execute
together (in sequential programs):
1 (L1-2)
1
2
3
4
5
6
7

x = 5
z = 2
q0 : if ( z < 17) goto q1
z = z + 1
print ( x )
goto q0
q1 : nop

2 (L3)
F
3 (L4-6)

4 (L7)
We use the following definition:
Definition 1 A basic block has one entry point and one exit point.
Note that a basic block may have multiple successors. However, there may not be any jumps into
the middle of a basic block (which is why statement l0 has its own basic block.)

Some Examples
Well now see how to construct control-flow graph fragments for various program constructs.

if statements: The book puts the conditions (and hence uses) on the control-flow edges, rather
than in the if node. I prefer putting the condition in the node.
1 (L1)
1
2
3
4

if ( z < 17)
print ( x );
else
print ( y );

1 (L1)
F

2 (L2)

3 (L4)

2 (L2)

if ( z < 17)
print ( x );

Short-circuit if evaluation is more complicated; I recommend working it out yourself.


(Recall that node coverage does not imply edge coverage.)
case / switch statements:
1 (L1)
i
1
2
3
4
5
6

switch
case
case
case
}
// ...

(n) {
I : ...; break ;
J : ...; /* fall thru */
K : ...; break ;

2 (L2)

3 (L3)

4 (L4)

5
(L2)

6
(L3)

7
(L4)

8 (L6)
while statements:
1 (x = 0;
y = 20)
1
2
3
4

x = 0; y = 20;
while ( x < y ) {
x ++; y - -;
}

2 (L2)

x<y
3 (L3)

(x < y)
4

Note that arbitrarily complicated structures may occur inside the loop body.
for statements:
1
2
3
4
5

for ( int i = 0; i < 57; i ++) {


if ( i % 3 == 0) {
print ( i );
}
}

(an exercise for the reader!)

This example uses Javas enhanced for loops, which iterates over all of the elements in the widgetList:
1
2
3
4

for ( Widget w : widgetList ) {


decorate ( w );
}
// ...

I will accept the simplified CFG or the more useful one on the right:
it = wL.iterator()
w widgetList

1 (L1)

it.hasNext()
2 (L2)

F
3 (L4)

w = it.next();
decorate(w);

3 (L4)
All of these graphs admit the notions of node coverage (statement coverage, basic block coverage)
and edge coverage (branch coverage).
Larger example.
1
2
3
4
5
6
7
8
9
10
11
12
13

You can draw a 7-node CFG for this program:

/** Binary search for target in sorted subarray a [ low .. high ] */


int binary_search ( int [] a , int low , int high , int target ) {
while ( low <= high ) {
int middle = low + ( high - low )/2;
if ( target < a [ middle )
high = middle - 1;
else if ( target > a [ middle ])
low = middle + 1;
else
return middle ;
}
return -1; /* not found in a [ low .. high ] */
}

1. if (low <= high) (L3)


F
8. return middle
(L12)

2 (L45)
T

4 (L6)

5 (L7)
F
T
6 (L9)

7. return middle (L9)

Exercises
Here are more exercise programs that you can draw CFGs for.
1
2
3
4
5
6
7
8
9
10
11
12
13

/* effects : if x == null , throw N u l l P o i n t e r E x c e p t i o n


otherwise , return number of elements in x that are odd , positive or both . */
int oddOrPos ( int [] x ) {
int count = 0;
for ( int i = 0; i < x . length ; i ++) {
if ( x [ i ]%2 == 1 || x [ i ] > 0) {
count ++;
}
}
return count ;
}
// example test case : input : x =[ -3 , -2 , 0 , 1 , 4]; output : 3

Next, we have a really poorly-designed API (Id give it a D at most, maybe an F) because its
impossible to succinctly describe what it does. Do not design functions with interfaces like
this. But we can still draw a CFG, no matter how bad the code is.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

/** Returns the mean of the first maxSize numbers in the array ,
if they are between min and max . Otherwise , skip the numbers . */
double computeMean ( int [] value , int maxSize , int min , int max ) {
int i , ti , tv , sum ;
i = 0; ti = 0; tv = 0; sum = 0;
while ( ti < maxSize ) {
ti ++;
if ( value [ i ] >= min && value [ i ] <= max ) {
tv ++;
sum += value [ i ];
}
i ++;
}
if ( tv > 0)
return ( double ) sum / tv ;
else
throw new I l l e g a l A r g u m e n t E x c e p t i o n ();
}

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 10 January 26, 2015


Patrick Lam

version 0

We are going to completely switch gears and talk about testing concurrent programs next. For
those of you in 3A, SE 350 is the first real exposure to these concepts; youll see them more in CS
343. ECE 459 also teaches you how to leverage parallelism.
Context: Multicores are everywhere today! For the past 10 years, chips have not been
getting more GHz. We still have more transistors though. Hardware manufacturers have been
sharing this bounty with us, through the magic of multicore processors!

If you want performance today, then you need parallelism.


The dark side is that concurrency bugs will bite you.
More often than not, printing a page on my dual-G5 crashes the application.
The funny thing is, printing almost never crashes on my (single-core) G4
PowerBook.
http://archive.oreilly.com/pub/post/dreaded_concurrency.html

The most famous kind of concurrency bug is the race condition. Lets look at this code.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21

#include <iostream>
#include <thread>

When we run it:

int counter = 0;

plam@polya
2
plam@polya
2
plam@polya
1
plam@polya
1
plam@polya
2
plam@polya
2

void func() {
int tmp;
tmp = counter;
tmp++;
counter = tmp;
}
int main() {
std::thread t1(func);
std::thread t2(func);
t1.join();
t2.join();
std::cout << counter;
return 0;
}

/tmp> ./a.out
/tmp> ./a.out
/tmp> ./a.out
/tmp> ./a.out
/tmp> ./a.out
/tmp> ./a.out

Yes, thats a race condition.


A race occurs when you have two concurrent accesses to the same memory location, at least
one of which is a write.
1

Race conditions arise between variables which are shared between threads. Note that when theres
a race, the final state may not be the same as running one access to completion and then the other.
Tools to the rescue. While races may be entertaining, race conditions are never good. You
have several dynamic analysis tools at your disposal to eradicate races, including:
Helgrind (part of Valgrind)
lockdep (Linux kernel)
Thread Analyzer (Oracle Solaris Studio)
Thread Analyzer (Coverity)
Intel Inspector XE 2011 (formerly Intel Thread Checker)
We can run the race condition shown above under Helgrind:
plam@polya /tmp> g++ -std=c++11 race.C -g -pthread -o race
plam@polya /tmp> valgrind --tool=helgrind ./race
[...]
==6486== Possible data race during read of size 4 at 0x603E1C by thread #3
==6486== Locks held: none
==6486==
at 0x400EA1: func() (race.C:8)
==6486==
by 0x402254: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1732)
==6486==
by 0x4021AE: std::_Bind_simple<void (*())()>::operator()() (functional:1720)
==6486==
by 0x402147: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
==6486==
by 0x4EF196F: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20)
==6486==
by 0x4C2F056: mythread_wrapper (hg_intercepts.c:234)
==6486==
by 0x56650A3: start_thread (pthread_create.c:309)
==6486==
by 0x595FCCC: clone (clone.S:111)
==6486==
==6486== This conflicts with a previous write of size 4 by thread #2
==6486== Locks held: none
==6486==
at 0x400EB1: func() (race.C:10)
==6486==
by 0x402254: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1732)
==6486==
by 0x4021AE: std::_Bind_simple<void (*())()>::operator()() (functional:1720)
==6486==
by 0x402147: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
==6486==
by 0x4EF196F: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.20)
==6486==
by 0x4C2F056: mythread_wrapper (hg_intercepts.c:234)
==6486==
by 0x56650A3: start_thread (pthread_create.c:309)
==6486==
by 0x595FCCC: clone (clone.S:111)
==6486== Address 0x603e1c is 0 bytes inside data symbol "counter"

Not enough. OK, great. Now youve eliminated all races (as required by specification). Of
course, there are still lots of bugs that your program might contain. Some of them are even
concurrency bugs. Here, there are no longer any contended accesses, but we cheated by caching
the value. Atomic operations would be safe.
1
2
3

#include <iostream>
#include <thread>
#include <mutex>

4
5
6

int counter = 0;
std::mutex m;

7
8
9

void func() {
int tmp;
2

10

m.lock();
tmp = counter;
m.unlock();
tmp++;
m.lock();
counter = tmp;
m.unlock();

11
12
13
14
15
16
17
18

}
We can test our code for additional concurrency bugs:
run the code multiple times
add noise (sleep, more system load, etc)
Helgrind and friends
force scheduling (e.g. Java PathFinder)
static approaches: lock-set, happens-before, state-of-the-art techniques
Reentrant/recursive Locks What happens if you have two requests for a POSIX/C++11 lock?
If the requests are in different threads, the second thread waits for the first thread to unlock. But,
if the requests are in the same thread, that thread waits for itself to unlock. . . forever!
To avoid this unhappy situation, we can use recursive locks. Each lock knows how many times its
owner has locked it. The owner must then unlock the same number of times to liberate.
Java locks work this way, e.g.

1
2

class SynchronizedIsRecursive {
int x;

synchronized void f() {


x--;
g(); // does not hang!
}

4
5
6
7
8

synchronized void g() {


x++;
}

9
10
11
12

}
Although every Java object is a lock, and we can synchronized() over every lock, ReentrantLocks
are more special:
we can explicitly lock() & unlock() them,

(or even trylock())!


However, inexpertly-written Java programs might hog the lock. To avoid that, use a try/finally
construct:
Lock lock = new ReentrantLock();
lock.lock();
try {
// you got the lock! workworkwork
} finally {
// might have thrown an exception
lock.unlock();
}

1
2
3
4
5
6
7
8

Tools for detecting lock usage issues


The reference for the next example is Engler et al [ECCH00]. This example falls short of excellence:
/ 2.4.0:drivers/sound/cmpci.c:cm midi release: /
lock_kernel(); // [PL: GRAB THE LOCK]
if (file->f_mode & FMODE_WRITE) {
add_wait_queue(&s->midi.owait, &wait);
...
if (file->f_flags & O_NONBLOCK) {
remove_wait_queue(&s->midi.owait, &wait);
set_current_state(TASK_RUNNING);
return -EBUSY; // [PL: OH NOES!!1]
}
...
}
unlock_kernel();

1
2
3
4
5
6
7
8
9
10
11
12
13

The problem: lock() and unlock() calls must be paired! They are on the happy path, but not on
the -EBUSY path. [ECCH00] describes a tool that allows developers to describe calls that must be
paired.
Another example:
1
2

foo(p, ...)
bar(p, ...);

1
2

foo(p, ...)
bar(p, ...);

1
2

foo(p, ...)
// ERROR: foo, no bar!

Our tool might then give the following results: 23 errors, 11 false positives. A false positive is something where the tool reports an error, but for instance, the error is in an infeasible path.
The next challenge is: how do we find such rules? In particular, we want to find rules of the form
A() must be followed by B(), ora(); ... b();, which denotes a MAY-belief that a() follows
b(). iComment by Lin Tan et al propose mining comments to find them [TYKZ07].
Here is some OpenSolaris code that demonstrates expectations. Also different locking primitives.
4

1
2
3
4

/* opensolaris/common/os/taskq.c: */
/* Assumes: tq->tq_lock is held. */
/* , consistent X */
static void taskq_ent_free(...) { ... }

5
6
7
8
9
10
11
12

static taskq_t
*taskq_create_common(...) { ...
// [different lock primitives below:]
mutex_enter(...);
taskq_ent_free(...); /* consistent X */
...
}
Getting back to actual bugs, here is a bad comment automatically detected by iComment:

1
2
3

/* mozilla/security/nss/lib/ssl/sslsnce.c: */
/* Caller must hold cache lock when calling this. */
static sslSessionID * ConvertToSID(...) { ... }

4
5
6
7
8
9
10
11

static sslSessionID *ServerSessionIDLookup(...)


{
...
UnlockSet(cache, set); ...
sid = ConvertToSID(...);
...
}
We observe a specification in the comment at line 2, and then a usage error at line 9 where we
unlock the cache and then call ConvertToSID. The badness of the comment was confirmed by
Mozilla developers.
Issue: Comments are not updated alongside code. Bad comments can and do cause bugs.
Heres another bad comment in the Linux kernel, also automatically detected.

// linux/drivers/ata/libata-core.c:

2
3
4

/* LOCKING: caller. */
void ata_dev_select(...) { ...}

5
6
7
8
9
10

int ata_dev_read_id(...) {
...
ata_dev_select(...);
...
}
Once again, the specification at line 3 states that the caller is to lock. But line 8 calls ata dev select()
without holding the lock. The badness of this comment was confirmed by Linux developers.

Deadlocks
Another concurrency problem is deadlocks. We focus on a particular form of deadlock here, which
occurs when code may get interrupted by interrupt handlers, and the code shares locks with the
interrupt handler. This problem inspired aComment [TZP11].

In particular, if the spinlock is taken by code that runs in interrupt context (either a hardware
or software interrupt), then code requesting the lock must use the spin lock form that disables
interrupts. Otherwise, sooner or later, the code will deadlock.
Heres an example of the right way to do things:
1
2
3
4
5

spinlock_t mr_lock = SPIN_LOCK_UNLOCKED;


unsigned long flags;
spin_lock_irqsave(&mr_lock, flags);
/* critical section... */
spin_lock_irqrestore(&mr_lock, flags);

spin lock irqsave() disables interrupts locally and provides spinlock on symmetric multiprocessors (SMPs).
spin lock irqrestore() restores interrupts to state when lock acquired.
This covers both interrupt and SMP concurrency issues.

References
[ECCH00] Dawson R. Engler, Benjamin Chelf, Andy Chou, and Seth Hallem. Checking system
rules using system-specific, programmer-written compiler extensions. In Michael B.
Jones and M. Frans Kaashoek, editors, OSDI, pages 116. USENIX Association, 2000.
[TYKZ07] Lin Tan, Ding Yuan, Gopal Krishna, and Yuanyuan Zhou. /*icomment: bugs or bad
comments?*/. In Thomas C. Bressoud and M. Frans Kaashoek, editors, SOSP, pages
145158. ACM, 2007.
[TZP11]

Lin Tan, Yuanyuan Zhou, and Yoann Padioleau. acomment: mining annotations from
comments and code to detect interrupt related concurrency bugs. In Richard N. Taylor,
Harald Gall, and Nenad Medvidovic, editors, ICSE, pages 1120. ACM, 2011.

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 11 January 28, 2015


Patrick Lam

version 1

Assertions are a key ingredient in writing tests, so I thought I would explicitly describe them.
Definition 1 An assertion contains an expression that is supposed to evaluate to true.
For instance, if we have a linked list, then the doubly-linked list property states that prev is the
inverse of next, i.e. for nodes n, we have n.next.prev == n. After inserting a node, one might
expect this property to be true of that node (possibly with caveats, for instance about the end of
the list.)

We use assertions in unit tests to say whats supposed to be true.


Preconditions and postconditions.
true upon entry & exit from a method.

More generally, we can express what is supposed to be

We saw this code in Linux:


1
2

/* LOCKING: caller. */
void ata_dev_select(...) { ...}
This expresses an assertion (although not written as a program statement) that the lock is held
upon entry.
Assume/Guarantee Reasoning. Why would you use preconditions and postconditions? Reasoning about programs is difficult, and preconditions and postconditions simplifies reasoning.
When reasoning about the callee, you get to assume that the precondition holds upon entry.
When reasoning about the caller, you get to guarantee the precondition holds before the call.
The reverse holds about the postcondition.

aComment
We talked about the aComment approach for locking-related annotations. In particular, it:
extracts locking-related annotations from code;
extracts locking-related annotations from comments; and
propagates annotations to callers.
1

Tools
To provide some context, consider the OS X Mavericks goto fail bug:
1
2
3
4
5
6
7
8
9
10
11

if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)


goto fail;
if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
goto fail;
goto fail; /* MISTAKE! THIS LINE SHOULD NOT BE HERE */
if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
goto fail;
err = sslRawVerify(...);
fail:
return err;

The bug:
opensource.apple.com/source/Security/Security-55471/libsecurity_ssl/lib/sslKeyExchange.c

No bug:
www.opensource.apple.com/source/Security/Security-55179.13/libsecurity_ssl/lib/sslKeyExchange.c

A writeup:
nakedsecurity.sophos.com/2014/02/24/anatomy-of-a-goto-fail-apples-ssl-bug-explained-plus-an-unofficial-patch/

Detecting goto fail. In retrospect, a number of tools couldve found this bug:
compiler -Wunreachable-code option
PC-Lint:warning 539: Did not expect positive indentation
PVS-Studio:V640: Logic does not match formatting
The problem is that these tools also report many other issues, and live issues can be buried among
those other issues.

The Landscape of Testing and Static Analysis Tools


Heres a survey of your options:
manual testing;
running a JUnit test suite, manually generated;
running automatically-generated tests;
running static analysis tools.
Well examine several points on this continuum today. More on this later (Lecture 23), thanks to
guest lecturer. Some examples:
Coverity: a static analysis tool used by 900+ companies, including BlackBerry, Mozilla, etc.
Microsoft requires Windows device drivers to pass their Static Driver Verifier for certification.

Tools for Java that you can download


FindBugs.

An open-source static bytecode analyzer for Java out of the University of Maryland.
findbugs.sourceforge.net

It finds bug patterns:


off-by-one;
null pointer dereference;
ignored read() return value;
ignored return value (immutable classes);
uninitialized read in constructor;
and more. . .
FindBugs gives some false positives. Here are some techniques to help avoid them:
patricklam.ca/papers/14.msr.saa.pdf
Java Path Finder (JPF), NASA. The key idea: Implement a Java Virtual Machine, but
explore many thread interleavings, looking for concurrency bugs.
JPF is an explicit state software model checker for JavaTM bytecode.
JPF can also search for deadlocks and unhandled exceptions (NullPointerException, AssertionError);
race conditions; missing heap bounds checks; and more.
javapathfinder.sourceforge.net
Korat (University of Illinois). Key Idea: Generate Java objects from a representation invariant
specification written as a Java method.

Binary Tree!

For instance, heres a binary tree.

One characteristic of a binary tree:


left & right pointers dont refer to same node.

We can express that characteristic in Java as follows:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

boolean repOk() {
if (root == null) return size == 0;
// empty tree has size 0
Set visited = new HashSet(); visited.add(root);
List workList = new LinkedList(); workList.add(root);
while (!workList.isEmpty()) {
Node current = (Node)workList.removeFirst();
if (current.left != null) {
if (!visited.add(current.left)) return false; // acyclicity
workList.add(current.left);
}
if (current.right != null) {
if (!visited.add(current.right)) return false; // acyclicity
workList.add(current.right);
}
}
if (visited.size() != size) return false; // consistency of size
return true;
}
Korat then generates all distinct (non-isomorphic) trees, up to a given size (say 3). It uses these
trees as inputs for testing the add() method of the tree (or for any other methods.)
korat.sourceforge.net/index.html
Randoop (MIT). Key Idea: Writing tests is a difficult and time-consuming activity, and yet
it is a crucial part of good software engineering. Randoop automatically generates unit tests for
Java classes.
Randoop generates random sequence of method calls, looking for object contract violations.
To use it, simply point it at a program & let it run.
Randoop discards bad method sequences (e.g. illegal argument exceptions). It remembers method
sequences that create complex objects, and sequences that result in object contract violations.
code.google.com/p/randoop/
Here is an example generated by Randoop:

1
2
3
4
5
6
7
8
9
10
11

public static void test1() {


LinkedList list = new LinkedList();
Object o1 = new Object();
list.addFirst(o1);
TreeSet t1 = new TreeSet(list);
Set s1 = Collections.synchronizedSet(t1);
// violated in the Java standard library!
Assert.assertTrue(s1.equals(s1));
}

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 14 February 4, 2015


Patrick Lam

version 1

Recall that weve been discussing beliefs. Here are a couple of beliefs that are worthwhile to check.
(examples courtesy Dawson Engler.)
Redundancy Checking. 1) Code ought to do something. So, when you have code that doesnt
do anything, thats suspicious. Look for identity operations, e.g.
x = x, 1 y, x&x, x|x.
Or, a longer example:
1
2
3

/* 2.4.5-ac8/net/appletalk/aarp.c */
da.s_node = sa.s_node;
da.s_net = da.s_net;
Also, look for unread writes:

1
2
3
4
5

for (entry=priv->lec_arp_tables[i];
entry != NULL; entry=next) {
next = entry->next; // never read!
...
}
Redundancy suggests conceptual confusion.
So far, weve talked about MUST-beliefs; violations are clearly wrong (in some sense). Lets examine
MAY beliefs next. For such beliefs, we need more evidence to convict the program.
Process for verifying MAY beliefs. We proceed as follows:
1. Record every successful MAY-belief check as check.
2. Record every unsucessful belief check as error.
3. Rank errors based on check : error ratio.
Most likely errors occur when check is large, error small.
Example.

1
2

One example of a belief is use-after-free:

free(p);
print(*p);
That particular case is a MUST-belief. However, other resources are freed by custom (undocumented) free functions. Its hard to get a list of what is a free function and what isnt. So, lets
derive them behaviourally.
1

Inferring beliefs: finding custom free functions. The key idea is: if pointer p is not used
after calling foo(p), then derive a MAY belief that foo(p) frees p.
OK, so which functions are free functions? Well, just assume all functions free all arguments:
emit check at every call site;
emit error at every use.
(in reality, filter functions with suggestive names).
Putting that into practice, we might observe:
foo(p)
p = x;

foo(p)
p = x;

foo(p)
p = x;

bar(p)
p = 0;

bar(p)
p=0;

bar(p)
p = x;

We would then rank bars error first. Plausible results might be: 23 free errors, 11 false positives.
Inferring beliefs: finding routines that may return NULL. The situation: we want to know
which routines may return NULL. Can we use static analysis to find out?
sadly, this is difficult to know statically (return p->next;?) and,
we get false positives: some functions return NULL under special cases only.
Instead, lets observe what the programmer does. Again, rank errors based on checks vs non-checks.
As a first approximation, assume all functions can return NULL.
if pointer checked before use: emit check;
if pointer used before check: emit error.
This time, we might observe:

p = bar(...);
p = x;

p = bar(...);
if (!p) return;
p = x;

p = bar(...);
if (!p) return;
p = x;

Again, sort errors based on the check:error ratio.


Plausible results: 152 free errors, 16 false positives.

p = bar(...);
if (!p) return;
p = x;

General statistical technique


When we write a(); . . . b();, we mean a MAY-belief that a() is followed by b(). We dont actually
know that this is a valid belief. Its a hypothesis, and well try it out. Algorithm:
assume every ab is a valid pair;
emit check for each path with a() and then b();
emit error for each path with a() and no b().
(actually, prefilter functions that look paired).
Consider:
foo(p, . . . );
bar(p, . . . ); // check

foo(p, . . . );
bar(p, . . . ); // check

foo(p, . . . );
// error: foo, no bar!

This applies to the course project as well.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

void scope1() {
A(); B(); C(); D();
}
void scope2() {
A(); C(); D();
}
void scope3() {
A(); B();
}
void scope4() {
B(); D(); scope1();
}
void scope5() {
B(); D(); A();
}

A() and B() must be paired:


either A() then B() or B() then A().

Support = # times a pair of functions appears together.


support({A,B})=3

Confidence({A,B},{A}) =
support({A,B})/support({A}) = 3/4

void scope6() {
B(); D();
}

Sample output for support threshold 3, confidence threshold 65% (intra-procedural analysis):
bug:A in scope2, pair: (A B), support: 3, confidence: 75.00%
bug:A in scope3, pair: (A D), support: 3, confidence: 75.00%
bug:B in scope3, pair: (B D), support: 4, confidence: 80.00%
bug:D in scope2, pair: (B D), support: 4, confidence: 80.00%

The point is to find examples like the one from cmpci.c where theres a lock kernel() call, but,
on an exceptional path, no unlock kernel() call.
Summary: Belief Analysis.

We dont know what the right spec is. So, look for contradictions.

MUST-beliefs: contradictions = errors!


MAY-beliefs: pretend theyre MUST, rank by confidence.
(A key assumption behind this belief analysis technique: most of the code is correct.)
3

Further references. Dawson R. Engler, David Yu Chen, Seth Hallem, Andy Chou and Benjamin
Chelf. Bugs as Deviant Behaviors: A general approach to inferring errors in systems code. In
SOSP 01.
Dawson R. Engler, Benjamin Chelf, Andy Chou, and Seth Hallem. Checking system rules using system-specific, programmer-written compiler extensions. In OSDI 00 (best paper). www.
stanford.edu/~engler/mc-osdi.pdf
Junfeng Yang, Can Sar and Dawson Engler. eXplode: a Lightweight, General system for Finding
Serious Storage System Errors. In OSDI06. www.stanford.edu/~engler/explode-osdi06.pdf

Using Linters
We will also talk about linters in this lecture, based on Jamie Wongs blog post jamie-wong.com/
2015/02/02/linters-as-invariants/.
First there was C.
1

In statically-typed languages, like C,

#include <stdio.h>

2
3
4
5
6

int main() {
printf("%d\n", num);
return 0;
}
the compiler saves you from yourself. The guaranteed invariant:
if code compiles, all symbols resolve.
Less-nice languages.
variant?

OK, so you try to run that in JavaScript and it crashes right away. In-

if code runs, all symbols resolve?


But what about this:
1
2
3
4
5
6
7

function main(x) {
if (x) {
console.log("Yay");
} else {
console.log(num);
}
}

8
9

main(true);

Nope! The above invariant doesnt work.


OK, what about this invariant:
if code runs without crashing, all symbols referenced in the code path executed resolve?
Nope!
1
2
3
4
5
6
7

function main() {
try {
console.log(num);
} catch (err) {
console.log("nothing to see here");
}
}

8
9

main();
So, when youre working in JavaScript and maintaining old code, you always have to deduce:
is this variable defined?
is this variable always defined?
do I need to load a script to define that variable?
We have computers. Theyre powerful. Why is this the developers problem?!

1
2
3
4
5
6
7

function main(x) {
if (x) {
console.log("Yay");
} else {
console.log(num);
}
}

8
9

main(true);
Now:
$ nodejs /usr/local/lib/node_modules/jshint/bin/jshint --config jshintrc foo.js
foo.js: line 5, col 17, num is not defined.
1 error

Invariant:

If code passes JSHint, all top-level symbols resolve.


5

Strengthening the Invariant.

Can we do better? How about adding a pre-commit hook?

If code is checked-in and commit hook ran,


all top-level symbols resolve.
Of course, sometimes the commit hook didnt run. Better yet:
Block deploys on test failures.
Better invariant.

If code is deployed,
all top-level symbols resolve.
Even better yet. It is hard to tell whether code is deployed or not. Use git feature branches,
merge when deployed.
If code is in master,
all top-level symbols resolve.

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 15 February 6, 2015


Patrick Lam

version 1

Data flow Criteria


So far weve seen structure-based criteria which imposed test requirements solely based on the
nodes and edges of a graph. These criteria have been oblivious to the contents of the nodes.
However, programs mostly move data around, so it makes sense to propose some criteria based on
the flow of data around a program. Well be talking about du-pairs, which connect definitions and
uses of variables.

x = 5
// def(x)
..
du-pair .
print(x)

// use(x)

Lets look at some graphs.


n1 :
print(x)

n0 : x = 5

We write

def(n0 ) = use(n1 ) = {x}.


Note that edges can also have defs and uses, for instance in a graph corresponding to a finite state
machine. In that case, we could write:

use(n0 , n1 ) = {}.
Heres another example.
q2 :
print(z)

q0 : z=x-y

q1 :
if (z < w)

q4

q3 :
print(w)
1

A particular def d of variable x may (or may not) reach a particular use u. If a def may reach
a particular use, then there exists a path from d to u which is free of redefinitions of x. In the
following graph, the def at n2 does not reach the use at n3 , since no path goes from n2 to n3 .
n2 : x=5

n3 : use(x)

Another example of a definition which does not reach:


n0 : x = 2

n1 : x = 17

n2 : use(x)

We say that the definition at n1 kills the definition at n0 , so that def(n0 ) does not reach n2 . We
are therefore looking for def-clear paths.
Definition 1 A path p from `1 to `m is def-clear with respect to variable v if for every node nk and
every edge ek on p from `1 to `m , where k 6= 1 and k 6= m, then v is not in def(nk ) or in def(ek ).
That is, nothing on the path p from location `1 to location `m redefines v. (Locations are edges or
nodes.)
Definition 2 A def of v at `i reaches a use of v at `2 if there exists a def-clear path from `i to `j
with respect to v.
Quick poll: does the def at n0 reach the use at n5 ?
n2 : x=84

n0 : x=1729

n1

n4

n5 : use(x)

n3

Building on the notion of a def-clear path:


Definition 3 A du-path with respect to v is a simple path that is def-clear with respect to v from
a node ni , such that v is in def(ni ), to a node nj , such that v is in use(nj ).
2

(This definition could be easily modified to use edges ei and ej ).


Note the following three points about du-paths:
associated with a variable
simple (otherwise there are too many)
may be any number of uses in a du-path

Coverage criteria using du-paths


We next create groups of du-paths. Consider again the following double-diamond graph D:
n9 : x = 9

n: use(x)

u1

n5 : x = 5

u2 : use(x)

n3 : x = 3

u0 : use(x)

We will define two sets of du-paths:


def-path sets: fix a def and a variable, e.g.
du(n5 , x) =
du(n3 , x) =
def-pair sets: fix a def, a use, and a variable, e.g. du(n5 , n, x) =
These sets will give the notions of all-defs coverage (tour at least one du-path from each def-path
seta weak criterion) and all-uses coverage (tour at least one du-path from each def-pair set).
How can there be multiple elements in a def-pair set?

Heres an example with two du-paths in a def-pair set.

n17 : x = 17

n: use(x)

We then have
du(n17 , n, x) =
Note the general relation
du(ni , v) =

du(ni , nj , v)

nj

There are more def-pair sets than def-path sets. Cycles are always allowed as du-paths, as long as
the du-path is simple; you can always tour a du-path with a non-simple path, of course.
Useful exercise. Create an example where one def-path set splits into several def-pair sets; you
can get a smaller example than the one in the book.
We can use the above definitions to provide coverage criteria.
Criterion 1 All-Defs Coverage (ADC). For each def-path set S = du(n, v), TR contains at least
one path d in S.
Criterion 2 All-Uses Coverage (AUC). For each def-pair set S = du(ni , nj , v), TR contains at
least one path d in S.

What do these criteria mean? For each def,


ADC: reach at least one use;
AUC: reach every use somehow;
In the context of the earlier example,
ADC requires:
AUC requires:
Nodes versus edges. So far, weve assumed definitions and uses occur on nodes.
uses on edges (p-uses) work as well;
defs on edges are trickier, because a du-path from an edge to an edge may not be simple.
(We could make things work out with more work.)
Another example.
n2

n5 : use(x)

n3

n0 : def(x)

n6

n4

n1 : use(x)

Some test sets that meet these criteria:


ADC:
AUC:

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 16 February 9, 2015


Patrick Lam

version 1

Subsumption Chart
Here are the subsumption relationships between our graph criteria.

Complete Path Coverage


CPC

Prime Path Coverage


PPC
All-du-Paths Coverage
ADUPC

Edge-Pair Coverage
EPC

Complete Round-Trip Coverage


CRTC

All-Uses Coverage
AUC

Edge Coverage
EC

Simple Round-Trip Coverage


SRTC

All-Defs Coverage
ADC

Node Coverage
NC

We know EPC subsumes EC subsumes NC from before, and clearly CRTC subsumes SRTC.
Also, CPC subsumes PPC subsumes EPC, and PPC subsumes CRTC, because simple paths
include round trips.
In this offering, we didnt talk about ADUPC at all, and we only touched briefly on CRTC
and SRTC.
Assumptions for dataflow criteria:
1. every use preceded by a def (guaranteed by Java)
2. every def reaches at least one use
1

3. for every node with multiple outgoing edges, at least one variable is used on each out edge,
and the same variables are used on each out edge.
Then:
AUC subsumes ADC.
Each edge has at least one use, so AUC subsumes EC.
Finally, each du-path is also a simple path, so PPC subsumes ADUPC (not discussed this
term, but ADUPC subsumes AUC). (Note that prime paths are simpler to compute than
data flow relationships, especially in the presence of pointers.

Dataflow Graph Coverage for Source Code


Last time, we saw how to construct graphs which summarized a control-flow graphs structure.
Lets enrich our CFGs with definitions and uses to enable the use of our dataflow criteria.
Definitions.

Here are some Java statements which correspond to definitions.

x = 5: x occurs on the left-hand side of an assignment statement;


foo(T x) { ...

} : implicit definition for x at the start of a method;

bar(x): during a call to bar, x might be defined if x is a C++ reference parameter.


(subsumed by others): x is an input to the program.
Examples:

Uses. The book lists a number of cases of uses, but it boils down to x occurs in an expression
that the program evaluates. Examples: RHS of an assignment, or as part of a method parameter,
or in a conditional.
Complications. As I said before, the situation in real-life is more complicated: weve assumed
that x is a local variable with scalar type.
What if x is a static field, an array, an object, or an instance field?
How do we deal with aliasing?

One answer is to be conservative and note that weve said that a definition d reaches a use u if it
is possible that the address defined at d refers to the same address used at u. For instance:
class C { int f ; }
v o i d f o o (C q ) { u s e ( q . f ) ; }
x = new C ( ) ; x . f = 5 ;
y = new C ( ) ; y . f = 2 ;
foo (x ) ;
foo (y ) ;

Our definition says that both definitions reach the use.


Exercise.

Consider the following graph:

N = {0, 1, 2, 3, 4, 5, 6, 7}
N0 = {0}
Nf = {7}
E = {(0, 1), (1, 2), (1, 7), (2, 3), (2, 4), (3, 2), (4, 5), (4, 6), (5, 6), (6, 1)}
test paths:

t1
t2
t3
t4

= [0, 1, 7]
= [0, 1, 2, 4, 6, 1, 7]
= [0, 1, 2, 4, 5, 6, 1, 7]
= [0, 1, 2, 3, 2, 4, 6, 1, 7]

def(0) = def(3) = use(5) = use(7) = { x }


(a) Draw the graph.
(b) List all du-paths with respect to x. Include all du-paths, even those that are subpaths of others.
(c) List a minimal test set (using the given paths) satisfying all-defs coverage with respect to x.
(d) List a minimal test set (using the given paths) satisfying all-uses coverage with respect to x.
Compiler tidbit. In a compiler, we use intermediate representations to simplify expressions,
including definitions and uses. For instance, we would simplify:
x

= foo(y + 1, z * 2)

Basic blocks and defs/uses. Basic blocks can rule out some definitions and uses as irrelevant.
Defs: consider the last definition of a variable in a basic block. (If were not sure whether x
and y are aliased, leave both of them.)
Uses: consider only uses that arent dominated by a definition of the same variable in the
same basic block, e.g. y = 5; use(y) is not interesting.
3

Graph Coverage for Design Elements


We next move beyond single methods to design elements, which include multiple methods, classes,
modules, packages, etc. Usually people refer to such testing as integration testing.

Structural Graph Coverage.


We want to create graphs that represent relationshipscouplingsbetween design elements.
Call Graphs.

Perhaps the most common interprocedural graph is the call graph.

design elements, or nodes, are methods (or larger program subsystems)


couplings, or edges, are method calls.
Consider the following example.
A

B
E
D

For method coverage, must call each method.


For edge coverage, must visit each edge; in this particular case, must get to both A and C
from both of their respective callees.
Like any other type of testing, call graph-based testing may require test harnesses. Imagine, for
instance, a library that implements a stack; it exposes push and pop methods, but needs a test
driver to exercise these methods.

Data Flow Graph Coverage for Design Elements


The structural coverage criteria for design elements were not very satisfying: basically we only had
call graphs. Lets instead talk about data-bound relationships between design elements.
caller: unit that invokes the callee;
actual parameter: value passed to the callee by the caller; and
formal parameter: placeholder for the incoming variable.
Illustration.
caller: foo(actual1, actual2);
callee: void foo(int formal1, int formal2) { }

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 17 February 11, 2015


Patrick Lam

version 1

We want to define du-pairs between callers and callees. Heres an example.


1: x=5

11:
B(int y)

2: x=4

12: z=y
3: x=3

13: t=y
14:
print(y)

4: B(x)

Well say that the last-defs are 2, 3; the first-use is 12. We formally define these notions:
Definition 1 (Last-def). The set of nodes N that define a variable x for which there is a def-clear
path from a node n N through a call to a use in the other unit.
Definition 2 (First-use). The set of nodes N that have uses of y and for which there exists a path
that is def-clear and use-clear from the entry point (if the use is in the callee) or the callsite (if the
use is in the caller) to the nodes N .
We need the following side definition, analogous to that of def-clear:
Definition 3 A path p = [n1 , . . . , nj ] is use-clear with respect to v if for every nk p, where k 6= 1
and k 6= j, then v is not in use(nk ).
In other words, the last-def is the definition that goes through the call or return, and the first-use
picks up that definition.

Here are two more examples:


main() {
x = 1;
x = 2; // last-def
if (...) { x = 3; /* last-def */ }
B(x);
}
B(int y) {
if (...) {
Z = y; // first-use
} else {
T = y; // first-use
}
print(y);
}

x = 14;
// last-def
y = g(x);
print(y); // first-use
int g(a) {
print(a); // first-use
b = 24;
// last-def
return b;
}

One can create pairs of triples linking last-defs and first-uses; characterize a last-def or a first-use
by the method name, the variable name, and the statement, and then link them.
Tests can, of course, go beyond just testing the first-use. Our first-use and last-def definitions,
however, make testing slightly more tractable. We could, of course, carry out full inter-procedural
data-flow, i.e. covering all du-pairs, but this would be more expensive.

Syntax-Based Testing
We are going to completely switch gears now. We will see two applications of context-free grammars:
1. input-space grammars: create inputs (both valid and invalid)
2. program-based grammars: modify programs (mutation testing)
Mutation testing. The basic idea behind mutation testing is to improve your test suites by
creating modified programs (mutants) which force your test suites to include test cases which
verify certain specific behaviours of the original programs by killing the mutants.

Generating Inputs: Regular Expressions and Grammars


Consider the following Perl regular expression for Visa numbers:
^4[0-9]{12}(?:[0-9]{3})?$
Idea: generate valid tests from regexps and invalid tests by mutating the grammar/regexp. (Why
did I put valid in quotes? What is the fundamental limitation of regexps?)
2

Instead, we can use grammars to generate inputs (including sequences of input events).
Typical grammar fragment:

mult exp = unary exp | mult exp STAR unary arith exp | mult exp DIV unary arith exp;
unary exp = quant exp | unary exp DOT INT | unary exp up;
..
.
start = header? declaration
Using Grammars. Two ways you can use input grammars for software testing and maintenance:
recognizer: can include them in a program to validate inputs;
generator: can create program inputs for testing.
Generators start with the start production and replace nonterminals with their right-hand sides to
get (eventually) strings belonging to the input languages.
We specify three coverage criteria for inputs with respect to a grammar G.
Criterion 1 Terminal Symbol Coverage (TSC). TR contains each terminal of grammar G.
Criterion 2 Production Coverage (PDC). TR contains each production of grammar G.
Criterion 3 Derivation Coverage (DC). TR contains every possible string derivable from G.
PDC subsumes TSC. DC often generates infinite test sets, and even if you limit to fixed-length
strings, you still have huge numbers of inputs.
Another Grammar.
roll
action
dep
deb
account
amount
digit

=
=
=
=
=
=
=

action
dep | deb
"deposit" account amount
"debit" account amount
digit { 3 }
"$" digit+ "." digit { 2 }
[0 9]

Examples of valid strings.

Note: creating a grammar for a system that doesnt have one, but should, is a useful QA exercise.
Using this grammar at runtime to validate inputs can improve software reliability, although it
makes tests generated from the grammar less useful.
Some Grammar Mutation Operators.
Nonterminal Replacement; e.g.
dep = "deposit" account amount = dep = "deposit" amount amount

(Use your judgement to replace nonterminals with similar nonterminals.)


Terminal Replacement; e.g.
amount = "$" digit+ "." digit { 2 } = amount = "$" digit+ "$" digit { 2 }

Terminal and Nonterminal Deletion; e.g.


dep = "deposit" account amount = dep = "deposit" amount

Terminal and Nonterminal Duplication; e.g.


dep = "deposit" account amount = dep = "deposit" account account amount

Using grammar mutation operators.


1. mutate grammar, generate (invalid) inputs; or,
2. use correct grammar, but mis-derive a rule oncegives closer inputs (since you only miss
once.)

Why test invalid inputs? Bill Sempf (@sempf), on Twitter: QA Engineer walks into a bar.
Orders a beer. Orders 0 beers. Orders 999999999 beers. Orders a lizard. Orders -1 beers. Orders
a sfdeljknesv.
If youre lucky, your program accepts strings (or events) described by a regexp or a grammar. But
you might not use a parser or regexp engine. Generating using the regexp or grammar helps detect
deviations, both right now and in the future.
As you saw in Assignment 1, its the easiest thing to overlook invalid inputs. Yet they may lead to
undefined behaviour.
What well do is to mutate the grammars and generate test strings from the mutated grammars.
Some notes:
Book claims we dont have much experience using grammar-based operators.
Can generate strings still in the grammar even after mutation.
Recall that we arent talking about semantic checks.
Some programs accept only a subset of a specified larger language, e.g. Blogger HTML
comments. Then testing intersection is useful.

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 18 February 13, 2015


Patrick Lam

version 1

Mutation Testing
The second major way to use grammars in testing is mutation testing. Strings are usually programs,
but could also be inputs, especially for testing based on invalid strings.
Definition 1 Ground string: a (valid) string belonging to the language of the grammar.
Definition 2 Mutation Operator: a rule that specifies syntactic variations of strings generated
from a grammar.
Definition 3 Mutant: the result of one application of a mutation operator to a ground string.
Mutants may be generated either by modifying existing strings or by changing a string while it is
being generated.
It is generally difficult to find good mutation operators. One example of a bad mutation operator
might be to change all predicates to true.
Note that mutation is hard to apply by hand, and automation is complicated. The testing community generally considers mutation to be a gold standard that serves as a benchmark against
which to compare other testing criteria against.
More on ground strings. Mutation manipulates ground strings to produce variants on these
strings. Here are some examples:
the program that we are testing; or
valid inputs to a program.
If we are testing invalid inputs, we might not care about ground strings.
Credit card number examples.

Valid strings:

Invalid strings:
We can also create mutants by applying a mutation operator during generation, which is useful
when you dont need the ground string.
NB: There are many ways for a string to be invalid but still be generated by the grammar.

Some points:
How many mutation operators should you apply to get mutants? One.
Should you apply every mutation operator everywhere it might apply? More work, but can
subsume other criteria.

Killing Mutants
Generate a mutant m for an original ground string m0 .
Definition 4 Test case t kills m if running t on m gives different output than running t on m0 .
(The book uses a derivation D rather than a ground string m0 .)
We can also define a mutation score, which is the percentage of mutants killed.
Im going to list a trio of coverage criteria from the book. I dont think that the mutation coverage
criteria are ever used. If one were trying to use mutation testing, one would measure the effectiveness
of a test suite (the mutation score) and to keep adding tests until reaching a desired mutation score.
Criterion 1 Mutation Coverage (MC). For each mutant m, TR contains requirement kill m.
(This definition does not describe the set of mutants required.)
Criterion 2 Mutation Operator Coverage (MOC). For each mutation operator op, TR contains requirement to create a mutated string m derived using op.
Criterion 3 Mutation Production Coverage (MPC). For each mutation operator op and each
production p that op can be applied to, TR contains requirement to create a mutated string from p.

Program Based Grammars


The usual way to use mutation testing is by generating mutants by modifying programs according
to the language grammar, using mutation operators.
Mutants are valid programs (not tests) which ought to behave differently from the ground string.
Our task, in mutation testing, is to create tests which distinguish mutants from originals.
Example. Given the ground string x = a + b, we might create mutants x = a - b, x = a * b,
etc. A possible original on the left and a mutant on the right:
int foo(int x, int y) { // original
if (x > 5) return x + y;
else return x;
}

int foo(int x, int y) { // mutant


if (x > 5) return x - y;
else return x;
}
2

Once we find a test case that kills a mutant, we can forget the mutant and keep the test case. The
mutant is then dead.
Uninteresting Mutants. Three kinds of mutants are uninteresting:
stillborn: such mutants cannot compile (or immediately crash);
trivial : killed by almost any test case;
equivalent: indistinguishable from original program.
The usual application of program-based mutation is to individual statements in unit-level (permethod) testing.
Mutation Example. Here are some mutants.
// original
int min(int a, int b) {
int minVal;
minVal = a;
if (b < a) {

minVal = b;

}
return minVal;
}

// with mutants
int min(int a, int b) {
int minVal;
minVal = a;
minVal = b;
if (b < a) {
if (b > a) {
if (b < minVal) {
minVal = b;
BOMB();
minVal = a;
minVal = failOnZero(b);
}
return minVal;
}

// 1
// 2
// 3
// 4
// 5
// 6

Conceptually weve shown 6 programs, but we display them together for convenience.
Goals of mutation testing:
1. mimic (and hence test for) typical mistakes;
2. encode knowledge about specific kinds of effective tests in practice, e.g. statement coverage
(4), checking for 0 values (6).
Reiterating the process for using mutation testing (see picture at end):
Goal: kill mutants
Desired Side Effect: good tests which kill the mutants.
These tests will help find faults (we hope). We find these tests by intuition and analysis.

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 19 February 23, 2015


Patrick Lam

version 0

Weak and strong mutants


So far weve talked about requiring differences in the output for mutants. We call such mutants
strong mutants. We can relax this by only requiring changes in the state, which well call weak
mutants.
In other words,
strong mutation: fault must be reachable, infect state, and propagate to output.
weak mutation: a fault which kills a mutant need only be reachable and infect state.
The book claims that experiments show that weak and strong mutation require almost the same
number of tests to satisfy them.
We restate the definition of killing mutants which weve seen before:
Definition 1 Strongly killing mutants: Given a mutant m for a program P and a test t, t is said
to strongly kill m iff the output of t on P is different from the output of t on m.
Criterion 1 Strong Mutation Coverage (SMC). For each mutant m, TR contains a test which
strongly kills m.
What does this criterion not say?

Definition 2 Weakly killing mutants: Given a mutant m that modifies a source location ` in
program P and a test t, t is said to weakly kill m iff the state of the execution of P on t is different
from the state of the execution of m on t, immediately after some execution of `.
How does this criterion differ from what weve tested recently in unit tests?

Criterion 2 Weak Mutation Coverage (WMC). For each mutant m, TR contains a test which
weakly kills m.

Lets consider mutant 1 from before, i.e. we change minVal = a to minVal = b. In this case:
reachability: unavoidable;
infection: need b 6= a;
propagation: wrong minVal needs to return to the caller; that is, we cant execute the body
of the if statement, so we need b > a.
A test case for strong mutation is therefore a = 5, b = 7 (return value = , expected ), and for
weak mutation a = 7, b = 5 (return value = , expected ).
Now consider mutant 3, which replaces b < a with b < minVal. This mutant is an equivalent
mutant, since a = minVal. (The infection condition boils down to false.)
Equivalence testing is, in its full generality, undecidable, but we can always estimate.

Testing Programs with Mutation


Heres a possible workflow for actually performing mutation testing.

Program P

Create
mutants m

Eliminate
knownequivalent
mutants

Define
threshold

Run T
on P

Run T
on all M

no

Fix P

no

Generate
test
cases T

Enough
mutants
killed?

Filter
bogus
t T

yes

Output
of p on T
correct?
yes
Done

Mutation Operators
Well define a number of mutation operators, although precise definitions are specific to a language
of interest. Typical mutation operators will encode typical programmer mistakes, e.g. by changing
2

relational operators or variable references; or common testing heuristics, e.g. fail on zero. Some
mutation operators are better than others.
The book contains a more exhaustive list of mutation operators. How many (intraprocedural)
mutation operators can you invent for the following code?

i n t mutationTest ( i n t a , b ) {
int x = 3 a , y ;
i f (m > n ) {
y = n ;
}
e l s e i f ( ! ( a > b ) ) {
x = a b;
}
return x ;
}
Integration Mutation. We can go beyond mutating method bodies by also mutating interfaces
between methods, e.g.
change calling method by changing actual parameter values;
change calling method by changing callee; or
change callee by changing inputs and outputs.
class M {
int f , g ;
void c ( i n t x ) {
foo (x , g ) ;
bar ( 3 , x ) ;
}
int foo ( int a , int b) {
return a + b f ;
}
i n t bar ( i n t a , i n t b ) {
return a b ;
}
}
[Absolute value insertion, operator replacement, scalar variable replacement, statement replacement
with crash statements. . . ]
3

Mutation for OO Programs. One can also use some operators specific to object-oriented
programs. Most obviously, one can modify the object on which field accesses and method calls
occur.
class A {
public int x ;
Object f ;
Square s ;
v o i d m( ) {
int x ;
f = new Object ( ) ;
this . x = 5;
}
}
c l a s s B extends A {
int x ;
}

Exercise.

Come up with a test case to kill each of these types of mutants.

ABS: Absolute Value Insertion


x = 3 * a = x = 3 * abs(a), x = 3 * -abs(a), x = 3 * failOnZero(a);
ROR: Relational Operator Replacement
if (m > n) = if (m >= n), if (m < n), if (m <= n), if (m == n), if (m != n), if
(false), if (true)
UOD: Unary Operator Deletion
if (!(a > -b)) = if (a > -b), if (!(a > b))
Summary of Syntax-Based Testing.

Grammar
Summary
Use Ground String?
Use Valid Strings Only?
Tests
Killing

Program-based
Programming language
Mutates programs / tests integration
Yes (compare outputs)
Yes (mutants must compile)
Mutants are not tests
Generate tests by killing

Input Space
Input languages / XML
Input space testing
No
Invalid only
Mutants are tests
Not applicable

Notes:
Program-based testing has notion of strong and weak mutants; applied exhaustively, programbased testing subsumes many other techniques.
Sometimes we mutate the grammar, not strings, and get tests from the mutated grammar.
Tool support. PIT Mutation testing tool: http://pitest.org. Mutates your program, reruns
your test suite, tells you how it went.
4

Software Testing, Quality Assurance and Maintenance

Winter 2015

Lecture 20 February 25, 2015


Patrick Lam

version 1

Weve talked about mutation testing in the past few lectures. I thought Id summarize some recent
research out of Waterloo, in collaboration with the University of Washington and the University of
Sheffield, about: (1) the effectiveness of mutation testing; and (2) what coverage gets you in terms
of test suite effectiveness.

Is Mutation Testing Any Good?


Weve talked about mutation testing as a metric for evaluating test suites and making sure that
test suites exercise the system under test sufficiently. The problem with metrics is that they can
be gamed, or that they might measure not quite the right thing. When using metrics, its critical
to keep in mind what the right thing is. In this case, the right thing is the fault detection power of
a test suite.
Some researchers set out to determine just that. They carried out a study, using realistic code,
where they isolated a number of bugs, and evaluated whether or not there exists a correlation
between real fault detection and mutant detection.
Summary. The answer is yes: test suites that kill more mutants are also better at finding real
bugs. The researchers also investigated when mutation testing fell shortthey enumerated types
of bugs that mutation testing, as currently practiced, would not detect.
Methodology. The authors used 5 open-source projects. They isolated a total of 357 reproducible faults in these projects using the projects bug reporting systems and source control repositories. They they generated 230,000 mutants using the Major mutation framework and investigated
the ability of both developer-written test suites and automatically-generated test suites (EvoSuite,
Randoop, JCrasher) to detect the 357 faults.
For each fault, the authors started with a developer-written test suite Tbug that did not detect the
fault. Then, using the source repository, they extracted a developer-written test that detects the
fault. Call this suite Tfix . Does Tfix detect more mutants than Tbug ? If so, then we can conclude
that the mutant behaves like a bug.
Results. The authors found that Major-generated mutation tests could detect 73% of the faults.
In other words, for 73% of faults, some mutant will be killed by a test that also detects the fault.
Increasing mutation coverage thus also increases the likelihood of finding faults.
The analogous numbers for branch coverage and statement coverage are, respectively, 50% and
40%. Specifically: the 357 tests that find faults only increase branch coverage 50% of the time,
and they only increase statement coverage 40% of the time. So: improving your test suite often
1

doesnt get rewarded with a better statement coverage score, and half the time doesnt result in
a better branch coverage score. Conversely, improving statement coverage doesnt help find more
bugs because youre already reaching the fault, but you arent sensitive to the erroneous state.
The authors also looked at the 27% of remaining faults that are not found by mutants. For 10%
of these, better mutation operators could have helped. The remaining 17% were not suitable for
mutation testing: they were fixed by e.g. algorithmic improvements or code deletion.
Reference. Rene Just, Darioush Jalali, Laura Inozemtseva, Michael D. Ernst, Reid Holmes, and
Gordon Fraser. Are Mutants a Valid Substitute for Real Faults in Software Testing? In Foundations of Software Engineering 2014. pp654665. http://www.linozemtseva.com/research/
2014/fse/mutant_validity/

What Does (Graph) Coverage Buy You?


Weve talked about graph coverage, notably statement coverage (node coverage) and branch coverage (edge coverage). Theyre popular because they are easy to compute. But, are they any
good? Reid Holmes (a Waterloo CS prof) and his student Laura Inozemtseva set out to answer
that question.
Answer. Coverage does not correlate with high quality when it comes to test suites. Specifically:
test suites that are larger are better because they are larger, not because they have higher coverage.
Methodology. The authors picked 5 large programs and created test suites for these programs
by taking random subsets of the developer-written test suites. They measured coverage and they
measured effectiveness (defined as % mutants detected; weve seen that detecting mutants is good,
above).
Result. In more technical terms: after controlling for suite size, coverage is not strongly correlated
with effectiveness.
Furthermore, stronger coverage (e.g. branch vs statement, logic vs branch) doesnt buy you better
test suites.
Discussion. So why are we making you learn about coverage? Well, its whats out there, so you
should know about it. But be aware of its limitations.
Plus: if you are not covering some program element, then you obviously get no information about
the behaviour of that element. Low coverage is bad. But high coverage is not necessarily good.
Reference. Laura Inozemtseva and Reid Holmes. Coverage is Not Strongly Correlated with
Test Suite Effectiveness. In International Conference on Software Engineering 2014. pp435445.
http://www.linozemtseva.com/research/2014/icse/coverage/