Вы находитесь на странице: 1из 11

A Chaos Engineering System for Live Analysis and

Falsification of Exception-handling in the JVM


Long Zhang1 , Brice Morin2 , Philipp Haller1 , Benoit Baudry1 , and Martin Monperrus1
1
KTH Royal Institute of Technology, Sweden
2
SINTEF, Norway

Abstract—Software systems contain resilience code to handle C HAOS M ACHINE is designed around three components. For
those failures and unexpected events happening in production. It each service of the system under study, there is a monitoring
arXiv:1805.05246v1 [cs.SE] 14 May 2018

is essential for developers to understand and assess the resilience sidecar (component #1) and a perturbation injector (component
of their systems. Chaos engineering is a technology that aims
at assessing resilience and uncovering weaknesses by actively #2) attached to it. The monitoring sidecar is responsible for
injecting perturbations in production. collecting all information needed for the resilience analysis,
In this paper, we propose a novel design and implementation and the perturbation injector is able to throw a specific
of a chaos engineering system in Java called C HAOS M ACHINE. It exception at runtime. Component #3 is the chaos controller,
provides a unique and actionable analysis on exception-handling which controls all the perturbation injectors and analyzes the
capabilities in production, at the level of try-catch blocks.
information collected by every monitoring sidecar. Eventually,
To evaluate our approach, we have deployed C HAOS M ACHINE
on top of 3 large-scale and well-known Java applications totaling the chaos controller produces a report that gives developers
630k lines of code. Our results show that C HAOS M ACHINE reveals unique and actionable knowledge about their system’s re-
both strengths and weaknesses of the resilience code of a software silience.
system at the level of exception handling. We evaluate C HAOS M ACHINE by deploying it on top of
3 large-scale and well-known open-source Java applications
I. I NTRODUCTION in the domains of file-sharing, content-management system
and e-commerce. All the experiments are conducted in a
Chaos engineering is a new field that consists in injecting production-ready environment with end-user level workload.
faults in production systems to assess the resilience of a The results show that C HAOS M ACHINE is capable of analyz-
software system [5]. The core idea of chaos engineering is ing the resilience of 339 try-catch blocks located in 212 Java
active probing: the chaos engineering system actively injects classes. C HAOS M ACHINE successfully identifies the strongly
a controlled perturbation into the production system and resilient try-catch blocks (18/339) that should remain resilient
observes the impact of the perturbation as well as the reaction in subsequent versions. It also identifies the weakest ones,
of the system under study [2], [7], [13]. This aids developers called silent try-catch blocks (34/339), which are possible
in gaining confidence in their system’s resilience, and can help debug nightmares when developers try to understand failures
to find weaknesses in error handling and disaster recovery happening in production.
routines [3], [5]. To sum up, our main contributions are the following.
The major advantage of chaos engineering is that it is • The conceptual foundations of chaos engineering in the
complementary to static analysis and testing, because it 1) context of exception-handling in Java: 1) the definition
generates new knowledge about the system under study and 2) of four categories of try-catch blocks according to their
identifies strengths and weaknesses that can only be observed resilience characteristics; 2) a systematic procedure based
in production [12], [16]. An example of chaos engineering on fault injection to assess resilience of try-catch blocks.
system is Netflix’ ChaosMonkey, which randomly shuts down • A novel system, called C HAOS M ACHINE , that assesses
machines to make sure that the system is capable of spawning exception-handling capabilities in production. C HAOS -
new ones automatically. M ACHINE is based on bytecode instrumentation and
In this paper, we present the design and implementation of remote control of fine-grained fault injection. It pro-
a novel chaos engineering system called C HAOS M ACHINE. Its vides valuable and actionable feedback to the develop-
core novelty and uniqueness is that it considers error-handling ers. The system is publicly-available for future research
capabilities at the fine-grain level of programming language (https://github.com/kth-tcs/chaosmachine).
exceptions. While bad exception-handling is known to be the • An empirical evaluation of C HAOS M ACHINE on 3 real-
cause of up to 92% of critical failures [25], it remains to world Java systems totaling 630k line of codes, con-
be done to apply the chaos engineering vision to exception- taining 339 try-catch blocks executed by the considered
handling. The contribution of this paper is C HAOS M ACHINE, production traffic. It shows the effectiveness of C HAOS -
which is capable of revealing the resilience strengths and M ACHINE to reveal both strengths and weaknesses of
weaknesses for every try-catch block executed in production. a software system’s resilience at the exception-handling
level. be caught using a wide range of tools: core business metrics
The rest of the paper is organized as follows: Section II (number of streams), system-level invariants, execution traces,
presents the background, Section III and Section IV describes environment metrics like I/O usage, etc. For example, a
the design and evaluation of C HAOS M ACHINE. Section V and hypothesis may be: in our web page rendering system, if
Section VI discusses the literature and future work. one stops the cache (perturbation simulating that the cache
subsystem is broken), the correct content is still delivered to
II. BACKGROUND ON C HAOS E NGINEERING users (monitored behavior).
In this section, we give some background on chaos engi- An experiment is the process of validating or falsifying
neering [5] for readers who are not familiar with the concept. a hypothesis. An experiment includes injecting perturbations
into the system, monitoring how the system reacts and in-
A. Brief Overview ferring validation and falsification. In the example above, an
Let us first start with a metaphor: chaos engineering is experiment for this hypothesis is: 1) to inject an exception
vaccination for software. In medicine, people are vaccinated into the page rendering service and 2) to monitor the system’s
to prevent particular diseases. Administration of the vaccine reaction. If one still gets the correct output, the experiment
consists in injecting a potentially dangerous yet controlled validates the hypothesis, indicating that the error-handling
preparation that resembles a virus, which helps the body code works well under such a perturbation. Otherwise, if there
strengthen its defenses against this virus. Chaos engineering is a difference between the behavior under injection and the
is such a technology in software engineering: it consists in normal behavior, the experiment is considered to have falsified
injecting a potentially dangerous perturbation that resembles the hypothesis.
a failure or heavy load, which helps the developer understand The blast radius defines the impact level of a perturbation.
and improve resilience of the system [17]. In practice, it is undersirable that perturbations cause too much
Basiri et al.’s seminal paper [5] about chaos engineering tells trouble to the service. Controlling the potential impact of
us that the main goals of chaos engineering are: 1) to verify er- perturbations and mitigating their side-effects is an essential
ror handling capabilities and resilience in production settings; part of chaos engineering.
2) to learn about error handling behavior in production. The
mantra of chaos engineering is to experiment with the system C. Basic Chaos Methodology
in production. To this extent, Schermann et al. [22] consider Per [1], there are four main steps to apply chaos engineering
chaos engineering as one facet of continuous experimentation. to a system.
Let us now discuss a concrete example. Consider two micro- The first three steps are related to designing the hypothesis.
services interacting with each other to provide a feature. First of all, one must find metrics which capture the essential
Those two micro-services have error-handling code to deal performance and correctness characteristics of the system’s
with problems in the communication link. Chaos engineering steady state. The steady state is characterized by a range of
on this system would mean injecting perturbation in the metric values, with departure from that range meaning the
communication link in production. If the system continues system should be considered impacted. Secondly, one defines
to provide the expected service under this perturbation, the perturbations which simulate real world possible events, such
developers gain confidence in the error-handling code. If any as connection timeout, hard drive exhaustion, thread death, etc.
perturbation breaks the system’s provided features, it means Then one defines two phases: a control phase and an
that the developers need to fix the error-handling code. This is experimental phase. A control phase is a monitoring period
the meaning behind the primary idea of chaos engineering: without perturbation, while the experimental phase is a study
“experimenting on a distributed system in order to build of the system behaviors under perturbation. At the end of these
confidence in the system’s capability to withstand unexpected first three steps, the hypothesis and the experiment design are
conditions in production” [1]. set.
Fourth, one performs the actual experiment, consisting of
B. Core Concepts injecting perturbations into the system and monitoring the
Chaos engineering is founded on the following concepts. metrics. A report is eventually generated by analyzing the
A perturbation is a change in the application execution differences in the effect of the perturbation between the
flow, or state, or environment, it is made in a pro-active control phase and experimental phase. There are two possible
and controlled manner. Working with our prior example, one outcomes: 1) when a hypothesis is validated, the confidence
can inject a timeout into a communication link between two in the system resilience capability is improved; 2) when a
micro-services. An example of perturbation in the system hypothesis is falsified, the issue is reported to the development
environment is when one cuts down the memory available to team, which then has to fix the error-handling code.
the system to see how the application reacts.
A hypothesis is a stipulated relation between a perturba- D. Concrete Example
tion and some monitored behaviors. In the case of a video Arguably, the most famous chaos engineering system to date
streaming service, one monitored behavior can be the number is Netflix’ ChaosMonkey. ChaosMonkey is used to randomly
of streams started per second. The behaviors of interest can shutdown systems that are part of a fleet providing a service
in production and to then analyze impact on that system. Java Virtual Machine 1
Production traffic

Developers at Netflix use the number of video starts per Service 1


End Users

second as a metric to define system’s steady state [6]. In Monitoring Perturbation


Chaos Controller
this context, 1) an hypothesis is that one instance terminating Sidecar Injector

abnormally has no influence on the number of served videos;


2) a perturbation is ChaosMonkey shutting down a specific Java Virtual Machine 2

instance; 3) a chaos experiment is the whole procedure of Service 2


Application & Chaos logs
applying ChaosMonkey and analyzing the system’s behavior Monitoring Perturbation
to validate or falsify the hypothesis. Sidecar Injector

III. D ESIGN OF A C HAOS S YSTEM FOR E XCEPTIONS Java Virtual Machine 3


Report
Developer Team

This section presents our system for controlled chaos engi- Service 3
Normal Application Communication

neering in the Java Virtual Machine, called C HAOS M ACHINE. Monitoring Perturbation
Chaos Perturbation Commands
Chaos Machine Report

Its core novelty is that it does chaos engineering at the level of Sidecar Injector Monitoring Information

exception handling and try-catch blocks, which is more fine-


grained than all chaos engineering systems we are aware of. Fig. 1. The components of C HAOS M ACHINE

A. Overview
M ACHINE performs falsification experiments, it is in falsifica-
The goal of C HAOS M ACHINE is twofold: 1) falsify hypothe-
tion mode. Finally, when C HAOS M ACHINE does not introduce
ses and 2) infer hypotheses. The former is the classical goal
chaos, it is simply in observation mode.
of chaos engineering systems [5], and the latter is the key
contribution of this paper. B. Input to C HAOS M ACHINE
Hypotheses. C HAOS M ACHINE considers error-handling hy-
potheses in Java applications. We define the following four C HAOS M ACHINE works on arbitrary software written in
chaos engineering hypotheses at the level of try-catch blocks, Java, no manual change is required in the code. To use
from the most beneficial to the most problematic: C HAOS M ACHINE, the application is deployed in production
as usual, C HAOS M ACHINE is attached to it in an automated
• Resilience hypothesis. A try-catch block is said to be
manner, in observation mode by default. Optionally, devel-
resilient if the observable behavior of the catch block,
opers can also feed C HAOS M ACHINE with manually-written
executed upon exception, is equivalent to the observable
hypotheses.
behavior of the try-block when no exception happens [8].
• Observability hypothesis. A try-catch block is said to C. Architecture of C HAOS M ACHINE
be observable if an exception caught in the catch block
results in user-visible effects. Figure 1 presents the main components of C HAOS M ACHINE
• Debug hypothesis. A try-catch block is said to be debug-
and their interactions. C HAOS M ACHINE is meant to be de-
gable if an exception caught in the catch block results in ployed on any modern Internet application, such as search
an explicit message in the application logs. engines or transaction systems. Those applications typically
• Silence hypothesis. A try-catch block is said to be silent
are distributed over several different servers, where the servers
if it fails to provide the expected behavior upon exception either provide different services (as shown in the figure with
while providing no troubleshooting information whatso- three different services), or provide redundancy and elasticity
ever, i.e., it is neither observable nor debuggable. If the for the same service. Per the best practices, and without loss
silent try-catch block later causes a user-visible failure, it of generality, all services are considered deployed in separate
would be extremely hard for the developers to understand virtual machines for the sake of isolation.
that the root cause is the silent try-catch block, and to fix C HAOS M ACHINE attaches a monitoring sidecar and a per-
the failure accordingly. turbation injector into each service. The monitoring sidecar
(Section III-C1) collects the information needed to infer hy-
Experiments. C HAOS M ACHINE performs two kinds of potheses and study the outcome of chaos experiments. The
experiments: perturbation injector (Section III-C2) is responsible for inject-
• Falsification experiments. They aim at validating or falsi- ing perturbations according to a given perturbation model. The
fying a hypothesis about the behavior of a try-catch block. chaos controller (Section III-C3) is a standalone component
This hypothesis can be stated upfront by developers or that is separated from the application services, and it has three
can be inferred through exploration experiments. responsibilities: 1) controlling the behavior of perturbation
• Exploration experiments. They aim at monitoring the injectors; 2) aggregating monitoring information from each
behavior of try-catch blocks under perturbation in order monitoring sidecar; 3) generating a report for the developers
to infer new hypotheses. about the quality of error-handling in their code, which con-
Modes. When C HAOS M ACHINE performs exploration ex- tains novel and actionable feedback about error-handling in
periments, it is said to be in exploration mode. When C HAOS - production. We further describe these outputs in Section III-D.
TABLE I
I NTERPLAY BETWEEN THE 3 COMPONENTS AND THE 3 MODES OF C HAOS M ACHINE

Observation Mode Exploration Mode Falsification Mode


Monitoring Sidecar Monitors all the relevant execution in- Monitors how the system reacts accord- Monitors whether an hypothesis is fal-
formation ing to a perturbation sified
Perturbation Injector Not active Injects a specific perturbation Injects a specific perturbation
Chaos Controller Deactivate all the perturbation injectors Controls perturbation injectors to con- Controls perturbation injectors accord-
to keep the system running as usual duct a sequence of chaos experiments ing to a specific hypothesis
so as to infer new hypotheses

1) Monitoring Sidecars: Chaos engineering consists of Listing 1. The Application Code is Automatically Transformed for Injecting
studying the influence of perturbations on the system behavior, Perturbations
1 try {
as captured by metrics [1]. Example metrics include the 2 // injection point #1, type: Exception1
number of streamed videos for Netflix and the HTTP response 3 if (perturbationInjector1.isActive()) {
code for web applications. These metrics are part of evaluation 4 throw new Exception1();
of whether the system can provide acceptable services, even 5 }
under perturbation. The main role of the monitoring sidecars 6 if (perturbationInjector2.isActive()) {
7 throw new Exception3();
is to collect these metrics at runtime. 8 }
In order to gather sufficient information about error- 9 ...original code...
handling in Java applications, C HAOS M ACHINE uses the fol- 10 } catch (Exception1 e1) {
lowing monitoring scheme. For each try-catch block found in 11 ...original code...
12 try {
code as it is loaded into the JVM, it notes: 1) their position in 13 if (perturbationInjector3.isActive()) {
the code, 2) the type of the caught exception, 3) the number 14 throw new Exception2();
of executions (both in observation mode and in exploration 15 }
mode), 4) whether exceptions are recorded in application logs. 16 ...original code...
} catch (Exception2 e2) {
Monitoring sidecars also collect the following generic met- 17
18 ...original code...
rics: 19 }
• The classes that have been loaded so far into the JVM. 20 } catch (Exception3 e3) {
21 ...original code...
• C HAOS M ACHINE’s logs, which include information 22 }
about when and where an injection has happened, to-
gether with the corresponding stacktrace.
• The exit status, i.e., whether a service has exited normally 3) Chaos Controller: The chaos controller has two goals:
or not (crash). 1) infer new hypotheses and 2) validate or falsify existing hy-
• A set of operating system metrics including CPU usage, potheses. The chaos controller activates or deactivates specific
memory usage, and peak thread number. perturbation injectors at specific points in time, according to
• The application logs. an hypothesis being inferred or falsified. Then it analyzes the
2) Perturbation Injectors: The main responsibility of a information recorded by the monitoring sidecars. From this,
perturbation injector is to generate a specific perturbation when the chaos controller reports whether the injected perturbation
the chaos controller sends the corresponding command, i.e., has broken the hypothesis under consideration.
throwing an exception at the beginning of a try-catch block, The chaos controller is also responsible for containing the
resulting in short-circuiting the try-block. An injector is added blast radius. It decides how many perturbation injectors are
to every try-block, using automated code instrumentation. active concurrently, as well as how long the perturbation
Each injector can be activated (in exploration mode or in injectors are active.
falsification mode) and deactivated individually. The roles of the above three components in different modes
are shown in Table I.
Listing 1 gives an example about how this perturbation
injector works. There are two try-blocks in this code snippet, D. Output for the developer
Exception1 and Exception3 might happen in the first try C HAOS M ACHINE produces a report for the developer, con-
block during the execution of the omitted code at line 9, taining the hypotheses validated or falsified for each try-catch
and Exception2 might happen in the second try block at line block, sorted according to their criticality. This provides devel-
16. Consequently, there are three injection points in total, opers with an overview of the resilience of their system. Silent
corresponding to each caught exception type. When an injector catch blocks are usually the ones that require the most urgent
is activated in exploration or falsification mode, it throws attention, as they hurt the resilience and/or debuggability of the
the corresponding exception. Each injector can be controlled system. Resilient catch blocks help the resilience, by keeping
separately. the system running even when certain exceptions happen.
E. Implementation exits normally. The chaos controller also detects some er-
C HAOS M ACHINE is written in Java in 2.1k lines of code. ror messages in the application log. Even though there is
Both the monitoring sidecars and the perturbation injectors are an exception thrown at the very beginning of the try-catch
woven into the application services using a JVM agent [11]. block, the application still fulfills the user’s requirement
The agent adds the monitoring and injection code using correctly. This kind of try-catch block contributes to the
binary code transformation with the ASM library 1 . The application’s resilience, as the application still supports
chaos controller is a standalone service communicating with the users’ requests even though the entire logic of the
the monitoring sidecars and the injectors using sockets. For try-block has been discarded.
• Observable try-catch block. A try-catch block is said
sake of open-science, the code is made publicly available at
https://github.com/kth-tcs/chaosmachine. observable if the client directly crashes or exits with an
error message under perturbations, i.e., the perturbation
IV. E VALUATION in this try-catch block causes user-visible behaviors of
the client.
In our evaluation, we apply C HAOS M ACHINE to 3 different • Debuggable try-catch block. A try-catch block is said
real-world Java projects, including TTorrent (a peer-to-peer file debuggable if the system metrics become abnormal or
downloading tool based on the BitTorrent protocol), BroadLeaf the exception information is captured in application logs
(a web-based commercial system) and X-Wiki (a web-based when an exception is injected. The information is use-
wiki system). Following the chaos engineering principles, all ful for developers to debug and improve the system’s
applications are set up in a production environment. In the resilience.
following, we present the protocol, experimental results, case • Silent try-catch block. When an exception occurs in this
studies, and discussions for each project. block, the client does not download the file and just keeps
Once the application is set up, C HAOS M ACHINE verifies running indefinitely. Worse still, there is not any error
and evaluates try-catch blocks following the procedure de- information about the injected error. This is a bad case for
scribed in Section III-A. both users and developers: users are not made aware that
the download is stalled and developers have no feedback
A. Evaluation on TTorrent
whatsoever about the problem. Developers can improve
1) Overview of the BitTorrent protocol: BitTorrent is a them so as to be able to detect and debug such a problem
peer-to-peer data transfer protocol, which is widely used to if it happens naturally in production.
download files over the Internet. The core concept of the Then, in an initial observation mode, the client downloads
BitTorrent protocol is that users who want to download a file the full file once until successful completion. During this
also serve to other users the file parts that they have already phase, C HAOS M ACHINE analyzes the client’s behavior.
downloaded. There are 4 parts in a typical file transfer scenario Next, for the covered try-catch blocks, C HAOS M ACHINE
with the BitTorrent protocol: 1) a torrent file which includes executes the procedure defined in Section III-A while re-
information about the shared files and the tracker servers; 2) downloading the file, and gathers the data shown in Table II.
several tracker servers which receive client registrations and In exploration mode, the perturbed clients might not be able
announce resource information to new clients; 3) clients who to exit normally, so C HAOS M ACHINE keeps the client alive
want to download files, and then get the torrent files and for at most 300 seconds. After this delay, the client is killed
register their download status with the tracker server; 4) clients and information is logged indicating that the client was killed
who have already downloaded files and provide pieces of the after this timeout.
files to others (called the “seeders”). 3) Experimental results: Table II reads as follows: there are
2) Experiment protocol: In this experiment, we consider 27 try-blocks covered by the production traffic, i.e., the code in
the Bittorrent client called TTorrent (version 1.5), written in the try-blocks is executed while the client is downloading the
Java. It is built as a .jar file and can be used on the command file. Each row contains one try-block’s information. The first
line. We attach C HAOS M ACHINE to this client, and then column is the basic information about each try-catch block,
use it to download ubuntu-14.04.5-server-i386.iso, a Linux including the class and method names, caught exception type
distribution installer of 623.9MB from the Canonical company. and the index number. The second column records the number
This means that we use tracker servers from somewhere else of executions, in both the observation mode and exploration
in the Internet, and use many seeders that are providing pieces mode. The third column indicates whether the developers have
of the downloaded file. logged the exception in their application logs when such an
First, we classify try-catch blocks in TTorrent by refining the exception is caught. The forth column shows whether the
four hypotheses discussed in Section III-A with the combining client has successfully downloaded the file when exceptions
monitoring metrics specific to this application domain. are injected in this try-block. The fifth column records the
• Resilient try-catch block. Despite injected exceptions in client’s exit status. The sixth column indicates differences in
this block, the client successfully downloads the file and system metrics (if any) between the observation mode and the
exploration mode. Finally, the last four columns indicate how
1 See http://asm.ow2.org this try-catch block meets our pre-defined four hypotheses.
TABLE II
T HE R ESULTS OF C HAOS E XPERIMENTATION W ITH E XCEPTION I NJECTION ON 27 T RY- CATCH B LOCKS IN THE TT ORRENT B ITTORRENT C LIENT

Try-catch block information Execution Logged Downl. Exit status System metrics RH OH DH SH
Anal./Expl.
BEValue/getBytes,ClassCastException,0 41 / 1 yes no crashed - x x
BEValue/getNumber,ClassCastException,0 15 / 1 yes no crashed - x x
BEValue/getString,ClassCastException,0 37 / 1 yes no crashed - x x
BEValue/getString,UnsupportedEncodingException,1 37 / 1 yes no crashed - x x
ClientMain/main,CmdLineParser$OptionException,0 1/1 yes no crashed - x x
ClientMain/main,Exception,1 1/1 yes no crashed - x x
Announce/run,AnnounceException,0 1 / 60 yes no stalled - x x
Announce/run,InterruptedException,2 1 / 760 no yes normally more threads x
Announce/run,InterruptedException,3 1/1 no yes normally no diff x
Announce/run,AnnounceException,4 1/1 yes yes normally no diff x x
Announce/stop,InterruptedException,0 1/1 no yes normally no diff x
ConnectionHandler/run,SocketTimeoutException,0 1290 / 1030 no yes normally no diff x
ConnectionHandler/run,IOException,1 1290 / 1 yes yes stalled higher cpu x
ConnectionHandler/run,InterruptedException,2 1290 / 2 yes no stalled no diff x
ConnectionHandler/stop,InterruptedException,0 1/1 no yes normally no diff x
ConnectionHandler$ConnectorTask/run,Exception,0 50 / 50 yes no stalled no diff x
Handshake/craft,UnsupportedEncodingException,0 50 / 48 yes no stalled no diff x
PeerExchange/send,InterruptedException,0 90763 / 210 no no stalled no diff x
PeerExchange/stop,InterruptedException,0 46 / 44 no yes normally no diff x
PeerExchange$OutgoingThread/run,InterruptedException,0 90805 / no no stalled higher cpu x x
32984841
PeerExchange$OutgoingThread/run,InterruptedException,1 90763 / 288 no no stalled no diff x
PeerExchange$OutgoingThread/run,IOException,2 90805 / 43 yes no stalled no diff x
PeerExchange$OutgoingThread/run,IOException,3 90763 / 46 yes no stalled no diff x
Piece/validate,NoSuchAlgorithmException,0 2564 / 5427 yes no stalled higher cpu x
HTTPAnnounceRespMessage/parse,InvalidBEncodingException,0 3 / 30 yes no stalled no diff x
HTTPAnnounceRespMessage/parse,InvalidBEncodingException,1 3 / 30 yes no stalled no diff x
HTTPAnnounceResponseMessage/parse,UnknownHostException,2 3 / 30 yes no stalled no diff x
total: 27/52 460626 / 18/27 8/27 7/27 4/27 6/27 7/27 20/27 3/27
32992950

Since exceptions change the execution flow of the application, Listing 2. ClassCastException in BEValue/getBytes
the execution times under analysis mode and exploration mode 1 public byte[] getBytes() throws
are not the same. InvalidBEncodingException {
2 try {
Take the first row as an example, it shows that there
3 return (byte[])this.value;
is a try-catch block in the getBytes method inside the 4 } catch (ClassCastException cce) {
BEValue class, which handles a ClassCastException. 5 throw new InvalidBEncodingException(cce.
Through the entire process of downloading the file, it is toString());
executed 41 times. When the perturbation injector throws a 6 }
7 }
ClassCastException exception at the beginning of the
try-block, the client does not download the file and crashes.
The chaos controller also detects that a specific error message Listing 3. InterruptedException in Announce/run
is logged in the application log before its crash. Based on 1 while (!this.stop) {
2 ...
these behaviors, this try-catch block validates the observability 3 try {
hypothesis (OH) and debug hypothesis (DH). 4 Thread.sleep(this.interval * 1000);
In total, there are 27 try-catch blocks covered by this file- 5 } catch (InterruptedException ie) {
download operation in production. Some of them are executed 6 // Ignore
only once, others up to 90805 times (cf. Column Execution 7 }
8 }
Anal. of Table II). This information is very important for the
developer. Thanks to C HAOS M ACHINE, the developer is able
to identify: 6 resilient try-catch blocks, 7 observable try-catch action to take because this try catch is both observable and
blocks, 20 debuggable try-catch blocks, and 3 silent try-catch debuggable.
blocks. Listing 3 shows the run method in class Announce. The
4) Case studies: In the following we detail 4 case studies. try-block is a piece of code running in a sub-thread. The an-
Listing 2 shows a part of the getBytes method, con- nounce thread starts by making the initial “started” announce
taining a single try-catch statement. This try-catch statement request to register on the tracker and get an interval value. In
is executed 41 times. When perturbed with an exception the observation mode, the try-catch block is executed once.
injection, the chaos controller verifies that two core hypotheses However in the exploration mode with exception injection,
are validated in production: the exception is logged, and the the try-catch block is executed 760 times. Indeed, due to the
client exits with an error status. The developer has no further skip of the Thread.sleep, the while loop runs more times
Listing 4. AnnounceException in Announce/run CHINE report, the developer is urged to change the exception-
1 if (!this.forceStop) { handling behavior.
2 ... 5) Falsification on Next Version: It is of utmost importance
3
that the resilience capabilities do not degrade over time. We
4 try {
5 this.getCurrentTrackerClient().announce( try to falsify all hypothesis in a version of TTorrent (1.6) that
event, true); is subsequent to the analyzed one, with the same protocol. The
6 } catch (AnnounceException ae) { result is that no hypothesis inferred on version 1.5 are falsified
7 logger.warn(ae.getMessage()); on version 1.6, which means that the resilient try-catch blocks
8 }
are still capable of handling unanticipated exceptions and
9 }
keeping the system steady.

Listing 5. InterruptedException in PeerExchange/send Main result of the TTorrent experiment: In a real-world


1 public void send(PeerMessage message) { production usage, C HAOS M ACHINE identifies 6 resilient
2 try { try-catch blocks and 3 silent ones in the TTorrent client.
3 this.sendQueue.put(message); Each silent try-catch block indicates a potential debug
4 } catch (InterruptedException ie) {
5 // Ignore, our send queue will only case that would be extremely difficult to fix (no vis-
block if it contains ible behavior, no log can be provided by the user).
6 // MAX_INTEGER messages, in which case C HAOS M ACHINE precisely detects those silent try-catch
we’re already in big blocks and reports them to the developer. In subsequent
7 // trouble, and we’d have to be versions, C HAOS M ACHINE verifies that the 6 resilient
interrupted, too.
8 } try-catch blocks remains resilient thanks to falsification
9 } experiments.

B. Evaluation on XWiki
before reaching its objective. When the perturbation injector
injects the exception, the catch-block simply “swallows” this 1) Introduction of XWiki: XWiki is a widely-used open-
exception and does not do anything to handle the exception. source wiki system developed in Java, and is active over the
This results in using more computing resources. As shown past 14 years. XWiki requires external dependencies like a
in the comment, the developer knows about this behavior. database server and a web application server.
However, thanks to C HAOS M ACHINE, she is made aware that 2) Experiment Protocol: We use a full-fledged production
ignoring the exception is not good for performance, and she is setup of XWiki version 9.11.1, which is deployed into Tomcat-
even given a quantitative measurement (per the system metrics 8.5.29 and configured to connect to a MySQL server. We
collected by the monitoring sidecar). collect end-user traffic performed through a web browser: 1)
Listing 4 is also from the run method in the Announce visit pages, 2) log in with a username and a password, 3)
class. The exception type is AnnounceException and this add some comments on the main page and on a specific user
try-catch block is executed once in the observation mode, and page, 4) update personal page information and 5) log out. We
once in the exploration mode. When the perturbation injector record every HTTP user request, as well as the associated
injects the exception, the file is still correctly downloaded. HTTP responses (including response code, header and body).
Once the client finishes the download, it exits with a normal This end-user traffic is replayed on the production system to
exit code, and some error messages about this exception appear perform each experiment, per the chaos engineering practices
in the application log. In this case, the try-catch block success- observed in previous work [20]. First, C HAOS M ACHINE runs
fully blocks AnnounceException to break the system. Thanks the observation mode to monitor all the dynamic try-catch
to C HAOS M ACHINE, the developer has improved confidence information and the normal behavior without any perturbation.
that the exception-handling design here was correct: it fully Then, an exploration mode is activated. C HAOS M ACHINE
works in production. activates the corresponding perturbation injector for each cov-
Listing 5 shows method send in class PeerExchange. ered try-catch block. The injector is active for 1 minute and
It is executed 90’763 times in the observation mode and C HAOS M ACHINE collects the HTTP responses, which are then
210 times in the exploration mode. In this case, when the compared to those collected in observation mode.
perturbation injector injects an InterruptedException, In XWiki’s experiment, we define the four classes of try-
the client just keeps running until some external entity (the catch blocks as following:
user or C HAOS M ACHINE) kills the process. No information • Resilient try-catch block. Despite injected exceptions in
is logged in the application logs. This means that, when this this block, users still get the expected response content
exception happens naturally, users have absolutely no debug or succeed in adding comments and updating personal
information to give to developers. Here, C HAOS M ACHINE profile.
helps the developer to identify “nightmare” debug cases of the • Observable try-catch block. A try-catch block is said
form of purely silent try-catch blocks. Based on C HAOS M A - observable if the response code changes from “200 OK”
TABLE III Listing 6. XWikiException in XWikiCachingRightService/authenticateUser
R ESULTS ON C HAOS E XPERIMENTATION ON 268 T RY- CATCH B LOCKS IN 1 try {
XW IKI C OVERED BY THE C ONSIDERED W ORKLOAD
2 XWikiUser user = context.getWiki().
checkAuth(context);
Packages Covered Executions in RH OH DH SH
Anal. / Expl. 3 if (user != null) {
4 userReference = resolveUserName(user.
org/xwiki/a* 1 273 / 273 0 0 1 0 getUser(), new WikiReference(context
org/xwiki/c* 20 112968 / 119544 0 6 20 0
org/xwiki/d* 2 855 / 1398 0 0 2 0 .getWikiId()));
org/xwiki/e* 11 20882 / 99204 0 1 11 0 5 }
org/xwiki/f* 23 44813 / 222 0 0 23 0 6 } catch (XWikiException e) {
org/xwiki/i* 8 1142 / 280 0 0 8 0 7 LOGGER.error("Caught exception while
org/xwiki/l* 12 295530 / 73048 0 1 12 0
org/xwiki/m* 9 38360 / 37739 0 1 9 0 authenticating user.", e);
org/xwiki/n* 10 62 / 190837 0 0 8 2 8 }
org/xwiki/o* 2 43753 / 68154 0 0 2 0
org/xwiki/p* 4 5403 / 3075 0 0 4 0
org/xwiki/q* 3 262 / 142 0 0 3 0
org/xwiki/r* 93 1137420 / 272944 5 7 70 14 in these try-catch blocks, the same blocks are executed 31826
org/xwiki/s* 15 20522 / 31826 2 5 15 0
org/xwiki/t* 2 83 / 81 0 0 2 0 times in total. After classification by C HAOS M ACHINE, the de-
org/xwiki/u* 20 13795 / 6229 0 8 16 1 veloper knows that: 1) 2 try-catch blocks satisfy the resilience
org/xwiki/v* 5 3201 / 831 0 2 5 0
org/xwiki/w* 21 2526 / 3140 0 2 16 5 hypothesis, 2) 5 try-catch blocks satisfy the observable hypoth-
org/xwiki/x* 7 890 / 580 0 0 6 1 esis, 3) 15 try-catch blocks satisfy the debug hypothesis and
Total 268/1567 1742740 / 909547 7 33 233 23
4) none of the try-catch blocks satisfy the silence hypothesis.
With the help of this report, developers gain more knowl-
edge on XWiki’s error-handling capabilities in production.
to others. Consequently users also get an error page or They are also encouraged to take action: 1) go over the silent
request redirection under the corresponding exceptions. try-catch blocks to confirm whether they need to record more
• Debuggable try-catch block. A try-catch block is said
information when an exception occurs and 2) focus on the try-
debuggable if the the exception information is captured catch blocks which have serious impact on system’s steady
in application logs when an exception is injected. state, i.e. the observable ones. For example, if there is an
• Silent try-catch block. A silent try-catch block only causes
exception in a specific try block, which leads to the system
response body change while the response code stays the to generate an 500 response code instead of 200. As a result,
same as usual, and there is no error information about the response contents also change to an error page for users.
the injected exception in application logs. The chaos experiment provides more clues for developers to
3) Experimental Results: There are 290 user requests we review the try-catch block and help them improve the fault
recorded: 276 GET requests and 14 POST ones. This traffic tolerance ability.
contains: 97 GET requests directly on downloading resources, 4) Case studies: In the following, we detail two interesting
178 GET requests and 10 POST requests on rendering pages, cases found in the XWiki experiment.
4 POST requests on logging in, adding comments, updating Listing 6 shows part of method authenticateUser in
user data, and 1 GET request on logging out. class XWikiCachingRightService. There is only one
In total, 1567 try-catch blocks are registered in C HAOS M A - try-catch block in this method. It is executed 151 times in
CHINE , and 268 of them are covered by the traffic we recorded. observation mode and 153 times with perturbation. When the
Table III summarizes the data aggregated over packages. The exception occurs, this catch block logs the error information.
first column is the abbreviated package name. The second According to the monitored behavior, this perturbation actually
column displays the number of try-catch blocks that are has a visible impact on certain requests: it leads to an HTTP
covered by the production traffic. The third column is the total response code 302 (Redirect) instead of 200. Per our definition,
number of try-catch black executions in the observation mode. this try-catch block satisfies both the observability and the
The fourth column is the total number of executions in the debug hypothesis.
exploration mode. The fifth column is the number of catch Listing 7 shows part of method runInternal in class
blocks which capture the injected exception in the application DefaultSolrIndexer’s private inner class Resolver.
log. Finally, the last four columns show the classification This try block is executed 11 times in observation mode and is
of try-catch blocks given by C HAOS M ACHINE, including: 7 executed only once with perturbation. C HAOS M ACHINE iden-
resilient try-catch blocks, 33 observable try-catch blocks, 233 tifies that this perturbation does not influence the output of any
dbuggable try-catch blocks, and 23 silent try-catch blocks. request. The monitoring sidecar also detects that the exception
Take the row “org/xwiki/s*” as an example. For all the is caught in the application log. As we can see from the source
try-catch blocks in the package whose name begins with code, developers log the exception information and also assign
org/xwiki/s, there are 15 try-catch blocks covered by this queueEntry to RESOLVE_QUEUE_ENTRY_STOP in the
set of chaos experiments. Under normal conditions, these 15 catch block which is a valid error-handling strategy in this
try-catch blocks are executed 20522 times. When C HAOS - context. Through the chaos experiment, the developers gain
M ACHINE activates the corresponding perturbation injectors more confidence that this exception-handling design actually
Listing 7. InterruptedException in DefaultSolrIndexer$Resolver/runInternal TABLE IV
1 try { R ESULTS ON C HAOS E XPERIMENTATION ON 44 T RY- CATCH B LOCKS IN
B ROADLEAF C OVERED BY THE C ONSIDERED W ORKLOAD
2 queueEntry = resolveQueue.take();
3 } catch (InterruptedException e) {
Packages Covered Executions in RH OH DH SH
4 logger.warn("The SOLR resolve thread has Anal. / Expl.
been interrupted", e);
o/b/cms/file* 1 53 / 50 0 1 1 0
5 queueEntry = RESOLVE_QUEUE_ENTRY_STOP; o/b/cms/url* 3 288 / 111 2 0 0 1
6 } o/b/common/audit* 2 40 / 13 0 0 1 1
7 o/b/common/classloader* 2 1596 / 849 0 2 2 0
8 if (queueEntry == RESOLVE_QUEUE_ENTRY_STOP) { o/b/common/i18n* 1 10660 / 51 0 1 1 0
o/b/common/persistence* 1 24 / 2 0 0 1 0
9 // Stop the index thread: clear the queue o/b/common/security* 2 14 / 40 0 1 2 0
and send the stop signal without o/b/common/util* 1 30 / 21 0 1 1 0
blocking. o/b/common/web* 4 188 / 60 0 2 3 1
10 indexQueue.clear(); o/b/core/catalog* 1 2/2 0 0 0 1
o/b/core/order* 5 34 / 84 0 3 5 0
11 indexQueue.offer(INDEX_QUEUE_ENTRY_STOP); o/b/core/payment* 1 1/1 1 0 0 0
12 break; o/b/core/pricing* 1 5 / 21 0 1 1 0
13 } o/b/core/rating* 2 6/6 0 0 2 0
o/b/core/search* 2 44 / 38 0 2 2 0
o/b/core/web* 10 615 / 340 1 5 7 2
o/b/openadmin/audit* 3 16 / 14 0 1 1 2
works in production. o/b/profile/core* 1 3/2 1 0 0 0
o/b/vendor/sample* 1 1/1 0 1 1 0
Total 44/355 13620 / 1706 5 21 31 8
Main result of the XWiki experiment: C HAOS M ACHINE
analyzes 268 try-catch blocks and identifies 7 that satisfy
the resilience hypothesis, and 23 try-catch blocks that are
silent, violating the silence hypothesis. This experiment First, C HAOS M ACHINE keeps a 90 seconds observation mode
shows that our prototype implementation of C HAOS M A - to gather the system’s normal behaviors. At the same time, it
CHINE scales to a system with 440k LOC and 1567 try-
also obtains information about covered try-catch blocks. Then,
catch blocks loaded in the JVM. for each covered try-catch block, C HAOS M ACHINE runs in
exploration mode for 90 seconds. The results are generated
C. Evaluation on Broadleaf and discussed next.
1) Introduction of Broadleaf: Broadleaf Commerce is a Table IV summarizes the results. The recorded traffic covers
series of open-source products in an eCommerce platform 44 try-catch blocks. In the first step of the experiment, we leave
written in Java. There are three components in Broadleaf C HAOS M ACHINE running automatically. In this case, it does
which can be deployed separately into different servers: not detect any resilient try-catch blocks, which leads us to do
administration website, end-user shopping website and data some further analysis. In the second step, we manually analyze
fetching APIs. all logs generated by monitoring sidecars. This analysis reveals
2) Experiment protocol: We choose to conduct chaos ex- that some of the diff-logs are semantically equivalent, but the
periments on Broadleaf version 5.0.0-GA. It provides an monitoring sidecar marks the output as different if it is not the
embedded Tomcat server, a HyperSQL database and a startup same (verbatim).
script, which simplifies deployment. For this experiment, we For instance, Broadleaf uses JSON objects to handle the
focus on the end-user shopping website. Similar to the exper- prices of products with different properties. The price of
iments on XWiki, we deploy Broadleaf and randomly interact an XL-size black T-shirt is $17, which is displayed as
with the website system, including: 1) visiting product pages, {"options":[1, 14], "price":17}. In the snippet,
2) logging in with a username and a password, 3) adding number 1 stands for “XL” and number 14 stands for “black”. It
products to a shopping cart, 4) checking out, and 5) logging is obvious that {"options":[14, 1], "price":17}
out. As before, we record every user request and its associated has the same meaning. However, the current implementation
response. In this experiment, we define resilient, observable, of our monitoring sidecar regards these as different outputs.
debuggable, silent try-catch block as per the XWiki experiment This phenomenon reflects one limitation of the monitoring
in Section IV-B2, since they are both web systems with the sidecar: it is not sophisticated enough to determine semantical
same core characteristics. equivalence.
3) Experimental results: The recorded operations include Following the manual comparison between the response
384 requests in total. There are 362 requests responsible for bodies, the revised report about try-catch resilience is: 5
directly downloading files, all of which are GET requests. resilient try-catch blocks, 21 observable try-catch blocks, 31
There are 15 GET requests about rendering pages. All of the 6 debuggable try-catch blocks, and 8 silent try-catch blocks.
functional requests are POST, including logging in, updating 4) Case studies: Next, we discuss one of the most inter-
the shopping cart, and checking out. The request for logging esting cases found in the chaos experiment on Broadleaf.
out is a request of type GET. These requests are replayed by As shown in Listing 8, C HAOS M ACHINE identifies this try-
GoReplay all the time during the experiments, and the time catch block as a resilient one. As the method name suggests,
to finish this sequence of operations is less than 90 seconds. the method is used for obtaining the sub-division of a country.
Listing 8. NoResultException in CountrySubdivisionDaoImpl/findSubdivi- different areas of fault-injection and error-handling static anal-
sionByCountryAndAltAbbreviation ysis.
1 public CountrySubdivision Netflix [7] is well known for its ChaosMonkey, which
findSubdivisionByCountryAndAltAbbreviation
(...) { randomly shuts down Amazon instances in production. It is
2 TypedQuery<CountrySubdivision> query = new used to ensure that the user experience is not impacted by
... a loss of an Amazon instance. This methodology has been
3 try { extended to more failure types both at Netflix [13] and other
4 return query.getSingleResult(); companies [19]. An example of cloud oriented tool is by
5 } catch (NoResultException e) {
6 return null; Sheridan et al. [24], who presented a fault injection tool for
7 } cloud applications, where faults are resource stress or service
8 } outage. While those tools conduct chaos experiments between
services at the OS level or the network level, C HAOS M ACHINE
The method is executed 3 times in observation mode and 2 is, to the best of our knowledge, the first to perform chaos
times in exploration mode. When the perturbation injector is experiments in a white-box fashion. It perturbs the runtime
activated, the query result is always “null”. The reason why the status inside the JVM, and enables developers to detect internal
response content stays the same is that in observation mode, weaknesses at the code level (not the service interaction level).
the user’s country information does not contain sub-divisions. Hyosoon Lee et al. [15] proposed SFIDA, a fault injection
Thus, no matter if there is an exception, the query result tool to test the dependability of distributed applications on the
remains “null”. However, this try-catch block may impact the Linux platform, by injecting transient or permanent hardware
system’s output if a specific user has sub-division information. faults in the runtime environment. Bartolomeo Montrucchio et
This phenomenon exposes another limitation of C HAOS - al. [18], Segal et al. [4], [23] also presented similar injection
M ACHINE. Since it uses production traffic to evaluate the techniques for memory faults. Kao et al. [14] invented “FINE”,
resilience of try-catch blocks, the traffic might not be sufficient a fault injection and monitoring tool to inject both hardware-
to give definitive conclusions. For some try-catch blocks that induced software errors and software faults. None of those
are currently classified as resilient, different traffic may be approaches are meant to be used in production, they are not
able to falsify the hypotheses. The accuracy of the report of chaos engineering tools. On the contrary, C HAOS M ACHINE
C HAOS M ACHINE can be optimized by inputing more varied evaluates application’s error-handling capabilities on studying
production traffic and analyzing output using more detailed the effect of unanticipated exceptions in production.
domain-specific knowledge. Now we discuss exception analysis. Fu and Ryder [10]
5) Discussion of Broadleaf experiment: Through this ex- described a static analysis method for exception chains in
periment, we finally succeed in analyzing Broadleaf’s error- Java. Zhang and Elbaum [26] presented an approach that
handling resilience capabilities under unanticipated excep- amplifies test to validate exception handling. Cornu et al. [8]
tions. However, it is necessary to invest more manual work proposed a classification of try-catch blocks at testing time.
compared to experiments on XWiki. The limitations of the Here, the problem domain and implementation techniques are
monitoring sidecar makes C HAOS M ACHINE unable to de- different: those authors use test suites to study resilience. On
termine which try-catch blocks really influence the system’s the contrary, C HAOS M ACHINE operates in production, using
steady state. In the first step of the experiment, C HAOS - real production traffic to conduct the analysis.
M ACHINE detects 0 resilient try-catch blocks. However, it is Czeck et al. [9] described a methodology for modeling fault
practicable to improve the abilities of the monitoring sidecar effects on system behavior. They construct a behavior model
to calculate differences. The variety of production traffic also based on a small set of workloads and use the model to
limits the identifications performed by C HAOS M ACHINE. infer the fault behavior of other workloads. In comparison,
C HAOS M ACHINE is directly applied to the production system
Main result of the Broadleaf experiment: C HAOS M A - to make and falsify hypotheses about its resilience. Finally,
CHINE identifies 5 resilient, 8 silent ones, 21 observable chaos engineering relates to failure-oblivious computing [21]:
ones, and 31 try-catch blocks. This experiment exposes both are engineering techniques for production failures, yet
two important facts: (a) the monitoring sidecar may need failure-oblivious computing is not about the active injection
to embed some domain-specific knowledge in order to of faults in the production systems.
better interpret the application output and logs, and (b)
the length of the captured production traffic during chaos VI. C ONCLUSION
experimentation matters. This paper presented C HAOS M ACHINE, which analyzes
and falsifies exception-handling hypotheses in Java programs
V. R ELATED WORK running in production. We showed, on three large applications,
that C HAOS M ACHINE is able to produce actionable reports
Chaos engineering is a new field which is little researched, for developers to gain more confidence about the resilience of
hence the closely related work is relatively scarce. Beyond their system, and to point out critical try-catch blocks that
chaos engineering, we discuss here the related yet completely need more attention. In future work, we will improve the
monitoring sidecar to capture more precisely the steady state [13] Yury Izrailevsky and Ariel Tseitlin. The netflix simian army. http:
of the system. We will also design more advanced perturbation //techblog.netflix.com/2011/07/netflix-simian-army.html, July 2011.
[14] W. I. Kao, R. K. Iyer, and D. Tang. Fine: A fault injection and monitoring
models, for example by changing the timing of methods environment for tracing the unix system behavior under faults. IEEE
invocation or with finer-grained strategies to control the blast Transactions on Software Engineering, 19(11):1105–1118, Nov 1993.
radius of chaos engineering experiments. [15] Hyosoon Lee, Youngshik Song, and Heonshik Shin. Sfida: a software
implemented fault injection tool for distributed dependable applications.
ACKNOWLEDGEMENT In Proceedings Fourth International Conference/Exhibition on High
Performance Computing in the Asia-Pacific Region, volume 1, pages
This work was partially supported by the Wallenberg AI, 410–415 vol.1, May 2000.
Autonomous Systems and Software Program (WASP) funded [16] Tanakorn Leesatapornwongsa and Haryadi S. Gunawi. The case for
by the Knut and Alice Wallenberg Foundation. drill-ready cloud computing. In Proceedings of the ACM Symposium on
Cloud Computing, SOCC ’14, pages 13:1–13:8, New York, NY, USA,
R EFERENCES 2014. ACM.
[17] Martin Monperrus. Principles of Antifragile Software. In Proceedings
[1] Principles of chaos engineering. http://principlesofchaos.org/, April of the Salon des Refusés 2017, 2017.
2018. [18] B. Montrucchio, M. Rebaudengo, and A. Velasco. Fault injection in
[2] John Allspaw. Fault injection in production. Queue, 10(8):30:30–30:35, the process descriptor of a unix-based operating system. In 2014 IEEE
August 2012. International Symposium on Defect and Fault Tolerance in VLSI and
[3] Peter Alvaro, Kolton Andrus, Chris Sanden, Casey Rosenthal, Ali Basiri, Nanotechnology Systems (DFT), pages 281–286, Oct 2014.
and Lorin Hochstein. Automating failure testing research at internet [19] Heather Nakama. Inside Azure search: Chaos engineering. https://azure.
scale. In Proceedings of the Seventh ACM Symposium on Cloud microsoft.com/en-us/blog/inside-azure-search-chaos-engineering/, July
Computing, SoCC ’16, pages 17–28, New York, NY, USA, 2016. ACM. 2015.
[4] J. H. Barton, E. W. Czeck, Z. Z. Segall, and D. P. Siewiorek. Fault [20] Kyle Parrish and David Halsey. Too big to test: Breaking a production
injection experiments using fiat. IEEE Transactions on Computers, brokerage platform without causing financial devastation. https:
39(4):575–582, Apr 1990. //conferences.oreilly.com/velocity/devops-web-performance-ny-2015/
[5] A. Basiri, N. Behnam, R. de Rooij, L. Hochstein, L. Kosewski, public/schedule/detail/45012, October 2015.
J. Reynolds, and C. Rosenthal. Chaos engineering. IEEE Software, [21] M. Rinard, C. Cadar, D. Dumitran, D.M. Roy, T. Leu, and W.S.
33(3):35–41, May 2016. Beebee Jr. Enhancing server availability and security through failure-
[6] Aaron Blohowiak Nora Jones Casey Rosenthal, Lorin Hochstein and oblivious computing. In Proceedings of the 6th conference on Sym-
Ali Basiri. Chaos engineering - building confidence in system behavior posium on Operating Systems Design & Implementation, pages 21–21.
through experiments. https://www.oreilly.com/ideas/chaos-engineering, USENIX Association, 2004.
September 2017. [22] Gerald Schermann, Jürgen Cito, Philipp Leitner, Uwe Zdun, and Har-
[7] Michael Alan Chang, Bredan Tschaen, Theophilus Benson, and Laurent ald C. Gall. We’re doing it live: A multi-method empirical study
Vanbever. Chaos monkey: Increasing sdn reliability through systematic on continuous experimentation. Information and Software Technology,
network destruction. SIGCOMM Comput. Commun. Rev., 45(4):371– 99:41 – 57, 2018.
372, August 2015. [23] Z. Segall, D. Vrsalovic, D. Siewiorek, D. Yaskin, J. Kownacki, J. Barton,
[8] Benoit Cornu, Lionel Seinturier, and Martin Monperrus. Exception R. Dancey, A. Robinson, and T. Lin. Fiat-fault injection based automated
Handling Analysis and Transformation Using Fault Injection: Study of testing environment. In [1988] The Eighteenth International Symposium
Resilience Against Unanticipated Exceptions. Information and Software on Fault-Tolerant Computing. Digest of Papers, pages 102–107, June
Technology, 57:66–76, January 2015. 1988.
[9] E. W. Czeck and D. P. Siewiorek. Observations on the effects of [24] Craig Sheridan, Darren Whigham, and Matej Artac. DICE fault injection
fault manifestation as a function of workload. IEEE Transactions on tool. CoRR, abs/1707.06420, 2017.
Computers, 41(5):559–566, May 1992. [25] Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao,
[10] Chen Fu and Barbara G. Ryder. Exception-chain analysis: Revealing ex- Yongle Zhang, Pranay Jain, and Michael Stumm. Simple testing can
ception handling architecture in java server applications. In Proceedings prevent most critical failures: An analysis of production failures in
of the 29th International Conference on Software Engineering, ICSE ’07, distributed data-intensive systems. In 11th USENIX Symposium on
pages 230–239, Washington, DC, USA, 2007. IEEE Computer Society. Operating Systems Design and Implementation, pages 249–265, 2014.
[11] Sudipto Ghosh and John L. Kelly. Bytecode fault injection for java [26] Pingyu Zhang and Sebastian Elbaum. Amplifying tests to validate excep-
software. J. Syst. Softw., 81(11):2034–2043, November 2008. tion handling code: An extended study in the mobile application domain.
[12] Haryadi S. Gunawi, Thanh Do, Joseph M. Hellerstein, Ion Stoica, ACM Trans. Softw. Eng. Methodol., 23(4):32:1–32:28, September 2014.
Dhruba Borthakur, and Jesse Robbins. Failure as a service (faas):
A cloud service for large-scale, online failure drills. Technical Re-
port UCB/EECS-2011-87, EECS Department, University of California,
Berkeley, Jul 2011.