Академический Документы
Профессиональный Документы
Культура Документы
RECIFE, MARO/2013
RECIFE, MARO/2013
Catalogao na fonte
Bibliotecria Jane Souto Maior, CRB4-571
Federal
de
Inclui bibliografia.
1. Cincia da computao. 2. Sistemas distribudos. I. Garcia,
Vinicius Cardoso (orientador). II. Ttulo.
004
MEI2013 054
Dissertao de Mestrado apresentada por Thiago Pereira de Brito Vieira PsGraduao em Cincia da Computao do Centro de Informtica da Universidade Federal
de Pernambuco, sob o ttulo An Approach for Profiling Distributed Applications
Through Network Traffic Analysis orientada pelo Prof. Vinicius Cardoso Garcia e
aprovada pela Banca Examinadora formada pelos professores:
______________________________________________
Prof. Jos Augusto Suruagy Monteiro
Centro de Informtica / UFPE
______________________________________________
Prof. Denio Mariz Timoteo de Souza
Instituto Federal da Paraba
_______________________________________________
Prof. Vinicius Cardoso Garcia
Centro de Informtica / UFPE
Agradecimentos
vi
tribuiram para que estes dias dedicados ao mestrado fossem bastante agradveis. Gostaria
de agradecer a Paulo Fernando, Lenin Abadie, Marco Machado, Dhiego Abrantes,
Rodolfo Arruda, Francisco Soares, Sabrina Souto, Adriano Tito, Hlio Rodrigues, Jamilson Batista, Bruno Felipe e demais pessoas que tive o prazer de conhecer durante este
perodo do mestrado.
Tambm agradeo a todos os meus velhos amigos de Joo Pessoa, Geisel, UFPB e
CEFET-PB, que tanto me deram apoio e incentivos para desenvolver este trabalho.
Finalmente, a todos aqueles que colaboraram direta ou indiretamente na realizao
deste trabalho.
Muito Obrigado!!!
vii
Resumo
ix
Abstract
Distributed systems has been adopted for building modern Internet services and cloud
computing infrastructures, in order to obtain services with high performance, scalability,
and reliability. Cloud computing SLAs require low time to identify, diagnose and solve
problems in a cloud computing production infrastructure, in order to avoid negative
impacts into the quality of service provided for its clients. Thus, the detection of error
causes, diagnose and reproduction of errors are challenges that motivate efforts to the
development of less intrusive mechanisms for monitoring and debugging distributed
applications at runtime.
Network trafc analysis is one option to the distributed systems measurement, although there are limitations on capacity to process large amounts of network trafc
in short time, and on scalability to process network trafc where there is variation of
resource demand.
The goal of this dissertation is to analyse the processing capacity problem for measuring distributed systems through network trafc analysis, in order to evaluate the
performance of distributed systems at a data center, using commodity hardware and cloud
computing services, in a minimally intrusive way.
We propose a new approach based on MapReduce, for deep inspection of distributed
application trafc, in order to evaluate the performance of distributed systems at runtime, using commodity hardware. In this dissertation we evaluated the effectiveness of
MapReduce for a deep packet inspection algorithm, its processing capacity, completion
time speedup, processing capacity scalability, and the behavior followed by MapReduce
phases, when applied to deep packet inspection for extracting indicators of distributed
applications.
Keywords: Distributed Application Measurement, Proling, MapReduce, Network
Trafc Analysis, Packet Level Analysis, Deep Packet Inspection
Contents
List of Figures
xiii
List of Tables
xiv
List of Acronyms
xv
Introduction
1.1 Motivation . . . . . . . .
1.2 Problem Statement . . .
1.3 Contributions . . . . . .
1.4 Dissertation Organization
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
4
5
6
.
.
.
.
.
.
.
.
7
7
7
9
10
13
13
14
15
.
.
.
.
.
.
.
.
.
.
17
18
20
28
28
30
31
34
34
35
36
xi
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
38
39
39
41
42
53
53
56
56
.
.
.
.
58
59
60
61
62
63
xii
List of Figures
2.1
2.2
8
10
3.1
3.2
3.3
3.4
21
23
31
32
32
32
4.1
DPI Completion Time and Speed-up of MapReduce for 90Gb of a JXTAapplication network trafc . . . . . . . . . . . . . . . . . . . . . . . .
DPI Processing Capacity for 90Gb . . . . . . . . . . . . . . . . . . . .
MapReduce Phases Behaviour for DPI of 90Gb . . . . . . . . . . . . .
(a) Phases Time for DPI . . . . . . . . . . . . . . . . . . . . . . . .
(b) Phases Distribution for DPI . . . . . . . . . . . . . . . . . . . .
Completion time comparison of MapReduce for packet level analysis,
evaluating the approach with and without splitting into packets . . . . .
CountUp completion time and speed-up of 90Gb . . . . . . . . . . . .
(a) P3 evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(b) CountUpDriver evaluation . . . . . . . . . . . . . . . . . . . . .
CountUp processing capacity for 90Gb . . . . . . . . . . . . . . . . . .
(a) P3 processing capacity . . . . . . . . . . . . . . . . . . . . . . .
(b) CountUpDriver processing capacity . . . . . . . . . . . . . . . .
MapReduce Phases time of CountUp for 90Gb . . . . . . . . . . . . .
(a) MapReduce Phases Times of P3 . . . . . . . . . . . . . . . . . .
(b) MapReduce Phases Times for CountUpDriver . . . . . . . . . . .
MapReduce Phases Distribution for CountUp of 90Gb . . . . . . . . .
(a) Phases Distribution for P3 . . . . . . . . . . . . . . . . . . . . .
(b) Phases Distribution for CountUpDriver . . . . . . . . . . . . . .
MapReduce Phases Distribution for CountUp of 90Gb . . . . . . . . .
(a) DPI Completion Time and Speed-up of MapReduce for 30Gb of a
JXTA-application network trafc . . . . . . . . . . . . . . . . .
(b) DPI Processing Capacity of 30Gb . . . . . . . . . . . . . . . . .
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
43
44
45
45
45
47
48
48
48
49
49
49
50
50
50
51
51
51
52
52
52
xiii
List of Tables
3.1
3.2
3.3
3.4
3.5
3.6
Metrics to evaluate MapReduce effectiveness and completion time scalability for DPI of a JXTA-based network trafc . . . . . . . . . . . . . .
Factors and levels to evaluate the dened metrics . . . . . . . . . . . .
Hypotheses to evaluate the dened metrics . . . . . . . . . . . . . . . .
Hypothesis notation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Completion time to process 16 GB split into 35 les . . . . . . . . . . .
Completion time to process 34 GB split into 79 les . . . . . . . . . . .
28
29
29
29
33
33
4.1
4.2
4.3
Metrics for evaluating MapReduce for DPI and packet level analysis . .
Factors and Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Non-Distributed Execution Time in seconds . . . . . . . . . . . . . . .
40
40
43
xiv
List of Acronyms
xv
Introduction
Though nobody can go back and make a new beginning, anyone can
start over and make a new ending.
CHICO XAVIER
1.1
Motivation
Distributed systems has been adopted for building high performance systems, due to the
possibility of obtaining high fault tolerance, scalability, availability and efcient use of
resources (Cox et al., 2002; Antoniu et al., 2007). Modern Internet services and cloud
computing infrastructures are commonly implemented as distributed systems, to provide
services with high performance and reliability (Mi et al., 2012). Cloud computing SLAs
require low time to identify, diagnose and solve problems in its production infrastructure,
in order to avoid negative impacts and problems into the quality of service provided for its
clients. Thus, monitoring and performance analysis of distributed systems at production
environment, became more necessary with the growth of cloud computing and the use of
distributed systems to provide services and infrastructure as a service (Fox et al., 2009;
Yu et al., 2011).
On distributed systems developing, maintaining and administration, the detection of
error causes, diagnosis and reproduction of errors are challenges that motivate efforts
to the development of less intrusive and more effective mechanisms for monitoring
and debugging distributed applications at runtime (Armbrust et al., 2010). Distributed
measurement systems (Massie et al., 2004) and log analysers (Oliner et al., 2012) provide
relevant information regarding some aspects of a distributed system, but this information
can be complemented by correlated information from others sources (Zheng et al., 2012),
1.1. MOTIVATION
such as network trafc analysis, which can provide valuable information of a distributed
application and its environment, and also increase the number of information sources
to make them more effective for evaluating complex distributed systems. Simulators
(Paul, 2010), emulators or testbeds (Loiseau et al., 2009; Gupta et al., 2011) are also used
to evaluate distributed systems, but these approaches present lacks of to reproduce the
production behavior of a distributed system, and its relation within a complex environment,
such as the cloud computing environment (Loiseau et al., 2009; Gupta et al., 2011).
Monitoring and diagnosing production failures of distributed systems require low
intrusion, high accuracy and fast results. It is complex to achieve these requirements,
because distributed systems are usually composed of asynchronous communication,
unpredictability of network message issues, high number of resources to be monitored
in short time, and black box components (Yuan et al., 2011; Nagaraj et al., 2012). To
measure distributed systems with less intrusion and less dependency on developers,
approaches with low dependency on source code or instrumentation are necessary, such
as log analysis or network trafc analysis (Aguilera et al., 2003).
It is possible to measure, evaluate and diagnose distributed applications through the
evaluation of information from communication protocols, ows, throughput and load
distribution (Mi et al., 2012; Nagaraj et al., 2012; Sambasivan et al., 2011; Aguilera et al.,
2003; Yu et al., 2011). This information can be collected through network trafc analysis,
but to retrieve this kind of information from distributed application trafc it is necessary
to recognize application protocols and perform DPI to retrieve details of the application
behaviors, sessions, and states.
Network trafc analysis is one option to evaluate distributed systems performance
(Yu et al., 2011), although there are limitations on processing capacity to deal with large
amounts of network trafc in short time, on scalability to process network trafc over
variation of resource demands, and on complexity to obtain information of a distributed
application behavior from network trafc (Loiseau et al., 2009; Callado et al., 2009). To
evaluate applications information from network trafc it is necessary to use DPI and
extract information from application protocols, which requires an additional effort in
comparison with traditional approaches of DPI, which usually do not evaluate content of
application protocols and application states.
In the production environment of a cloud computing provider, DPI can be used to
evaluate and diagnose distributed applications, through the analysis of application trafc
inside a data center. However, this kind of DPI presents differences and requires more
effort than common DPI approaches. DPI is usually used to inspect all network trafc that
1.1. MOTIVATION
arrives at a data center, but this approach would not provide reasonable performance for
inspecting application protocols and their states, due to the massive volumes of network
trafc to be online evaluated, and the computational cost to perform this kind of evaluation
in short time (Callado et al., 2009).
Packet level analysis can also be used to evaluate packet ows and load distribution of
network trafc inside a data center (Kandula et al., 2009), providing valuable information
about the behavior of a distributed system and about the dimension, capacity and usage
of network resources. However, with packet level analysis it is not possible to evaluate
application messages, protocols, and their states.
Although much work has been done to improve DPI performance (Fernandes et al.,
2009; Antonello et al., 2012), the evaluation of application states through trafc analysis
decreases the processing capacity of DPI to evaluate large amounts of network trafc.
With the growth of link speeds, Internet trafc exchange and use of distributed systems
to provide Internet services (Sigelman et al., 2010), the development of approaches are
needed to be able to deal with the analysis of the growing amount of network trafc, to
permit the efcient evaluation of distributed systems through network trafc analysis.
MapReduce (Dean and Ghemawat, 2008), which was proposed for distributed processing of large datasets, can be an option to deal with large amounts of network trafc.
MapReduce is a programming model and an associated implementation for processing
and generating large datasets. It becomes an important programming model and distribution platform to process large amounts of data, with diverse use cases in academia and
industry (Zaharia et al., 2008; Guo et al., 2012). MapReduce is a restricted programming
model to easily and automatically parallelize the execution of user functions and to
provide transparent fault-tolerance (Dean and Ghemawat, 2008). Based on functional
combinators from functional languages, it provides a simple programming paradigm for
parallel processing that is increasingly being used for data-intensive applications in cloud
computing environments.
MapReduce can be used for network packet level analysis (Lee et al., 2011), which
evaluates each packet individually to obtain information of network and transport layers.
Lee et al. (2011) proposed an approach to perform network packet level analysis through
MapReduce, using network traces split into packets to process each one individually
and to extract indicators from IP, TCP, and UDP. However, for proling an application
through network trafc analysis it is necessary to perform a deep packet inspection, in
order to evaluate the content of the application layer, and to evaluate application protocols
and reassemble application messages.
Because the approach proposed by Lee et al. (2011) is not able to evaluate more than
one packet per MapReduce iteration and analyse application messages, it is necessary a
new MapReduce approach to perform DPI algorithms for proling applications through
network trafc analysis.
The kind of workload submitted for processing by MapReduce impacts on the behaviour and performance of MapReduce (Tan et al., 2012; Groot, 2012), requiring specic
conguration to obtain an optimal performance. Information about the occupation of
MapReduce phases, about the processing characteristics (if the job is I/O or CPU bound),
and about the mean time duration of Map and Reduce tasks, can be used to optimize
parameter congurations of the MapReduce, in order to improve resource allocation and
task scheduling.
Although studies has been done to understand, analyse and improve workload management decisions in MapReduce (Lu et al., 2012; Groot, 2012), there is no evaluation to
characterize the MapReduce behaviour or to identify its optimal conguration to achieve
the best performance for packet level analysis and DPI.
1.2
Problem Statement
MapReduce can express several kinds of problems, but not all. MapReduce does not efciently express incremental, dependent or recursive data (Bhatotia et al., 2011; Lin, 2012),
because its approach adopts batch processing and functions executed independently,
without shared state or data. Although MapReduce is restrictive, it provides a good t
for many problems of processing large datasets. MapReduce expressiveness limitations
may be reduced by decomposition of problems into multiple MapReduce iterations, or
by combining MapReduce with others programming models for sub-problems (Lmmel,
2007; Lin, 2012), although the decomposition into interactions increases the completion
time of MapReduce jobs (Lmmel, 2007).
DPI algorithms require the evaluation of one or more packets to retrieve information from application layer messages; this represents a data dependency to mount an
application message from network packets, and it is a restriction to use MapReduce for
DPI. Because Lee et al. (2011)s approach for MapReduce performs packet level analysis
processes each packet individually, it can not be used to evaluate more than one packet
per MapReduce Map function and efciently reassemble an application message from
network traces. Thus it is necessary a new approach to use MapReduce to perform DPI,
evaluating the effectiveness of MapReduce to express DPI algorithms.
1.3. CONTRIBUTIONS
In elastic environments, like cloud computing providers, where users can request or
discard resources dynamically, it is important to know how to make provisioning and
resource allocation in an optimal way. To run MapReduce jobs efciently, the allocated
resources need to be matched to the workload characteristics, and the allocated resources
should be sufcient to meet a requested processing capacity or deadline (Lee, 2012).
The main performance evaluations of MapReduce are about text processing (Zaharia
et al., 2008; Chen et al., 2011; Jiang et al., 2010; Wang et al., 2009), where the input
data are split into blocks and into records, to be processed by parallel and independent
Map functions. Although studies has been done in order to understand, analyse and
improve workload decisions in MapReduce (Lu et al., 2012; Groot, 2012), there is no
evaluation to characterize the MapReduce behavior or to identify its optimal conguration
to achieve the best performance for packet level analysis and DPI. Thus, it is necessary the
characterization of MapReduce jobs for packet level analysis and DPI, in order to permit
its optimal conguration to achieve the best performance, and to obtain information that
can be used to predict or simulate the completion time of a job with given resources, in
order to determine whether the job will be nished by the deadline with the allocated
resources (Lee, 2012).
The goal of this dissertation is to analyse the processing capacity problem for measuring distributed systems through network trafc analysis, proposing a solution able to
perform deep inspection in distributed applications trafc, in order to evaluate distributed
systems at a data center, using commodity hardware and cloud computing services, in a
minimally intrusive way. Thus we developed an approach based on MapReduce to evaluate the behavior of distributed systems through DPI, and we evaluated the effectiveness of
MapReduce to a DPI algorithm and its completion time scalability through node addition
into the cluster, to measure a JXTA-based application, using virtual machines of a cloud
computing provider. Also we evaluated the MapReduce performance for packet level
analysis and DPI, characterizing the behavior followed by MapReduce phases, processing
capacity scalability and speed-up. In this evaluation we evaluated the impact caused by
the variation of input size, block size and cluster size.
1.3
Contributions
1.4
Dissertation Organization
2.1
2.1.1
Background
Network Trafc Analysis
Network trafc measurement can be divided into active or passive measurement, and
a measurement can be performed at packet or ow levels. In packet level analysis,
the measurements are performed on each packet transmitted across the measurement
point. The common packet inspection only analyses the content up to the transport layer,
including the source address, destination address, source port, destination port and the
protocol type, but packet inspection can also analyse the packet payload, performing a
deep packet inspection.
Risso et al. (2008) presented a taxonomy of the methods that can be used for network
trafc analysis. According to Risso et al. (2008), the Packet Based No State (PBNS) operates by checking the value of some elds present in each packet, such as the TCP or UDP
ports, thus this method is very simple computationally. The Packet Based Per Flow State
(PBFS) requires a session table to manage session identication (source/destination address, transport-layer protocol, source/destination port) and the corresponding application
2.1. BACKGROUND
layer protocol, in order to be able to scan the payload looking for a specic rule, which
usually is an application-layer signature, which increases the processing complexity of
this method. The Message Based Per Flow State (MBFS) operates on messages instead
of packets. This method requires a TCP/IP reassembler to handle IP fragments and TCP
segments. In such case, memory requirements increase because of the additional state
information that must be kept for each session and because of buffers required by the
TCP/IP reassembler. The Message Based Per Protocol State (MBPS) interprets exactly
what each application sends and receives. A MBPS processor understands not only the
semantic of the message, but also the different phases of a messages exchange because it
has a full understanding of the protocol state machine. Memory requirements become
even larger, because this method needs to take into account not only the state of the
transport session, but also the state of each application layer session. Also processing
power is the highest because the protocol conformance analysis requires processing the
entire application data, while previous methods are limited to the rst packets within each
session.
The Figure 2.1 illustrates the difference between packet level analysis and DPI from
PCAP les, and shows that packet level analysis evaluates each packet individually, while
DPI requires an evaluation of more than one packet to reassemble some packets and
obtain an application message.
Figure 2.1 Differences between packet level analysis and deep packet inspection
DPI refers for examining both packet header and complete payload to look for
predened patterns or rules. A pattern or rule can be a particular TCP connection, dened
by source and destination IP addresses and port numbers, it can also be a signature string
of a virus, or a segment of malicious code (Piyachon and Luo, 2006). Antonello et al.
2.1. BACKGROUND
(2012) argues that many critical network services rely on the inspection of packet payload,
instead of only looking at the information of packet headers. Although DPI systems
are essentially more accurate to identify application protocols and application messages,
they are also resource-intensive and may not scale well with the growing link speeds.
MBFS, MBPS and DPI evaluate the content of the application layer, thus it is necessary
to recognize the content of the message evaluated, but encrypted messages can make
these kind of evaluation infeasible.
2.1.2
JXTA
JXTA is a language and specication for peer to peer networking, it attempts to formulate
peer to peer standard protocols, in order to provide an infrastructure for building peer
to peer applications, through basic functionalities for peer resource discovery, communication and organization. JXTA introduces an overlay on top of the existing physical
network, with its own addressing and routing (Duigou, 2003; Halepovic and Deters,
2003).
According to JXTA specication (Duigou, 2003), JXTA peers communicate through
messages transmitted by pipes, which are an abstraction of virtual channels composed of
input and output channels, for peer to peer communication. Pipes are not bound to the
physical location, it has its own unique ID. Each peer can carry its pipe with itself even
when its physical network location changes. Pipes are asynchronous, unidirectional and
unreliable, but bi-directional and reliable services are provided on top of them. JXTA
uses source-based routing, each message carries its routing information as a sequence
of peers, and peers along the path may update this information. The JXTA socket adds
reliability and bi-directionality to JXTA communications through one layer of abstraction
on top of the pipes (Antoniu et al., 2005), and it provides an interface similar to the
POSIX sockets specication. JXTA messages are XML-documents composed of well
dened and ordered message elements.
Halepovic and Deters (2005) proposed a performance model, describing important
metrics to evaluate the JXTA throughput, scalability, services and the JXTA behavior
over different versions. Halepovic et al. (2005) analysed the JXTA performance in
order to show the increasing cost or latency with higher workload and with concurrent
requests, and suggests more evaluations about JXTA scalability with large peer groups in
direct communication. Halepovic (2004) cites that network trafc analysis is a feasible
approach to performance evaluation of JXTA-based applications, but do not adopt it due
to the lack on JXTA trafc characterization. Although there are performance models and
10
2.1. BACKGROUND
evaluations of JXTA, there are no evaluations of it for the current versions and there are
not mechanisms to evaluate JXTA applications at runtime. Because JXTA is still used for
building peer to peer systems, such as the U-Store (Fonseca et al., 2012), which motivates
our research, is necessary a solution to measure JXTA-based applications at runtime and
provide information about their behavior and performance.
2.1.3
MapReduce
MapReduce (Dean and Ghemawat, 2008) is a programming model and a framework for
processing large datasets trough distributed computing, providing fault tolerance and high
scalability to big data processing. The MapReduce model was designed for unstructured
data processed by clusters of commodity hardware. Its functional style of Map and
Reduce functions automatically parallelizes and executes large jobs in a cluster. Also,
MapReduce handles failures, application deployment, task duplications, and aggregation
of results, thereby allowing programmers to focus on the core logic of applications.
An application executed through MapReduce is called job. The input data of a
job, which is stored into a distributed le system, it is split into even-sized blocks and
replicated for fault tolerance. Figure 2.2 shows the dataset input splitting adopted by
MapReduce.
Figure 2.2 MapReduce input dataset splitting into blocks and into records
11
2.1. BACKGROUND
Initially the input dataset is split into blocks and stored into the distributed le system
adopted. During the job execution of a dataset, each split is assigned to be processed by a
Mapper, thus the number of splits of the input determines the number of Map tasks of a
MapReduce job. Each Mapper reads its split from the distributed le system and divides
it into records, to be processed by the user-dened Map function. Each Map function
generates intermediate data from the evaluated block, which will be fetched, ordered by
keys and processed by the Reducers to generate the output of a MapReduce job.
A MapReduce job is divided into Map and Reduce tasks, which are composed of
user-dened functions of Map and Reduce. The execution of these tasks can be grouped
into phases, representing the Map and Reduce phases, but Reduce tasks still can be
divided into other phases, which are the Shufe and Sort phases. A job is submitted by
an user to the master node, which selects worker nodes with idle slots and assigns Map or
Reduce tasks.
The execution of a Map task can be divided into two phases. In the rst, the Map
phase reads the tasks split from the distributed le system, parses it into records, and
applies the user-dened Map function to each record. In the second, after the user-dened
Map function has been applied to each input record, the commit phase registers the nal
output with the TaskTracker, which then informs the JobTracker that the task has nished
executing. The output of the Map phase is consumed by the Reduce phase.
The execution of a Reduce tasks can be divided into three phases. The rst phase,
called Shufe phase, fetches the Reduce tasks input data, where for each Reduce task is
assigned a partition of the key produced by the Map phase. The second phase, called Sort
phase, groups records with the same key. The third phase, called Reduce phase, applies
the user-dened Reduce function to each key and its values (Kavulya et al., 2010).
A Reduce task cannot fetch the output of a Map task until the Map has nished and
committed its output to disk. Only after receiving its partition from all Map outputs, the
Reduce task starts the Sort phase, while this does not happens, the Reduce task executes
the Shufe phase. After the Sort phase, the Reduce task enters the Reduce phase, in
which it executes the user-dened Reduce function for each key and its values. Finally
the output of the Reduce function is written to a temporary location on the distributed le
system (Condie et al., 2010).
MapReduce worker nodes are congurable to concurrently execute up to a dened
number of Map and Reduce tasks, which are dened according to the number of Map and
Reduce slots. Each worker node of a MapReduce cluster is congured with a xed number
of Map slots, and another xed number of Reduce slots, which means the number of Map
12
2.1. BACKGROUND
or Reduce tasks that can be executed concurrently per node. During job executions, if
all possible slots are occupied, pending tasks must wait until some slots are freed up. If
the number of tasks in the job is bigger than the number of slots available, then Maps or
Reduces are rst scheduled to execute on all available slots, and these tasks compose the
rst wave of tasks, that is followed by subsequent waves. If an input is broken into 200
blocks and there are 20 Map slots in a cluster, the number of map tasks are 200 and the
map tasks are executed through 10 waves of executions (Lee et al., 2012). The number of
waves, and the sizes of waves, would aid the conguration of tasks for improved cluster
utilization (Kavulya et al., 2010).
The Shufe phase of the rst Reduce wave may be signicantly different from the
Shufe phase that belongs to the next Reduce waves. This happens because the Shufe
phase of the rst Reduce wave overlaps with the entire Map phase, and hence its depends
on the number of Map waves and their durations (Verma et al., 2012b).
Each Map task is independent of the others Map tasks, meaning that all Mappers can
be performed in parallel on multiple machines. The number of concurrent Map tasks in a
MapReduce system is limited by the number of slots and the number of blocks in which
the input data was divided. Reduce tasks can also be performed in parallel during the
Reduce phase, and the number of reduce tasks in a job is specied by the application and
by the number of Reduce slots per node.
MapReduce tries to achieve data locality for its job executions, which means the Map
task and the input data block it will process should be located as close to each other as
possible, in order for the Map task can read the input data block incurring as little network
trafc as possible.
Hadoop1 is an open source implementation of MapReduce, which relies on HDFS
for distributed data storage and replication. HDFS is an implementation of Google File
System (Ghemawat et al., 2003), which was designed to store large les, and was adopted
by MapReduce system as distributed le system to store its les and intermediate data.
The input data type and workload characteristics cause impact into the MapReduce
performance, due to each application has a different bottleneck resource, and requires
specic conguration to achieve optimal resource utilization (Kambatla et al., 2009).
Hadoop has a set of parameters for its conguration, the default values of these parameters
are based on typical conguration of machines in clusters and requirements of a typical
application, that usually processes text-like inputs, although the MapReduce optimal
resource utilization is dependent on the resource consumption prole of its application.
1 http://hadoop.apache.org/
13
Because the input data type and workload characteristics of MapReduce jobs impacts
into MapReduce performance, it is necessary to evaluate the MapReduce behavior and
performance for different purposes. Although much work has been done in order to
understand and analyse MapReduce for different input data types and workloads (Lu
et al., 2012; Groot, 2012), there is no evaluation to characterize the MapReduce behavior
and identify its optimal conguration for an application to packet level analysis and DPI.
2.2
2.2.1
Related Work
Distributed Debugging
14
2.2.2
Lee et al. (2010) proposed a network ow analysis method using MapReduce, where the
network trafc was captured, converted to text and used as input to Map tasks. As a result,
15
it was shown improvements in fault tolerance and computation time, when compared with
ow-tools2 . The conversion time from binary network traces to text represents a relevant
additional time, that can be avoided adopting binary data as input data for MapReduce
jobs.
Lee et al. (2011) presented a Hadoop-based packet trace processing tool to process
large amounts of binary network trafc. A new input type to Hadoop was developed,
the PcapInputFormat, which encapsulate the complexity of processing a captured binary
PCAP traces and extracting the packets through the Libpcap (Jacobson et al., 1994)
library. Lee et al. (2011) compared their approach with CoralReef3 , which is a network
trafc analysis tool that also relies on Libpcap, the results of the evaluation showed
speed-up on completion time, for a case that process packet traces with more than 100GB.
This approach implemented a packet level evaluation, to extract indicators from IP, TCP
and UDP, evaluating the job completion time achieved with different input size and two
cluster congurations. It was implemented their own component to save network traces
into blocks, and the developed PcapInputFormat rely on a timestamp-based heuristic for
nding the rst packet from each block, using sliding-window. These implementations to
iterate over packets of a network trace, can present a limitation on accuracy, if compared
with the accuracy obtained by Tcpdump4 and LibPCAP for the same functionalities.
The approach proposed by Lee et al. (2011) is not able to evaluate more than one
packet per MapReduce iteration, because each block is divided into packets that are
evaluated individually by the user-dened Map function. Therefore, a new MapReduce
approach is necessary to perform DPI algorithms, which requires to reassemble more
than one packet to mount an application message, in order to evaluate message contents,
application states and application protocols.
2.3
Chapter Summary
3 http://www.caida.org/tools/measurement/coralreef
4 http://www.tcpdump.org/
16
efforts to develop less intrusive mechanisms for monitoring and debugging distributed
applications at runtime. Network trafc analysis is one option to distributed systems
measurement, although there are limitations on capacity to process large amounts of
network trafc in short time, and on scalability to process network trafc where there is
variation of resource demand.
Although MapReduce can be used for packet level analysis, it is necessary an approach
to use MapReduce for DPI, in order to evaluate distributed systems at a data center through
network trafc analysis, using commodity hardware and cloud computing services, in a
minimally intrusive way. Due to the lack of evaluation of MapReduce for trafc analysis
and the peculiarity of this kind of data, it is necessary to evaluate the performance of
MapReduce for packet-level analysis and DPI, characterizing the behavior followed by
MapReduce phases, its processing capacity scalability and speed-up, over variations of
the most important conguration parameters of MapReduce.
17
In this chapter, we rst look at the problems in the distributed application monitoring,
processing capacity of network trafc, and in the restriction to use MapReduce for
proling application network trafc of distributed applications.
Network trafc analysis can be used to extract performance indicators from communication protocols, ows, throughput and load distribution of a distributed system. In
this context, network trafc analysis can enrich diagnoses and provide a mechanism for
measuring distributed systems in a passive way, with low overhead and low dependency
on developers.
However, there are limitations on the capacity to process large amounts of network
trafc in short time, and on processing capacity scalability to be able to process network
trafc over variations of throughput and resource demands. To address this problem, we
present an approach for proling application network trafc using MapReduce. Experiments show the effectiveness of our approach for proling a JXTA-based distributed
application through DPI, and its completion time scalability through node addition, in a
cloud computing environment.
In Section 3.1 we begin this chapter by motivating the need for an approach using
MapReduce for DPI, then we describe, in Section 3.2, the architecture proposed and the
DPI algorithm to extract indicators from network trafc of a JXTA-based distributed
application. Section 3.3 presents the adopted evaluation methodology and the experiment
18
3.1. MOTIVATION
setup used to evaluate our proposed approach. The obtained results are presented in
Section 3.4 and discussed in Section 3.5. Finally, Section 3.6 concludes and summarizes
this chapter.
3.1
Motivation
Modern Internet services and cloud computing infrastructure are commonly implemented
as distributed systems, to provide services with high performance, scalability and reliability. Cloud computing SLAs require a short time to identify, diagnose and solve problems
in its infrastructure, in order to avoid negative impacts and problems in the provided
quality of service.
Monitoring and performance analysis of distributed systems became more necessary
with the growth of cloud computing and the use of distributed systems to provide services
and infrastructure (Fox et al., 2009). In distributed systems development, maintenance and
administration, the detection of error causes, and the diagnosing and reproduction of errors
are challenges that motivates efforts to develop less intrusive mechanisms for debugging
and monitoring distributed applications at runtime (Armbrust et al., 2010). Distributed
measurement systems (Massie et al., 2004) and log analyzers (Oliner et al., 2012) provide
relevant information of some aspects of a distributed system. However this information
can be complemented by correlating information from network trafc analysis, making
them more effective and increasing the information source to ubiquitously evaluate a
distributed system.
Low overhead, and transparency and scalability are commons requirements for an
efcient solution to the measurement of distributed systems. Many approaches have been
proposed in this direction, using instrumentation or logging, which cause overhead and
a dependency on developers. It is possible to diagnose and evaluate distributed applications performance with the evaluation of information from communication protocols,
ows, throughput and load distribution (Sambasivan et al., 2011; Mi et al., 2012). This
information can be collected through network trafc analysis, enriching a diagnosis, and
also providing an approach for the measurement of distributed systems in a passive way,
with low overhead and low dependency on developers.
Network trafc analysis is one option to evaluate distributed systems performance
(Yu et al., 2011), although there are limitations on the capacity to process large number
of network packets in a short time (Loiseau et al., 2009; Callado et al., 2009) and on
scalability to process network trafc over variations of throughput and resource demands.
To obtain information of the behaviour of distributed systems, from network trafc, it
19
3.1. MOTIVATION
is necessary to use DPI and evaluate information from application states, which requires
an additional effort in comparison with traditional approaches of DPI, which usually do
not evaluate application states.
Although much work has been done in order to improve the DPI performance (Fernandes et al., 2009; Antonello et al., 2012), the evaluation of application states still decreases
the processing capacity of DPI to evaluate large amounts of network trafc. With the
growth of links speed, Internet trafc exchange and the use of distributed systems to
provide Internet services (Sigelman et al., 2010), the development of new approaches
are needed to be able to deal with the analysis of the growing amount of network trafc,
and to permit the efcient evaluation of distributed systems through the network trafc
analysis.
MapReduce (Dean and Ghemawat, 2008) becomes an important programming model
and distribution platform to process large amount of data, with diverse use cases in
academia and industry (Zaharia et al., 2008; Guo et al., 2012). MapReduce can be used
for packet level analysis: Lee et al. (2011) proposed an approach which evaluates each
packet individually to obtain information of network and transport layers. An approach
to process large amount of network trafc using MapReduce was proposed by Lee et al.
(2011), which splits network traces into packets to process each one individually and
extract indicators from IP, TCP, and UDP.
However, for proling distributed applications through network trafc analysis, it is
necessary to analyse the content of more than one packet, up to the application layer, to
evaluate application messages and its protocols. Due to TCP and message segmentation,
the desired application message may be split into several packets. Therefore, it is
necessary to evaluate more than one packet per MapReduce iteration to perform a deep
packet inspection, in order to be able to reassemble more than one packet and mount
application messages, to retrieve information from the application sessions, states and
from its protocols.
DPI refers for examining both packet header and complete payload to look for
predened patterns or rules, which can be a signature string or an application message.
According to the taxonomy presented by Risso et al. (2008), deep packet inspection
can be classied as message based per ow state (MBFS), which analyses application
messages and its ows, and also can be classied as message based per protocol state
(MBPS), which analyses application messages and its application protocol states, what
makes necessary to evaluate distributed applications through network trafc analysis, to
extract application indicators.
20
3.1. MOTIVATION
MapReduce is a restricted programming model to parallelize user functions automatically and to provide transparent fault-tolerance (Dean and Ghemawat, 2008), based
on functional combinators from functional languages. MapReduce does not efciently
express incremental, dependent or recursive data (Bhatotia et al., 2011; Lin, 2012), because its approach adopts batch processing and functions executed independently, without
shared states.
Although restrictive, MapReduce provides a good t for many problems of processing
large datasets. Also, its expressiveness limitations may be reduced by problem decomposition into multiple MapReduce iterations, or by combining MapReduce with others
programming models for subproblems (Lmmel, 2007; Lin, 2012), but this approach can
be not optimal in some cases. DPI algorithms require the evaluation of one or more packets to retrieve information from application messages; this represents a data dependence
to mount an application message and is a restriction on the use of MapReduce for DPI.
Because the Lee et al. (2011) approach processes each packet individually, it can
not be efciently used to evaluate more than one packet and reassemble an application
message from a network trace, which makes it necessary a new approach for using
MapReduce to perform DPI and to evaluate application messages.
To be able to process large amounts of network trafc using commodity hardware,
in order to evaluate the behaviour of distributed systems at runtime, and also because
there is no evaluation of MapReduce effectiveness and processing capacity for DPI, an
approach was developed based on MapReduce, to deeply inspect distributed applications
trafc, in order to evaluate the behaviour of distributed systems, using Hadoop, an open
source implementation of MapReduce.
In this Chapter is evaluate the effectiveness of MapReduce to a DPI algorithm and
its completion time scalability through node addition, to measure a JXTA-based application, using virtual machines of Amazon EC21 , a cloud computing provider. The main
contributions of this chapter are:
1. To provide an approach to implement DPI algorithms using MapReduce;
2. To show the effectiveness of MapReduce for DPI;
3. To show the completion time scalability of MapReduce for DPI, using virtual
machines of cloud computing providers.
1 http://aws.amazon.com/ec2/
21
3.2. ARCHITECTURE
3.2
Architecture
In this section we present the architecture of the proposed approach to capture and process
network trafc of distributed applications.
To monitor distributed applications through network trafc analysis, specics points
of a data center must be monitored to capture the desired application network trafc.
Also, an approach is needed to process a large amount of network trafc in an acceptable
time. According to (Sigelman et al., 2010), fresh information enables a faster reaction to
production problems, thereby the information must be obtained as soon as possible, although a trace analysis system operating on hours-old data is still valuable for monitoring
distributed applications in a data center (Sigelman et al., 2010).
In this direction, we propose a pipelined process to capture network trafc, store
locally, transfer to a distributed le system, and evaluate the network trace to extract
application indicators. We use MapReduce, implemented by Apache Hadoop, to process application network trafc, extract application indicators, and provide an efcient
and scalable solution for DPI and proling application network trafc in a production
environment, using commodity hardware.
The architecture for network trafc capturing and processing is composed of four
main components: the SnifferServer (Shown in Figure 3.1), that captures, splits and stores
network packets into the HDFS for batch processing through Hadoop; the Manager, that
orchestrates the collected data, the job executions and stores the results generated; the
AppParser, that converts network packets into application messages; and the AppAnalyzer,
that implements Map and Reduce functions to extract the desired indicators.
Figure 3.1 Architecture of the the SnifferServer to capture and store network trafc
22
3.2. ARCHITECTURE
Figure 3.1 shows the architecture of the SnifferServer and its placement into monitoring points of a datacenter. SnifferServer captures network trafc from specic points
and stores it into the HDFS, for batch processing through Hadoop. Sniffer executes
user-dened monitoring plans guided by specication of places, time, trafc lters and
the amount of data to be captured. According to an user-dened monitoring plan, Sniffer
starts the capture of the desired network trafc through Tcpdump, which saves network
trafc in binary les, known as PCAP les. The collected trafc is split into les with
predened size, saved at the local SnifferServer le system, and transferred to HDFS
only when each le is totally saved into the local le system of the SnifferServer. The
SnifferServer must be connected to the network where the monitoring target nodes are
connected, and must be able to establish communication with the others nodes that
compose the HDFS cluster.
During the execution of a monitoring plan, initially the network trafc must be
captured, split into even-sized les and stored into HDFS. Through the Tcpdump, a
widely used LibPCAP network trafc capture tool, the packets are captured and split into
PCAP les with 64MB of size, which is the default block size of the HDFS, although this
block size may be congured to different values.
HDFS is optimized to store large les, but internally each le is split into blocks with
a predened size. Files that are greater than the HDFS block size must be split into blocks
with size equal to or smaller than the adopted block size, and must be spread among
machines in the cluster.
Because the LibPCAP, used by Tcpdump, stores the network packets in binary PCAP
les and due to the complexity of providing to HDFS an algorithm for splitting PCAP
les into packets, PCAP les splitting can be avoided through the adoption of les less
than the HDFS block size, but also can be provided to Hadoop an algorithm to split PCAP
les into packets, in order to be able to store PCAP les into the HDFS.
We adopted the approach that saves the network trace into PCAP les with the adopted
HDFS block size, using the split functionality provided by Tcpdump, because of the
PCAP le split into packets demands additional computing time and because of the trace
splitting into packets increases the complexity of the system. Thus, the network trafc
is captured by Tcpdump, split into even-sized PCAP les and stored into the local le
system of the SnifferServer, and periodically transferred to HDFS, which is responsible
for replicating the les into the cluster.
In the MapReduce framework, the input data is split into blocks, which are split
into small pieces, called records, to be used as input for each Map function. We adopt
23
3.2. ARCHITECTURE
the use of entire blocks, with size dened by the HDFS block size, as input for each
Map function, instead of using the block divided into records. With this approach, it is
possible to evaluate more than one packet per MapReduce task and to be able to mount an
application message from network trafc. Also it is possible to obtain more processing
time for the Map function than the approach where each Map function receives only one
packet as input.
Differently from the approach presented by Lee et al. (2011), which only permits
evaluation of a packet individually per Map function, with our approach it is possible to
evaluate many packets from a PCAP le per Map function and to reassemble application
messages from network trafc, which had the content of its messages divided into many
packets to be transferred over TCP.
Figure 3.2 shows the architecture to process distributed application trafc through
Map and Reduce functions, implemented by AppAnalyzer, which is deployed at Hadoop
nodes, and managed by Manager, and has the generated results stored into a distributed
database.
The communication between components was characterized as blocking and nonblocking; blocking communication was adopted in cases that require high consistency,
and non blocking communication was adopted in cases where it is possible to use eventual
consistency to obtain better response time and scalability.
24
3.2. ARCHITECTURE
25
3.2. ARCHITECTURE
can transport one or more application message, but if an application message is greater
than the MSS, the message is split into some TCP packets, according with the TCP
segmentation. Thus, it is necessary to evaluate the full content of some TCP segments to
recognize application messages and their protocols.
If application messages have its packets spread into two or more blocks, it is possible
to generate intermediate data of this unevaluated messages by the Map function, grouping
each message by ow and its individual identication, and use the Reduce function to
reassembly the message and evaluate it.
To evaluate the effectiveness of our approach, we developed a pilot project to extract
application indicators from a JXTA-based distributed application trafc, this JXTA-based
distributed application implements a distributed backup system, based on JXTA Socket.
To analyse JXTA-based network trafc, the JNetPCAP-JXTA (Vieira, 2012b) was developed, which parses network trafc into Java JXTA messages, and the JXTAPerfMapper
and JXTAPerfReducer, which extract application indicators from JXTA Socket communication layer through Map and Reduce functions.
JNetPCAP-JXTA is writen in the Java language and provides methods to convert byte
arrays into Java JXTA messages, using an extension of the JXTA default library for Java,
known as JXSE3 . With JNetPCAP-JXTA, we are able to parse all kinds of messages
dened by the JXTA specication. JNetPCAP-JXTA relies on the JNetPCAP library to
support the instantiation and inspection of LibPCAP packets. JNetPCAP was adopted due
to its performance to iterate over packets, to the large quantity of functionalities provided
to handle packet traces and due to the recent update activities for this library.
The JXTAPerfMapper implements a Map function that receives as input a path of
a PCAP le stored into the HDFS; then the content of the specied le is processed to
extract the number of JXTA connection requests and number of JXTA message arrivals to
a server peer, and to evaluate the round-trip time of each piece of content transmitted over
a JXTA Socket. If a JXTA message is greater than the TCP PDU size, the message is split
into some TCP segments, due to the TCP segmentation. Additionally, in a JXTA network
trafc, one TCP packet can transport one or more JXTA message, due to the buffer
window size used by the Java JXTA Socket implementation to segment its messages.
Because of the possibility of transporting more than one JXTA message per packet
and the TCP segmentation, it is necessary to reassemble more than one packet and the
full content of each TCP segment to recognize all possible JXTA messages, instead of
evaluating only a message header or signature of individual packets, as is commonly done
3 http://jxse.kenai.com/
26
3.2. ARCHITECTURE
in DPI or by widely used trafc analysis tools, such as Wireshark4 , which is unable to
recognize all JXTA messages in a captured network trafc, due to its approach in which
it does not identify when two or more JXTA messages are transported into the same TCP
packet.
JXTAPerfMapper implements a DPI algorithm to recognize, sort and reassemble TCP
segments into JXTA messages, which is shown in Algorithm 1.
Algorithm 1 JXTAPerfMapper
for all tcpPacket do
if isJXTA isWaitingForPendings then
parsePacket(tcpPacket)
end if
end for
function PARSE PACKET(tcpPacket)
parseMessage
if isMessageParsed then
upddateSavedFlows
if hasRemain then
parsePacket(remainPacket)
end if
else
savePendingMessage
lookForMoreMessages
end if
end function
For each TCP packet of the PCAP le, it is veried if it is a JXTA message or if it
is part of a JXTA message that was not fully parsed and is waiting for its complement;
if one of these conditions is true, then a parse attempt is made, using JNetPCAP-JXTA
functionalities, up to the full verication of the packet content. As a TCP packet may
contain one or more JXTA messages, if a message is fully parsed, then another parse
attempt is done with the content not used by the previous parse. If the content is a JXTA
message and the parse attempt is not successful, then its TCP content is stored with its
TCP ow identication as a key, and all the next TCP packets that match with the ow
identication will be sorted and used to attempt to mount a new JXTA message, until the
parser is successful.
With these characteristics, inspection of JXTA messages and extract application
indicators requires more effort than other cases of DPI. For this kind of trafc analysis
4 http://www.wireshark.org/
27
3.2. ARCHITECTURE
memory requirements become even larger, because it needs to take into account not only
the state of the transport session, but also the state of each application layer session.
Also processing power is the highest because the protocol conformance analysis requires
processing the entire application data (Risso et al., 2008).
As previously shown in Figure 3.2, the AppAnalyzer is composed by Map and Reduce
functions, respectively JXTAPerfMapper and JXTAPerfReducer, to extract performance
indicators from JXTA Socket communication layer, which is a JXTA communication
mechanism that implements a reliable message exchange and obtains better throughput
between the communication layers provided by the Java JXTA implementation.
The JXTA Socket messages are transported by the TCP protocol, but it also implements its own control for data delivery, retransmission and acknowledgements. Each
message of a JXTA Socket is part of a Pipe that represents a connection established
between the sender and the receiver. In a JXTA Socket communication, two Pipes are
established, one from sender to receiver and the other from receiver to sender, in which are
transported content messages and acknowledgement messages, respectively. To evaluate
and extract performance indicators from a JXTA Socket, the messages must be sorted,
grouped and linked with their respective Pipes of content and acknowledgement.
The content transmitted into a JXTA Socket is split into byte array blocks and
stored in a reliability message that is sent to the destination and it expects to receive
an acknowledgement message of its arrival. The time between the message delivery
and when the acknowledgement is sent back is called round-trip time (RTT); this may
vary according to the system load and may indicate a possible overload of a peer. In a
Java JXTA implementation, each block received or to be sent is queued by the JXTA
implementation, until the system is ready to process a new block. This waiting time to
handle messages can impact the response time of the system, increasing the message
RTT.
The JXTAPerfMapper and JXTAPerfReducer evaluate the RTT of each content block
transmitted over a JXTA Socket, and also extract information about the number of
connection requests and message arrivals per time. Each Map function evaluates the
packet trace to mount JXTA messages, Pipes and Sockets. The parsed JXTA messages
are sorted by their sequence number and grouped by their Pipe identication, to compose
the Pipes of a JXTA Socket. As soon as the messages are sorted and grouped, the RTT is
obtained, its value is associated with its key and written as an output of the Map function.
The reduce function dened by JXTAPerfReducer receives as input a key and a collection of values, which are the evaluated indicator and its collected values, respectively,
28
3.3. EVALUATION
and then generates individual les with the results of each indicator evaluated.
The requirements to improve these Map and Reduce functions to address others
application indicators, such as throughput or number of retransmissions, are that each
indicator must be represented by an intermediate key, which is used by MapReduce for
grouping and sorting, and that collected values must be associated with its key.
3.3
Evaluation
3.3.1
Evaluation Methodology
29
3.3. EVALUATION
Table 3.1 Metrics to evaluate MapReduce effectiveness and completion time scalability for DPI
of a JXTA-based network trafc
Metrics
M1 : Number of Indicators
M2 : Proportional Scalability
Description
Number of application indicators extracted
from a distributed application trafc.
Verify if the completion time decreases
proportionally to the number of worker
nodes.
Question
Q1
Q2
Factors
Number of worker nodes
Input Size
Levels
3 up to 19
16GB and 34GB
Our testing hypotheses are dened in Table 3.3 and 3.4, that describe the null hypothesis and alternative hypothesis for each previously dened question. Table 3.3 describes
our hypotheses and Table 3.4 presents the notation used to evaluate our hypotheses.
Table 3.3 Hypotheses to evaluate the dened metrics
Alternative Hypothesis
H1num.indct : It is possible to use
MapReduce for extracting application indicators from network trafc.
H1scale.prop : The completion time of
MapReduce for DPI, does not scale
proportionally to node addition.
Null Hypothesis
Question
H0num.indct . It is not possible to use
Q1
MapReduce for extracting applications indicators from network trafc.
H0scale.prop . The completion time of
Q2
MapReduce for DPI, scales proportionally to node addition.
30
3.3. EVALUATION
Hypothesis
H1num.indct
H0num.indct
H1scale.prop
H0scale.prop
Notation
num.indct > 0
num.indct <= 0
scale.prop = n N , sn = s n tn =
scale.prop = n N , sn = s n tn =
t
n
t
n
Question
Q1
Q1
Q2
Q2
factor, which is the increasing factor for the cluster size evaluated. H0scale.prop states that,
evaluating a specic MapReduce Job and input data, for all n being natural and bigger
than zero, a new cluster size dened by a previous cluster size multiplied by the factor n,
implies into the reduction of the previous job time t according to the division factor n,
resulting in the time tn obtained through the division of the previous time t by the factor n.
3.3.2
Experiment Setup
To evaluate the MapReduce effectiveness for application trafc analysis and its completion
time scalability, we performed two sets of experiments, grouped by the input size analysed,
with variation in the number of worker nodes.
Was used as input for MapReduce jobs, network trafc captured from a JXTA-based
distributed backup system, which uses the JXTA Socket communication layer for data
transfer between peers. The network trafc was captured from an environment composed
of six peers, where one peer server receives datafrom ve concurrent client peers, to be
stored and replicated to another peers. During the capturing trafc, one server peer creates
a JXTA Socket Server to accept JXTA Socket connections and receive data through an
established connection.
For each data backup, one client peer establishes a connection with a server peer and
sends messages with the content to be stored; if the content to be stored is bigger than the
JXTA message maximum size, its content will be transferred through two or more JXTA
messages. For our experiment, we adopted the backup of les with content size randomly
dened, with values between 64KB and 256KB.
The network trafc captured was saved into datasets of 16GB and 34GB, split in 35
and 79 les of 64MB, respectively, and stored into HDFS, to be processed as described
in Section 3.2, in order to extract, from the JXTA Socket communication layer, these
selected indicators: round-trip time, number of connection requests per time and number
of messages received by one peer server per time.
For each experiment set the algorithm 1 was executed, implemented by JXTAPerfMap-
31
3.4. RESULTS
per and JXTAPerfReducer, and was measured the completion time and processing capacity
for proling a JXTA-based distributed application through DPI, over different number of
worker nodes. Each experiment was executed 30 times to obtain reliable values (Chen
et al., 2011), within condence interval of 95% and a maximum error ratio of 5%. The
experiment was performed using virtual machines of the Amazon EC2, with nodes running Linux kernel 3.0.0-16, Hadoop version 0.20.203, block size of 64MB and with the
data replicated 3 times over the HDFS. All used virtual machines were composed of 2
virtual cores, 2.5 EC2 Compute Units and 1.7GB of RAM.
3.4
Results
From the JXTA trafc analysed, we extracted three indicators, the number of JXTA
connection requests per time, the number of JXTA messages received per time, and the
round-trip time of JXTA messages, that is dened by the time between content message
arrival from a client peer, and the JXTA acknowledgement sent back from a server peer.
The extracted indicators are shown in Figure 3.3.
Figure 3.3 shows the extracted indicators, exhibiting the measured indicators from the
32
3.4. RESULTS
JXTA Socket communication layer and its behaviour for concurrent data transferring, of
a server peer receiving JXTA Socket connection request and messages from concurrent
client peers of a distributed backup system.
The three indicators extracted from the network trafc of a JXTA-based distributed
application, using MapReduce to perform DPI algorithm and extract desired indicators,
represents important indicators to evaluate a JXTA-based application (Halepovic and
Deters, 2005). With these extracted indicators it is possible to evaluate a distributed
system, providing a better understanding of the behaviour of a JXTA-based distributed
application. Through the extracted information it is possible to evaluate important
metrics, such as the load distribution, response time and the negative impact caused by
the increasing of number of messages received by a peer.
Using MapReduce to perform a DPI algorithm, was possible to extract the three
application indicators from network trafc, then was obtained num.indct = 3, what rejects
the null hypothesis H0num.indct , that states num.indct <= 0, and conrms the alternative
hypothesis H1num.indct , conrming that num.indct > 0.
Figures 3.4(a) and 3.4(b) illustrate how the addition of worker nodes into an Hadoop
cluster reduces the mean completion time and how the scalability of completion time is
for proling 16 GB and 34 GB of network trafc trace.
350
500
Completion Time
Completion Time
450
300
400
350
Time(s)
Time(s)
250
200
300
250
200
150
150
100
100
2
Nodes
10
11
10
12
Nodes
14
16
18
20
In both graphics, the behaviour of the completion time scalability is similar, not
following a linear function and with more signicant scalability gains, through node
addition, in smaller clusters, and less signicant gains with node addition into bigger
clusters.
This scalability behaviour highlights the importance of evaluating the relation between
costs and benets to nodes additions in a MapReduce cluster, due to the non proportional
33
3.4. RESULTS
Nodes
Time
Margin of Error
MB/s
(MB/s)/node
3
322.53
0.54
50.80
16.93
4
246.03
0.67
66.59
16.65
6
173.17
0.56
94.61
15.77
8
151.73
1.55
107.98
13.50
10
127.17
1.11
128.84
12.88
Nodes
Time
Margin of Error
MB/s
(MB/s)/node
4
464.33
0.32
74.98
18.75
8
260.60
0.76
133.60
16.70
12
189.07
1.18
184.14
15.35
16
167.13
0.81
208.32
13.02
19
134.47
1.53
258.91
13.63
3.5. DISCUSSION
what conrms sn = s n. We also have the measured t2 = 151.73 and the calculated nt
dened by 246.03
= 123.01 = t2 , what rejects tn = nt and conrms tn = nt . Therefore,
2
with the measured results was rejected the null hypothesis H0scale.prop and conrmed the
alternative hypothesis H1scale.prop , which states that the completion time of MapReduce
for DPI, does not scale proportionally to node addition.
3.5
Discussion
In this section, we discuss the measured results and evaluate its meaning, restrictions and
opportunities. We also discuss possible threats to validity of our experimental results.
3.5.1
Results Discussion
Distributed systems analysis, detection of root causes and error reproduction are challenges that motivates efforts to develop less intrusive mechanisms for proling and
monitoring distributed applications at runtime. Network trafc analysis is one option to
evaluate distributed systems, although there are limitations on capacity to process a large
amount of network trafc in a short time, and on completion time scalability to process
network trafc where there is variation of resource demand.
According to the evaluated results, using MapReduce for proling a network trafc
from a JXTA-based distributed backup system, through DPI, it is important to analyse the
possible gains with node addition into a MapReduce cluster, because the node addition
provides different gains according to the cluster size and input size. For example, Table
3.6 shows that the addition of 4 nodes into a cluster with 12 nodes, produces a reduction
of 11% in completion time and an improvement of 13% in processing capacity, while the
addition of the same amount of nodes (4 nodes) into a cluster with 4 nodes produces a
reduction of 43% in completion time and an improvement of 78% in processing capacity.
The scalability behaviour of MapReduce for DPI highlights the importance of evaluating the relation between costs and benets to node additions into a MapReduce cluster,
because the gains obtained with node addition are related to the actual and future cluster
size and the input size to be processed.
The growing of the number of nodes in the cluster increases costs due to greater
cluster management, data replication, tasks allocation to available nodes and due costs
with the management of failures. Also, with the cluster growing, the cost is increased
with merging and sorting of the data processed by Map tasks (Jiang et al., 2010), that can
be spread into a bigger number of nodes.
35
3.5. DISCUSSION
In smaller clusters, the probability of a node having a replica of the input data, is
greater than in bigger clusters adopting the same replication factor (Zaharia et al., 2010).
In bigger clusters there are more options of nodes for delegate a task execution, but the
number of data replication limits the benets of data locality to the number of nodes that
store a replica of the data. This increases the cost to schedule tasks and to distribute tasks
in the cluster, and also increases costs with data transfer over the network.
The kind of workload submitted to be processed by MapReduce impacts in the
behaviour and performance of MapReduce (Tan et al., 2012; Groot, 2012), requiring
specic conguration to obtain an optimal performance. Although studies have been
done in order to understand, analyse and improve workload management decisions
in MapReduce (Lu et al., 2012; Groot, 2012), there is no evaluation to characterize
the MapReduce behaviour or to identify its optimal conguration to achieve the best
performance for packet level analysis and DPI. Thus, it is necessary deeply understand
the behaviour of MapReduce to process network traces and what optimizations can be
done to better explore the potential provided by MapReduce for packet level analysis and
DPI
3.5.2
Due budget and time restrictions, our experiments were performed with small cluster size
and small input size, if compared with benchmarks that evaluate the MapReduce performance and its scalability (Dean and Ghemawat, 2008). However, relevant performance
evaluations and reports of real MapReduce production traces shows that the majority of
the MapReduce jobs are small and executed into a small number of nodes (Zaharia et al.,
2008; Wang et al., 2009; Lin et al., 2010; Zaharia et al., 2010; Kavulya et al., 2010; Chen
et al., 2011; Guo et al., 2012).
Although MapReduce has been designed to handle big data, the use of input data in
order of gigabytes has been reported by realistic production traces (Chen et al., 2011),
and this input size has been used in relevant MapReduce performance analysis (Zaharia
et al., 2008; Wang et al., 2009; Lin et al., 2010).
Improvements in MapReduce performance and proposed schedulers has focused
into problems related to small jobs, for example Facebooks fairness scheduler aims to
provide fast response time for small jobs (Zaharia et al., 2010; Guo et al., 2012). Fair
scheduler attempts to guarantee service levels for production jobs by maintaining job
pools composed by a small number of nodes than the total nodes of a data center, to
maintain a minimum share and dividing excess capacity among all jobs or pools (Zaharia
36
et al., 2010).
According to Zaharia et al. (2010), 78% of Facebooks MapReduce jobs have up to
60 Map tasks. Our evaluated datasets were composed by 35 and 79 les, what implies
into the same and respective numbers of Map tasks, due to our approach evaluates an
entire block per Map task.
3.6
Chapter Summary
In this chapter, we presented an approach for proling application trafc using MapReduce, and evaluated its effectiveness for proling application through DPI and its completion time scalability in a cloud computing environment.
We proposed a solution based on MapReduce, for deep inspection of distributed
applications trafc, in order to evaluate the behaviour of distributed systems at runtime,
using commodity hardware, in a low intrusive way, through a scalable and fault tolerant
approach based on Hadoop, an open source implementation of MapReduce.
MapReduce was used to implement a DPI algorithm to extract application indicators
from a JXTA-based trafc of a distributed backup system. Was adopted an splitting
approach without the block division into records, was used a network trace split into les
with maximum size lesser than the HDFS block size, to avoid the cost and complexity of
providing to the HDFS a algorithm for splitting the network trace into blocks, and also to
use a whole block as input for Map functions, in order to be able to reassemble two or
more packets and reassemble JXTA messages from packets of network traces, per Map
function.
We evaluated the effectiveness of MapReduce for a DPI algorithm and its completion
time scalability, over different sizes of network trafc used as input, and different cluster
size. We showed that the MapReduce programming model can express algorithms for
DPI and extracts application indicators from application network trafc, using virtual
machines of a cloud computing provider, for DPI of large amounts of network trafc.
We also evaluated its completion time scalability, showing the scalability behaviour, the
processing capacity achieved, and the inuence of the number of nodes and the data input
size on the capacity processing for DPI.
It was shown that MapReduce completion time scalability for DPI does not follow a
linear function, with more signicant scalability gains, through the addition of nodes, in
small clusters, and less signicant gains in bigger clusters.
According to the results, input size and cluster size generate signicant impact in
37
processing capacity and completion time of MapReduce jobs for DPI. This highlights
the importance of evaluating the best input size and cluster size to obtain an optimal
performance in MapReduce Jobs, but also indicates the need for more evaluations about
the inuence of others important factors on MapReduce performance, in order to provide
better conguration, selection of input size and machine allocation into a cluster, and to
provide valuable information for performance tuning and predictions.
38
All difcult things have their origin in that which is easy, and great
things in that which is small.
LAO TZU
The use of MapReduce for distributed data processing has been growing and achieving benets with its application for different workloads. MapReduce can be used for
distributed trafc analysis, although network trafc traces present characteristics which
are not similar to the data type commonly processed through MapReduce, which in
general is divisible and text-like data, while network traces are binary and may present
restrictions about splitting, when processed through distributed approaches.
Due to the lack of evaluation of MapReduce for trafc analysis and the peculiarity of
this kind of data, this chapter deeply evaluates the performance of MapReduce in packet
level analysis and DPI of distributed application trafc, evaluating its scalability, speed-up
and the behaviour followed by MapReduce phases. The experiments provide evidences
for the predominant phases in this kind of MapReduce job, and show the impact of input
size, block size and number of nodes, on completion time and scalability.
This chapter is organized as follows. We rst describe the motivation for a MapReduce
performance evaluation for network trafc analysis in Section 4.1. Then we present the
evaluation plan and methodology adopted in Section 4.2, and the results are presented in
39
4.1. MOTIVATION
Section 4.3. Section 4.4 discusses the results and Section 4.5 summarizes the chapter.
4.1
Motivation
40
4.2. EVALUATION
4.2
Evaluation
The goal of this evaluation is to characterize the behaviour of MapReduce phases, its
scalability characteristics over node addition and the speed-up achieved with MapReduce
for packet level analysis and DPI. Thus, we performed a performance measurement and
evaluation of MapReduce jobs that execute packet level analysis and DPI algorithms.
To evaluate MapReduce for DPI, the Algorithm 1 implemented by JXTAPerfMapper
and JXTAPerfReducer was used, and applied to new factors and levels. To evaluate
MapReduce for packet level analysis, a port counter algorithm developed by Lee et al.
(2011) was used, which divides a block into packets and processes each packet individually to count the number of occurrence of TCP and UDP port numbers. The same
algorithm was evaluated using the splitting approach that processes a whole block per
Map function, without the division of a block into records or packets. A comparison was
done between these two approaches for packet level analysis.
4.2.1
Evaluation Methodology
41
4.2. EVALUATION
Metrics
Completion Time
Phases Time
Phases Occupation
Scalability
Speed-up
Description
Completion time of MapReduce jobs
Time consumed by each MapReduce phase in total completion
time of MapReduce jobs
Relative time consumed by each MapReduce phase in total completion time of MapReduce jobs
Processing capacity increasing obtained with node addition in a
MapReduce cluster
Improvement in completion time against the same algorithm implemented without distributed processing
Table 4.1 describes the metrics evaluated, that is the completion time of MapReduce
jobs, the relative and absolute time of each MapReduce phases from the total job time,
the processing capacity scalability, and the speed-up against non-distributed processing.
The experiments adopt the factors and level described in Table 4.2. The selected
factors were chosen due to its importance for MapReduce performance evaluations and
its adoption by relevant previous researches (Jiang et al., 2010; Chen et al., 2011; Shafer
et al., 2010; Wang et al., 2009).
Table 4.2 Factors and Levels
Factors
Number of Worker Nodes
Block Size
Input Size
Levels
2 up to 29
32MB, 64MB and 128MB
90Gb and 30Gb
Hadoop logs are a valuable information source about its environment and execution
jobs; important MapReduce indicators and information about jobs, tasks, attempts,
failures and topology are logged by Hadoop during its execution. The collected data to
perform this performance evaluation was extracted from Hadoop logs.
To extract information from Hadoop logs and to evaluate the selected metricis, we
developed the Hadoop-Analyzer (Vieira, 2013), an open source and publicly available
42
4.2. EVALUATION
tool to extract and evaluate MapReduce indicators, such as job completion time and
MapReduce phases distribution, from logs generated by Hadoop from its job executions.
With Hadoop-Analyzer is possible to generate graphs with the extracted indicators and
thereby evaluate the desired metrics.
Hadoop-Analyzer relies on Rumen (2012) to extract raw data from Hadoop logs
and generate structured information, which is processed and shown in graphs generated
through R (Eddelbuettel, 2012) and Gnuplot (Janert, 2010), such as the results presented
in Section 4.3
4.2.2
Experiment Setup
Network trafc traces of distributed applications were captured to be used as input for
MapReduce jobs of our experiments; these traces were divided into les with size dened
by the block size adopted in each experiment, and then the les were stored into HDFS,
following the process described in the previous chapter. The packets were captured using
Tcpdump and were split into les with sizes of 32MB, 64MB and 128MB.
For packet level analysis and DPI evaluation, two sizes of datasets were captured
from network trafc transferred between some nodes of distributed systems. One dataset
was 30Gb of network trafc, with data divided in 30 les of 128MB, 60 les of 64MB
and 120 les of 32MB. The other dataset was 90Gb of network trafc, split in 90 les of
128MB, 180 les of 64MB and 360 les of 32MB.
For the experiments of DPI through MapReduce, we used network trafc captured
from the same JXTA-based application described in Section 3.3.2, but with different
sizes of traces and les. To evaluate MapReduce for packet level analysis, we processed
network trafc captured from data transferred between 5 clients and one server of a data
storage service provided through the Internet, known as Dropbox1 .
To evaluate MapReduce for packet level analysis and DPI, one driver was developed
for each case of network trafc analysis, with one version using MapReduce and another
without it.
CountUpDriver implements packet level analysis for a port counter of network traces,
which records how many times a port appears in TCP or UDP packets; its implementation
is based on processing a whole block as input for Map functions, without splitting and with
block size dened by the block size of HDFS. Furthermore a port counter implemented
with P3 was evaluated; this implementation is a version of the tool presented by Lee et al.
1 http://www.dropbox.com/
43
4.2. EVALUATION
(2011), which adopts an approach that divides a block into packets and processes each
packet individually, without dependent information between packets.
JxtaSocketPerfDriver implements DPI to extract, from a JXTA (Duigou, 2003) network trafc, the round-trip time of JXTA messages, the number of connection requisitions
per time and the number of JXTA Socket messages from JXTA clients and a JXTA Socket
server. JxtaSocketPerfDriver uses whole les as input for each Map function, with size
dened by the HDFS block size, in order to reassemble JXTA messages with its content
divided into many TCP packets.
One TCP packet can transport one or more JXTA messages per time, which makes it
necessary to evaluate the full content of TCP segments to recognize all possible JXTA
messages, instead of to evaluate only a message header or signature. The round-trip time
of JXTA messages is calculated from the time between a client peer sending a JXTA
message and receiving the JXTA message arrival conrmation. To evaluate the round-trip
time it is necessary to keep the information of requests and which responses correspond to
each request, and thus, it is necessary to analyse several packages to retrieve and evaluate
information of the application behaviour and its states.
To analyse the speed-up provided by MapReduce against a single machine execution,
two drivers were developed that use the same dataset and implement the same algorithms
implemented by CountUpDriver and JxtaSocketPerfDriver, but without distributed processing. These drivers are respectively CountUpMono and JxtaSocketPerfMono.
The source code of all implemented drivers and other implementations to support the
use of MapReduce for network trafc analysis, are open source and publicly available at
Vieira (2012a).
The experiments were performed on a 30-node Hadoop-1.0.3 cluster composed of
nodes with four 3.2GHz cores, 8 GB RAM and 260GB of available hard disk space,
running Linux kernel 3.2.0-29. Hadoop was used as our MapReduce implementation,
and congured to permit a maximum of 4 Map and 1 Reduce task per node; also, we
dened the value -Xmx1500m as child option of JVM and 400 as io.Sort.mb value.
For drivers CountUpDriver and JxtaSocketPerfDriver, the number of Reducers
was dened as a function of the number of slots of Reducers per node, dened by
numReducers = (0.95)(numNodes)(maxReducersPerNode) (Kavulya et al., 2010). The
driver implemented with P3 (Lee et al., 2011) adopts a xed number of Reducers, dened
as 10 by the available version of P3. Each experiment was executed 20 times to obtain
reliable values (Chen et al., 2011), within condence interval of 95% and a maximum
error ratio of 5%.
44
4.3. RESULTS
4.3
Results
Two dataset sizes of network trafc were used during the experiments, with 30Gb and
90Gb. Each dataset was processed by MapReduce jobs that implement packet level
analysis and DPI, in Hadoop clusters with variation in number of worker nodes between
2 and 29, and block size of 32MB, 64MB and 128MB.
Each dataset was processed by algorithms implemented through MapReduce and
without distributed processing, to evaluate the speed-up achieved. The Table 4.3 shows
the execution times obtained by the non-distributed processing, implemented and executed through JxtaSocketPerfMono and CountUpMono, using a single machine, with the
resource conguration described in Subsection 4.2.2.
Table 4.3 Non-Distributed Execution Time in seconds
Block
JxtaSocketPerfMono
32MB
64MB
128MB
CountUpMono
90Gb
30Gb
90Gb
30Gb
1.745,35
1.755,40
1.765,50
584,92
587,02
606,50
872,40
571,33
745,25
86,71
91,76
94,82
Figure 4.1 shows the completion time and speed-up of the DPI Algorithm 1 to extract
indicators from a JXTA-based distributed application. The Completion Time represents
the job time of JxtaSocketPerfDriver and the Speed-up represents gains in execution time
of JxtaSocketPerfDriver against JxtaSocketPerfMono to process 90Gb of network trafc.
900
800
20
700
15
500
400
10
Speedup
Time(s)
600
300
200
5
100
0
02
04
06
10
14
Nodes
18
21
25
29
Figure 4.1 DPI Completion Time and Speed-up of MapReduce for 90Gb of a JXTA-application
network trafc
45
4.3. RESULTS
According to Figure 4.1, JxtaSocketPerfDriver performs better than JxtaSocketPerfMono over all factors variation. Initially we observed the best speed-up of 3.70 times
with 2 nodes and block of 128MB, lastly we observed a maximum speed-up of 16.19
times with 29 nodes and block of 64MB. The speed-up achieved with block size of 32MB
was the worst case initially, but its speed-up increased with node addition and became
better than blocks with 128MB and near of the speed-up achieved with block of 64MB,
for a cluster with 29 nodes.
The completion time scalability behaviour of 32MB block size showed reduction
in completion time for all node additions, although cases with block size of 64MB and
128MB present no signicant reduction in completion time in clusters with more than 25
nodes. According to Figure 4.1, the completion time does not reduce linearly with node
addition, and the improvement in completion time was less signicant when the dataset
was processed by more than 14 nodes, specially for cases that adopted blocks of 64MB
and 128MB.
Figure 4.2 shows the processing capacity of MapReduce applied to DPI of 90Gb
of a JXTA-based application trafc, over variation of cluster size and block size. The
processing capacity was evaluated by the throughput of network trafc processed, and
by the relative throughput, dened by the processing capacity achieved per number of
allocated nodes.
100
Throughput 32MB
Throughput 64MB
Throughput 128MB
Throughput/Nodes 32MB
Throughput/Nodes 64MB
Throughput/Nodes 128MB
1000
90
Throughput (Mbps)
70
600
60
50
400
Throughput/Nodes (Mbps)
80
800
40
200
30
10
12
14 16
Nodes
18
20
22
24
26
28
30
The processing capacity achieved for DPI of 90Gb using block size of 64MB was
159.89 Mbps with 2 worker nodes, increasing up to 869.43 Mbps with 29 worker nodes.
For the same case, the relative processing capacity achieved was 79.94 Mbps/node with 2
nodes and 29.98 Mbps/node with 29 nodes, showing a decrease of relative processing
46
4.3. RESULTS
Map
Map and Shuffle
Shuffle
Sort
Reduce
Setup
Cleanup
800
700
Time(s)
600
500
400
300
200
100
0
29
25
21
18
14
10
06
04
02
29
25
21
18
14
10
06
04
02
29
25
21
18
14
10
06
04
02
32MB
64MB
128MB
Map
Map and Shuffle
Shuffle
Sort
100
% of Completion Time
80
60
40
20
0
29
25
21
18
14
10
06
04
02
29
25
21
18
14
10
06
04
02
29
25
21
18
14
10
06
04
02
32MB
64MB
128MB
47
4.3. RESULTS
MapReduce execution can be divided into Map, Shufe, Sort and Reduce phases,
although Shufe tasks can be executed before the conclusion of all Map tasks, thereby
Map and Shufe tasks can overlap. According to the Hadoop default conguration, the
overlapping between Map and Shufe tasks starts after 5% of Map tasks are concluded;
and then Shufe tasks are started and run until the Map phase ends.
In Figures 4.3(a) and 4.3(b) we showed the overlapping between Map and Shufe
tasks as a specic MapReduce phase, represented as "Map and Shufe" phase. The time
consumed by Setup and Cleanup tasks was considered too, for a better visualization of
the execution time division in Hadoop jobs.
Figure 4.3(a) shows the cumulative time of each MapReduce phase in total job time.
For DPI, Map time, which is the Map and the "Map and Shufe" phases, consumes the
major part of a job execution time and it is the phase that exhibits more variation with the
number of nodes variation, but no signicant time reduction is achieved with more than
21 nodes and block size of 64MB or 128MB.
The Shufe time, which happens after all Map tasks are completed, presented low
variation with node addition. Sort and Reduce phases required relatively low execution
times and did not appear in some bars of the graph. Setup and Cleanup tasks consumed
an almost constant time, independently of cluster size or block size variation.
Figure 4.3(b) shows the percentage of each MapReduce phase in total job completion
time. We also considered an additional phase, called others, which represents the time
consumed by cluster management tasks, like scheduling and tasks assignment. The
behaviour followed by phases occupation is similar over all block sizes evaluated, with
the exception of the case where Map time does not decrease with node addition, in
clusters using block size of 128MB and with more than 21 nodes.
With cluster size variation, a relative reduction in Map time was observed and a
relative increase in the time of Shufe, Setup and Cleanup phases. During the evaluation
of Figure 4.3(a), it was observed that Setup and Cleanup phases consume an almost
absolute constant time, independently of cluster size and block size, and thereby with
node addition and completion time decreasing, the time consumed by Setup and Cleanup
tasks became more signicant in relation to the total execution time, due to the taotal
job completion time decreasing and the time of Setup and Cleanup tasks remaining
almost the same; therefore, the Setup and Cleanup percentage time increased and became
more signicant over the total job completion time reduction, by node addition into the
MapReduce cluster.
According to Figures 4.3(a) and 4.3(b), Map phase is predominant in MapReduce
48
4.3. RESULTS
jobs for DPI, and the reduction of the total job completion time over node addition is
related to the decreasing of Map phase time. Thus, improvements in Map phase execution
for DPI workloads can produce more signicant gains, in order to reduce the total job
completion time for DPI.
Figure 4.4 shows the comparison between completion time of CountUpDriver and
P3 to packet level analysis of 90Gb of network trafc, over variation of cluster size and
block size.
CountUpDriver 32MB
P3 32MB
CountUpDriver 64MB
P3 64MB
CountUpDriver 128MB
P3 128MB
500
Time(s)
400
300
200
100
0
02
04
06
10
Nodes
14
21
28
Figure 4.4 Completion time comparison of MapReduce for packet level analysis, evaluating the
approach with and without splitting into packets
P3 achieves better completion time than CountUpDriver over all factors, showing
that a divisible les approach performs better for packet level analysis and that block size
is a signicant factor for both approaches, due to signicant impact on completion time
caused by adoption of blocks with different sizes.
With variation in the number of nodes, it was observed that using a block size of
128MB was achieved better completion time up to 10 nodes, but that no more improvement in completion time was achieved with node addition on clusters with more than 10
nodes. Blocks of 32MB and 64MB only present signicant completion time difference in
a cluster up to 14 nodes; for a cluster bigger than 14 nodes a similar completion time was
achieved for both block sizes adopted, but these are still better completion times than the
completion time achieved with blocks of 128MB.
Figures 4.5(a) and 4.5(b) show respectively the completion time and speed-up of P3
and CountUpDriver against CountUpMono, for packet level analysis, with variation on
the number of nodes and block size. For both cases, the use of a block size of 128MB
49
4.3. RESULTS
provides the best completion time in smaller clusters, up to 10 nodes, but it provides
a worse completion time in a cluster with more than 21 nodes. For both evaluations,
the speed-up adopting block of 128MB scales up to 10 nodes, but for a bigger cluster a
speed-up gain was not achieved with node addition.
350
300
18
16
250
200
Time(s)
12
10
150
Speedup
14
8
100
6
50
0
02
04
06
10
Nodes
14
21
28
(a): P3 evaluation
Completion Time 32MB
Completion Time 64MB
Completion Time 128MB
Speedup 32MB
Speedup 64MB
Speedup 128MB
500
14
12
400
200
Speedup
Time(s)
10
300
100
2
0
02
04
06
10
Nodes
14
21
28
Using blocks of 32MB was achieved improvement in completion time for all node
addition, which causes improvement of speed-up for all cluster size, although the use of
this block size did not present better completion time than others block sizes in any case.
The adoption of 32MB blocks provided better speed-up than other block sizes in
a cluster with more than 14 nodes, due to the time consumed by CountUpMono for
processing of 90Gb divided into 32MB les which was bigger than the time consumed
by cases with other block sizes, as shown in Table 4.3.
50
4.3. RESULTS
Figures 4.6(a) and 4.6(b) show the processing capacity of P3 and CountUpDriver to
perform packet level analysis of 90Gb of network trafc, over variation of cluster size
and block size.
Throughput 32MB
Throughput 64MB
Throughput 128MB
Throughput/Nodes 32MB
Throughput/Nodes 64MB
Throughput/Nodes 128MB
2000
250
150
1000
Throughput/Nodes (Mbps)
Throughput (Mbps)
200
1500
100
500
50
0
10
12
14 16
Nodes
18
20
22
24
26
28
30
Throughput 32MB
Throughput 64MB
Throughput 128MB
Throughput/Nodes 32MB
Throughput/Nodes 64MB
Throughput/Nodes 128MB
1600
120
Throughput (Mbps)
1200
100
1000
80
800
600
Throughput/Nodes (Mbps)
1400
60
400
200
40
0
0
10
12
14 16
Nodes
18
20
22
24
26
28
30
Using block size of 64MB, P3 achieved throughput of 413.16 Mbps with 2 nodes and
the maximum of 1606.13 Mbps with 28 nodes, while its relative throughput for the same
conguration was 206.58 Mbps and 55.38 Mbps. The processing capacity for packet
level analysis, evaluated for P3 and CountUpDriver, follows the same behaviour showed
in Figure 4.2. Additionally, it is possible to observe a convergent decreasing of relative
processing capacity for all block sizes evaluated, starting in a cluster size of 14 nodes,
where the relative throughput achieved by all block sizes is quite similar.
Figure 4.6(b) shows a relative processing capacity increasing with the addition of
51
4.3. RESULTS
2 nodes into a cluster with 4 nodes. For packet level analysis of 90Gb, MapReduce
achieved a better processing capacity efciency per node using 6 nodes, which provides
24 Mappers and 5 Reducers per Map and Reduce waves. With the adopted variation in
number of Reducers, according to cluster size, using 5 Reducers was achieved better
processing efciency and signicant reduction on Reduce time, as shown in Figure 4.7(b).
Figures 4.7(a) and 4.7(b) show the cumulative time per phase during a job execution.
350
Map
Map and Shuffle
Shuffle
Sort
Reduce
Setup
Cleanup
300
Time(s)
250
200
150
100
50
0
21
28
21
28
14
10
64MB
06
04
02
28
21
14
10
06
04
02
28
21
14
10
06
04
02
32MB
128MB
Map
Map and Shuffle
Shuffle
Sort
Reduce
Setup
Cleanup
500
Time(s)
400
300
200
100
0
14
10
06
04
02
28
21
64MB
14
10
06
04
02
28
21
14
10
06
04
02
32MB
128MB
The behaviour of MapReduce phases for packet level analysis is similar to the
behaviour observed for DPI; with Map time being predominant, Map and Shufe time
do not decreasing with node addition in a cluster bigger than a specic size, and Sort
and Reduce phases consuming low execution time. The exception is that Shufe phase
consumes more time in packet level analysis jobs than in DPI, specially in smaller clusters.
52
4.3. RESULTS
For packet level analysis, the amount of intermediate data generated by Map functions
is bigger than the amount generated through the use of MapReduce for DPI; packet
level analysis generates an intermediate data for each packet evaluated, but for DPI it is
necessary to evaluate more than one packet to generate intermediate data. Shufe phase
is responsible for sorting and transferring the Map outputs to the Reducers as inputs; then
the amount of intermediate data generated by Map tasks and the network transfer cost,
will impact on the Shufe phase time.
Figures 4.8(a) and 4.8(b) show the percentage of each phase on job completion time
of P3 and CountUpDriver, respectively.
Reduce
Setup
Cleanup
Others
Map
Map and Shuffle
Shuffle
Sort
100
% of Completion Time
80
60
40
20
0
21
28
21
28
14
10
64MB
06
04
02
28
21
14
10
06
04
02
28
21
14
10
06
04
02
32MB
128MB
Map
Map and Shuffle
Shuffle
Sort
100
% of Completion Time
80
60
40
20
0
14
10
06
04
02
28
21
64MB
14
10
06
04
02
28
21
14
10
06
04
02
32MB
128MB
As the behaviour observed in Figure 4.3(b) and followed by these cases, Map and
Shufe phases consume more relative time than all others phases, over all factors. But, for
53
4.3. RESULTS
packet level analysis, Map phase occupation decreases signicantly, with node addition,
only when block sizes are 32MB or 64MB, and following the completion time behaviour
observed in Figures 4.5(a) and 4.5(b).
For the dataset of 30Gb of network trafc the same experiments were conducted, and
the results about MapReduce phases evaluation presented a behaviour quite similar to the
results already presented by 90Gb experiments, for DPI and packet level analysis.
Relevant differences were identied for speed-up, completion time and scalability
evaluation, as shown by Figures 4.9(a) and 4.9(b), which exhibit the completion time and
processing capacity scalability of MapReduce for DPI of 30Gb of network trafc, with
variation in cluster size and block size.
10
350
300
8
250
Time(s)
6
150
Speedup
7
200
100
50
2
0
02
06
10
18
Nodes
21
25
29
80
Throughput 32MB
Throughput 64MB
Throughput 128MB
Throughput/Nodes 32MB
Throughput/Nodes 64MB
Throughput/Nodes 128MB
500
70
450
Throughput (Mbps)
50
350
300
40
250
Throughput/Nodes (Mbps)
60
400
30
200
20
150
100
10
0
10
12
14
16
Nodes
18
20
22
24
26
28
30
Figure 4.9 DPI Completion Time and Processing capacity for 90Gb
54
4.4. DISCUSSION
The completion time of DPI of 30Gb scales signicantly up to 10 nodes; then the
experiment presents no more gains with node addition using block size of 128MB, and
presents a low increase of completion time in cases using blocks of 32MB and 64MB.
This behaviour is the same observed for job completion time for 90Gb, as shown in
Figures 4.5(a) and 4.5(b), but with signicant scaling just up to 10 nodes for the dataset
of 30Gb, while it was achieved scaling up to 25 nodes for 90Gb.
Figure 4.9(a) shows that a completion time of 199.12 seconds was obtained with 2
nodes using block of 128MB, scaling up to 87.33 seconds with 10 nodes and the same
block size, while it was achieved respectively 474.44 and 147.12 seconds by the same
conguration for DPI of 90Gb, as shown in Figure 4.1.
Although the case for processing of 90Gb (Figure 4.1) had processed a dataset 3
times bigger than the case of 30Gb (Figure 4.9(a)), the completion time achieved in all
cases for processing of 90Gb was smaller than 3 times the completion time for processing
of 30Gb. For the cases with 2 and 10 nodes using block of 128MB, it was consumed
respectively 2.38 and 1.68 times more time to process a dataset 90Gb, which is a dataset
3 times bigger than the dataset of 30Gb.
Figure 4.9(b) shows the processing capacity for DPI of 30Gb. The maximum speed-up
achieved for DPI of 30Gb was 7.90 times, using block of 32MB and 29 worker nodes,
while it was achieved the maximum speed-up of 16.19 times for DPI of 90Gb with 28
nodes.
From these results, it is possible to conclude that the MapReduce efciency ts better
for bigger data and in some cases can be more efcient to accumulate input data to
process a bigger amount of data. Therefore it is important to analyse the dataset size to
be processed, and to quantify the ideal number of allocated nodes for each job, in order
to avoid wasting resources.
4.4
Discussion
In this section, we discuss the measured results and evaluate their meaning, restrictions
and opportunities. We also discuss possible threats to the validity of our experiment.
4.4.1
Results Discussion
According to the processing capacity presented in our experimental results, of the evaluation of MapReduce for packet level analysis and DPI, it is possible to see that the
MapReduce adoption for packet level analysis and DPI provided high processing capacity
55
4.4. DISCUSSION
and speed-up of completion time, when compared with a solution without distributed
processing, making it possible to evaluate a large amount of network trafc and extract
information from distributed applications of an evaluated data center.
The block size adopted and the number of nodes allocated to data processing are
important factors for obtaining an efcient job completion time and processing capacity
scalability. Some benchmarks show that MapReduce performance can be improved by
an optimal block size choice (Jiang et al., 2010), showing better performance with the
adoption of bigger block sizes. We evaluated the impact of the block size for packet level
analysis and DPI workloads; blocks with 128MB provided a better completion time for
smaller clusters, but blocks with 64MB performed better in bigger clusters. Thus, in order
to obtain an optimal completion time adopting bigger block size it is necessary also to
evaluate the node allocation for the MapReduce job, due to the variation in block size
and cluster size can cause signicant impact into completion time.
The different processing capacity achieved for processing datasets of 30Gb and 90Gb
highlights the efciency of MapReduce for dealing with bigger data processing, and
that can be more efcient to accumulate input data, to process a larger amount of data.
Therefore, it is important to analyse the dataset size to be processed, and to quantify the
ideal number of allocated nodes for each job, in order to avoid wasting resources.
The evaluation of the dataset size and the optimal number of nodes is important to
understand how to schedule MapReduce jobs and resource allocation through specic
Hadoop schedulers, such as Capacity Scheduler and Fair Scheduler (Zaharia et al., 2010),
in order to avoid resource wasting with allocation of nodes that will not produce signicant
gains (Verma et al., 2012a). Thus, the variation of processing capacity achieved in out
experiments highlights the importance of evaluation of the cost of node allocation and its
benets, and the need to evaluate the ideal size of pools in the Hadoop cluster, to obtain
efciency between the cluster size allocated to process an input size, and the resource
sharing of a Hadoop cluster.
The MapReduce processing capacity does not scale proportionally to node addition;
in some cases there is not signicant processing capacity which increases with node
addition, as shown in Figure 4.1, where jobs using block size of 64MB and 128MB in
clusters with more than 14 nodes for DPI of 90Gb, present no signicant completion time
gain with node addition.
The number of execution waves is a factor that must be evaluated (Kavulya et al.,
2010) when MapReduce scalability is analysed, due to the execution time decrease wis
related to the number of execution waves necessary to process all input data. The number
56
4.4. DISCUSSION
of execution waves is dened by the available slots for execution of Map and Reduce
tasks; for example, if a MapReduce job is divided into 10 tasks in a cluster with 5 available
slots, then it will be necessary to have 2 execution waves for all tasks be executed.
Figure 4.9(a) shows a case of DPI of 30Gb, using block size of 128MB, in which there
was not a reduction of completion time with cluster size bigger than 10 nodes, because
there was not a reduction in the number of execution waves. But in our experiments,
cases with a reduction of execution waves also presented no signicant reduction of
completion time, as cases using block size of 128MB in clusters with 21 nodes or more,
for DPI and packet level analysis, showed in Figure 4.1. Thus, node addition or tasks
distribution must be evaluated for resources usage optimization and to avoid additional or
unnecessary costs with machines and power consumption.
The comparison of completion time between CountUpDriver and P3 shows that P3,
which splits the data into packets, performs better than CountUpDriver, which processes
a whole block without splitting. Processing a whole block as input, the local node
parallelism is limited to the number of slots per node, while in the divisible approach
each split can be processed by an independent thread, increasing the possible parallelism.
Because some cases require data without splitting, such as DPI and video processing
cases (Pereira et al., 2010), improvements for this issue must be evaluated, considering
better schedulers, data location and task assignment.
The behavioural evaluation of MapReduce phases showed that the Map phase is
predominant in total execution time for packet level analysis and DPI, with Shufe being
the second most expressive phase. Shufe can overlap Map phase, and this condition
must be considered in MapReduce evaluations, specially in our case, due to the overlap
of Map and Shufe which represents more than 50% of total time execution.
The long time of "Map and Shufe" phase represents a long time of Shufe tasks
being executed in parallel with Map tasks, and a long time of slots allocated for Shufe
tasks that only will be concluded after all Map tasks are nished, although these Shufe
tasks can be longer than the time required to read and process the generated intermediate
data. If there are slots allocated for Shufe tasks that are only waiting for Map phase
conclusion, these slots could be used for other task executions, which could accelerate
the job completion time.
With the increase of cluster size and reduction of job completion time, it was observed
that the Map phase showed a proportional decrease, while Shufe phase increased with
the growth of the number of nodes. With more nodes, the intermediate data generated
by Map tasks is placed in more nodes, which are responsible for shufing the data and
57
sending them to specic Reducers, increasing the amount of remote I/O from Mappers to
Reducers and the number of data sources for each Reducer. Shufe phase may represent
a bottleneck (Zhang et al., 2009) for scalability and could be optimized, due to I/O
restrictions (Lee et al., 2012; Akram et al., 2012) and data locality issues for Reduce
phase (Hammoud and Sakr, 2011).
Information extracted from the analysed results about the performance obtained
with specic cluster, block and input sizes, is important for conguring MapReduce
resource allocation and specialized schedulers, such as the Fair Scheduler (Zaharia et al.,
2008), which denes pool sizes and resource share for MapReduce jobs. Thus, with
the information of the performance achieved with specic resources, it is possible to
congure MapReduce parameters in order to obtain efciency between the resource
allocation, and the expected completion time or resource sharing (Zaharia et al., 2008,
2010).
4.4.2
In this chapter we evaluated for packet level analysis, a port counter implemented with
P3. It was used a version of this implementation published in Lee et al. (2011) website2 ,
obtained on 2012 February, when a complete binary version was available, which was
used in our experiments, but this binary version is currently not available.
Part of the P3 source code was published later, but not all necessary code to compile
all binary libraries necessary to evaluate the P3 implementation of a port counter. Thereby
it is important to highlight that the obtained results, through our evaluation, was for the
P3 version obtained on 2012 February from Lee et al. (2011) website.
It is also important to highlight that the DPI can present restrictions to evaluate
encrypted messages, and that the obtained results are specics for the input datasets,
factors, levels and experiment setup used in our evaluation.
4.5
Chapter Summary
In this chapter, we evaluated the performance of MapReduce for packet level analysis and
DPI of applications trafc. We evaluated how data input, block and cluster sizes, impacts
on MapReduce phases, job completion time, processing capacity scalability and on the
speed-up achieved in comparison with the same algorithm executed by a non-distributed
2 https://sites.google.com/a/networks.cnu.ac.kr/yhlee/p3
58
implementation.
The results show that MapReduce presents high processing capacity for dealing with
massive application trafc analysis. The behaviour of MapReduce phases over variation
of block size and cluster size was evaluated; we veried that packet level analysis and
DPI are Map-intensive jobs, and that Map phase consumes more than 70% of execution
time, with Shufe phase being the second predominant phase.
We showed that input size, block size and cluster size are important factors to be
considered to achieve better job completion time and to explore MapReduce scalability
and efcient resource allocation, due to the variation in completion time provided by the
block size adopted and, in some cases, due to the processing capacity which does not
increase with node addition into the cluster.
We also showed that using a whole block as input for Map functions, achieved a
poorer performance than using divisible data, thereby more evaluation is necessary to
understand how it can be handled and improved.
59
Distributed systems has been adopted for building modern Internet services and cloud
computing infrastructure. The detection of error causes, diagnose and reproduction of
errors of distributed systems are challenges that motivate efforts to develop less intrusive
mechanisms for monitoring and debugging distributed applications at runtime.
Network trafc analysis is one option for distributed systems measurement, although
there are limitations on capacity to process large amounts of network trafc in short time,
and on scalability to process network trafc where there is variation of resource demand.
In this dissertation we proposed an approach to perform deep inspection in distributed
applications network trafc, in order to evaluate distributed systems at a data center
through network trafc analysis, using commodity hardware and cloud computing services, in a minimally intrusive way. Thus we developed an approach based on MapReduce,
to evaluate the behavior of a JXTA-based distributed system through DPI.
We evaluated the effectiveness of MapReduce to implement a DPI algorithm and
its completion time scalability to measure a JXTA-based application, using virtual machines of a cloud computing provider. Also, was deeply evaluated the performance of
MapReduce for packet-level analysis and DPI, characterizing the behavior followed by
MapReduce phases, its processing capacity scalability and speed-up, over variations of
60
5.1. CONCLUSION
5.1
Conclusion
With our proposed approach, it is possible to measure the network trafc behavior of
distributed applications with intensive network trafc generation, through the ofine
evaluation of information from the production environment of a distributed system,
making it possible to use the information from the evaluated indicators, to diagnose
problems and analyse performance of distributed systems.
We showed that MapReduce programming model can express algorithms for DPI, as
the Algorithm 1, implemented to extract application indicators from the network trafc
of a JXTA-based distributed application. We analysed the completion time scalability
achieved for different number of nodes in a Hadoop cluster composed of virtual machines,
with different size of network trafc used as input. We showed the processing capacity
and the completion time scalability achieved, and also was showed the inuence of the
number of nodes and the data input size in the processing capacity for DPI using virtual
machines of Amazon EC2, for a selected scenario.
We evaluated the performance of MapReduce for packet level analysis and DPI of
applications trafc, using commodity hardware, and showed how data input size, block
size and cluster size cause relevant impacts into MapReduce phases, job completion time,
processing capacity scalability and in the speedup achieved in comparison against the
same execution by a non distributed implementation.
The results showed that although MapReduce presents a good processing capacity
using cloud services or commodity computers for dealing with massive application
trafc analysis, but it is necessary to evaluate the behaviour of MapReduce to process
specics data type, in order to understand its relation with the available resources and
the conguration of MapReduce parameters, and to obtain an optimal performance for
specic environments.
We showed that MapReduce processing capacity scalability is not proportional to
number of allocated nodes, and the relative processing capacity decreases with node
addition. We showed that input size, block size and cluster size are important factors to be
considered to achieve better job completion time and to explore MapReduce scalability,
due to the observed variation in completion time provided by different block size adopted.
Also, in some cases, the processing capacity does not scale with node addition into
the cluster, what highlights the importance of allocating resources according with the
workload and input data, in order to avoid wasting resources.
61
5.2. CONTRIBUTIONS
We veried that packet level analysis and DPI are Map-intensive jobs, due to Map
phase consumes more than 70% of the total job completion time, and shufe phase is
the second predominant phase. We also showed that using whole block as input for Map
functions, it was achieved a poor completion time than the approach which splits the
block into records.
5.2
Contributions
We attempt to analyse the processing capacity problem of measurement of distributed systems through network trafc analysis, the results of the work presented in this dissertation
provide the contributions below:
1. We proposed an approach to implements DPI algorithms through MapReduce,
using whole blocks as input for Map functions. It was shown the effectiveness of
MapReduce for a DPI algorithm to extract indicators from a distributed application trafc, also it was shown the completion time scalability of MapReduce
for DPI, using virtual machines of a cloud provider;
2. We developed JNetPCAP-JXTA (Vieira, 2012b), an open source parser to extract
JXTA messages from network trafc traces;
3. We developed Hadoop-Analyzer (Vieira, 2013), an open source tool to extract
indicators from Hadoop logs and generate graphs of specied metrics.
4. We characterized the behavior followed by MapReduce phases for packet
level analysis and DPI, showing that this kind of job is intense in Map phase
and highlighting points that can be improved;
5. We described the processing capacity scalability of MapReduce for packet
level analysis and DPI, evaluating the impact caused by variations in input
size, cluster size and block size;
6. We showed the speed-up obtained with MapReduce for DPI, with variations in
input size, cluster size and block size;
7. We published two papers reporting our results, as follows:
(a) Vieira, T., Soares, P., Machado, M., Assad, R., and Garcia, V. Evaluating
Performance of Distributed Systems with MapReduce and Network Trafc
62
5.2. CONTRIBUTIONS
5.2.1
Lessons Learned
The contributions cited are of scientic and academic scope, with implementations and
evaluations little explored in the literature. However, with the development of this work,
some important lessons were learned.
During this research, different approaches for evaluating distributed systems of cloud
computing providers were studied. In this period, we could see the importance of the
performance evaluation in a cloud computing environment, and the recent efforts to
diagnose and evaluate system at production environment of a data center. Also, the
growth of the Internet and resource utilization make necessary solutions to be able to
evaluate large amounts of data in short time, with low performance degradation of the
evaluated system.
MapReduce has grown as a general purpose solution for big data processing, but it
is not a solution for all kind of problems, and its performance is dependent of several
parameters. Some researches has been done in order to improve MapReduce performance, through analytical modelling, simulation and measurement, but the most relevant
contributions in this direction was guided by realistic workload evaluations, from large
MapReduce clusters.
We learned that although the facilities provided by the MapReduce for distributed
processing, its performance is inuenced by the environment, network topology, workload,
data type and by several specic parameter congurations. Therefore, an evaluation of
the MapReduce behavior using data of a realistic environment will provide more accurate
and wide results, while in controlled experiments the results are more restricted and
limited to the evaluated metrics and factors.
63
5.3
Future Work
Because of time constraints imposed on the master degree, this dissertation addresses
some problems, but some problems are still open and others are emerging from current
results. Thus, the following issues should be investigated as future work:
Evaluating of all components of the proposed approach. This dissertation evaluated the JNetPCAP-JXTA, the AppAnalyzer and its implementation to evaluate a
JXTA-based distributed application, it is necessary to evaluate the SnifferServer,
Manager and the whole system working together, analysing their impact into the
measured distributed system and the scalability achieved;
Development of a technique for the efcient evaluation of distributed systems through information extracted from network trafc. This dissertation
addressed the problem of processing capacity for measuring distributed systems
through network trafc analysis, but it is necessary an efcient approach to diagnose problems of distributed systems, using information of ows, connections,
throughput and response time obtained from network trafc analysis;
Development of a analytic model and simulations, using information of MapReduce behavior for network trafc analysis, measured by this dissertation, to reproduce its characteristics and enable the evaluation and prediction of some cases of
MapReduce for network trafc analysis;
64
Bibliography
Aguilera, M. K., Mogul, J. C., Wiener, J. L., Reynolds, P., and Muthitacharoen, A. (2003).
Performance debugging for distributed systems of black boxes. SIGOPS Oper. Syst.
Rev., 37(5).
Akram, S., Marazakis, M., and Bilas, A. (2012). Understanding scalability and performance requirements of I/O-intensive applications on future multicore servers. In
Modeling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), 2012 IEEE 20th International Symposium on.
Antonello, R., Fernandes, S., Kamienski, C., Sadok, D., Kelner, J., Godorc, I., Szaboc, G.,
and Westholm, T. (2012). Deep packet inspection tools and techniques in commodity
platforms: Challenges and trends. Journal of Network and Computer Applications.
Antoniu, G., Hatcher, P., Jan, M., and Noblet, D. (2005). Performance evaluation of jxta
communication layers. In Cluster Computing and the Grid, 2005. CCGrid 2005. IEEE
International Symposium on, volume 1, pages 251 258 Vol. 1.
Antoniu, G., Cudennec, L., Jan, M., and Duigou, M. (2007). Performance scalability
of the jxta p2p framework. In Parallel and Distributed Processing Symposium, 2007.
IPDPS 2007. IEEE International, pages 1 10.
Armbrust, M., Fox, A., Grifth, R., Joseph, A. D., Katz, R., Konwinski, A., Lee, G.,
Patterson, D., Rabkin, A., Stoica, I., and Zaharia, M. (2010). A view of cloud
computing. Commun. ACM, 53, 5058.
Basili, V. R., Caldiera, G., and Rombach, H. D. (1994). The goal question metric
approach. In Encyclopedia of Software Engineering. Wiley.
Bhatotia, P., Wieder, A., Akkus, I. E., Rodrigues, R., and Acar, U. A. (2011). Largescale incremental data processing with change propagation. In Proceedings of the 3rd
USENIX conference on Hot topics in cloud computing, HotCloud11, pages 1818,
Berkeley, CA, USA. USENIX Association.
Callado, A., Kamienski, C., Szabo, G., Gero, B., Kelner, J., Fernandes, S., and Sadok, D.
(2009). A survey on internet trafc identication. Communications Surveys Tutorials,
IEEE, 11(3), 37 52.
65
BIBLIOGRAPHY
Chen, Y., Ganapathi, A., Grifth, R., and Katz, R. (2011). The case for evaluating
MapReduce performance using workload suites. In Modeling, Analysis Simulation of
Computer and Telecommunication Systems (MASCOTS), 2011 IEEE 19th International
Symposium on.
Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. (2010).
MapReduce online. In Proceedings of the 7th USENIX conference on Networked
systems design and implementation, pages 2121.
Cox, L. P., Murray, C. D., and Noble, B. D. (2002). Pastiche: making backup cheap and
easy. SIGOPS Oper. Syst. Rev., 36, 285298.
Dean, J. and Ghemawat, S. (2008). MapReduce: simplied data processing on large
clusters. Commun. ACM, 51, 107113.
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A.,
Sivasubramanian, S., Vosshall, P., and Vogels, W. (2007). Dynamo: Amazons highly
available key-value store. SIGOPS Oper. Syst. Rev., 41, 205220.
Duigou, M. (2003). Jxta v2.0 protocols specication. Technical report, IETF Internet
Draft.
Eddelbuettel, D. (2012). R in action. Journal of Statistical Software, Book Reviews, 46(2),
12.
Fernandes, S., Antonello, R., Lacerda, T., Santos, A., Sadok, D., and Westholm, T. (2009).
Slimming down deep packet inspection systems. In INFOCOM Workshops 2009,
IEEE.
Fonseca, A., Silva, M., Soares, P., Soares-Neto, F., Garcia, V., and Assad, R. (2012). Uma
proposta arquitetural para servios escalveis de dados em nuvens. In Proceedings of
the VIII Workshop de Redes Dinmicas e Sistemas P2P.
Fox, A., Grifth, R., Joseph, A., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin,
A., and Stoica, I. (2009). Above the clouds: A berkeley view of cloud computing.
Dept. Electrical Eng. and Comput. Sciences, University of California, Berkeley, Rep.
UCB/EECS, 28.
Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The Google le system. SIGOPS
Oper. Syst. Rev.
66
BIBLIOGRAPHY
67
BIBLIOGRAPHY
Jiang, D., Ooi, B. C., Shi, L., and Wu, S. (2010). The performance of MapReduce: an
in-depth study. Proc. VLDB Endow.
Kambatla, K., Pathak, A., and Pucha, H. (2009). Towards optimizing hadoop provisioning
in the cloud. In Proc. of the First Workshop on Hot Topics in Cloud Computing.
Kandula, S., Sengupta, S., Greenberg, A., Patel, P., and Chaiken, R. (2009). The nature
of data center trafc: measurements & analysis. In Proceedings of the 9th ACM
SIGCOMM conference on Internet measurement conference, IMC 09, pages 202208,
New York, NY, USA. ACM.
Kavulya, S., Tan, J., Gandhi, R., and Narasimhan, P. (2010). An analysis of traces from
a production MapReduce cluster. In Cluster, Cloud and Grid Computing (CCGrid),
2010 10th IEEE/ACM International Conference on.
Lmmel, R. (2007). Googles MapReduce programming model - revisited. Sci. Comput.
Program., 68(3), 208237.
Lee, G. (2012). Resource Allocation and Scheduling in Heterogeneous Cloud Environments. Ph.D. thesis, University of California, Berkeley.
Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y. D., and Moon, B. (2012). Parallel data
processing with MapReduce: a survey. SIGMOD Rec.
Lee, Y., Kang, W., and Son, H. (2010). An internet trafc analysis method with MapReduce. In Network Operations and Management Symposium Workshops (NOMS Wksps),
2010 IEEE/IFIP, pages 357 361.
Lee, Y., Kang, W., and Lee, Y. (2011). A hadoop-based packet trace processing tool. In
Proceedings of the Third international conference on Trafc monitoring and analysis,
TMA11.
Lin, H., Ma, X., Archuleta, J., Feng, W., Gardner, M., and Zhang, Z. (2010). Moon:
MapReduce on opportunistic environments. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pages 95106.
ACM.
Lin, J. (2012). Mapreduce is good enough? if all you have is a hammer, throw away
everything thats not a nail! Big Data.
68
BIBLIOGRAPHY
Loiseau, P., Goncalves, P., Guillier, R., Imbert, M., Kodama, Y., and Primet, P.-B. (2009).
Metroux: A high performance system for analysing ow at very ne-grain. In
Testbeds and Research Infrastructures for the Development of Networks Communities
and Workshops, 2009. TridentCom 2009. 5th International Conference on, pages 1 9.
Lu, P., Lee, Y. C., Wang, C., Zhou, B. B., Chen, J., and Zomaya, A. Y. (2012). Workload
characteristic oriented scheduler for MapReduce. In Parallel and Distributed Systems
(ICPADS), 2012 IEEE 18th International Conference on, pages 156 163.
Massie, M. L., Chun, B. N., and Culler, D. E. (2004). The Ganglia distributed monitoring
system: design, implementation, and experience. Parallel Computing, 30(7), 817
840.
Mi, H., Wang, H., Yin, G., Cai, H., Zhou, Q., and Sun, T. (2012). Performance problems
diagnosis in cloud computing systems by mining request trace logs. In Network
Operations and Management Symposium (NOMS), 2012 IEEE.
Nagaraj, K., Killian, C., and Neville, J. (2012). Structured comparative analysis of
systems logs to diagnose performance problems. In Proceedings of the 9th USENIX
conference on Networked Systems Design and Implementation, NSDI12.
Oliner, A., Ganapathi, A., and Xu, W. (2012). Advances and challenges in log analysis.
Commun. ACM, 55(2), 5561.
Paul, D. (2010). JXTA-Sim2: A Simulator for the core JXTA protocols. Masters thesis,
University of Dublin, Ireland.
Pereira, R., Azambuja, M., Breitman, K., and Endler, M. (2010). An architecture for
distributed high performance video processing in the cloud. In Cloud Computing
(CLOUD), 2010 IEEE 3rd International Conference on.
Piyachon, P. and Luo, Y. (2006). Efcient memory utilization on network processors
for deep packet inspection. In Proceedings of the 2006 ACM/IEEE Symposium on
Architecture for Networking and Communications Systems, pages 7180. ACM.
Risso, F., Baldi, M., Morandi, O., Baldini, A., and Monclus, P. (2008). Lightweight,
payload-based trafc classication: An experimental evaluation. In Communications,
2008. ICC08. IEEE International Conference on, pages 58695875. IEEE.
69
BIBLIOGRAPHY
Rumen (2012). Rumen, a tool to extract job characterization data from job tracker
logs. http://hadoop.apache.org/docs/MapReduce/r0.22.0/rumen.html. [Acessado em
dezembro de 2012].
Sambasivan, R. R., Zheng, A. X., De Rosa, M., Krevat, E., Whitman, S., Stroucken, M.,
Wang, W., Xu, L., and Ganger, G. R. (2011). Diagnosing performance changes by
comparing request ows. In Proceedings of the 8th USENIX conference on Networked
systems design and implementation, NSDI11.
Shafer, J., Rixner, S., and Cox, A. (2010). The hadoop distributed lesystem: Balancing
portability and performance. In Performance Analysis of Systems Software (ISPASS),
2010 IEEE International Symposium on.
Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D.,
Jaspan, S., and Shanbhag, C. (2010). Dapper, a large-scale distributed systems tracing
infrastructure. Technical report, Google, Inc.
Tan, J., Meng, X., and Zhang, L. (2012). Coupling scheduler for MapReduce/hadoop. In
Proceedings of the 21st international symposium on High-Performance Parallel and
Distributed Computing, HPDC 12.
Verma, A., Cherkasova, L., Kumar, V., and Campbell, R. (2012a). Deadline-based
workload management for MapReduce environments: Pieces of the performance
puzzle. In Network Operations and Management Symposium (NOMS), 2012 IEEE.
Verma, A., Cherkasova, L., and Campbell, R. (2012b). Two sides of a coin: Optimizing
the schedule of MapReduce jobs to minimize their makespan and improve cluster
performance. In Modeling, Analysis Simulation of Computer and Telecommunication
Systems (MASCOTS), 2012 IEEE 20th International Symposium on.
Vieira, T. (2012a). hadoop-dpi. http://github.com/tpbvieira/hadoop-dpi.
Vieira, T. (2012b). jnetpcap-jxta. http://github.com/tpbvieira/jnetpcap-jxta.
Vieira, T. (2013). hadoop-analyzer. http://github.com/tpbvieira/hadoop-analyzer.
Vieira, T., Soares, P., Machado, M., Assad, R., and Garcia, V. (2012a). Evaluating
performance of distributed systems with MapReduce and network trafc analysis. In
ICSEA 2012, The Seventh International Conference on Software Engineering Advances.
Xpert Publishing Services.
70
BIBLIOGRAPHY
Vieira, T., Soares, P., Machado, M., Assad, R., and Garcia, V. (2012b). Measuring
distributed applications through MapReduce and trafc analysis. In Parallel and
Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on, pages
704 705.
Wang, G., Butt, A., Pandey, P., and Gupta, K. (2009). A simulation approach to evaluating
design decisions in MapReduce setups. In Modeling, Analysis Simulation of Computer
and Telecommunication Systems, 2009. MASCOTS 09. IEEE International Symposium
on.
Yu, M., Greenberg, A., Maltz, D., Rexford, J., Yuan, L., Kandula, S., and Kim, C. (2011).
Proling network performance for multi-tier data center applications. In Proceedings
of the 8th USENIX conference on Networked systems design and implementation,
NSDI11.
Yuan, D., Zheng, J., Park, S., Zhou, Y., and Savage, S. (2011). Improving software
diagnosability via log enhancement. In ACM SIGARCH Computer Architecture News,
volume 39, pages 314. ACM.
Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., and Stoica, I. (2008). Improving
MapReduce performance in heterogeneous environments. In Proceedings of the 8th
USENIX conference on Operating systems design and implementation, OSDI08.
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., and Stoica, I.
(2010). Delay scheduling: a simple technique for achieving locality and fairness
in cluster scheduling. In Proceedings of the 5th European conference on Computer
systems, EuroSys 10.
Zhang, S., Han, J., Liu, Z., Wang, K., and Feng, S. (2009). Accelerating MapReduce
with distributed memory cache. In Parallel and Distributed Systems (ICPADS), 2009
15th International Conference on.
Zheng, Z., Yu, L., Lan, Z., and Jones, T. (2012). 3-dimensional root cause diagnosis
via co-analysis. In Proceedings of the 9th international conference on Autonomic
computing, pages 181190. ACM.
71