Академический Документы
Профессиональный Документы
Культура Документы
Abstract—In recent years ad hoc parallel data processing has emerged to be one of the killer applications for Infrastructure-as-a-
Service (IaaS) clouds. Major Cloud computing companies have started to integrate frameworks for parallel data processing in their
product portfolio, making it easy for customers to access these services and to deploy their programs. However, the processing
frameworks which are currently used have been designed for static, homogeneous cluster setups and disregard the particular nature of
a cloud. Consequently, the allocated compute resources may be inadequate for big parts of the submitted job and unnecessarily
increase processing time and cost. In this paper, we discuss the opportunities and challenges for efficient parallel data processing in
clouds and present our research project Nephele. Nephele is the first data processing framework to explicitly exploit the dynamic
resource allocation offered by today’s IaaS clouds for both, task scheduling and execution. Particular tasks of a processing job can be
assigned to different types of virtual machines which are automatically instantiated and terminated during the job execution. Based on
this new framework, we perform extended evaluations of MapReduce-inspired processing jobs on an IaaS cloud system and compare
the results to the popular data processing framework Hadoop.
Index Terms—Many-task computing, high-throughput computing, loosely coupled applications, cloud computing.
1 INTRODUCTION
Most notably, Nephele is the first data processing frame- the machines could be released and no longer contribute to
work to include the possibility of dynamically allocating/ the overall cost for the processing job.
deallocating different compute resources from a cloud in its Facilitating such use cases imposes some requirements
scheduling and during job execution. on the design of a processing framework and the way its
This paper is an extended version of [27]. It includes jobs are described. First, the scheduler of such a framework
further details on scheduling strategies and extended must become aware of the cloud environment a job should
experimental results. The paper is structured as follows: be executed in. It must know about the different types of
Section 2 starts with analyzing the above mentioned available VMs as well as their cost and be able to allocate or
opportunities and challenges and derives some important destroy them on behalf of the cloud customer.
design principles for our new framework. In Section 3, we Second, the paradigm used to describe jobs must
present Nephele’s basic architecture and outline how jobs be powerful enough to express dependencies between the
can be described and executed in the cloud. Section 4 different tasks the jobs consists of. The system must be aware
provides some first figures on Nephele’s performance and of which task’s output is required as another task’s input.
the impact of the optimizations we propose. Finally, our Otherwise the scheduler of the processing framework cannot
work is concluded by related work (Section 5) and ideas for decide at what point in time a particular VM is no longer
future work (Section 6). needed and deallocate it. The MapReduce pattern is a good
example of an unsuitable paradigm here: Although at the
end of a job only few reducer tasks may still be running, it is
2 CHALLENGES AND OPPORTUNITIES not possible to shut down the idle VMs, since it is unclear if
Current data processing frameworks like Google’s MapRe- they contain intermediate results which are still required.
duce or Microsoft’s Dryad engine have been designed for Finally, the scheduler of such a processing framework
cluster environments. This is reflected in a number of must be able to determine which task of a job should be
assumptions they make which are not necessarily valid in executed on which type of VM and, possibly, how many of
cloud environments. In this section, we discuss how those. This information could be either provided externally,
abandoning these assumptions raises new opportunities e.g., as an annotation to the job description, or deduced
but also challenges for efficient parallel data processing in internally, e.g., from collected statistics, similarly to the way
clouds. database systems try to optimize their execution schedule
over time [24].
2.1 Opportunities
Today’s processing frameworks typically assume the 2.2 Challenges
resources they manage consist of a static set of homogeneous The cloud’s virtualized nature helps to enable promising
compute nodes. Although designed to deal with individual new use cases for efficient parallel data processing.
nodes failures, they consider the number of available However, it also imposes new challenges compared to
machines to be constant, especially when scheduling the classic cluster setups. The major challenge we see is the
processing job’s execution. While IaaS clouds can certainly cloud’s opaqueness with prospect to exploiting data locality:
be used to create such cluster-like setups, much of their In a cluster the compute nodes are typically intercon-
flexibility remains unused. nected through a physical high-performance network. The
One of an IaaS cloud’s key features is the provisioning of topology of the network, i.e., the way the compute nodes
compute resources on demand. New VMs can be allocated are physically wired to each other, is usually well known
at any time through a well-defined interface and become and, what is more important, does not change over time.
available in a matter of seconds. Machines which are no Current data processing frameworks offer to leverage this
longer used can be terminated instantly and the cloud knowledge about the network hierarchy and attempt to
customer will be charged for them no more. Moreover, schedule tasks on compute nodes so that data sent from one
cloud operators like Amazon let their customers rent VMs node to the other has to traverse as few network switches as
of different types, i.e., with different computational power, possible [9]. That way network bottlenecks can be avoided
different sizes of main memory, and storage. Hence, the and the overall throughput of the cluster can be improved.
compute resources available in a cloud are highly dynamic In a cloud this topology information is typically not
and possibly heterogeneous. exposed to the customer [29]. Since the nodes involved in
With respect to parallel data processing, this flexibility processing a data intensive job often have to transfer
leads to a variety of new possibilities, particularly for tremendous amounts of data through the network, this
scheduling data processing jobs. The question a scheduler drawback is particularly severe; parts of the network may
has to answer is no longer “Given a set of compute resources, become congested while others are essentially unutilized.
how to distribute the particular tasks of a job among them?”, Although there has been research on inferring likely
but rather “Given a job, what compute resources match the network topologies solely from end-to-end measurements
tasks the job consists of best?”. This new paradigm allows (e.g., [7]), it is unclear if these techniques are applicable to
allocating compute resources dynamically and just for the IaaS clouds. For security reasons clouds often incorporate
time they are required in the processing workflow. For, e.g., a network virtualization techniques (e.g., [8]) which can
framework exploiting the possibilities of a cloud could start hamper the inference process, in particular when based
with a single VM which analyzes an incoming job and then on latency measurements.
advises the cloud to directly start the required VMs Even if it was possible to determine the underlying
according to the job’s processing phases. After each phase, network hierarchy in a cloud and use it for topology-aware
WARNEKE AND KAO: EXPLOITING DYNAMIC RESOURCE ALLOCATION FOR EFFICIENT PARALLEL DATA PROCESSING IN THE CLOUD 987
Fig. 4. Nephele’s job viewer provides graphical feedback on instance utilization and detected bottlenecks.
. File channels. A file channel allows two subtasks to profiling subsystem for Nephele which can continuously
exchange records via the local file system. The monitor running tasks and the underlying instances. Based
records of the producing task are first entirely on the Java Management Extensions (JMX) the profiling
written to an intermediate file and afterward read subsystem is, among other things, capable of breaking
into the consuming subtask. Nephele requires two down what percentage of its processing time a task thread
such subtasks to be assigned to the same instance. actually spends processing user code and what percentage
Moreover, the consuming Group Vertex must be of time it has to wait for data. With the collected data
scheduled to run in a higher Execution Stage than Nephele is able to detect both computational as well as I/O
the producing Group Vertex. In general, Nephele bottlenecks. While computational bottlenecks suggest a
only allows subtasks to exchange records across
higher degree of parallelization for the affected tasks, I/O
different stages via file channels because they are the
bottlenecks provide hints to switch to faster channel types
only channel types which store the intermediate
(like in-memory channels) and reconsider the instance
records in a persistent manner.
assignment. Since Nephele calculates a cryptographic
3.4 Parallelization and Scheduling Strategies signature for each task, recurring tasks can be identified
As mentioned before, constructing an Execution Graph and the previously recorded feedback data can be exploited.
from a user’s submitted Job Graph may leave different At the moment, we only use the profiling data to detect
degrees of freedom to Nephele. Using this freedom to these bottlenecks and help the user to choose reasonable
construct the most efficient Execution Graph (in terms of annotations for his job. Fig. 4 illustrates the graphical job
processing time or monetary cost) is currently a major focus viewer we have devised for that purpose. It provides
of our research. Discussing this subject in detail would go immediate visual feedback about the current utilization of
beyond the scope of this paper. However, we want to all tasks and cloud instances involved in the computation. A
outline our basic approaches in this section. user can utilize this visual feedback to improve his job
Unless the user provides any job annotation which annotations for upcoming job executions. In more advanced
contains more specific instructions we currently pursue a versions of Nephele, we envision the system to automati-
simple default strategy: Each vertex of the Job Graph is cally adapt to detected bottlenecks, either between con-
transformed into one Execution Vertex. The default channel secutive executions of the same job or even during job
types are network channels. Each Execution Vertex is by execution at runtime.
default assigned to its own Execution Instance unless the While the allocation time of cloud instances is determined
user’s annotations or other scheduling restrictions (e.g., the by the start times of the assigned subtasks, there are different
usage of in-memory channels) prohibit it. The default possible strategies for instance deallocation. In order to
instance type to be used is the one with the lowest price per reflect the fact that most cloud providers charge their
time unit available in the IaaS cloud. customers for instance usage by the hour, we integrated
One fundamental idea to refine the scheduling strategy the possibility to reuse instances. Nephele can keep track of
for recurring jobs is to use feedback data. We developed a the instances’ allocation times. An instance of a particular
WARNEKE AND KAO: EXPLOITING DYNAMIC RESOURCE ALLOCATION FOR EFFICIENT PARALLEL DATA PROCESSING IN THE CLOUD 991
type which has become obsolete in the current Execution the MapReduce pattern and implemented the task based on
Stage is not immediately deallocated if an instance of the a DAG to also highlight the advantages of using hetero-
same type is required in an upcoming Execution Stage. geneous instances.
Instead, Nephele keeps the instance allocated until the end of For all three experiments, we chose the data set size to be
its current lease period. If the next Execution Stage has begun 100 GB. Each integer number had the size of 100 bytes. As a
before the end of that period, it is reassigned to an Execution result, the data set contained about 109 distinct integer
Vertex of that stage, otherwise it deallocated early enough numbers. The cut-off variable k has been set to 2 108 , so the
not to cause any additional cost. smallest 20 percent of all numbers had to be determined
Besides the use of feedback data, we recently comple- and aggregated.
mented our efforts to provide reasonable job annotations
automatically by a higher level-programming model 4.1 General Hardware Setup
layered on top of Nephele. Rather than describing jobs as All three experiments were conducted on our local IaaS
arbitrary DAGs, this higher level programming model cloud of commodity servers. Each server is equipped with
called PACTs [4] is centered around the concatenation of two Intel Xeon 2.66 GHz CPUs (8 CPU cores) and a total main
second-order functions, e.g., like the map and reduce memory of 32 GB. All servers are connected through regular
function from the well-known MapReduce programming 1 GBit/s Ethernet links. The host operating system was
model. Developers can write custom first-order functions Gentoo Linux (kernel version 2.6.30) with KVM [15] (version
and attach them to the desired second-order functions. The 88-r1) using virtio [23] to provide virtual I/O access.
PACTs programming model is semantically richer than To manage the cloud and provision VMs on request of
Nephele’s own programming abstraction. For example, it Nephele, we set up Eucalyptus [16]. Similar to Amazon EC2,
considers aspects like the input/output cardinalities of the Eucalyptus offers a predefined set of instance types a user
first-order functions which is helpful to deduce reasonable can choose from. During our experiments, we used two
degrees of parallelization. More details can be found in [4]. different instance types: The first instance type was
“m1.small” which corresponds to an instance with one
CPU core, one GB of RAM, and a 128 GB disk. The second
4 EVALUATION instance type, “c1.xlarge”, represents an instance with 8 CPU
In this section, we want to present first performance results cores, 18 GB RAM, and a 512 GB disk. Amazon EC2 defines
of Nephele and compare them to the data processing comparable instance types and offers them at a price of about
framework Hadoop. We have chosen Hadoop as our 0.10 $, or 0.80 $ per hour (September 2009), respectively.
competitor, because it is an open source software and The images used to boot up the instances contained
currently enjoys high popularity in the data processing Ubuntu Linux (kernel version 2.6.28) with no additional
community. We are aware that Hadoop has been designed software but a Java runtime environment (version 1.6.0_13),
to run on a very large number of nodes (i.e., several which is required by Nephele’s Task Manager.
thousand nodes). However, according to our observations, The 100 GB input data set of random integer numbers
the software is typically used with significantly fewer has been generated according to the rules of the Jim Gray
instances in current IaaS clouds. In fact, Amazon itself limits sort benchmark [18]. In order to make the data accessible to
the number of available instances for their MapReduce Hadoop, we started an HDFS [25] data node on each of the
service to 20 unless the respective customer passes an allocated instances prior to the processing job and dis-
extended registration process [2]. tributed the data evenly among the nodes. Since this initial
The challenge for both frameworks consists of two abstract setup procedure was necessary for all three experiments
tasks: Given a set of random integer numbers, the first task is (Hadoop and Nephele), we have chosen to ignore it in the
to determine the k smallest of those numbers. The second task following performance discussion.
subsequently is to calculate the average of these k smallest
numbers. The job is a classic representative for a variety of 4.2 Experiment 1: MapReduce and Hadoop
data analysis jobs whose particular tasks vary in their In order to execute the described sort/aggregate task with
complexity and hardware demands. While the first task has Hadoop, we created three different MapReduce programs
to sort the entire data set, and therefore, can take advantage of which were executed consecutively.
large amounts of main memory and parallel execution, the The first MapReduce job reads the entire input data
second aggregation task requires almost no main memory set, sorts the contained integer numbers ascendingly, and
and, at least eventually, cannot be parallelized. writes them back to Hadoop’s HDFS file system. Since
We implemented the described sort/aggregate task for the MapReduce engine is internally designed to sort the
three different experiments. For the first experiment, we incoming data between the map and the reduce phase,
implemented the task as a sequence of MapReduce we did not have to provide custom map and reduce
programs and executed it using Hadoop on a fixed set of functions here. Instead, we simply used the TeraSort
instances. For the second experiment, we reused the same code, which has recently been recognized for being well
MapReduce programs as in the first experiment but devised suited for these kinds of tasks [18]. The result of this first
a special MapReduce wrapper to make these programs run MapReduce job was a set of files containing sorted
on top of Nephele. The goal of this experiment was to integer numbers. Concatenating these files yielded the
illustrate the benefits of dynamic resource allocation/ fully sorted sequence of 109 numbers.
deallocation while still maintaining the MapReduce proces- The second and third MapReduce jobs operated on the
sing pattern. Finally, as the third experiment, we discarded sorted data set and performed the data aggregation.
992 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011
Thereby, the second MapReduce job selected the first 512 MB, respectively, in order to avoid unnecessarily
output files from the preceding sort job which, just by their spilling transient data to disk.
file size, had to contain the smallest 2 108 numbers of the
initial data set. The map function was fed with the selected 4.3 Experiment 2: MapReduce and Nephele
files and emitted the first 2 108 numbers to the reducer. In For the second experiment, we reused the three MapReduce
order to enable parallelization in the reduce phase, we programs we had written for the previously described
chose the intermediate keys for the reducer randomly from Hadoop experiment and executed them on top of Nephele.
In order to do so, we had to develop a set of wrapper classes
a predefined set of keys. These keys ensured that the
providing limited interface compatibility with Hadoop and
emitted numbers were distributed evenly among the n
sort/merge functionality. These wrapper classes allowed us
reducers in the system. Each reducer then calculated the to run the unmodified Hadoop MapReduce programs with
8
average of the received 210 n integer numbers. The third Nephele. As a result, the data flow was controlled by the
MapReduce job finally read the n intermediate average executed MapReduce programs while Nephele was able to
values and aggregated them to a single overall average. govern the instance allocation/deallocation and the assign-
Since Hadoop is not designed to deal with heterogeneous ment of tasks to instances during the experiment. We
compute nodes, we allocated six instances of type “c1.xlarge” devised this experiment to highlight the effects of the
for the experiment. All of these instances were assigned to dynamic resource allocation/deallocation while still main-
Hadoop throughout the entire duration of the experiment. taining comparability to Hadoop as well as possible.
We configured Hadoop to perform best for the first, Fig. 5 illustrates the Execution Graph, we instructed
computationally most expensive, MapReduce job: In accor- Nephele to create so that the communication paths match
dance to [18] we set the number of map tasks per job to 48 the MapReduce processing pattern. For brevity, we omit a
(one map task per CPU core) and the number of reducers to discussion on the original Job Graph. Following our
12. The memory heap of each map task as well as the in- experiences with the Hadoop experiment, we pursued the
memory file system have been increased to 1 GB and overall idea to also start with a homogeneous set of six
WARNEKE AND KAO: EXPLOITING DYNAMIC RESOURCE ALLOCATION FOR EFFICIENT PARALLEL DATA PROCESSING IN THE CLOUD 993
“c1.xlarge” instances, but to reduce the number of allocated The task DummyTask, the forth task in the first Execution
instances in the course of the experiment according to the Stage, simply emitted every record it received. It was used
previously observed workload. For this reason, the sort to direct the output of a preceding task to a particular subset
operation should be carried out on all six instances, while of allocated instances. Following the overall idea of this
the first and second aggregation operation should only be experiment, we used the DummyTask task in the first stage
assigned to two instances and to one instance, respectively. to transmit the sorted output of the 12 TeraSortReduce
The experiment’s Execution Graph consisted of three subtasks to the two instances Nephele would continue to
Execution Stages. Each stage contained the tasks required by work with in the second Execution Stage. For the
the corresponding MapReduce program. As stages can only DummyTask subtasks in the first stage (Stage 0), we
be crossed via file channels, all intermediate data occurring shuffled the assignment of subtasks to instances in a way
between two succeeding MapReduce jobs was completely that both remaining instances received a fairly even fraction
written to disk, like in the previous Hadoop experiment. of the 2 108 smallest numbers. Without the shuffle, the 2
The first stage (Stage 0) included four different tasks, 108 would all be stored on only one of the remaining
split into several different groups of subtasks and assigned instances with high probability.
to different instances. The first task, BigIntegerReader, In the second and third stage (Stage 1 and 2 in Fig. 5), we
ran the two aggregation steps corresponding to the second
processed the assigned input files and emitted each integer
and third MapReduce program in the previous Hadoop
number as a separate record. The tasks TeraSortMap and
experiment. AggregateMap and AggregateReduce encapsu-
TeraSortReduce encapsulated the TeraSort MapReduce code
lated the respective Hadoop code.
which had been executed by Hadoop’s mapper and reducer
The first aggregation step was distributed across 12
threads before. In order to meet the setup of the previous
AggregateMap and four AggregateReduce subtasks, as-
experiment, we split the TeraSortMap and TeraSortReduce signed to the two remaining “c1.xlarge” instances. To
tasks into 48 and 12 subtasks, respectively, and assigned determine how many records each AggregateMap subtask
them to six instances of type “c1.xlarge”. Furthermore, we had to process, so that in total only the 2 108 numbers
instructed Nephele to construct network channels between would be emitted to the reducers, we had to develop a
each pair of TeraSortMap and TeraSortReduce subtasks. For small utility program. This utility program consisted of two
the sake of legibility, only few of the resulting channels are components. The first component ran in the DummyTask
depicted in Fig. 5. subtasks of the preceding Stage 0. It wrote the number of
Although Nephele only maintains at most one physical records each DummyTask subtask had eventually emitted
TCP connection between two cloud instances, we devised to a network file system share which was accessible to every
the following optimization: If two subtasks that are instance. The second component, integrated in the Aggre-
connected by a network channel are scheduled to run on gateMap subtasks, read those numbers and calculated what
the same instance, their network channel is dynamically fraction of the sorted data was assigned to the respective
converted into an in-memory channel. That way we were mapper. In the previous Hadoop experiment this auxiliary
able to avoid unnecessary data serialization and the program was unnecessary since Hadoop wrote the output
resulting processing overhead. For the given MapReduce of each MapReduce job back to HDFS anyway.
communication pattern this optimization accounts for After the first aggregation step, we again used the
approximately 20 percent less network channels. DummyTask task to transmit the intermediate results to the
The records emitted by the BigIntegerReader subtasks last instance which executed the final aggregation in
were received by the TeraSortMap subtasks. The TeraSort the third stage. The final aggregation was carried out by
partitioning function, located in the TeraSortMap task, four AggregateMap subtasks and one AggregateReduce
then determined which TeraSortReduce subtask was subtask. Eventually, we used one subtask of BigInteger-
responsible for the received record, depending on its Writer to write the final result record back to HDFS.
value. Before being sent to the respective reducer, the 4.4 Experiment 3: DAG and Nephele
records were collected in buffers of approximately 44 MB
In this third experiment, we were no longer bound to the
size and were presorted in memory. Considering that each
MapReduce processing pattern. Instead, we implemented
TeraSortMap subtask was connected to 12 TeraSortReduce
the sort/aggregation problem as a DAG and tried to
subtasks, this added up to a buffer size of 574 MB, similar exploit Nephele’s ability to manage heterogeneous com-
to the size of the in-memory file system we had used for pute resources.
the Hadoop experiment previously. Fig. 6 illustrates the Execution Graph we instructed
Each TeraSortReducer subtask had an in-memory buffer Nephele to create for this experiment. For brevity, we again
of about 512 MB size, too. The buffer was used to mitigate leave out a discussion on the original Job Graph. Similar to
hard drive access when storing the incoming sets of the previous experiment, we pursued the idea that several
presorted records from the mappers. Like Hadoop, we powerful but expensive instances are used to determine the
started merging the first received presorted record sets 2 108 smallest integer numbers in parallel, while, after that,
using separate background threads while the data transfer a single inexpensive instance is utilized for the final
from the mapper tasks was still in progress. To improve aggregation. The graph contained five distinct tasks, again
performance, the final merge step, resulting in one fully split into different groups of subtasks. However, in contrast
sorted set of records, was directly streamed to the next task to the previous experiment, this one also involved instances
in the processing chain. of different types.
994 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 22, NO. 6, JUNE 2011
In order to feed the initial data from HDFS into Nephele, processing chain, which allowed the Execution Stage to be
we reused the BigIntegerReader task. The records emitted by interrupted as soon as the final merge subtask had emitted
the BigIntegerReader subtasks were received by the second the 2 108 smallest records.
task, BigIntegerSorter, which attempted to buffer all incom- The fourth task, BigIntegerAggregater, read the incoming
ing records into main memory. Once it had received all records from its input channels and summed them up. It
designated records, it performed an in-memory quick sort was also assigned to the single “m1.small” instance. Since
and subsequently continued to emit the records in an order- we no longer required the six “c1.xlarge” instances to run
preserving manner. Since the BigIntegerSorter task requires once the final merge subtask had determined the 2 108
large amounts of main memory we split it into 146 subtasks smallest numbers, we changed the communication channel
and assigned these evenly to six instances of type between the final BigIntegerMerger and BigIntegerAggre-
“c1.xlarge”. The preceding BigIntegerReader task was also gater subtask to a file channel. That way Nephele pushed
split into 146 subtasks and setup to emit records via in- the aggregation into the next Execution Stage and was able
memory channels. to deallocate the expensive instances.
The third task, BigIntegerMerger, received records from Finally, the fifth task, BigIntegerWriter, eventually re-
multiple input channels. Once it has read a record from all ceived the calculated average of the 2 108 integer numbers
available input channels, it sorts the records locally and and wrote the value back to HDFS.
always emits the smallest number. The BigIntegerMerger
tasks occurred three times in a row in the Execution Graph. 4.5 Results
The first time it was split into six subtasks, one subtask Figs. 7, 8, and 9 show the performance results of our three
assigned to each of the six “c1.xlarge” instances. As experiment, respectively. All three plots illustrate the
described in Section 2, this is currently the only way to average instance utilization over time, i.e., the average
ensure data locality between the sort and merge tasks. The utilization of all CPU cores in all instances allocated for the
second time the BigIntegerMerger task occurred in the job at the given point in time. The utilization of each
Execution Graph, it was split into two subtasks. These two instance has been monitored with the Unix command “top”
subtasks were assigned to two of the previously used and is broken down into the amount of time the CPU cores
“c1.xlarge” instances. The third occurrence of the task was spent running the respective data processing framework
assigned to new instance of the type “m1.small”. (USR), the kernel and its processes (SYS), and the time
Since we abandoned the MapReduce processing pattern, waiting for I/O to complete (WAIT). In order to illustrate
we were able to better exploit Nephele’s streaming the impact of network communication, the plots addition-
pipelining characteristics in this experiment. Consequently, ally show the average amount of IP traffic flowing between
each of the merge subtasks was configured to stop the instances over time.
execution after having emitted 2 108 records. The stop We begin with discussing Experiment 1 (MapReduce
command was propagated to all preceding subtasks of the and Hadoop): For the first MapReduce job, TeraSort, Fig. 7
WARNEKE AND KAO: EXPLOITING DYNAMIC RESOURCE ALLOCATION FOR EFFICIENT PARALLEL DATA PROCESSING IN THE CLOUD 995
assign specific instance types to specific kinds of tasks always consists of a distinct map and reduce program.
becomes apparent: Since the entire sorting can be done in However, several systems have been introduced to coordi-
main memory, it only takes several seconds. Three minutes nate the execution of a sequence of MapReduce jobs [19], [17].
later (c), the first BigIntegerMerger subtasks start to receive MapReduce has been clearly designed for large static
the presorted records and transmit them along the processing clusters. Although it can deal with sporadic node failures,
chain. the available compute resources are essentially considered
Until the end of the sort phase, Nephele can fully exploit to be a fixed set of homogeneous machines.
the power of the six allocated “c1.xlarge” instances. After The Pegasus framework by Deelman et al. [10] has been
that period the computational power is no longer needed designed for mapping complex scientific workflows onto
for the merge phase. From a cost perspective it is now grid systems. Similar to Nepehle, Pegasus lets its users
desirable to deallocate the expensive instances as soon as describe their jobs as a DAG with vertices representing the
possible. However, since they hold the presorted data sets, tasks to be processed and edges representing the depen-
at least 20 GB of records must be transferred to the dencies between them. The created workflows remain
inexpensive “m1.small” instance first. Here, we identified abstract until Pegasus creates the mapping between the
the network to be the bottleneck, so much computational given tasks and the concrete compute resources available at
power remains unused during that transfer phase (from (c) runtime. The authors incorporate interesting aspects like the
to (d)). In general, this transfer penalty must be carefully scheduling horizon which determines at what point in time
considered when switching between different instance a task of the overall processing job should apply for a
compute resource. This is related to the stage concept in
types during job execution. For the future we plan to
Nephele. However, Nephele’s stage concept is designed to
integrate compression for file and network channels as a
minimize the number of allocated instances in the cloud
means to trade CPU against I/O load. Thereby, we hope to
and clearly focuses on reducing cost. In contrast, Pegasus’
mitigate this drawback. scheduling horizon is used to deal with unexpected changes
At point (d), the final BigIntegerMerger subtask has in the execution environment. Pegasus uses DAGMan and
emitted the 2 108 smallest integer records to the file channel Condor-G [13] as its execution engine. As a result, different
and advises all preceding subtasks in the processing chain to task can only exchange data via files.
stop execution. All subtasks of the first stage have now been Thao et al. introduced the Swift [30] system to reduce the
successfully completed. As a result, Nephele automatically management issues which occur when a job involving
deallocates the six instances of type “c1.xlarge” and numerous tasks has to be executed on a large, possibly
continues the next Execution Stage with only one instance unstructured, set of data. Building upon components like
of type “m1.small” left. In that stage, the BigIntegerAggre- CoG Karajan [26], Falkon [21], and Globus [12], the authors
gater subtask reads the 2 108 smallest integer records from present a scripting language which allows to create
the file channel and calculates their average. Again, since the mappings between logical and physical data structures
six expensive “c1.xlarge” instances no longer contribute to and to conveniently assign tasks to these.
the number of available CPU cores in that period, the The system our approach probably shares most simila-
processing power allocated from the cloud again fits the task rities with is Dryad [14]. Dryad also runs DAG-based jobs
to be completed. At point (e), after 33 minutes, Nephele has and offers to connect the involved tasks through either file,
finished the entire processing job. network, or in-memory channels. However, it assumes an
Considering the short processing times of the presented execution environment which consists of a fixed set of
tasks and the fact that most cloud providers offer to lease an homogeneous worker nodes. The Dryad scheduler is
instance for at least one hour, we are aware that Nephele’s designed to distribute tasks across the available compute
savings in time and cost might appear marginal at first nodes in a way that optimizes the throughput of the overall
glance. However, we want to point out that these savings cluster. It does not include the notion of processing cost for
grow by the size of the input data set. Due to the size of our particular jobs.
test cloud we were forced to restrict data set size to 100 GB. In terms of on-demand resource provising several projects
For larger data sets, more complex processing jobs become arose recently: Dörnemann et al. [11] presented an approach
feasible, which also promises more significant savings. to handle peak-load situations in BPEL workflows using
Amazon EC2. Ramakrishnan et al. [22] discussed how to
provide a uniform resource abstraction over grid and cloud
5 RELATED WORK resources for scientific workflows. Both projects rather aim at
In recent years a variety of systems to facilitate MTC has batch-driven workflows than the data intensive, pipelined
been developed. Although these systems typically share workflows Nephele focuses on. The FOS project [28] recently
common goals (e.g., to hide issues of parallelism or fault presented an operating system for multicore and clouds
tolerance), they aim at different fields of application. which is also capable of on-demand VM allocation.
MapReduce [9] (or the open source version Hadoop [25])
is designed to run data analysis jobs on a large amount of
data, which is expected to be stored across a large set of 6 CONCLUSION
share-nothing commodity servers. MapReduce is high- In this paper, we have discussed the challenges and
lighted by its simplicity: Once a user has fit his program opportunities for efficient parallel data processing in cloud
into the required map and reduce pattern, the execution environments and presented Nephele, the first data
framework takes care of splitting the job into subtasks, processing framework to exploit the dynamic resource
distributing and executing them. A single MapReduce job provisioning offered by today’s IaaS clouds. We have
WARNEKE AND KAO: EXPLOITING DYNAMIC RESOURCE ALLOCATION FOR EFFICIENT PARALLEL DATA PROCESSING IN THE CLOUD 997
described Nephele’s basic architecture and presented a [15] A. Kivity, “Kvm: The Linux Virtual Machine Monitor,” Proc.
Ottawa Linux Symp. (OLS ’07), pp. 225-230, July 2007.
performance comparison to the well-established data [16] D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L.
processing framework Hadoop. The performance evalua- Youseff, and D. Zagorodnov, “Eucalyptus: A Technical Report on
tion gives a first impression on how the ability to assign an Elastic Utility Computing Architecture Linking Your Programs
specific virtual machine types to specific tasks of a to Useful Systems,” technical report, Univ. of California, Santa
Barbara, 2008.
processing job, as well as the possibility to automatically [17] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig
allocate/deallocate virtual machines in the course of a job Latin: A Not-So-Foreign Language for Data Processing,” Proc.
execution, can help to improve the overall resource ACM SIGMOD Int’l Conf. Management of Data, pp. 1099-1110, 2008.
[18] O. O’Malley and A.C. Murthy, “Winning a 60 Second Dash with a
utilization and, consequently, reduce the processing cost. Yellow Elephant,” technical report, Yahoo!, 2009.
With a framework like Nephele at hand, there are a [19] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting
variety of open research issues, which we plan to address the Data: Parallel Analysis with Sawzall,” Scientific Programming,
for future work. In particular, we are interested in vol. 13, no. 4, pp. 277-298, 2005.
[20] I. Raicu, I. Foster, and Y. Zhao, “Many-Task Computing for Grids
improving Nephele’s ability to adapt to resource overload
and Supercomputers,” Proc. Workshop Many-Task Computing on
or underutilization during the job execution automatically. Grids and Supercomputers, pp. 1-11, Nov. 2008.
Our current profiling approach builds a valuable basis for [21] I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde, “Falkon:
this, however, at the moment the system still requires a A Fast and Light-Weight TasK ExecutiON Framework,” Proc.
reasonable amount of user annotations. ACM/IEEE Conf. Supercomputing (SC ’07), pp. 1-12, 2007.
[22] L. Ramakrishnan, C. Koelbel, Y.-S. Kee, R. Wolski, D. Nurmi, D.
In general, we think our work represents an important Gannon, G. Obertelli, A. YarKhan, A. Mandal, T.M. Huang, K.
contribution to the growing field of Cloud computing Thyagaraja, and D. Zagorodnov, “VGrADS: Enabling e-Science
services and points out exciting new opportunities in the Workflows on Grids and Clouds with Fault Tolerance,” Proc. Conf.
field of parallel data processing. High Performance Computing Networking, Storage and Analysis (SC
’09), pp. 1-12, 2009.
[23] R. Russell, “Virtio: Towards a De-Facto Standard for Virtual I/O
Devices,” SIGOPS Operating System Rev., vol. 42, no. 5, pp. 95-103,
REFERENCES 2008.
[1] Amazon Web Services LLC, “Amazon Elastic Compute Cloud [24] M. Stillger, G.M. Lohman, V. Markl, and M. Kandil, “LEO—DB2’s
(Amazon EC2),” http://aws.amazon.com/ec2/, 2009. LEarning Optimizer,” Proc. 27th Int’l Conf. Very Large Data Bases
[2] Amazon Web Services LLC, “Amazon Elastic MapReduce,” (VLDB ’01), pp. 19-28, 2001.
http://aws.amazon.com/elasticmapreduce/, 2009. [25] The Apache Software Foundation “Welcome to Hadoop!” http://
[3] Amazon Web Services LLC, “Amazon Simple Storage Service,” hadoop.apache.org/, 2009.
http://aws.amazon.com/s3/, 2009. [26] G. von Laszewski, M. Hategan, and D. Kodeboyina, Workflows for
[4] D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke, e-Science Scientific Workflows for Grids. Springer, 2007.
“Nephele/PACTs: A Programming Model and Execution Frame- [27] D. Warneke and O. Kao, “Nephele: Efficient Parallel Data
work for Web-Scale Analytical Processing,” Proc. ACM Symp. Processing in the Cloud,” Proc. Second Workshop Many-Task
Cloud Computing (SoCC ’10), pp. 119-130, 2010. Computing on Grids and Supercomputers (MTAGS ’09), pp. 1-10,
[5] R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. 2009.
Weaver, and J. Zhou, “SCOPE: Easy and Efficient Parallel [28] D. Wentzlaff, C. Gruenwald III, N. Beckmann, K. Modzelewski, A.
Processing of Massive Data Sets,” Proc. Very Large Database Belay, L. Youseff, J. Miller, and A. Agarwal, “An Operating
Endowment, vol. 1, no. 2, pp. 1265-1276, 2008. System for Multicore and Clouds: Mechanisms and Implementa-
[6] H. chih Yang, A. Dasdan, R.-L. Hsiao, and D.S. Parker, “Map- tion,” Proc. ACM Symp. Cloud Computing (SoCC ’10), pp. 3-14, 2010.
Reduce-Merge: Simplified Relational Data Processing on Large [29] T. White, Hadoop: The Definitive Guide. O’Reilly Media, 2009.
Clusters,” Proc. ACM SIGMOD Int’l Conf. Management of Data, [30] Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, V.
2007. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde, “Swift: Fast,
[7] M. Coates, R. Castro, R. Nowak, M. Gadhiok, R. King, and Y. Reliable, Loosely Coupled Parallel Computation,” Proc. Services,
Tsang, “Maximum Likelihood Network Topology Identification ’07 IEEE Congress on, pp. 199-206, July 2007.
from Edge-Based Unicast Measurements,” SIGMETRICS Perfor-
mance Evaluation Rev., vol. 30, no. 1, pp. 11-20, 2002. Daniel Warneke received the Diploma and BS
[8] R. Davoli, “VDE: Virtual Distributed Ethernet,” Proc. Testbeds and degrees in computer science from the University
Research Infrastructures for the Development of Networks and of Paderborn, in 2008 and 2006, respectively. He
Communities, Int’l Conf., pp. 213-220, 2005. is a research assistant at the Berlin University of
[9] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Proces- Technology, Germany. Currently, he is working in
sing on Large Clusters,” Proc. Sixth Conf. Symp. Opearting Systems the DFG-funded research project Stratosphere.
Design and Implementation (OSDI ’04), p. 10, 2004. His research interests centers around massively-
[10] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. parallel, fault-tolerant data processing frame-
Mehta, K. Vahi, G.B. Berriman, J. Good, A. Laity, J.C. Jacob, and works on Infrastructure-as-a-Service platforms.
D.S. Katz, “Pegasus: A Framework for Mapping Complex He is a member of the IEEE.
Scientific Workflows onto Distributed Systems,” Scientific Pro-
gramming, vol. 13, no. 3, pp. 219-237, 2005.
[11] T. Dornemann, E. Juhnke, and B. Freisleben, “On-Demand
Resource Provisioning for BPEL Workflows Using Amazon’s Odej Kao is a full professor at the Berlin
Elastic Compute Cloud,” Proc. Ninth IEEE/ACM Int’l Symp. Cluster University of Technology and director of the IT
Computing and the Grid (CCGRID ’09), pp. 140-147, 2009. center tubIT. He received the PhD degree and
[12] I. Foster and C. Kesselman, “Globus: A Metacomputing Infra- his habilitation from the Clausthal University of
structure Toolkit,” Int’l J. Supercomputer Applications, vol. 11, no. 2, Technology. Thereafter, he moved to the Uni-
pp. 115-128, 1997. versity of Paderborn as an associated professor
[13] J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke, for operating and distributed systems. His
“Condor-G: A Computation Management Agent for Multi- research areas include Grid Computing, service
Institutional Grids,” Cluster Computing, vol. 5, no. 3, pp. 237-246, level agreements, and operation of complex IT
2002. systems. He is a member of many program
[14] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: committees and has published more than 190 papers.
Distributed Data-Parallel Programs from Sequential Building
Blocks,” Proc. Second ACM SIGOPS/EuroSys European Conf.
Computer Systems (EuroSys ’07), pp. 59-72, 2007.