You are on page 1of 10

contributed articles

DOI:10.1145/ 2934664
for interactive SQL queries and Pregel11
This open source computing framework for iterative graph algorithms. In the
open source Apache Hadoop stack,
unifies streaming, batch, and interactive big systems like Storm1 and Impala9 are
data workloads to unlock new applications. also specialized. Even in the relational
database world, the trend has been to
BY MATEI ZAHARIA, REYNOLD S. XIN, PATRICK WENDELL, move away from “one-size-fits-all” sys-
TATHAGATA DAS, MICHAEL ARMBRUST, ANKUR DAVE, tems.18 Unfortunately, most big data
XIANGRUI MENG, JOSH ROSEN, SHIVARAM VENKATARAMAN, applications need to combine many
MICHAEL J. FRANKLIN, ALI GHODSI, JOSEPH GONZALEZ, different processing types. The very
SCOTT SHENKER, AND ION STOICA nature of “big data” is that it is diverse
and messy; a typical pipeline will need

Apache Spark:
MapReduce-like code for data load-
ing, SQL-like queries, and iterative
machine learning. Specialized engines
can thus create both complexity and

A Unified
inefficiency; users must stitch together
disparate systems, and some applica-
tions simply cannot be expressed effi-
ciently in any engine.

Engine for
In 2009, our group at the Univer-
sity of California, Berkeley, started
the Apache Spark project to design
a unified engine for distributed data
processing. Spark has a programming

Big Data
model similar to MapReduce but ex-
tends it with a data-sharing abstrac-
tion called “Resilient Distributed Da-
tasets,” or RDDs.25 Using this simple

Processing
extension, Spark can capture a wide
range of processing workloads that
previously needed separate engines,
including SQL, streaming, machine
learning, and graph processing2,26,6
(see Figure 1). These implementations
use the same optimizations as special-
ized engines (such as column-oriented
processing and incremental updates)
and achieve similar performance but
THE GROWTH OF data volumes in industry and research run as libraries over a common en-
gine, making them easy and efficient
poses tremendous opportunities, as well as tremendous to compose. Rather than being specific
computational challenges. As data sizes have outpaced
the capabilities of single machines, users have needed key insights
new systems to scale out computations to multiple ˽˽ A simple programming model can
capture streaming, batch, and interactive
nodes. As a result, there has been an explosion of workloads and enable new applications
that combine them.
new cluster programming models targeting diverse ˽˽ Apache Spark applications range from
computing workloads.1,4,7,10 At first, these models were finance to scientific data processing
and combine libraries for SQL, machine
relatively specialized, with new models developed for learning, and graphs.
new workloads; for example, MapReduce4 supported ˽˽ In six years, Apache Spark has
grown to 1,000 contributors and
batch processing, but Google also developed Dremel13 thousands of deployments.

56 COMM UNICATIO NS O F THE ACM | NOV EM BER 201 6 | VO L . 5 9 | N O. 1 1


Analyses performed using Spark of brain activity in a larval zebrafish: (left) matrix factorization to characterize functionally similar
regions (as depicted by different colors) and (right) embedding dynamics of whole-brain activity into lower-dimensional trajectories.
Source: Jeremy Freeman and Misha Ahrens, Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA.

to these workloads, we claim this result gine, Spark can run diverse functions applications that combine their func-
is more general; when augmented with over the same data, often in memory. tions (such as video messaging and
data sharing, MapReduce can emu- Finally, Spark enables new applica- Waze) that would not have been pos-
late any distributed computation, so tions (such as interactive queries on a sible on any one device.
it should also be possible to run many graph and streaming machine learn- Since its release in 2010, Spark
other types of workloads.24 ing) that were not possible with previ- has grown to be the most active open
Spark’s generality has several im- ous systems. One powerful analogy for source project or big data processing,
portant benefits. First, applications the value of unification is to compare with more than 1,000 contributors. The
are easier to develop because they use a smartphones to the separate portable project is in use in more than 1,000 or-
unified API. Second, it is more efficient devices that existed before them (such ganizations, ranging from technology
to combine processing tasks; whereas as cameras, cellphones, and GPS gad- companies to banking, retail, biotech-
prior systems required writing the gets). In unifying the functions of these nology, and astronomy. The largest
data to storage to pass it to another en- devices, smartphones enabled new publicly announced deployment has

N OV E MB E R 2 0 1 6 | VO L. 59 | N O. 1 1 | C OM M U N IC AT ION S OF T HE ACM 57
contributed articles

Figure 1. Apache Spark software stack, with specialized processing libraries implemented returns a result to the program (here,
over the core engine. the number of elements in the RDD)
instead of defining a new RDD.
Spark evaluates RDDs lazily, al-
lowing it to find an efficient plan for
Streaming SQL ML Graph the user’s computation. In particular,
transformations return a new RDD ob-
ject representing the result of a compu-
tation but do not immediately compute
it. When an action is called, Spark looks
at the whole graph of transformations
used to create an execution plan. For ex-
ample, if there were multiple filter or
map operations in a row, Spark can fuse
them into one pass, or, if it knows that
data is partitioned, it can avoid moving
it over the network for groupBy.5 Users
can thus build up programs modularly
without losing performance.
Finally, RDDs provide explicit sup-
port for data sharing among compu-
tations. By default, RDDs are “ephem-
eral” in that they get recomputed each
time they are used in an action (such
as count). However, users can also
more than 8,000 nodes.22 As Spark has across a cluster that can be manipu- persist selected RDDs in memory or
grown, we have sought to keep building lated in parallel. Users create RDDs by for rapid reuse. (If the data does not
on its strength as a unified engine. We applying operations called “transfor- fit in memory, Spark will also spill it
(and others) have continued to build an mations” (such as map, filter, and to disk.) For example, a user searching
integrated standard library over Spark, groupBy) to their data. through a large set of log files in HDFS
with functions from data import to ma- Spark exposes RDDs through a func- to debug a problem might load just the
chine learning. Users find this ability tional programming API in Scala, Java, error messages into memory across the
powerful; in surveys, we find the major- Python, and R, where users can simply cluster by calling
ity of users combine multiple of Spark’s pass local functions to run on the clus-
libraries in their applications. ter. For example, the following Scala errors.persist()
As parallel data processing becomes code creates an RDD representing the
common, the composability of process- error messages in a log file, by search- After this, the user can run a variety of
ing functions will be one of the most ing for lines that start with ERROR, and queries on the in-memory data:
important concerns for both usability then prints the total number of errors:
and performance. Much of data analy- // Count errors mentioning MySQL
sis is exploratory, with users wishing to lines = spark.textFile(“hdfs://...”) errors.filter(s => s.contains(“MySQL”))
combine library functions quickly into errors = lines.filter( .count()
a working pipeline. However, for “big s => s.startsWith(“ERROR”)) // Fetch back the time fields of errors that
data” in particular, copying data be- println(“Total errors:“+errors.count()) // mention PHP, assuming time is field #3:
tween different systems is anathema to errors.filter(s => s.contains(“PHP”))
performance. Users thus need abstrac- The first line defines an RDD backed .map(line => line.split(‘\t’)(3))
tions that are general and composable. by a file in the Hadoop Distributed File .collect()
In this article, we introduce the Spark System (HDFS) as a collection of lines of
programming model and explain why it text. The second line calls the filter This data sharing is the main differ-
is highly general. We also discuss how transformation to derive a new RDD ence between Spark and previous com-
we leveraged this generality to build from lines. Its argument is a Scala puting models like MapReduce; other-
other processing tasks over it. Finally, function literal or closure.a Finally, the wise, the individual operations (such
we summarize Spark’s most common last line calls count, another type of as map and groupBy) are similar. Data
applications and describe ongoing de- RDD operation called an “action” that sharing provides large speedups, often
velopment work in the project. as much as 100×, for interactive que-
ries and iterative algorithms.23 It is also
Programming Model a The closures passed to Spark can call into any the key to Spark’s generality, as we dis-
existing Scala or Python library or even refer-
The key programming abstraction in ence variables in the outer program. Spark
cuss later.
Spark is RDDs, which are fault-toler- sends read-only copies of these variables to Fault tolerance. Apart from provid-
ant collections of objects partitioned worker nodes. ing data sharing and a variety of paral-

58 COMM UNICATIO NS O F THE ACM | NOV EM BER 201 6 | VO L . 5 9 | N O. 1 1


contributed articles

lel operations, RDDs also automatical- RDDs usually store only temporary SQL and DataFrames. One of the
ly recover from failures. Traditionally, data within an application, though most common data processing para-
distributed computing systems have some applications (such as the Spark digms is relational queries. Spark SQL2
provided fault tolerance through data SQL JDBC server) also share RDDs and its predecessor, Shark,23 imple-
replication or checkpointing. Spark across multiple users.2 Spark’s de- ment such queries on Spark, using
uses a different approach called “lin- sign as a storage-system-agnostic techniques similar to analytical da-
eage.”25 Each RDD tracks the graph of engine makes it easy for users to run tabases. For example, these systems
transformations that was used to build computations against existing data support columnar storage, cost-based
it and reruns these operations on base and join diverse data sources. optimization, and code generation for
data to reconstruct any lost partitions. query execution. The main idea behind
For example, Figure 2 shows the RDDs in Higher-Level Libraries these systems is to use the same data
our previous query, where we obtain the The RDD programming model pro- layout as analytical databases—com-
time fields of errors mentioning PHP by vides only distributed collections of pressed columnar storage—inside
applying two filters and a map. If any objects and functions to run on them. RDDs. In Spark SQL, each record in an
partition of an RDD is lost (for example, Using RDDs, however, we have built RDD holds a series of rows stored in bi-
if a node holding an in-memory partition a variety of higher-level libraries on nary format, and the system generates
of errors fails), Spark will rebuild it by Spark, targeting many of the use cas-
applying the filter on the corresponding es of specialized computing engines. Figure 2. Lineage graph for the third query
in our example; boxes represent RDDs, and
block of the HDFS file. For “shuffle” op- The key idea is that if we control the arrows represent transformations.
erations that send data from all nodes to data structures stored inside RDDs,
all other nodes (such as reduceByKey), the partitioning of data across nodes,
lines
senders persist their output data locally and the functions run on them, we can
filter(line.startsWith(“ERROR”))
in case a receiver fails. implement many of the execution tech-
errors
Lineage-based recovery is signifi- niques in other engines. Indeed, as we
filter(line.contains(“PHP”)))
cantly more efficient than replication show in this section, these libraries
in data-intensive workloads. It saves often achieve state-of-the-art perfor- PHP errors
both time, because writing data over mance on each task while offering sig- map(line.split(‘\t’)(3))
the network is much slower than writ- nificant benefits when users combine time fields
ing it to RAM, and storage space in them. We now discuss the four main
memory. Recovery is typically much libraries included with Apache Spark.
faster than simply rerunning the pro-
gram, because a failed node usually Figure 3. A Scala implementation of logistic regression via batch gradient descent in Spark.
contains multiple RDD partitions, and
these partitions can be rebuilt in paral- // Load data into an RDD
lel on other nodes. val points = sc.textFile(...).map(readPoint).persist()
A longer example. As a longer exam-
// Start with a random parameter vector
ple, Figure 3 shows an implementa- var w = DenseVector.random(D)
tion of logistic regression in Spark.
It uses batch gradient descent, a // On each iteration, update param vector with a sum
for (i <- 1 to ITERATIONS) {
simple iterative algorithm that
val gradient = points.map { p =>
computes a gradient function over p.x * (1/(1+exp(-p.y*(w.dot(p.x))))-1) * p.y
the data repeatedly as a parallel }.reduce((a, b) => a+b)
sum. Spark makes it easy to load the w -= gradient
}
data into RAM once and run multiple
sums. As a result, it runs faster than
traditional MapReduce. For example,
in a 100GB job (see Figure 4), MapRe- Figure 4. Performance of logistic regression in Hadoop MapReduce vs. Spark for 100GB of
data on 50 m2.4xlarge EC2 nodes.
duce takes 110 seconds per iteration
because each iteration loads the data
from disk, while Spark takes only one Hadoop Spark
second per iteration after the first load. 2,500
Running Time (s)

Integration with storage systems. 2,000


Much like Google’s MapReduce, 1,500
Spark is designed to be used with 1,000
multiple external systems for per- 500
sistent storage. Spark is most com-
0
monly used with cluster file systems 1 5 10 20
like HDFS and key-value stores like Number of Iterations
S3 and Cassandra. It can also connect
with Apache Hive as a data catalog.

N OV E MB E R 2 0 1 6 | VO L. 59 | N O. 1 1 | C OM M U N IC AT ION S OF T HE ACM 59
contributed articles

code to run directly against this layout. means model) are easily passed to oth-
Beyond running SQL queries, er libraries. Apart from compatibility
we have used the Spark SQL engine at the API level, composition in Spark
to provide a higher-level abstrac- is also efficient at the execution level,
tion for basic data transformations
called DataFrames,2 which are RDDs Spark has a similar because Spark can optimize across pro-
cessing libraries. For example, if one li-
of records with a known schema.
DataFrames are a common abstraction
programming brary runs a map function and the next
library runs a map on its result, Spark
for tabular data in R and Python, with model to will fuse these operations into a single
programmatic methods for filtering,
computing new columns, and aggrega-
MapReduce but map. Likewise, Spark’s fault recovery
works seamlessly across these librar-
tion. In Spark, these operations map extends it with ies, recomputing lost data no matter
down to the Spark SQL engine and re-
ceive all its optimizations. We discuss
a data-sharing which libraries produced it.
Performance. Given that these librar-
DataFrames more later. abstraction ies run over the same engine, do they
One technique not yet implemented
in Spark SQL is indexing, though other called “resilient lose performance? We found that by
implementing the optimizations we
libraries over Spark (such as Indexe- distributed just outlined within RDDs, we can often
dRDDs3) do use it.
Spark Streaming. Spark Streaming26 datasets,” or RDDs. match the performance of specialized
engines. For example, Figure 6 com-
implements incremental stream pro- pares Spark’s performance on three
cessing using a model called “discretized simple tasks—a SQL query, stream-
streams.” To implement streaming over ing word count, and Alternating Least
Spark, we split the input data into small Squares matrix factorization—versus
batches (such as every 200 milliseconds) other engines. While the results vary
that we regularly combine with state across workloads, Spark is generally
stored inside RDDs to produce new re- comparable with specialized systems
sults. Running streaming computations like Storm, GraphLab, and Impala.b For
this way has several benefits over tradi- stream processing, although we show
tional distributed streaming systems. results from a distributed implementa-
For example, fault recovery is less expen- tion on Storm, the per-node through-
sive due to using lineage, and it is pos- put is also comparable to commercial
sible to combine streaming with batch streaming engines like Oracle CEP.26
and interactive queries. Even in highly competitive bench-
GraphX. GraphX6 provides a graph marks, we have achieved state-of-the-
computation interface similar to Pregel art performance using Apache Spark.
and GraphLab,10,11 implementing the In 2014, we entered the Daytona Gray-
same placement optimizations as these Sort benchmark (http://sortbench-
systems (such as vertex partitioning mark.org/) involving sorting 100TB of
schemes) through its choice of parti- data on disk, and tied for a new record
tioning function for the RDDs it builds. with a specialized system built only
MLlib. MLlib,14 Spark’s machine for sorting on a similar number of ma-
learning library, implements more chines. As in the other examples, this
than 50 common algorithms for dis- was possible because we could imple-
tributed model training. For example, it ment both the communication and
includes the common distributed algo- CPU optimizations necessary for large-
rithms of decision trees (PLANET), La- scale sorting inside the RDD model.
tent Dirichlet Allocation, and Alternat-
ing Least Squares matrix factorization. Applications
Combining processing tasks. Spark’s Apache Spark is used in a wide range
libraries all operate on RDDs as the of applications. Our surveys of Spark
data abstraction, making them easy to
combine in applications. For example, b One area in which other designs have outper-
Figure 5 shows a program that reads formed Spark is certain graph computations.12,16
some historical Twitter data using However, these results are for algorithms with
Spark SQL, trains a K-means clustering low ratios of computation to communication
model using MLlib, and then applies (such as PageRank) where the latency from syn-
chronized communication in Spark is signifi-
the model to a new stream of tweets. cant. In applications with more computation
The data tasks returned by each library (such as the ALS algorithm) distributing the ap-
(here the historic tweet RDD and the K- plication on Spark still helps.

60 COMM UNICATIO NS O F THE ACM | NOV EM BER 201 6 | VO L . 5 9 | N O. 1 1


contributed articles

users have identified more than 1,000 making applications. Published use streaming with batch and interactive
companies using Spark, in areas from cases for Spark Streaming include queries. For example, video company
Web services to biotechnology to fi- network security monitoring at Cis- Conviva uses Spark to continuously
nance. In academia, we have also seen co, prescriptive analytics at Samsung maintain a model of content distribu-
applications in several scientific do- SDS, and log mining at Netflix. Many tion server performance, querying it
mains. Across these workloads, we find of these applications also combine automatically when it moves clients
users take advantage of Spark’s gener-
ality and often combine multiple of its Figure 5. Example combining the SQL, machine learning, and streaming libraries in Spark.
libraries. Here, we cover a few top use
// Load historical data as an RDD using Spark SQL
cases. Presentations on many use cases val trainingData = sql(
are also available on the Spark Summit “SELECT location, language FROM old_tweets”)
conference website (http://www.spark-
// Train a K-means model using MLlib
summit.org). val model = new KMeans()
Batch processing. Spark’s most com- .setFeaturesCol(“location”)
mon applications are for batch proc- .setPredictionCol(“language”)
.fit(trainingData)
essing on large datasets, including
// Apply the model to new tweets in a stream
Extract-Transform-Load workloads to TwitterUtils.createStream(...)
convert data from a raw format (such .map(tweet => model.predict(tweet.location))
as log files) to a more structured for-
mat and offline training of machine
learning models. Published examples Figure 6. Comparing Spark’s performance with several widely used specialized systems
for SQL, streaming, and machine learning. Data is from Zaharia24 (SQL query and stream-
of these workloads include page per- ing word count) and Sparks et al.17 (alternating least squares matrix factorization).
sonalization and recommendation at
Yahoo!; managing a data lake at Gold-
man Sachs; graph mining at Alibaba; Response Time Throughput Response Time
financial Value at Risk calculation; and (sec) (records/s) (hours)
text mining of customer feedback at 20 10 x 106 6

MATLAB
Impala (disk)

Toyota. The largest published use case

Spark
Spark (disk)

we are aware of is an 8,000-node cluster 8


5
at Chinese social network Tencent that 15
4
ingests 1PB of data per day.22 6
Impala (mem)

While Spark can process data in


Storm

10 3
memory, many of the applications in 4
Spark (mem)

Mahout
this category run only on disk. In such
Redshift

GraphLab
cases, Spark can still improve perfor- 5
2

Spark
mance over MapReduce due to its sup- 1

port for more complex operator graphs.


0 0 0
Interactive queries. Interactive use of SQL Streaming Machine Learning
Spark falls into three main classes. First,
organizations use Spark SQL for rela-
tional queries, often through business-
intelligence tools like Tableau. Examples Figure 7. PanTera, a visualization application built on Spark that can interactively filter data.
include eBay and Baidu. Second, devel-
opers and data scientists can use Spark’s
Scala, Python, and R interfaces interac-
tively through shells or visual notebook
environments. Such interactive use is
crucial for asking more advanced ques-
tions and for designing models that
eventually lead to production applica-
tions and is common in all deployments.
Third, several vendors have developed
domain-specific interactive applications
that run on Spark. Examples include
Tresata (anti-money laundering), Tri-
facta (data cleaning), and PanTera (large-
scale visualization, as in Figure 7).
Stream processing. Real-time proc-
Source: PanTera
essing is also a popular use case, both
in analytics and in real-time decision-

N OV E MB E R 2 0 1 6 | VO L. 59 | N O. 1 1 | C OM M U N IC AT ION S OF T HE ACM 61
contributed articles

across servers, in an application that queries during live experiments. Figure Figure 9, most organizations use mul-
requires substantial parallel work for 8 shows an example image generated tiple components; 88% use at least two
both model maintenance and queries. using Spark. of them, 60% use at least three (such
Scientific applications. Spark has also Spark components used. Because as Spark Core and two libraries), and
been used in several scientific domains, Spark is a unified data-processing en- 27% use at least four components.
including large-scale spam detection,19 gine, the natural question is how many Deployment environments. We also
image processing,27 and genomic data of its libraries organizations actually see growing diversity in where Apache
processing.15 One example that com- use. Our surveys of Spark users have Spark applications run and what data
bines batch, interactive, and stream shown that organizations do, indeed, sources they connect to. While the first
processing is the Thunder platform use multiple components, with over Spark deployments were generally in
for neuroscience at Howard Hughes 60% of organizations using at least Hadoop environments, only 40% of de-
Medical Institute, Janelia Farm.5 It is three of Spark’s APIs. Figure 9 out- ployments in our July 2015 Spark sur-
designed to process brain-imaging data lines the usage of each component in vey were on the Hadoop YARN cluster
from experiments in real time, scaling a July 2015 Spark survey by Databricks manager. In addition, 52% of respon-
up to 1TB/hour of whole-brain imaging that reached 1,400 respondents. We dents ran Spark on a public cloud.
data from organisms (such as zebrafish list the Spark Core API (just RDDs)
and mice). Using Thunder, researchers as one component and the higher- Why Is the Spark Model General?
can apply machine learning algorithms level libraries as others. We see that While Apache Spark demonstrates
(such as clustering and Principal Com- many components are widely used, that a unified cluster programming
ponent Analysis) to identify neurons in- with Spark Core and SQL as the most model is both feasible and useful, it
volved in specific behaviors. The same popular. Streaming is used in 46% of would be helpful to understand what
code can be run in batch jobs on data organizations and machine learning makes cluster programming models
from previous runs or in interactive in 54%. While not shown directly in general, along with Spark’s limita-
tions. Here, we summarize a discus-
Figure 8. Visualization of neurons in the zebrafish brain created with Spark, where each sion on the generality of RDDs from
neuron is colored based on the direction of movement that correlates with its activity.
Source: Jeremy Freeman and Misha Ahrens of Janelia Research Campus.
Zaharia.24 We study RDDs from two
perspectives. First, from an expres-
siveness point of view, we argue that
RDDs can emulate any distributed
computation, and will do so efficient-
ly in many cases unless the computa-
tion is sensitive to network latency.
Second, from a systems point of view,
we show that RDDs give applications
control over the most common bottle-
neck resources in clusters—network and
storage I/O—and thus make it possible
to express the same optimizations
for these resources that characterize
specialized systems.
Expressiveness perspective. To study the
expressiveness of RDDs, we start by com-
paring RDDs to the MapReduce model,
which RDDs build on. The first question
is what computations can MapReduce
Figure 9. Percent of organizations using each Spark component, from the Databricks 2015 itself express? Although there have been
Spark survey; https://databricks.com/blog/2015/09/24/.
numerous discussions about the limita-
tions of MapReduce, the surprising an-
swer here is that MapReduce can emu-
Core
late any distributed computation.
SQL To see this, note that any distributed
computation consists of nodes that per-
Streaming
form local computation and occasionally
MLlib exchange messages. MapReduce offers
the map operation, which allows local
GraphX
computation, and reduce, which allows
0% 20% 40% 60% 80% 100% all-to-all communication. Any distrib-
uted computation can thus be emulated,
Fraction of Users
perhaps somewhat inefficiently, by
breaking down its work into timesteps,

62 COM MUNICATIO NS O F TH E AC M | NOV EM BER 201 6 | VO L . 5 9 | N O. 1 1


contributed articles

running maps to perform the local bottleneck resources in cluster com- Links. Each node has a 10Gbps
computation in each timestep, and putations? And can RDDs use them ef- (1.3GB/s) link, or approximately 40×
batching and exchanging messages at ficiently? Although cluster applications less than its memory bandwidth and
the end of each step using a reduce. A are diverse, they are all bound by the 2× less than its aggregate disk band-
series of MapReduce steps will capture same properties of the underlying hard- width; and
the whole result, as in Figure 10. Re- ware. Current datacenters have a steep Racks. Nodes are organized into racks
cent theoretical work has formalized storage hierarchy that limits most ap- of 20 to 40 machines, with 40Gbps–
this type of emulation by showing that plications in similar ways. For example, 80Gbps bandwidth out of each rack,
MapReduce can simulate many com- a typical Hadoop cluster might have the or 2×–5× lower than the in-rack net-
putations in the Parallel Random Ac- following characteristics: work performance.
cess Machine model.8 Repeated Map- Local storage. Each node has local Given these properties, the most
Reduce is also equivalent to the Bulk memory with approximately 50GB/s important performance concern in
Synchronous Parallel model.20 of bandwidth, as well as 10 to 20 lo- many applications is the placement of
While this line of work shows that cal disks, for approximately 1GB/s to data and computation in the network.
MapReduce can emulate arbitrary 2GB/s of disk bandwidth; Fortunately, RDDs provide the facili-
computations, two problems can
make the “constant factor” behind Figure 10. Emulating an arbitrary distributed computation with MapReduce.
this emulation high. First, MapReduce
is inefficient at sharing data across
timesteps because it relies on repli- map
cated external storage systems for this
purpose. Our emulated system may
reduce
thus become slower due to writing
out its state after each step. Second,
the latency of the MapReduce steps
(a) MapReduce provides primitives
determines how well our emulation for local computation and all-to-all
will match a real network, and most communication.
Map-Reduce implementations were
designed for batch environments with
minutes to hours of latency. ...
RDDs and Spark address both of (b) By chaining these steps together,
these limitations. On the data-sharing we can emulate any distributed
computation. The main costs for this
front, RDDs make data sharing fast by emulation are the latency of the rounds
avoiding replication of intermediate data and the overhead of passing state
and can closely emulate the in-memory across steps.
“data sharing” across time that would
happen in a system composed of long-
running processes. On the latency front, Figure 11. Example of Spark’s DataFrame API in Python. Unlike Spark’s core API, DataFrames
have a schema with named columns (such as age and city) and take expressions in a limited
Spark can run MapReduce-like steps language (such as age > 20) instead of arbitrary Python functions.
on large clusters with 100ms latency;
nothing intrinsic to the MapReduce model
users.where(users[“age”] > 20)
prevents this. While some applications
.groupBy(“city”)
need finer-grain timesteps and commu- .agg(avg(“age”), max(“income”))
nication, this 100ms latency is enough
to implement many data-intensive
workloads, where the amount of com-
putation that can be batched before a Figure 12. Working with DataFrames in Spark’s R API. We load a distributed DataFrame
using Spark’s JSON data source, then filter and aggregate using standard R column ex-
communication step is high. pressions.
In summary, RDDs build on Map-
Reduce’s ability to emulate any dis- people <- read.df(context, “./people.json”, “json”)
tributed computation but make this
emulation significantly more efficient. # Filter people by age
Their main limitation is increased adults = filter(people, people$age > 20)

latency due to synchronization in each # Count number of people by country


communication step, but this latency summarize(groupBy(adults, adults$city), count=n(adults$id))
is often not a factor. ## city count
##1 Cambridge 1
Systems perspective. Independent ##2 San Francisco 6
of the emulation approach to char- ##3 Berkeley 4
acterizing Spark’s generality, we can
take a systems approach. What are the

N OV E MB E R 2 0 1 6 | VO L. 59 | N O. 1 1 | C OM M U N IC AT ION S OF T HE ACM 63
contributed articles

ties to control this placement; the in- ity in new libraries. More than 200 third- relational optimizations under a data
terface lets applications place com- party packages are also available.c In the frame API.d
putations near input data (through research community, multiple projects While DataFrames are still new,
an API for “preferred locations” for at Berkeley, MIT, and Stanford build on they have quickly become a popular
input sources25), and RDDs provide Spark, and many new libraries (such API. In our July 2015 survey, 60% of
control over data partitioning and co- as GraphX and Spark Streaming) came respondents reported using them. Be-
location (such as specifying that data from research groups. Here, we sketch cause of the success of DataFrames,
be hashed by a given key). Libraries four of the major efforts. we have also developed a type-safe in-
(such as GraphX) can thus implement DataFrames and more declarative terface over them called Datasetse that
the same placement strategies used in APIs. The core Spark API was based on lets Java and Scala programmers view
specialized systems.6 functional programming over distrib- DataFrames as statically typed col-
Beyond network and I/O bandwidth, uted collections that contain arbitrary lections of Java objects, similar to the
the most common bottleneck tends to be types of Scala, Java, or Python objects. RDD API, and still receive relational
CPU time, especially if data is in memo- While this approach was highly ex- optimizations. We expect these APIs
ry. In this case, however, Spark can run pressive, it also made programs more to gradually become the standard ab-
the same algorithms and libraries used difficult to automatically analyze and straction for passing data between
in specialized systems on each node. For optimize. The Scala/Java/Python ob- Spark libraries.
example, it uses columnar storage and jects stored in RDDs could have com- Performance optimizations. Much of
processing in Spark SQL, native BLAS plex structure, and the functions run the recent work in Spark has been on per-
libraries in MLlib, and so on. As we over them could include arbitrary formance. In 2014, the Databricks team
discussed earlier, the only area where code. In many applications, develop- spent considerable effort to optimize
RDDs clearly add a cost is network la- ers could get suboptimal performance Spark’s network and I/O primitives, al-
tency, due to the synchronization at if they did not use the right operators; lowing Spark to jointly set a new record
parallel communication steps. for example, the system on its own for the Daytona GraySort challenge.f
One final observation from a systems could not push filter functions Spark sorted 100TB of data 3× faster
perspective is that Spark may incur extra ahead of maps. than the previous record holder based
costs over some of today’s special- To address this problem, we extend- on Hadoop MapReduce using 10× few-
ized systems due to fault tolerance. ed Spark in 2015 to add a more declara- er machines. This benchmark was not
For example, in Spark, the map tasks tive API called DataFrames2 based on executed in memory but rather on (solid-
in each shuffle operation save their the relational algebra. Data frames are state) disks. In 2015, one major effort was
output to local files on the machine a common API for tabular data in Py- Project Tungsten,g which removes Java
where they ran, so reduce tasks can re- thon and R. A data frame is a set of re- Virtual Machine overhead from many of
fetch it later. In addition, Spark imple- cords with a known schema, essentially Spark’s code paths by using code genera-
ments a barrier at shuffle stages, so the equivalent to a database table, that tion and non-garbage-collected memory.
reduce tasks do not start until all the supports operations like filtering One benefit of doing these optimizations
maps have finished. This avoids some and aggregation using a restricted in a general engine is that they simulta-
of the complexity that would be needed “expression” API. Unlike working in neously affect all of Spark’s libraries;
for fault recovery if one “pushed” re- the SQL language, however, data frame machine learning, streaming, and SQL
cords directly from maps to reduces in operations are invoked as function all became faster from each change.
a pipelined fashion. Although removing calls in a more general programming R language support. The SparkR
some of these features would speed language (such as Python and R), al- project21 was merged into Spark in
up the system, Spark often performs lowing developers to easily structure 2015 to provide a programming inter-
competitively despite them. The main their program using abstractions in the face in R. The R interface is based on
reason is an argument similar to our host language (such as functions and DataFrames and uses almost identical
previous one: many applications are classes). Figure 11 and Figure 12 show syntax to R’s built-in data frames. Oth-
bound by an I/O operation (such as examples of the API. er Spark libraries (such as MLlib) are
shuffling data across the network or Spark’s DataFrames offer a similar also easy to call from R, because they
reading it from disk) and beyond this API to single-node packages but auto- accept DataFrames as input.
operation, optimizations (such as matically parallelize and optimize the Research libraries. Apache Spark
pipelining) add only a modest benefit. computation using Spark SQL’s query continues to be used to build higher-
We have kept fault tolerance “on” by planner. User code thus receives op-
default in Spark to make it easy to reason timizations (such as predicate push-
d One reason optimization is possible is that
about applications. down, operator reordering, and join Spark’s DataFrame API uses lazy evaluation
algorithm selection) that were not where the content of a DataFrame is not com-
Ongoing Work available under Spark’s functional API. puted until the user asks to write it out. The
Apache Spark remains a rapidly evolv- To our knowledge, Spark DataFrames data frame APIs in R and Python are eager, pre-
ing project, with contributions from are the first library to perform such venting optimizations like operator reordering.
e https://databricks.com/blog/2016/01/04/in-
both industry and research. The code- troducing-spark-datasets.html
base size has grown by a factor of six c One package index is available at https:// f http://sortbenchmark.org/ApacheSpark2014.pdf
since June 2013, with most of the activ- spark-packages.org/ g https://databricks.com/blog/2015/04/28/

64 COMMUNICATIO NS O F TH E AC M | NOV EM BER 201 6 | VO L . 5 9 | N O. 1 1


contributed articles

level data processing libraries. Recent amplab/spark-indexedrdd S., and Stoica, I. Shark: SQL and rich analytics at scale.
4. Dean, J. and Ghemawat, S. MapReduce: Simplified In Proceedings of the ACM SIGMOD/PODS Conference
projects include Thunder for neurosci- data processing on large clusters. In Proceedings of (New York, June 22–27). ACM Press, New York, 2013.
ence,5 ADAM for genomics,15 and Kira the Sixth OSDI Symposium on Operating Systems 24. Zaharia, M. An Architecture for Fast and General Data
Design and Implementation (San Francisco, CA, Dec. Processing on Large Clusters. Ph.D. thesis, Electrical
for image processing in astronomy.27 6–8). USENIX Association, Berkeley, CA, 2004. Engineering and Computer Sciences Department,
Other research libraries (such as 5. Freeman, J., Vladimirov, N., Kawashima, T., Mu, Y., University of California, Berkeley, 2014; https://www.eecs.
Sofroniew, N.J., Bennett, D.V., Rosen, J., Yang, C.-T., berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf
GraphX) have been merged into the Looger, L.L., and Ahrens, M.B. Mapping brain activity 25. Zaharia, M. et al. Resilient distributed datasets: A
main codebase. at scale with cluster computing. Nature Methods 11, 9 fault-tolerant abstraction for in-memory cluster
(Sept. 2014), 941–950. computing. In Proceedings of the Ninth USENIX
6. Gonzalez, J.E. et al. GraphX: Graph processing in a NSDI Symposium on Networked Systems Design and
Conclusion distributed dataflow framework. In Proceedings of the Implementation (San Jose, CA, Apr. 25–27, 2012).
26. Zaharia, M. et al. Discretized streams: Fault-tolerant
11th OSDI Symposium on Operating Systems Design
Scalable data processing will be es- and Implementation (Broomfield, CO, Oct. 6–8). streaming computation at scale. In Proceedings of
USENIX Association, Berkeley, CA, 2014. the 24th ACM SOSP Symposium on Operating Systems
sential for the next generation of Principles (Farmington, PA, Nov. 3–6). ACM Press, New
7. Isard, M. et al. Dryad: Distributed data-parallel
computer applications but typically programs from sequential building blocks. In York, 2013.
Proceedings of the EuroSys Conference (Lisbon, 27. Zhang, Z., Barbary, K., Nothaft, N.A., Sparks, E., Zahn,
involves a complex sequence of pro- Portugal, Mar. 21–23). ACM Press, New York, 2007. O., Franklin, M.J., Patterson, D.A., and Perlmutter, S.
cessing steps with different com- 8. Karloff, H., Suri, S., and Vassilvitskii, S. A model Scientific Computing Meets Big Data Technology:
of computation for MapReduce. In Proceedings An Astronomy Use Case. In Proceedings of IEEE
puting systems. To simplify this of the ACM-SIAM SODA Symposium on Discrete International Conference on Big Data (Santa Clara,
task, the Spark project introduced Algorithms (Austin, TX, Jan. 17–19). ACM Press, CA, Oct. 29–Nov. 1). IEEE, 2015.
New York, 2010.
a unified programming model and 9. Kornacker, M. et al. Impala: A modern, open-source
engine for big data applications. Our SQL engine for Hadoop. In Proceedings of the Seventh Matei Zaharia (matei@cs.stanford.edu) is an assistant
Biennial CIDR Conference on Innovative Data professor of computer science at Stanford University,
experience shows such a model can Systems Research (Asilomar, CA, Jan. 4–7, 2015). Stanford, CA, and CTO of Databricks, San Francisco, CA.
efficiently support today’s workloads 10. Low, Y. et al. Distributed GraphLab: A framework Reynold S. Xin (rxin@databricks.com) is the chief architect
for machine learning and data mining in the cloud. on the Spark team at Databricks, San Francisco, CA.
and brings substantial benefits to users. In Proceedings of the 38th International VLDB
We hope Apache Spark highlights the Conference on Very Large Databases (Istanbul, Patrick Wendell (patrick@databricks.com) is the vice
Turkey, Aug. 27–31, 2012). president of engineering at Databricks, San Francisco, CA.
importance of composability in pro- 11. Malewicz, G. et al. Pregel: A system for large-scale Tathagata Das (tdas@databricks.com) is a software
gramming libraries for big data and graph processing. In Proceedings of the ACM engineer at Databricks, San Francisco, CA.
SIGMOD/PODS Conference (Indianapolis, IN, June
encourages development of more eas- 6–11). ACM Press, New York, 2010. Michael Armbrust (michael@databricks.com) is a
software engineer at Databricks, San Francisco, CA.
ily interoperable libraries. 12. McSherry, F., Isard, M., and Murray, D.G. Scalability!
But at what COST? In Proceedings of the 15th Ankur Dave (ankurd@eecs.berkeley.edu) is a graduate
All Apache Spark libraries described HotOS Workshop on Hot Topics in Operating Systems student in the Real-Time, Intelligent and Secure Systems
in this article are open source at http:// (Kartause Ittingen, Switzerland, May 18–20). USENIX Lab at the University of California, Berkeley.
Association, Berkeley, CA, 2015.
spark.apache.org/. Databricks has 13. Melnik, S. et al. Dremel: Interactive analysis of Web- Xiangrui Meng (meng@databricks.com) is a software
scale datasets. Proceedings of the VLDB Endowment 3 engineer at Databricks, San Francisco, CA.
also made videos of all Spark Summit
(Sept. 2010), 330–339. Josh Rosen (josh@databricks.com) is a software
conference talks available for free at 14. Meng, X., Bradley, J.K., Yavuz, B., Sparks, E.R., engineer at Databricks, San Francisco, CA.
https://spark-summit.org/. Venkataraman, S., Liu, D., Freeman, J., Tsai, D.B.,
Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M.J., Shivaram Venkataraman (shivaram@cs.berkeley.edu)
Zadeh, R., Zaharia, M., and Talwalkar, A. MLlib: is a Ph.D. student in the AMPLab at the University of
California, Berkeley.
Acknowledgments Machine learning in Apache Spark. Journal of Machine
Learning Research 17, 34 (2016), 1–7. Michael Franklin (mjfranklin@uchicago.edu) is the Liew
Apache Spark is the work of hun- 15. Nothaft, F.A., Massie, M., Danford, T., Zhang, Z., Family Chair of Computer Science at the University of
Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A.,
dreds of open source contributors Hammerbacher, J., Linderman, M., Franklin, M.J.,
Chicago and Director of the AMPLab at the University of
California, Berkeley.
who are credited in the release notes Joseph, A.D., and Patterson, D.A. Rethinking data-
intensive science using scalable analytics systems. Ali Ghodsi (ali@databricks.com) is the CEO of Databricks
at https://spark.apache.org. Berke- In Proceedings of the SIGMOD/PODS Conference and adjunct faculty at the University of California,
ley’s research on Spark was sup- (Melbourne, Australia, May 31–June 4). ACM Press, Berkeley.
New York, 2015.
ported in part by National Science 16. Shun, J. and Blelloch, G.E. Ligra: A lightweight
Joseph E. Gonzalez (jegonzal@cs.berkeley.edu) is an
assistant professor in EECS at the University of California,
Foundation CISE Expeditions Award graph processing framework for shared memory. Berkeley.
In Proceedings of the 18th ACM SIGPLAN PPoPP
CCF-1139158, Lawrence Berkeley Symposium on Principles and Practice of Parallel Scott Shenker (shenker@icsi.berkeley.edu) is a professor
National Laboratory Award 7076018, Programming (Shenzhen, China, Feb. 23–27). ACM in EECS at the University of California, Berkeley.
Press, New York, 2013.
and DARPA XData Award FA8750- 17. Sparks, E.R., Talwalkar, A., Smith, V., Kottalam,
Ion Stoica (shenker@icsi.berkeley.edu) is a professor in
EECS and co-director of the AMPLab at the University of
12-2-0331, and gifts from Amazon J., Pan, X., Gonzalez, J.E., Franklin, M.J., Jordan, California, Berkeley.
M.I., and Kraska, T. MLI: An API for distributed
Web Services, Google, SAP, IBM, The machine learning. In Proceedings of the IEEE ICDM
Thomas and Stacey Siebel Founda- International Conference on Data Mining (Dallas, TX, Copyright held by the authors.
Dec. 7–10). IEEE Press, 2013. Publication rights licensed to ACM. $15.00
tion, Adobe, Apple, Arimo, Blue 18. Stonebraker, M. and Cetintemel, U. ‘One size fits all’: An
Goji, Bosch, C3Energy, Cisco, Cray, idea whose time has come and gone. In Proceedings
of the 21st International ICDE Conference on Data
Cloudera, EMC2, Ericsson, Face- Engineering (Tokyo, Japan, Apr. 5–8). IEEE Computer
book, Guavus, Huawei, Informatica, Society, Washington, D.C., 2005, 2–11.
19. Thomas, K., Grier, C., Ma, J., Paxson, V., and Song,
Intel, Microsoft, NetApp, Pivotal, D. Design and evaluation of a real-time URL spam
Samsung, Schlumberger, Splunk, filtering service. In Proceedings of the IEEE
Symposium on Security and Privacy (Oakland, CA, May
Virdata, and VMware. 22–25). IEEE Press, 2011.
20. Valiant, L.G. A bridging model for parallel computation.
Commun. ACM 33, 8 (Aug. 1990), 103–111.
References 21. Venkataraman, S. et al. SparkR; http://dl.acm.org/
1. Apache Storm project; http://storm.apache.org citation.cfm?id=2903740&CFID=687410325&CFTO
2. Armbrust, M. et al. Spark SQL: Relational data KEN=83630888
processing in Spark. In Proceedings of the ACM 22. Xin, R. and Zaharia, M. Lessons from running large- Watch the authors discuss
SIGMOD/PODS Conference (Melbourne, Australia, May scale Spark workloads; http://tinyurl.com/large- their work in this exclusive
31–June 4). ACM Press, New York, 2015. scale-spark Communications video.
3. Dave, A. Indexedrdd project; http://github.com/ 23. Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, http://cacm.acm.org/videos/spark

N OV E MB E R 2 0 1 6 | VO L. 59 | N O. 1 1 | C OM M U N IC AT ION S OF T HE ACM 65