High Level Overview of Apache Spark

88 Sign in Get started
LATEST TOP 2.0 PODCAST COMMUNITY GET PUBLISHED
High Level Overview of Apache Spark

What is Spark? Let’s take a look under the hood
Eric Girouard Follow
Apr 22 · 9 min read
In my last post we introduced a problem: copious, never ending streams of

data, and it’s solution: Apache Spark. Here in Part II we’ll focus on Spark’s
internal architecture and data structures.
. . .
In pioneer days they used oxen for heavy pulling, and when
one ox couldn’t budge a log, they didn’t try to grow a larger
ox. We shouldn’t be trying for bigger computers, but for
more systems of computers — Grace Hopper
With the scale of data growing at a rapid and ominous pace, we needed a
way to process potential petabytes of data quickly, and we simply couldn’t
make a single computer process that amount of data at a reasonable pace.
This problem is solved by creating a cluster of machines to perform the work
for you, but how do those machines work together to solve the common
problem?
Meet Spark
Photo by Jez Timms on Unsplash
Spark is the cluster computing framework for large-scale data processing.

Spark o ers a set of libraries in 3 languages (Java, Scala, Python) for its
uni ed computing engine. What does this de nition actually mean?
Unified : With Spark, there is no need to piece together an application out of

multiple APIs or systems. Spark provides you with enough built-in APIs to get
the job done
Computing Engine: Spark handles loading data from various le systems and
runs computations on it, but does not store any data itself permanently. Spark
operates entirely in memory — allowing unparalleled performance and speed
Libraries: Spark is comprised of a series of libraries built for data science

tasks. Spark includes libraries for SQL (SparkSQL), Machine Learning
(MLlib), Stream Processing (Spark Streaming and Structured Streaming),
and Graph Analytics (GraphX)
. . .
The Spark Application
Every Spark Application consists of a Driver and a set of distributed worker
processes (Executors).
Spark Driver
The Driver runs the main() method of our application and is where the
SparkContext is created. The Spark Driver has the following duties:
• Runs on a node in our cluster, or on a client, and schedules the job

execution with a cluster manager
• Responds to user’s program or input
• Analyzes, schedules, and distributes work across the exectors
• Stores metadata about the running application and conveniently exposes

it in a webUI
Spark Executors
An executor is a distributed process responsible for the execution of tasks.
Each Spark Application has its own set of executors, which stay alive for the
life cycle of a single Spark application.
• Executors perform all data processing of a Spark job
• Stores results in memory, only persisting to disk when speci cally

instructed by the driver program
• Returns results to the driver once they have been completed
• Each node can have anywhere from 1 executor per node to 1 executor per
core
Spark’s Application Work ow

When you submit a job to Spark for processing, there is a lot that goes on
behind the scenes.
1. Our Standalone Application is kicked o , and initializes its SparkContext.

Only after having a SparkContext can an app be referred to as a Driver
2. Our Driver program asks the Cluster Manager for resources to launch its
executors
3. The Cluster Manager launches the executors
4. Our Driver runs our actual Spark code
5. Executors run tasks and send their results back to the driver
6. SparkContext is stopped and all executors are shut down, returning

resources back to the cluster
. . .
MaxTemperature, Revisited
Lets take a deeper look at the Spark Job we wrote in Part I to nd max
temperature by country. This abstraction hid a lot of set-up code, including
the initialization of our SparkContext, lets ll in the gaps:
MaxTemperature Spark Setup
Remember that Spark is a framework, in this case implemented in Java. It

isn’t until line 16 that Spark needs to do any work at all. Sure, we initialized
our SparkContext, however loading data into an RDD is the rst bit of code
that requires work be sent to our executors.
By now you may have seen the term “RDD” appear multiple times, it’s about
time we de ne it.
. . .
Spark Architecture Overview

Spark has a well-de ned layered architecture with loosely coupled
components based on two primary abstractions:
• Resilient Distributed Datasets (RDDs)
• Directed Acyclic Graph (DAG)
Resilient Distributed Datasets

RDDs are essentially the building blocks of Spark: everything is comprised of
them. Even Sparks higher-level APIs (DataFrames, Datasets) are composed of
RDDs under the hood. What does it mean to be a Resilient Distributed
Dataset?
• Resilient: Since Spark runs on a cluster of machines, data-loss from

hardware failure is a very real concern, so RDDs are fault tolerant and can
rebuild themselves in the event of failure
• Distributed: A single RDD is stored on a series of di erent nodes in the

cluster, belonging to no single source (and no single point of failure). This
way our cluster can operate on our RDD in parallel
• Dataset: A collection of values — this one you should probably have known

already
All data we work with in Spark will be stored inside some form of RDD — it is
therefore imperative to fully understand them.
Spark o ers a slew of “Higher Level” APIs built on top of RDDs designed to
abstract away complexity, namely the DataFrame and Dataset. With a strong
focus on Read-Evaluate-Print-Loops (REPLs), Spark-Submit and the Spark-
Shell in Scala and Python are targeted toward Data Scientists, who often
desire repeat analysis on a dataset. The RDD is still imperative to understand,
as it’s the underlying structure of all data in Spark.
An RDD is colloquially equivalent to: “Distributed Data Structure”. A

JavaRDD<String> is essentially just a List<String> dispersed amongst each
node in our cluster, with each node getting several di erent chunks of our
List. With Spark, we need to think in a distributed context, always.
RDDs work by splitting up their data into a series of partitions to be stored on

each executor node. Each node will then perform its work only on its own
partitions. This is what makes Spark so powerful: If an executor dies or a task
fails Spark can rebuild just the partitions it needs from the original source
and re-submit the task for completion.
Spark RDD partitioned amongst executors
RDD Operations
RDDs are Immutable , meaning that once they are created, they cannot be
altered in any way, they can only be transformed . The notion of
transforming RDDs is at the core of Spark, and Spark Jobs can be thought of
as nothing more than any combination of these steps:
• Loading data into an RDD
• Transforming an RDD
• Performing an Action on an RDD
In fact, every Spark job I’ve written is comprised of exclusively those types of
tasks, with vanilla Java for avour.
Spark de nes a set of APIs for working with RDDs that can be broken down
into two large groups: Transformations and Actions.
Transformations create a new RDD from an existing one.
Actions return a value, or values, to the Driver program after running a

computation on its RDD.
For example, the map function weatherData.map() is a transformation that

passes each element of an RDD through a function.
Reduce is an RDD action that aggregates all the elements of an RDD using
some function and returns the nal result to the driver program.
. . .
Lazy Evaluation
“I choose a lazy person to do a hard job. Because a lazy person will nd an easy
way to do it. — Bill Gates”
All transformations in Spark are lazy . This means that when we tell Spark
to create an RDD via transformations of an existing RDD, it won’t generate
that dataset until a speci c action is performed on it or one of it's children.
Spark will then perform the transformation and the action that triggered it.
This allows Spark to run much more e ciently.
Let’s re-examine the function declarations from our earlier Spark example to
identify which functions are actions and which are transformations:
16: JavaRDD<String> weatherData = sc.textFile(inputPath);
Line 16 is neither an action or a transformation; it’s a function of sc, our

JavaSparkContext.
17: JavaPairRDD<String, Integer> tempsByCountry =

weatherData.mapToPair(new Func.....
Line 17 is a transformation of the weatherData RDD, in it we map each line

of weatherData to a pair comprised of (City, Temperature)
26: JavaPairRDD<String, Integer> maxTempByCountry =

tempsByCountry.reduce(new Func....
Line 26 is also a transformation because we are iterating over key-value

pairs. its a transformation of tempsByCountry in which we reduce each city to
its highest recorded temperature.
31: maxTempByCountry.saveAsHadoopFile(destPath, String.class,

Integer.class, TextOutputFormat.class);
Finally on line 31 we trigger a Spark action: saving our RDD to our le

system. Since Spark subscribes to the lazy execution model, it isn’t until this
line that Spark generates weatherData, tempsByCountry, and
maxTempsByCountry before nally saving our result.
Directed Acyclic Graph

Whenever an action is performed on an RDD, Spark creates a DAG, a nite
direct graph with no directed cycles (otherwise our job would run forever).
Remember that a graph is nothing more than a series of connected vertices
and edges, and this graph is no di erent. Each vertex in the DAG is a Spark
function, some operation performed on an RDD (map, mapToPair,
reduceByKey, etc).
In MapReduce, the DAG consists of two vertices: Map → Reduce .
In our above example of MaxTemperatureByCountry, the DAG is a little more

involved:
parallelize → map → mapToPair → reduce → saveAsHadoopFile
The DAG allows Spark to optimize its execution plan and minimize shu ing.
We’ll discuss the DAG in greater depth in later posts, as it’s outside the scope
of this Spark overview.
. . .
Evaluation Loops
With our new vocabulary, let us re-examine the problem with MapReduce as I
de ned in Part I, quoted below:
MapReduce excels at batch data processing, however it lags behind when it comes
to repeat analysis and small feedback loops. The only way to reuse data between
computations is to write it to an external storage system (a la HDFS)”
‘Re-use data between computations’? Sounds like an RDD that can have
multiple actions performed on it! Lets suppose we have a le “data.txt” and
want to accomplish two computations:
• Total length of all lines in the le
• Length of the longest line in the le
In MapReduce, each task would require a separate job or a fancy

MulitpleOutputFormat implementation. Spark makes this a breeze in just
four simple steps:
1. Load contents of data.txt into an RDD
JavaRDD<String> lines = sc.textFile("data.txt");
2. Map each line of ‘lines’ to its length (Lambda functions used for brevity)
JavaRDD<Integer> lineLengths = lines.map(s -> s.length());
3. To solve for total length: reduce lineLengths to nd the total line length
sum, in this case the sum of every element in the RDD
int totalLength = lineLengths.reduce((a, b) -> a + b);
4. To solve for longest length: reduce lineLengths to nd the maximum line

length
int maxLength = lineLengths.reduce((a, b) -> Math.max(a,b));
Note that steps 3 and 4 are RDD actions, so they return a result to our Driver
program, in this case a Java int. Also recall that Spark is lazy and refuses to do
any work until it sees an action, in this case it will not begin any real work until
step 3.
. . .
Next Steps
So far we’ve introduced our data problem and its solution: Apache Spark. We
reviewed Spark’s architecture and work ow, it’s agship internal abstraction
(RDD), and its execution model. Next we’ll look into Functions and Syntax in
Java, getting progressively more technical as we dive deeper into the
framework.
Big Data Software Development Apache Spark Data Science Technology
88 claps
Eric Girouard Follow Hacker Noon Follow

Medium member since how hackers start their
Feb 2019 afternoons.
Boston Software
Engineer specializing in
Hadoop and big-data
development
More from Hacker Noon More from Hacker Noon More from Hacker Noon
How to hire the best developers TensorFlow is dead, long live Will These UX Trends Stick or Fade
TensorFlow! Away?
David Gilbertson 3.9K Cassie Kozyrkov 12.4K Miklos Philips 2K

Apr 14 · 18 min read Mar 28 · 8 min read Apr 7 · 14 min read
Responses
Write a response…
Never miss a story from Hacker Noon, when you sign up for Medium.
GET UPDATES
Learn more

High Level Overview of Apache Spark

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

High Level Overview of Apache Spark

Загружено:

Авторское право:

Доступные форматы

88 Sign in Get started

LATEST TOP 2.0 PODCAST COMMUNITY GET PUBLISHED

High Level Overview of Apache Spark

In my last post we introduced a problem: copious, never ending streams of

Photo by Jez Timms on Unsplash

Spark is the cluster computing framework for large-scale data processing.

Unified : With Spark, there is no need to piece together an application out of

Libraries: Spark is comprised of a series of libraries built for data science

• Runs on a node in our cluster, or on a client, and schedules the job

• Responds to user’s program or input

• Analyzes, schedules, and distributes work across the exectors

• Stores metadata about the running application and conveniently exposes

• Executors perform all data processing of a Spark job

• Stores results in memory, only persisting to disk when speci cally

• Returns results to the driver once they have been completed

Spark’s Application Work ow

1. Our Standalone Application is kicked o , and initializes its SparkContext.

3. The Cluster Manager launches the executors

4. Our Driver runs our actual Spark code

6. SparkContext is stopped and all executors are shut down, returning

MaxTemperature Spark Setup

Remember that Spark is a framework, in this case implemented in Java. It

Spark Architecture Overview

• Resilient Distributed Datasets (RDDs)

• Directed Acyclic Graph (DAG)

Resilient Distributed Datasets

• Resilient: Since Spark runs on a cluster of machines, data-loss from

• Distributed: A single RDD is stored on a series of di erent nodes in the

• Dataset: A collection of values — this one you should probably have known

An RDD is colloquially equivalent to: “Distributed Data Structure”. A

RDDs work by splitting up their data into a series of partitions to be stored on

Spark RDD partitioned amongst executors

• Loading data into an RDD

• Performing an Action on an RDD

Actions return a value, or values, to the Driver program after running a

For example, the map function weatherData.map() is a transformation that

16: JavaRDD<String> weatherData = sc.textFile(inputPath);

Line 16 is neither an action or a transformation; it’s a function of sc, our

17: JavaPairRDD<String, Integer> tempsByCountry =

Line 17 is a transformation of the weatherData RDD, in it we map each line

26: JavaPairRDD<String, Integer> maxTempByCountry =

Line 26 is also a transformation because we are iterating over key-value

31: maxTempByCountry.saveAsHadoopFile(destPath, String.class,

Finally on line 31 we trigger a Spark action: saving our RDD to our le

Directed Acyclic Graph

In MapReduce, the DAG consists of two vertices: Map → Reduce .

In our above example of MaxTemperatureByCountry, the DAG is a little more

parallelize → map → mapToPair → reduce → saveAsHadoopFile

• Total length of all lines in the le

• Length of the longest line in the le

In MapReduce, each task would require a separate job or a fancy

1. Load contents of data.txt into an RDD

JavaRDD<String> lines = sc.textFile("data.txt");

JavaRDD<Integer> lineLengths = lines.map(s -> s.length());

int totalLength = lineLengths.reduce((a, b) -> a + b);

4. To solve for longest length: reduce lineLengths to nd the maximum line

int maxLength = lineLengths.reduce((a, b) -> Math.max(a,b));

Big Data Software Development Apache Spark Data Science Technology

Eric Girouard Follow Hacker Noon Follow

David Gilbertson 3.9K Cassie Kozyrkov 12.4K Miklos Philips 2K

Вам также может понравиться