Академический Документы
Профессиональный Документы
Культура Документы
. . .
In pioneer days they used oxen for heavy pulling, and when
one ox couldn’t budge a log, they didn’t try to grow a larger
ox. We shouldn’t be trying for bigger computers, but for
more systems of computers — Grace Hopper
With the scale of data growing at a rapid and ominous pace, we needed a
way to process potential petabytes of data quickly, and we simply couldn’t
make a single computer process that amount of data at a reasonable pace.
This problem is solved by creating a cluster of machines to perform the work
for you, but how do those machines work together to solve the common
problem?
Meet Spark
Computing Engine: Spark handles loading data from various le systems and
runs computations on it, but does not store any data itself permanently. Spark
operates entirely in memory — allowing unparalleled performance and speed
. . .
The Spark Application
Every Spark Application consists of a Driver and a set of distributed worker
processes (Executors).
Spark Driver
The Driver runs the main() method of our application and is where the
SparkContext is created. The Spark Driver has the following duties:
Spark Executors
An executor is a distributed process responsible for the execution of tasks.
Each Spark Application has its own set of executors, which stay alive for the
life cycle of a single Spark application.
• Each node can have anywhere from 1 executor per node to 1 executor per
core
2. Our Driver program asks the Cluster Manager for resources to launch its
executors
5. Executors run tasks and send their results back to the driver
. . .
MaxTemperature, Revisited
Lets take a deeper look at the Spark Job we wrote in Part I to nd max
temperature by country. This abstraction hid a lot of set-up code, including
the initialization of our SparkContext, lets ll in the gaps:
By now you may have seen the term “RDD” appear multiple times, it’s about
time we de ne it.
. . .
All data we work with in Spark will be stored inside some form of RDD — it is
therefore imperative to fully understand them.
Spark o ers a slew of “Higher Level” APIs built on top of RDDs designed to
abstract away complexity, namely the DataFrame and Dataset. With a strong
focus on Read-Evaluate-Print-Loops (REPLs), Spark-Submit and the Spark-
Shell in Scala and Python are targeted toward Data Scientists, who often
desire repeat analysis on a dataset. The RDD is still imperative to understand,
as it’s the underlying structure of all data in Spark.
RDD Operations
RDDs are Immutable , meaning that once they are created, they cannot be
altered in any way, they can only be transformed . The notion of
transforming RDDs is at the core of Spark, and Spark Jobs can be thought of
as nothing more than any combination of these steps:
• Transforming an RDD
In fact, every Spark job I’ve written is comprised of exclusively those types of
tasks, with vanilla Java for avour.
Spark de nes a set of APIs for working with RDDs that can be broken down
into two large groups: Transformations and Actions.
Transformations create a new RDD from an existing one.
Reduce is an RDD action that aggregates all the elements of an RDD using
some function and returns the nal result to the driver program.
. . .
Lazy Evaluation
“I choose a lazy person to do a hard job. Because a lazy person will nd an easy
way to do it. — Bill Gates”
All transformations in Spark are lazy . This means that when we tell Spark
to create an RDD via transformations of an existing RDD, it won’t generate
that dataset until a speci c action is performed on it or one of it's children.
Spark will then perform the transformation and the action that triggered it.
This allows Spark to run much more e ciently.
Let’s re-examine the function declarations from our earlier Spark example to
identify which functions are actions and which are transformations:
The DAG allows Spark to optimize its execution plan and minimize shu ing.
We’ll discuss the DAG in greater depth in later posts, as it’s outside the scope
of this Spark overview.
. . .
Evaluation Loops
With our new vocabulary, let us re-examine the problem with MapReduce as I
de ned in Part I, quoted below:
MapReduce excels at batch data processing, however it lags behind when it comes
to repeat analysis and small feedback loops. The only way to reuse data between
computations is to write it to an external storage system (a la HDFS)”
‘Re-use data between computations’? Sounds like an RDD that can have
multiple actions performed on it! Lets suppose we have a le “data.txt” and
want to accomplish two computations:
2. Map each line of ‘lines’ to its length (Lambda functions used for brevity)
3. To solve for total length: reduce lineLengths to nd the total line length
sum, in this case the sum of every element in the RDD
Note that steps 3 and 4 are RDD actions, so they return a result to our Driver
program, in this case a Java int. Also recall that Spark is lazy and refuses to do
any work until it sees an action, in this case it will not begin any real work until
step 3.
. . .
Next Steps
So far we’ve introduced our data problem and its solution: Apache Spark. We
reviewed Spark’s architecture and work ow, it’s agship internal abstraction
(RDD), and its execution model. Next we’ll look into Functions and Syntax in
Java, getting progressively more technical as we dive deeper into the
framework.
88 claps
More from Hacker Noon More from Hacker Noon More from Hacker Noon
How to hire the best developers TensorFlow is dead, long live Will These UX Trends Stick or Fade
TensorFlow! Away?
Responses
Write a response…
Never miss a story from Hacker Noon, when you sign up for Medium.
GET UPDATES
Learn more