Вы находитесь на странице: 1из 28

RDD_programing

How to Use the Low-Level APIs?

·0 A SparkContext is the entry point for low-level API functionality

·1 You access it through the SparkSession, which is the tool you use to
perform computation across a Spark cluste

·2 you can access spark.sparkContext

·3 RDDs were the primary API in the Spark 1.X series and are still
available in 2.X, but they are not as commonly used

·4 virtually all Spark code you run, whether DataFrames or Datasets,


compiles down to an RDD

·5 In short, an RDD represents an immutable, partitioned collection of


records that can be operated on in parallel

·6 RDDs the records are just Java, Scala, or Python objects of the
programmer’s choosing.

·7 You can store anything you want in these objects, in any format you
want

But

·8 spark doesnot understand inner structure of records so optimization


are going to manual work(we need to do)

·9 Spark’s Structured APIs automatically store data in an optimzied,


compressed binary format, so to achieve the same space-efficiency
and performance, you’d also need to implement this type of format
inside your objects and all the low-level operations to compute over
it

·10 optimizations like reordering filters and aggregations that occur


automatically in Spark SQL need to be implemented by hand
Types of RDDs

·11 you will notice that there are lots of subclasses of RDD. For the most
part, these are internal representations that the DataFrame API uses
to create optimized physical execution plans

·12 as a user ,you will likely only be creating two types of RDDs: the
“generic” RDD type or a key-value RDD that provides additional
functions

·13 Both just represent a collection of objects, but key-value RDDs have
special operations as well as a concept of custom partitioning by key.

Partioner

The Partitioner is probably one of the core reasons why you might want to
use RDDs in your code. Specifying your own custom Partitioner can give you
significant performance and stability improvements if you use it correctly

note:there is no concept of “rows” in RDDs; individual records are just raw


Java/Scala/Python objects

Comparing to java/scala..python RDD are slow.We serialize the data to the


Python process, operate on it in Python, and then serialize it back to the
Java Virtual Machine (JVM).his causes a high overhead for Python RDD
manipulations.but in structe APIs everything same performence.

Creating RDDs

From a Local Collection

·14 you will need to use the parallelize method on a SparkContext (within
a SparkSession).

·15 When creating this parallel collection, you can also explicitly state the
number of partitions into which you would like to distribute this
array. In this case, we are creating two partitions:

// in Scala
val myCollection = "Spark The Definitive Guide : Big Data Processing Made
Simple" .split(" ")

val words = spark.sparkContext.parallelize(myCollection, 2)

val lines = sc.parallelize(List("pandas", "i like pandas"))

An additional feature is that you can then name this RDD to show up in
the Spark UI according to a given name:

// in Scala

words.setName("myWords")

words.name // myWords

From Data Sources/External datasets

·16 Although you can create RDDs from data sources or text files, it’s
often preferable to use the Data Source APIs. RDDs do not have a
notion of “Data Source APIs” like DataFrames do; they primarily
define their dependency structures and lists of partitions.

·17 you can also read data as RDDs using sparkContext. For example, let’s
read a text file line by line:

spark.sparkContext.textFile("/some/path/withTextFiles")

·18 This creates an RDD for which each record in the RDD represents a
line in that text file or files.

you can read in data for which each text file should become a single record.
The use case here would be where each file is a file that consists of a large
JSON object or some document that you will operate on as an individual

spark.sparkContext.wholeTextFiles("/some/path/withTextFiles")

In this RDD, the name of the file is the first object and the value of the text
file is the second string object.
Interoperating Between DataFrames, Datasets, and RDDs

·19 One of the easiest ways to get RDDs is from an existing DataFrame or
Dataset

·20 Converting these to an RDD is simple: just use the rdd method on any
of these data types.

·21 You’ll notice that if you do a conversion from a Dataset[T] to an RDD,


you’ll get the appropriate native type T back (remember this applies
only to Scala and Java):

// in Scala: converts a Dataset[Long] to RDD[Long]

spark.range(500).rdd

Because Python doesn’t have Datasets—it has only DataFrames—you will


get an RDD of type Row:

# in Python

spark.range(10).rdd

To operate on this data, you will need to convert this Row object to the
correct data type or extract values out of it, as shown in the example that
follows. This is now an RDD of type Row:

// in Scala

spark.range(10).toDF().rdd.map(rowObject => rowObject.getLong(0))

DataFrame or Dataset from an RDD(reverse)

·22 All you need to do is call the toDF method on the RDD

// in Scala

spark.range(10).rdd.toDF()

This command creates an RDD of type Row. This row is the internal Catalyst
format that Spark uses to represent data in the Structured APIs. This
functionality makes it possible for you to jump between the Structured and
low-level APIs as it suits your use case

Manipulating RDDs

Map

The map transformation evaluates a function for each input record and
emits a transformed output record.

FILTER()

Filtering is equivalent to creating a SQL-like where clause

Apply user function:keep item if functionreturns true


resultent RDD has only elements which the condition is true
FLATMAP

Return a new RDD by first applying a function to all elements of


this RDD, and then flattening the results
Similar to map, but each input item can be mapped to 0 or more
output items (so func should return a Seq rather than a single
item).
diff b/w map and flatmap()
Map and flatMap are similar in the way that they take a line from
input RDD and apply a function on that line. The key difference
between map() and flatMap() is map() returns only one element,
while flatMap() can return a list of elements.
Grouping, Sorting, and Distinct Functions

The groupBy

transformation returns an RDD of items grouped by a specified


function. The function can simply nominate a key by which to
group all elements or to specify an expression to be evaluated
against elements to determine a group (such as when grouping
elements by odd or even numbers of a numeric field in the data).

RDD.groupBy(<function>, numPartitions=None)

The numPartitions argument can be used to create a specified


number of partitions created automatically by computing hashes
from the output of the grouping function. For instance, if you
wanted to group an RDD by the days in a week and process each
day separately, you would specify numPartitions=7. You will see
numPartitions specified in numerous Spark transformations,
where its behavior is exactly the same

Caution: Consider Other Functions If You Are Grouping to


Aggregate

If your ultimate intention in using groupBy is to aggregate values


(such as when performing a sum or count) there are more efficient
operators for this purpose in Spark, including aggregateByKey and
reduceByKey (which I will discuss in the section on Key Value Pair
Transformations).

The groupBy transformation does not perform any aggregation


prior to shuffling data, resulting in more data being shuffled.
Furthermore, groupBy requires that all the values for a given key
fit into memory. The groupBy transformation is useful in some
cases, but you should consider these factors before deciding to
use this function.

sortBy()

RDD.sortBy(<keyfunc>, ascending=True, numPartitions=None)

·23 The sortBy transformation sorts an RDD by the function that


nominates the key for a given dataset.

·24 It sorts according to the sort order of the key object type
(such as, for instance, int values would be sorted
numerically, whereas string values would be sorted in
lexicographical order).

·25 The ascending argument is a Boolean argument defaulting to


True that specifies the sort order to be used. A descending
sort order is specified by setting ascending=False.

// in Scala

words.sortBy(word => word.length() * -1).take(2)

Random Splits(need to practice this one)

We can also randomly split an RDD into an Array of RDDs by using


the randomSplit method, which accepts an Array of weights and a
random seed:

// in Scala
val fiftyFiftySplit = words.randomSplit(Array[Double](0.5, 0.5))
This returns an array of RDDs that you can manipulate individually.
distinct()
The distinct transformation returns a new RDD containing distinct
elements from the input RDD. It is used to remove duplicates
where duplicates are defined as all elements or fields within a
record that are the same as another record in the dataset.
Set Operations(transformations)
·26 Setoperations are conceptually similar to mathematical set
operations. Set functions operate against two RDDs and
result in one RDD
union()
·27 Theunion transformation takes one RDD and appends
another RDD to it resulting in a combined output RDD
·28 TheRDDs are not required to have the same schema or
structure. (For instance, the first RDD can have five fields
whereas the second can have more or less than five fields.)
·29 Theunion transformation does not filter duplicates from the
output RDD, in the case that two unioned RDDs have records
that are identical to one another. To filter duplicates, you
could follow the union transformation with the distinct
function discussed previously.
·30 Theresultant RDD from a union operation is not sorted
either, but this could be accomplished by following union
with a sortBy function.
odds = sc.parallelize([1,3,5,7,9])
fibonacci = sc.parallelize([0,1,2,3,5,8])
odds.union(fibonacci).collect()
# [1, 3, 5, 7, 9, 0, 1, 2, 3, 5, 8]
intersection()
Syntax:
RDD.intersection(<otherRDD>)
·31 Theintersection transformation returns elements that are
present in both RDDs. In other words, it returns the overlap
between two sets
·32 Theelements or records must be identical in both sets, with
each respective record’s data structure and all of its fields
being identical in both RDDs
odds = sc.parallelize([1,3,5,7,9])
fibonacci = sc.parallelize([0,1,2,3,5,8])
odds.intersection(fibonacci).collect()
# [1, 3, 5]
subtract()
RDD.subtract(<otherRDD>, numPartitions=None)
·33 returnsall elements from the first RDD that are not present
in the second RDD. This is an implementation of a
mathematical set subtraction.
odds = sc.parallelize([1,3,5,7,9])
fibonacci = sc.parallelize([0,1,2,3,5,8])
odds.subtract(fibonacci).collect()
# [7, 9]
Spark Actions
·34 Actions in Spark return values to the Spark driver program.
·35 They are typically the final step in a Spark program. Recall
that with lazy evaluation, the complete set of Spark
transformations in a program are only processed when an
action is requested.
example:to work with
val myCollection = "Spark The Definitive Guide : Big Data
Processing Made Simple".split(" ")
val words = spark.sparkContext.parallelize(myCollection, 2)
The reduce and fold Actions
·36 Thereduce and fold actions are aggregate actions, each of
which executes a commutative and associative operation
against an RDD.
commutative ⇒ x + y = y + x associative ⇒ (x + y) + z = x + (y + z)
·37 This makes the operations independent of the order in which
they run, and this is integral to distributed processing
because the order cannot be guaranteed.

reduce
·38 You can use the reduce method to specify a function to
“reduce” an RDD of any kind of value to one value
·39 For instance, given a set of numbers, you can reduce this to
its sum by specifying a function that takes as input two
values and reduces them into one.
// in Scala
spark.sparkContext.parallelize(1 to 20).reduce(_ + _) // 210
You can also use this to get something like the longest word in our
set of words that we defined a moment ago. The key is just to
define the correct function:
// in Scala
def wordLengthReducer(leftWord:String, rightWord:String): String
={
if (leftWord.length > rightWord.length)
return leftWord
else
return rightWord
}
words.reduce(wordLengthReducer)
eg:2
numbers = sc.parallelize([1,2,3,4,5,6,7,8,9])
numbers.reduce(lambda x, y: x + y)
# 45
fold()
RDD.fold(zeroValue, <function>)
·40 The fold action aggregates the elements of each partition of
an RDD, and then performs the aggregate operation against
the results for all using a given associative and commutative
function and a zeroValue
example is a fold action with zeroValue=0
numbers = sc.parallelize([1,2,3,4,5,6,7,8,9])
numbers.fold(0, lambda x, y: x + y)
# 45
·41 the fold looks exactly the same as the reduce action.
·42 The fold action provides a zeroValue that will be added to
the beginning and the end of the commutative and
associative function supplied as input to the fold action
·43 This allows fold to operate on an empty RDD, whereas
reduce will produce an exception with an empty RDD
result = zeroValue + (1 + 2) + 3 . . + zeroValue
count()
·44 The count action takes no arguments,
·45 Note that with actions that take no arguments, you will need
to include empty parentheses, “()”, after the action name.
words.count()
COUNTAPPROX
·46 This is an approximation of the count method we just looked
at, but it must execute within a timeout (and can return
incomplete results if it exceeds the timeout).
·47 The confidence is the probability that the error bounds of
the result will contain the true value.
·48 ifcountApprox were called repeatedly with confidence 0.9,
we would expect 90% of the results to contain the true
count. The confidence must be in the range [0,1], or an
exception will be thrown:
val confidence = 0.95
val timeoutMilliseconds = 400
words.countApprox(timeoutMilliseconds, confidence)
COUNTAPPROXDISTINCT(not understanded)
·49 There are two implementations of this, both based on
streamlib’s implementation of “HyperLogLog in Practice:
Algorithmic Engineering of a State-of-the-Art Cardinality
Estimation Algorithm.”
1)in the first implementation,the argument we pass into the
function is the relative accuracy. Smaller values create counters
that require more space. The value must be greater than
0.000017:
words.countApproxDistinct(0.05)
·50 With the other implementation you have a bit more control;
you specify the relative accuracy based on two parameters:
one for “regular” data and another for a sparse
representation.
·51 The two arguments are p and sp where p is precision and sp
is sparse precision. The relative accuracy is approximately
1.054 / sqrt(2P). Setting a nonzero (sp > p) can reduce the
memory consumption and increase accuracy when the
cardinality is small. Both values are integers:
words.countApproxDistinct(4, 10)
COUNTBYVALUE
·52 This method counts the number of values in a given RDD
·53 You should use this method only if the resulting map is
expected to be small because the entire thing is loaded into
the driver’s memory
·54 this method makes sense only in a scenario in which either
the total number of rows is low or the number of distinct
items is low
words.countByValue()
COUNTBYVALUEAPPROX
·55 This does the same thing as the previous function, but it
does so as an approximation. This must execute within the
specified timeout (first parameter) (and can return
incomplete results if it exceeds the timeout).
·56 The confidence is the probability that the error bounds of
the result will contain the true value
·57 if countApprox were called repeatedly with confidence 0.9,
we would expect 90% of the results to contain the true
count. The confidence must be in the range [0,1], or an
exception will be thrown:
words.countByValueApprox(1000, 0.95)
take()
·58 The take action returns the first n elements of an RDD
·59 The elements taken are not in any particular order; the
elements returned from a take action are non-deterministic,
meaning they can differ if the same action is run again
(because fully distributed environment)
·60 similar Spark function, takeOrdered, which takes the first n
elements ordered based upon a key supplied by a key
function.
·61 if RDD is more than one partition.it scans one partition and
uses the results from that partition to estimate the number
of additional partitions needed to satisfy the number
requested.
//code in python
lorem = sc.textFile('file:///opt/spark/data/lorem.txt')
words = lorem.flatMap(lambda x: x.split())
words.take(3)
top()
·62 The top action returns the top n elements from an RDD, but
unlike take, the elements are ordered and returned in
descending order.
·63 Order is determined by the object type(numerical order for
integers or dictionary order for strings.)
·64 Thekey argument specifies the key by which to order the
results to return the top n elements.
RDD.top(n, key=None)
lorem = sc.textFile('file:///opt/spark/data/lorem.txt')
words = lorem.flatMap(lambda x: x.split())
words.top(3)
first()(need to check)
·65 The first action returns the first element in this RDD. Similar
to take and collect and unlike top, first does not consider the
order of elements and is a non-deterministic operation
·66 the primary difference between first and take(1) is that first
returns an atomic data element, take (even if n = 1) returns a
list of data elements.
·67 Thefirst action is useful for inspecting the output of an RDD
as part of development or data exploration
words.first()
max and min
max and min return the maximum values and minimum values,
respectively:
spark.sparkContext.parallelize(1 to 20).max()
spark.sparkContext.parallelize(1 to 20).min()
Data Sampling with Spark
sample():Transformation
RDD.sample(withReplacement, fraction, seed=None)
·68 The sample transformation is used to create a sampled
subset RDD from an original RDD based upon a percentage
of the overall dataset.
·69 The withReplacement argument is a Boolean value
specifying whether elements in an RDD can be sampled
multiple times.
·70 The fraction argument is a double value between 0 and 1
that represents the probability that an element will be
chosen. Effectively, this represents the approximate
percentage of the dataset you wish to be returned to the
resultant sampled RDD. Note that if you specify a value
larger than 1 for this argument it will default back to 1.
·71 The optional seed argument is an integer representing a
seed for the random number generator that is used to
determine whether to include an element in the return RDD.
Eg:
sample transformation used to create an approximate 10 percent
subset of web log events from a corpus of web logs.
logs = sc.textFile('file:///opt/spark/data/weblogs')
logs.count()
# returns 1,051,105
sampled_logs = logs.sample(False, 0.1, seed=None)
sampled_logs.count()
# returns 106,020 (10.09% of the orginal RDD)
takeSample():Action
RDD.takeSample(withReplacement, num, seed=None)
·72 ThetakeSample action is used to return a random list of
values (elements or records) from the RDD being sampled.
·73 Thenum argument is the number of randomly selected
records to be returned.
dataset = sc.parallelize([1,2,3,4,5,6,7,8,9,10])
dataset.takeSample(False, 3)
# returned [6, 10, 2]
Foreach()
·74 The foreach action applies a function (anonymous or named)
to all elements of an RDD.
·75 Becauseforeach is an action rather than a transformation,
you can perform functions otherwise not possible (or
intended) in transformations, such as a print function
·76 Unlike other actions, foreach do not return any value to
driver. It simply operates on all the elements in the RDD.
foreach() can be used in situations, where we do not want to
return any result, but want to initiate a computation
·77 Keepin mind that foreach is an action, as such it will trigger
evaluation of the RDDs entire lineage. If this is not what you
intend then map may be a better option
rdd.foreach(func)
collect()
·78 Thesimplest and most common operation that returns data
to our driver program is collect()
·79 whichreturns the entire RDD’s contents. collect() is
commonly used in unit tests where the entire contents of
the RDD are expected to fit in memory, as that makes it easy
to compare the value of our RDD with our expected result.
·80 collect()
suffers from the restriction that all of your data
must fit on a single machine, as it all needs to be copied to
the driver.

Saving Files(Actions)
·81 Saving files
means writing to plain-text files. With RDDs, you
cannot actually “save” to a data source in the conventional
sense
·82 Youmust iterate over the partitions in order to save the
contents of each partition to some external database.
·83 Thisis a low-level approach that reveals the underlying
operation that is being performed in the higher-level APIs.
Spark will take each partition, and write that out to the
destination.
saveAsTextFile
To save to a text file, you just specify a path and optionally a
compression codec:
words.saveAsTextFile("file:/tmp/bookTitle")
To set a compression codec, we must import the proper codec
from Hadoop. You can find these in the
org.apache.hadoop.io.compress library:
// in Scala
import org.apache.hadoop.io.compress.BZip2Codec
words.saveAsTextFile("file:/tmp/bookTitleCompressed",
classOf[BZip2Codec])
SequenceFiles
A sequenceFile is a flat file consisting of binary key–value pairs. It
is extensively used in MapReduce as input/output formats.
Spark can write to sequenceFiles using the saveAsObjectFile
method or by explicitly writing key–value pairs
words.saveAsObjectFile("/tmp/my/sequenceFilePath")
There are a variety of different Hadoop file formats to which you
can save. These allow you to specify classes, output formats,
Hadoop configurations, and compression schemes. (For
information on these formats, read Hadoop: The Definitive Guide
[O’Reilly, 2015].) These formats are largely irrelevant except if
you’re working deeply in the Hadoop ecosystem or with some
legacy mapReduce jobs.
Checkpointing
·84 One feature not available in the DataFrame API is the
concept of checkpointing
·85 Checkpointing is the act of saving an RDD to disk so that
future references to this RDD point to those intermediate
partitions on disk rather than recomputing the RDD from its
original source.
·86 This is similar to caching except that it’s not stored in
memory, only disk. This can be helpful when performing
iterative computation, similar to the use cases for caching:
spark.sparkContext.setCheckpointDir("/some/path/for/checkpoi
nting")
words.checkpoint()
Now, when we reference this RDD, it will derive from the
checkpoint instead of the source data. This can be a helpful
optimization.
Pipe RDDs to System Commands(this need to learn)
words.pipe("wc -l").collect()
mapPartitions
·87 return
signature of a map function on an RDD is actually
MapPartitionsRDD
·88 Thisis because map is just a row-wise alias for
mapPartitions, which makes it possible for you to map an
individual partition (represented as an iterator).
·89 That’sbecause physically on the cluster we operate on each
partition individually (and not a specific row).
·90 Asimple example creates the value “1” for every partition in
our data, and the sum of the following expression will count
the number of partitions we have:
words.mapPartitions(part => Iterator[Int](1)).sum() // 2
·91 this
means that we operate on a per-partition basis and
allows us to perform an operation on that entire partition.
·92 Thisis valuable for performing something on an entire
subdataset of your RDD. You can gather all values of a
partition class or group into one partition and then operate
on that entire group using arbitrary functions and controls.
mapPartitionsWithIndex.
·93 With this you specify a function that accepts an index (within
the partition) and an iterator that goes through all items
within the partition.
·94 Thepartition index is the partition number in your RDD,
which identifies where each record in our dataset sits (and
potentially allows you to debug). You might use this to test
whether your map functions are behaving correctly:
// in Scala
def indexedFunc(partitionIndex:Int, withinPartIterator:
Iterator[String]) = {
withinPartIterator.toList.map(
value => s"Partition: $partitionIndex => $value").iterator
}
words.mapPartitionsWithIndex(indexedFunc).collect()
foreachPartition
·95 Although mapPartitions needs a return value to work
properly, this next function does not.
·96 foreachPartitionsimply iterates over all the partitions of the
data. The difference is that the function has no return value.
·97 Thismakes it great for doing something with each partition
like writing it out to a database
·98 In
fact, this is how many data source connectors are written.
You can create our own text file source if you want by
specifying outputs to the temp directory with a random ID:
words.foreachPartition { iter =>
import java.io._
import scala.util.Random
val randomFileName = new Random().nextInt()
val pw = new PrintWriter(new File(s"/tmp/random-file-$
{randomFileName}.txt"))
while (iter.hasNext) {
pw.write(iter.next())
}
pw.close()
}
You’ll find these two files if you scan your /tmp directory.
glom
·99 takesevery partition in your dataset and converts them to
arrays
·100 This can be useful if you’re going to collect the data to
the driver and want to have an array for each partition.
·101 this can cause serious stability issues because if you
have large partitions or a large number of partitions, it’s
simple to crash the driver.
·102 In the following example, you can see that we get two
partitions and each word falls into one partition each:
// in Scala
spark.sparkContext.parallelize(Seq("Hello", "World"),
2).glom().collect()
// Array(Array(Hello), Array(World))
Passing Functions to Spark(need to know about this)
·103 Most of Spark’s transformations, and some of its
actions, depend on passing in functions that are used by
Spark to compute data.
·104 Each of the core languages(python/java/scala) has a
slightly different mechanism for passing functions to Spark.
Scala
·105 two recommended ways to do this
1)Anonymous function syntax, which can be used for short pieces
of code.
x:Int=x+1
2)Static methods in a global singleton object. For example, you
can define object MyFunctions and then pass MyFunctions.func1,
as follows:
object MyFunctions {
def func1(s: String): String = { ... }
}
myRdd.map(MyFunctions.func1)

Note
it is also possible to pass a reference to a method in a class
instance (as opposed to a singleton object), this requires sending
the object that contains that class along with the method
class MyClass {
def func1(s: String): String = { ... }
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
Here, if we create a new MyClass and call doStuff on it, the map
inside there references the func1 method of that MyClass
instance, so the whole object needs to be sent to the cluster. It is
similar to writing rdd.map(x => this.func1(x)).
In a similar way, accessing fields of the outer object will
reference the whole object:
class MyClass {
val field = "Hello"
def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x =>
field + x) }
}
is equilvalent to writing rdd.map(x => this.field + x), which
references all of this. To avoid this issue, the simplest way is to
copy field into a local variable instead of accessing it externally:
def doStuff(rdd: RDD[String]): RDD[String] = {
val field_ = this.field
rdd.map(x => field_ + x)
}

Вам также может понравиться