Part 04 From Hadoop To Spark

Big Data Analysis Workshop:
From Hadoop to Spark,

introduction of Zeppelin
Notebook
Drs. Weijia Xu, Ruizhu Huang and Amit Gupta
Data Mining & Statistics Group
Texas Advanced Computing Center
University of Texas at Austin
Sept. 28~29, 2017

Atlanta, GA
From Hadoop to Spark
•  Shortcomings of Hadoop and MapReduce
•  Disk based operations

•  All the intermediate steps are wrote onto the disk
•  Why?
•  Shuffle-sort between map and reduce process

•  Often becomes a bottleneck of the executions
•  Lacks of flexibilities for micro-grained parallelism

•  Map and reduce task assumes one core
•  High memory requirement may limit possibility of using all the cores
available in the host
•  Best for batch simple disk based processing

•  Hard to be used for interactive analysis, streaming analysis …
From Hadoop to Spark
•  Apache Spark is a cluster computing platform designed
to be fast and general purpose.
•  Speed:
•  run computation in memory,
•  faster than Mapreduce on disk
•  Generality:
•  supporting batch, interactive and streaming
•  Accessibility:
•  Simple APIs in Python, Java, Scala and SQL, and built-in libraries
•  Usability
•  Spark programming model is more than MapReduce model.
A Core Concept in Spark:
RDD
•  A Resilient Distributed Dataset (RDD),
•  the basic data abstraction in Spark,
•  represents an immutable, partitioned collection
•  An RDD contains a set of partitions

•  elements that can be operated on in parallel.
•  An RDD is immutable
•  Immutable: value cannot be changed.
•  Why?
A Core Concept in Spark:
RDD
•  Why RDD is immutable
•  Safe to share across multiple process
•  Easy to move around among resources, cache
•  Simplifies development
•  RDD can be recreated at any time.

•  List of dependencies
•  Function to compute a partition given its parent.
•  Think RDD as a deterministic function rather than a data
object.
•  Why?
RDD: From Centralized to
Distributed Data Structure
List(1,2,3,4,5,6,7,8)
List(1,2) List(3,4) List(5,6) List(7,8)
Courtesy Image from h0ps://cdn0.iconﬁnder.com/data/icons/hardware-outline-icons/60/Hardware_Hardware-01-512.p

RDD: From Centralized to
Distributed Data Structure
List(1,2,3,4,5,6,7,8)
ParMMon(1,2) ParMMon(3,4) ParMMon(5,6) ParMMon(7,8)

RDD(1,2,3,4,5,6,7,8)

RDD Processing
RDD
ParMMon(1,2) ParMMon(5,6)
List(1,2,3,4,5,6,7,8) ParMMon(3,4) ParMMon(7,8)
sum
sum sum sum sum
List(3, 7, 11, 15) RDD

ParMMon(3) ParMMon(11)

From Centralized to
Distributed
RDD
sum sum sum sum
RDD
Fault-tolerance: if the data on one computer gets lost, the

system should be able to recover it
The Spark Stack
Introduction of Zeppelin
•  A web-based notebook that enables interactive data
analytics and visualization.
•  Multiple Language Backend
Start Zeppelin on Wrangler
login1.wrangler(7)$sbatch --reservation=hadoop+TRAINING-
OPEN+2375 /data/apps/zeppelin_user/job.zeppelin
login1.wrangler(8)$cat zeppelin.out
Wait 5 minutes for Zeppelin UI to start, copy and paste

h0p://wrangler.tacc.utexas.edu:XXXXX to your web browser.

Use your TACC credenMal to login Zeppelin
12
Note: wait 8 minutes and refresh your web browser

13
Use your TACC credenMal
15
18
19
Using Spark with Scala
20
WordCount - Scala
%spark
import org.apache.spark.rdd.RDD
val textFile = sc.textFile("book.txt")
val stopWords = sc.textFile("stopwords.txt")
val stopWordSet = stopWords.collect.toSet
val stopWordSetBC = sc.broadcast(stopWordSet)
import org.apache.spark.sql.Row
//textFile.flatMap(_.toLowerCase.split(" ")).subtract(stopWords).take(100)
val wordCounts = textFile.flatMap(_.toLowerCase.split(" ")).filter( w => !
stopWordSetBC.value.contains(w)&&w!="").map(word => (word, 1)).reduceByKey((a, b) => a + b)
val top50 = wordCounts.sortBy(_._2,ascending=false).map{case (word:String,count:Int) => {word

+ "\t" + count} }.take(50)
//top50.mkString("\n")
print("%table Word\t Count\n" + top50.mkString("\n"))
Using Spark with Python
22
WordCount - Python
%pyspark
textFile = sc.textFile("book.txt")
stopWords = sc.textFile("stopwords.txt”)
wordCounts = textFile.flatMap(lambda line:

line.lower().split()).subtract(stopWords).map(lambda word: (word,
1)).reduceByKey(lambda a, b: a+b)
top50 = wordCounts.sortBy(lambda a: a[1],ascending=False).map(lambda
x: x[0]+ "\t" + str(x[1])).take(50)
print("%table Word\t Count\n" + "\n".join(top50))

Using Spark with R
24
Try Zeppelin
Import Zeppelin_Intro.json

Part 04 From Hadoop To Spark

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Part 04 From Hadoop To Spark

Загружено:

Авторское право:

Доступные форматы

Big Data Analysis Workshop:

From Hadoop to Spark,

Sept. 28~29, 2017

•  Disk based operations

•  Shuffle-sort between map and reduce process

•  Lacks of flexibilities for micro-grained parallelism

•  Best for batch simple disk based processing

•  An RDD contains a set of partitions

•  RDD can be recreated at any time.

List(1,2) List(3,4) List(5,6) List(7,8)

Courtesy Image from h0ps://cdn0.iconﬁnder.com/data/icons/hardware-outline-icons/60/Hardware_Hardware-01-512.p

ParMMon(1,2) ParMMon(3,4) ParMMon(5,6) ParMMon(7,8)

Courtesy Image from h0ps://cdn0.iconﬁnder.com/data/icons/hardware-outline-icons/60/Hardware_Hardware-01-512.p

List(3, 7, 11, 15) RDD

Courtesy Image from h0ps://cdn0.iconﬁnder.com/data/icons/hardware-outline-icons/60/Hardware_Hardware-01-512.p

sum sum sum sum

Fault-tolerance: if the data on one computer gets lost, the

Wait 5 minutes for Zeppelin UI to start, copy and paste

val top50 = wordCounts.sortBy(_._2,ascending=false).map{case (word:String,count:Int) => {word

wordCounts = textFile.flatMap(lambda line:

print("%table Word\t Count\n" + "\n".join(top50))

Вам также может понравиться

Part 04 From Hadoop To Spark

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Part 04 From Hadoop To Spark

Загружено:

Авторское право:

Доступные форматы

Big Data Analysis Workshop:

From Hadoop to Spark,

Sept. 28~29, 2017

• Disk based operations

• Shuffle-sort between map and reduce process

• Lacks of flexibilities for micro-grained parallelism

• Best for batch simple disk based processing

• An RDD contains a set of partitions

• RDD can be recreated at any time.

List(1,2) List(3,4) List(5,6) List(7,8)

Courtesy Image from h0ps://cdn0.iconﬁnder.com/data/icons/hardware-outline-icons/60/Hardware_Hardware-01-512.p

ParMMon(1,2) ParMMon(3,4) ParMMon(5,6) ParMMon(7,8)

Courtesy Image from h0ps://cdn0.iconﬁnder.com/data/icons/hardware-outline-icons/60/Hardware_Hardware-01-512.p

List(3, 7, 11, 15) RDD

Courtesy Image from h0ps://cdn0.iconﬁnder.com/data/icons/hardware-outline-icons/60/Hardware_Hardware-01-512.p

sum sum sum sum

Fault-tolerance: if the data on one computer gets lost, the

Wait 5 minutes for Zeppelin UI to start, copy and paste

val top50 = wordCounts.sortBy(_._2,ascending=false).map{case (word:String,count:Int) => {word

wordCounts = textFile.flatMap(lambda line:

print("%table Word\t Count\n" + "\n".join(top50))

Вам также может понравиться

•  Disk based operations

•  Shuffle-sort between map and reduce process

•  Lacks of flexibilities for micro-grained parallelism

•  Best for batch simple disk based processing

•  An RDD contains a set of partitions

•  RDD can be recreated at any time.