Вы находитесь на странице: 1из 25

Big Data Analysis Workshop:

From Hadoop to Spark,


introduction of Zeppelin
Notebook
Drs. Weijia Xu, Ruizhu Huang and Amit Gupta
Data Mining & Statistics Group
Texas Advanced Computing Center
University of Texas at Austin

Sept. 28~29, 2017


Atlanta, GA
From Hadoop to Spark
•  Shortcomings of Hadoop and MapReduce

•  Disk based operations


•  All the intermediate steps are wrote onto the disk
•  Why?

•  Shuffle-sort between map and reduce process


•  Often becomes a bottleneck of the executions

•  Lacks of flexibilities for micro-grained parallelism


•  Map and reduce task assumes one core
•  High memory requirement may limit possibility of using all the cores
available in the host

•  Best for batch simple disk based processing


•  Hard to be used for interactive analysis, streaming analysis …
From Hadoop to Spark
•  Apache Spark is a cluster computing platform designed
to be fast and general purpose.

•  Speed:
•  run computation in memory,
•  faster than Mapreduce on disk

•  Generality:
•  supporting batch, interactive and streaming

•  Accessibility:
•  Simple APIs in Python, Java, Scala and SQL, and built-in libraries

•  Usability
•  Spark programming model is more than MapReduce model.
A Core Concept in Spark:
RDD
•  A Resilient Distributed Dataset (RDD),
•  the basic data abstraction in Spark,
•  represents an immutable, partitioned collection

•  An RDD contains a set of partitions


•  elements that can be operated on in parallel.

•  An RDD is immutable
•  Immutable: value cannot be changed.
•  Why?
A Core Concept in Spark:
RDD
•  Why RDD is immutable
•  Safe to share across multiple process
•  Easy to move around among resources, cache
•  Simplifies development

•  RDD can be recreated at any time.


•  List of dependencies
•  Function to compute a partition given its parent.
•  Think RDD as a deterministic function rather than a data
object.
•  Why?
RDD: From Centralized to
Distributed Data Structure
List(1,2,3,4,5,6,7,8)

List(1,2) List(3,4) List(5,6) List(7,8)

Courtesy Image from h0ps://cdn0.iconfinder.com/data/icons/hardware-outline-icons/60/Hardware_Hardware-01-512.p


RDD: From Centralized to
Distributed Data Structure
List(1,2,3,4,5,6,7,8)

ParMMon(1,2) ParMMon(3,4) ParMMon(5,6) ParMMon(7,8)


RDD(1,2,3,4,5,6,7,8)

Courtesy Image from h0ps://cdn0.iconfinder.com/data/icons/hardware-outline-icons/60/Hardware_Hardware-01-512.p


RDD Processing
RDD
ParMMon(1,2) ParMMon(5,6)
List(1,2,3,4,5,6,7,8) ParMMon(3,4) ParMMon(7,8)

sum
sum sum sum sum

List(3, 7, 11, 15) RDD


ParMMon(3) ParMMon(11)
ParMMon(7) ParMMon(15)

Courtesy Image from h0ps://cdn0.iconfinder.com/data/icons/hardware-outline-icons/60/Hardware_Hardware-01-512.p


From Centralized to
Distributed
RDD
ParMMon(1,2) ParMMon(5,6)
ParMMon(3,4) ParMMon(7,8)

sum sum sum sum

RDD
ParMMon(3) ParMMon(11)
ParMMon(7) ParMMon(15)

Fault-tolerance: if the data on one computer gets lost, the


system should be able to recover it
Courtesy Image from h0ps://cdn0.iconfinder.com/data/icons/hardware-outline-icons/60/Hardware_Hardware-01-512.p
The Spark Stack
Introduction of Zeppelin
•  A web-based notebook that enables interactive data
analytics and visualization.
•  Multiple Language Backend
Start Zeppelin on Wrangler
login1.wrangler(7)$sbatch --reservation=hadoop+TRAINING-
OPEN+2375 /data/apps/zeppelin_user/job.zeppelin

login1.wrangler(8)$cat zeppelin.out

Wait 5 minutes for Zeppelin UI to start, copy and paste


h0p://wrangler.tacc.utexas.edu:XXXXX to your web browser.

Use your TACC credenMal to login Zeppelin

12
Note: wait 8 minutes and refresh your web browser

13
Use your TACC credenMal

15
18
19
Using Spark with Scala

20
WordCount - Scala
%spark
import org.apache.spark.rdd.RDD
val textFile = sc.textFile("book.txt")
val stopWords = sc.textFile("stopwords.txt")
val stopWordSet = stopWords.collect.toSet
val stopWordSetBC = sc.broadcast(stopWordSet)

import org.apache.spark.sql.Row
//textFile.flatMap(_.toLowerCase.split(" ")).subtract(stopWords).take(100)
val wordCounts = textFile.flatMap(_.toLowerCase.split(" ")).filter( w => !
stopWordSetBC.value.contains(w)&&w!="").map(word => (word, 1)).reduceByKey((a, b) => a + b)

val top50 = wordCounts.sortBy(_._2,ascending=false).map{case (word:String,count:Int) => {word


+ "\t" + count} }.take(50)
//top50.mkString("\n")
print("%table Word\t Count\n" + top50.mkString("\n"))
Using Spark with Python

22
WordCount - Python
%pyspark
textFile = sc.textFile("book.txt")

stopWords = sc.textFile("stopwords.txt”)

wordCounts = textFile.flatMap(lambda line:


line.lower().split()).subtract(stopWords).map(lambda word: (word,
1)).reduceByKey(lambda a, b: a+b)
top50 = wordCounts.sortBy(lambda a: a[1],ascending=False).map(lambda
x: x[0]+ "\t" + str(x[1])).take(50)

print("%table Word\t Count\n" + "\n".join(top50))


Using Spark with R

24
Try Zeppelin
Import Zeppelin_Intro.json

Вам также может понравиться