Академический Документы
Профессиональный Документы
Культура Документы
• Speed:
• run computation in memory,
• faster than Mapreduce on disk
• Generality:
• supporting batch, interactive and streaming
• Accessibility:
• Simple APIs in Python, Java, Scala and SQL, and built-in libraries
• Usability
• Spark programming model is more than MapReduce model.
A Core Concept in Spark:
RDD
• A Resilient Distributed Dataset (RDD),
• the basic data abstraction in Spark,
• represents an immutable, partitioned collection
• An RDD is immutable
• Immutable: value cannot be changed.
• Why?
A Core Concept in Spark:
RDD
• Why RDD is immutable
• Safe to share across multiple process
• Easy to move around among resources, cache
• Simplifies development
sum
sum sum sum sum
RDD
ParMMon(3) ParMMon(11)
ParMMon(7) ParMMon(15)
login1.wrangler(8)$cat zeppelin.out
12
Note: wait 8 minutes and refresh your web browser
13
Use your TACC credenMal
15
18
19
Using Spark with Scala
20
WordCount - Scala
%spark
import org.apache.spark.rdd.RDD
val textFile = sc.textFile("book.txt")
val stopWords = sc.textFile("stopwords.txt")
val stopWordSet = stopWords.collect.toSet
val stopWordSetBC = sc.broadcast(stopWordSet)
import org.apache.spark.sql.Row
//textFile.flatMap(_.toLowerCase.split(" ")).subtract(stopWords).take(100)
val wordCounts = textFile.flatMap(_.toLowerCase.split(" ")).filter( w => !
stopWordSetBC.value.contains(w)&&w!="").map(word => (word, 1)).reduceByKey((a, b) => a + b)
22
WordCount - Python
%pyspark
textFile = sc.textFile("book.txt")
stopWords = sc.textFile("stopwords.txt”)
24
Try Zeppelin
Import Zeppelin_Intro.json