Вы находитесь на странице: 1из 7

Presentation on Apache Spark

ByB V S Mridula (1039060)


Milind Baluni (1003192)
G. Ganesh ()
Dilip Payra ()
Sameer Nayak ()

Introduction to Apache Spark

It is a framework for performing general data analytics on distributed


computing cluster like Hadoop
It provides in memory computations for increase speed and data
process over MapReduce.
Runs on top of existing Hadoop cluster and access Hadoop data store
(HDFS)
Can also process structured data in Hive and Streaming data from
HDFS,Flume,Kafka,Twitter

High-Productivity Language
Support

Native support for multiple


languages with identical
APIs

Use of closures, iterations,


and other common
language constructs to
minimize code

Python

lines = sc.textFile(...)
lines.fi
lter(lam bda s: ERRO R in s).count()

Scala
val lines = sc.textFile(...)
lines.fi
lter(s = >
s.contains(ERRO R)).count()

Java

Unified API for batch and


streaming

JavaRD D < String> lines = sc.textFile(...);


lines.fi
lter(n ew Function< String, Boolean> ()
{
Boolean call(String s) {
retu rn s.contains(error);
}
}).count();

Why Apache Spark is used?

Apache Spark is best for performing iterative jobs like machine learning.
Apache spark maintains an in-memory copy of map reduce job using RDD(Resilient
Distributed Datasets)
With this, if the map reduce job is once loaded into in-memory copy.
The need for loading that map reduce job again and again from memory gets
reduced. With this there is a tremendous increase in speed.
The in-memory copy holds the frequently used map reduce job within itself.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster
on disk.

Spark libraries
Spark SQL:
It is a Spark's module for working with structured data.
Spark Streaming:
Which makes it easy to build scalable fault-tolerant streaming
applications.
MLlib:
It is Apache Spark's scalable machine learning library.
GraphX:
It is Apache Spark's API for graphs and graph-parallel computation.

Is Apache Spark going to replace Hadoop?

Hadoop essentially consists of a MapReduce phase and a file system


(HDFS) whereas Spark is a framework that executes jobs, so
practically Spark can only replace the MapReduce phase in the
Hadoop Ecosystem.
Spark was mainly designed to run on top of Hadoop so as to minimize
the job execution time.
Spark is an alternative to the traditional MapReduce model that used
to work only in batch mode. Spark supports both batch as well as
real-time processing.
Spark mainly utilizes the primary memory of the system to provide
efficient output. Thus, it requires a high-end machine to execute jobs.
Hadoop, on the other hand, can easily run on commodity hardware.
Spark's way of handle fault tolerance is very fast as compared to
Hadoop's. This minimizes network I/O and guarantees fault tolerance.

Thank You!