Вы находитесь на странице: 1из 34

Apache Spark

Name- Shweta Patnaik & Madhusmita Dey


Regd No-160303110004 & 160303110008
Branch-M.tech(cse)
Contents
 What is SPARK
 SPARK Architecture
 Why Spark when Hadoop is already there?
 Spark vs. Hadoop MapReduce
 Apache Spark Installation
 Operating or Deploying a Spark Cluster Manually
 Running Spark Application
 Spark Environment
What is Spark?
 Spark was introduced by Apache Software Foundation for speeding up
the Hadoop computational computing software process .
 Apache Spark is an open-source cluster computing framework for real-time
processing.
 Spark is being adopted by major players like Amazon, eBay, and Yahoo!
 It was built on top of Hadoop MapReduce and it extends the MapReduce
model to efficiently use more types of computations.
 Hadoop is just one of the ways to implement Spark.
 Spark uses Hadoop in two ways – one is storage and second
is processing.
 Spark has its own cluster management computation.
Spark Architecture
Why Spark when Hadoop is already
there?
 Hadoop is based on batch processing of big data. This means that the
data is stored over a period of time and is then processed using Hadoop.

 Whereas in Spark, processing can take place in real-time. This real-time


processing power in Spark helps us to solve the use cases of Real Time
Analytics .

 Spark is also able to do batch processing 100 times faster than that of
Hadoop MapReduce (Processing framework in Apache Hadoop).
Spark vs. Hadoop MapReduce
1. Introduction
 Apache Spark – It is an open source big data framework. It provides
faster and more general purpose data processing engine. Spark is
basically designed for fast computation.
 Hadoop MapReduce – It is also an open source framework for writing
applications. It also processes structured and unstructured data that are
stored in HDFS. MapReduce can process data in batch mode.

2. Speed
 Apache Spark – Spark is lightning fast cluster computing tool. Apache
Spark runs applications up to 100x faster in memory and 10x faster on
disk than Hadoop.
 Hadoop MapReduce – MapReduce reads and writes from disk, as a
result, it slows down the processing speed.
3. Real-time analysis
 Apache Spark – It can process real time data i.e. data coming from the real-time event
streams at the rate of millions of events per second, e.g. Twitter data for instance or
Facebook sharing/posting. Spark’s strength is the ability to process live streams efficiently.
 Hadoop MapReduce – MapReduce fails when it comes to real-time data processing as it
was designed to perform batch processing on voluminous amounts of data.

4. latency
 Apache Spark – Spark provides low-latency computing.
 Hadoop MapReduce – MapReduce is a high latency computing framework.

5. Interactive mode
 Apache Spark – Spark can process data interactively.
 Hadoop MapReduce – MapReduce doesn’t have an interactive mode.

6. Streaming
 Apache Spark – Spark can process real time data through Spark Streaming.
 Hadoop MapReduce – With MapReduce, you can only process data in batch mode.
7. Fault tolerance
 Apache Spark – Spark is fault-tolerant. As a result, there is no need to restart
the application from scratch in case of any failure.
 Hadoop MapReduce – Like Apache Spark, MapReduce is also fault-tolerant, so
there is no need to restart the application from scratch in case of any failure.

8. Cost
 Apache Spark – As spark requires a lot of RAM to run in-memory. Thus,
increases the cluster, and also its cost.
 Hadoop MapReduce – MapReduce is a cheaper option available while
comparing it in terms of cost.

9. Language Developed
 Apache Spark – Spark is developed in Scala.
 Hadoop MapReduce – Hadoop MapReduce is developed in Java.
10. OS support
 Apache Spark – Spark supports cross-platform.
 Hadoop MapReduce – Hadoop MapReduce also supports cross-platform.

11. Programming Language support


 Apache Spark – Scala, Java, Python, R, SQL.
 Hadoop MapReduce – Primarily Java, other languages like C, C++, Ruby, Groovy,
Perl, Python are also supported using Hadoop streaming.

 13. SQL support


 Apache Spark – It enables the user to run SQL queries using Spark SQL.
 Hadoop MapReduce – It enables users to run SQL queries using Apache Hive(HQL).

 14. Scalability
 Apache Spark – Spark is highly scalable. Thus, we can add n number of nodes in the
cluster. Also a largest known Spark Cluster is of 8000 nodes.
 Hadoop MapReduce – MapReduce is also highly scalable we can keep adding n
number of nodes in the cluster. Also, a largest known Hadoop cluster is of 14000
nodes.
15. The line of code
 Apache Spark – Apache Spark is developed in merely 20000 line of
codes.
 Hadoop MapReduce – Hadoop 2.0 has 1,20,000 line of codes

16. Machine Learning


 Apache Spark – Spark has its own set of machine learning ie MLlib.
 Hadoop MapReduce – Hadoop requires machine learning tool for
example Apache Mahout.

17.Hardware Requirements
 Apache Spark – Spark needs mid to high-level hardware.
 Hadoop MapReduce – MapReduce runs very well on commodity
hardware.
Apache Spark Installation
Step-1 Install Scala
 Download Scala from the link:
http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.msi

 Install scala :Under C drive in Scala folder(C:\scala)


Step-2 Set environmental variables

 MyComputer properties  advanced system settings environment


variables  tab of user variable New
 User variable:

Variable: SCALA_HOME
Value: C:\scala
Click Ok Button
 System variable:

Variable: PATH
Value: C:\scala\bin
Then Click okokok
Step-3 Check it on cmd, see below.
Step-4 Install Spark 1.6.1.
 Download it from the following link:
http://spark.apache.org/downloads.html and extract it into E drive, such
as E:\Spark.(you can extract in other drives also)
Step-5 Set environmental variables

 User variable:
Variable: SPARK_HOME
Value: E:\spark\spark-1.6.1-bin-hadoop2.3

 System variable:
Variable: PATH
Value: E:\spark\spark-1.6.1-bin-hadoop2.3\bin
Step-6 Download Windows Utilities

 Download it from the link:


 https://github.com/prabaprakash/Hadoop-2.3/tree/master/bin
(winutils.exe) And paste it in E:\spark\spark-1.6.1-bin-hadoop2.3\bin
Step-7 Execute Spark on cmd

 See below:
Operating or Deploying a Spark
Cluster Manually

 Let’s look at the hadoop MapReduce example of Word Count in Apache


Spark –
 The input in the file input.txt contains the following text –
Below is the source code for the Word
Count program in Apache Spark
Running Spark Application

 We will submit the word count example in Apache Spark using the Spark
shell instead of running the word count program as a whole -
 Let’s start Spark shell
 Let’s create a Spark RDD using the input file that we want to run our first Spark
program on. You should specify the absolute path of the input file-

 On executing the above command, the following output is observed -


 Now is the step to count the number of words –

 You will get the following output:


 The next step is to store the output in a text file-

 Go to the output directory (location where you have created the file
named output)
Spark Environment
Thank You

Вам также может понравиться