Академический Документы
Профессиональный Документы
Культура Документы
Spark is also able to do batch processing 100 times faster than that of
Hadoop MapReduce (Processing framework in Apache Hadoop).
Spark vs. Hadoop MapReduce
1. Introduction
Apache Spark – It is an open source big data framework. It provides
faster and more general purpose data processing engine. Spark is
basically designed for fast computation.
Hadoop MapReduce – It is also an open source framework for writing
applications. It also processes structured and unstructured data that are
stored in HDFS. MapReduce can process data in batch mode.
2. Speed
Apache Spark – Spark is lightning fast cluster computing tool. Apache
Spark runs applications up to 100x faster in memory and 10x faster on
disk than Hadoop.
Hadoop MapReduce – MapReduce reads and writes from disk, as a
result, it slows down the processing speed.
3. Real-time analysis
Apache Spark – It can process real time data i.e. data coming from the real-time event
streams at the rate of millions of events per second, e.g. Twitter data for instance or
Facebook sharing/posting. Spark’s strength is the ability to process live streams efficiently.
Hadoop MapReduce – MapReduce fails when it comes to real-time data processing as it
was designed to perform batch processing on voluminous amounts of data.
4. latency
Apache Spark – Spark provides low-latency computing.
Hadoop MapReduce – MapReduce is a high latency computing framework.
5. Interactive mode
Apache Spark – Spark can process data interactively.
Hadoop MapReduce – MapReduce doesn’t have an interactive mode.
6. Streaming
Apache Spark – Spark can process real time data through Spark Streaming.
Hadoop MapReduce – With MapReduce, you can only process data in batch mode.
7. Fault tolerance
Apache Spark – Spark is fault-tolerant. As a result, there is no need to restart
the application from scratch in case of any failure.
Hadoop MapReduce – Like Apache Spark, MapReduce is also fault-tolerant, so
there is no need to restart the application from scratch in case of any failure.
8. Cost
Apache Spark – As spark requires a lot of RAM to run in-memory. Thus,
increases the cluster, and also its cost.
Hadoop MapReduce – MapReduce is a cheaper option available while
comparing it in terms of cost.
9. Language Developed
Apache Spark – Spark is developed in Scala.
Hadoop MapReduce – Hadoop MapReduce is developed in Java.
10. OS support
Apache Spark – Spark supports cross-platform.
Hadoop MapReduce – Hadoop MapReduce also supports cross-platform.
14. Scalability
Apache Spark – Spark is highly scalable. Thus, we can add n number of nodes in the
cluster. Also a largest known Spark Cluster is of 8000 nodes.
Hadoop MapReduce – MapReduce is also highly scalable we can keep adding n
number of nodes in the cluster. Also, a largest known Hadoop cluster is of 14000
nodes.
15. The line of code
Apache Spark – Apache Spark is developed in merely 20000 line of
codes.
Hadoop MapReduce – Hadoop 2.0 has 1,20,000 line of codes
17.Hardware Requirements
Apache Spark – Spark needs mid to high-level hardware.
Hadoop MapReduce – MapReduce runs very well on commodity
hardware.
Apache Spark Installation
Step-1 Install Scala
Download Scala from the link:
http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.msi
Variable: SCALA_HOME
Value: C:\scala
Click Ok Button
System variable:
Variable: PATH
Value: C:\scala\bin
Then Click okokok
Step-3 Check it on cmd, see below.
Step-4 Install Spark 1.6.1.
Download it from the following link:
http://spark.apache.org/downloads.html and extract it into E drive, such
as E:\Spark.(you can extract in other drives also)
Step-5 Set environmental variables
User variable:
Variable: SPARK_HOME
Value: E:\spark\spark-1.6.1-bin-hadoop2.3
System variable:
Variable: PATH
Value: E:\spark\spark-1.6.1-bin-hadoop2.3\bin
Step-6 Download Windows Utilities
See below:
Operating or Deploying a Spark
Cluster Manually
We will submit the word count example in Apache Spark using the Spark
shell instead of running the word count program as a whole -
Let’s start Spark shell
Let’s create a Spark RDD using the input file that we want to run our first Spark
program on. You should specify the absolute path of the input file-
Go to the output directory (location where you have created the file
named output)
Spark Environment
Thank You