Speculative Execution in Hadoop Scheduling

PROJECT REPORT ON
SPECULATIVE EXECUTION IN HADOOP SCHEDULING
Submitted in partial fulfilment of the requirements for the award of the

Degree of Bachelor of Technology in Computer Science & Engineering
Submitted by
KATHIRAVAN R 218003048
SHERWIN A 218003080
SRIKANTH S 218003091
Under the Guidance of

Smt. M.Vanitha M.Tech
Assistant Professor-II, Dept. of CSE
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SASTRA DEEMED TO BE UNIVERSITY
(A University Under Section 3 of the UGC Act, 1956)
SRINIVASA RAMANUJAN CENTRE
Kumbakonam - 612 001
Tamil Nadu, India
MAY 2018
i
SHANMUGHA ARTS, SCIENCE, TECHNOLOGY & RESEARCH ACADEMY
Kumbakonam – 612 001
Tamil Nadu, India
BONAFIDE CERTIFICATE
Certified that this project work entitled “SPECULATIVE EXECUTION IN HADOOP

SCHEDULING” submitted to the Shanmugha Arts, Science, Technology & Research
Academy (SASTRA Deemed to be University), Kumbakonam – 612 001 by Kathiravan R,
(218003048), Sherwin A (218003080), Srikanth S (218003091) in partial fulfilment of the
requirements for the award of the degree of Bachelor of Technology in Computer Science
and Engineering. This work is an original and independent work carried out under my
guidance, during the period December 2017 - April 2018.
Guide HOD (CSE)
Submitted for University Examination held on: …………… 2018
Examiner I Examiner II
ii
Department of Computer Science and Engineering
SHANMUGHA ARTS, SCIENCE, TECHNOLOGY & RESEARCH ACADEMY
Kumbakonam – 612 001
Tamil Nadu, India
DECLARATION
We submit this project entitled “SPECULATIVE EXECUTION IN HADOOP

SCHEDULING” to SASTRA Deemed to Be University, Srinivasa Ramanujan Centre – 612
001 in partial fulfilment for the award of the degree of Bachelor of Technology in
Computer Science and Engineering and declare that this is our original work carried out
under the guidance of Smt. M.Vanitha M.Tech.
Date : Signature:
Place : Kumbakonam 1.
2.
3.
Name :
1.
2.
3.
iii
ACKNOWLEDGMENT
We pay our sincere obeisance to God ALMIGHTY who in His infinite mercy
showered the blessings to make us into what we are today.
First, we would like to express our sincere thanks to our Vice Chancellor
Prof. R. Sethuraman and our Registrar Dr. G. Bhalachandran for having given us an
opportunity to be a student of this esteemed institution.
We express our deepest thanks to our revered Dr.V.Ramaswamy, Dean, Srinivasa

Ramanujan Centre for his moral support and suggestions when required without any
reservations.
We express our gratitude to Dr.S.Sivagurunathan, HOD, CSE for his constant

support and valuable suggestions for the completion of the project.
We exhibit our pleasure in expressing our thanks to Smt. M.Vanitha M.Tech,

Assistant professor-II, Department of CSE, our guide for his/her ever encouraging spirit,
and meticulous guidance for the completion of the project.
We would like to place on record the benevolent approach and pain taking efforts of
guidance and corrections of Dr. V.Kalaichelvi, M.E., Ph.D and Smt.R.Sujarani, M.C.A.,
M.Tech., (Ph.D), the project coordinators and all the department staff to whom we owe our
heartily thanks for ever.
Without the support of our parents and friends this project would never have become
a reality.
We dedicate this work to all our well-wishers with love and affection.
iv
Table of contents
Chapter Content Page No.

No.
ABSTRACT
LIST OF TABLES
LIST OF FIGURES
1.0 INTRODUCTION
2.0 SOFTWARE PROJECT PLAN
3.0 SOFTWARE REQUIREMENTS
SPECIFICATION
3.1 Introduction
3.2 Overall Description
3.3 External Interface Requirements
3.4 System Features
4.0 SYSTEM ANALYSIS
4.1 Architecture Diagram
4.2 Use Case Diagram
4.3 Class Diagram
4.4 Dataflow Diagram
4.5 Case Tools for Analysis
5.0 DESIGN
5.1 Methodology and Approach
5.2 Interface Design
5.3 Back end Design
6.0 CODING
6.1 Algorithm
6.2 Source code
7.0 TESTING
7.1 Validation Testing
7.2 Unit Testing
8.0 IMPLEMENTATION
8.1 Problems Faced
8.2 Lessons Learnt
9.0 CONCLUSION AND FUTURE PLAN
9.1 Conclusion
9.2 Future Plan
10.0 REFERENCES
v
ABSTRACT
Hadoop is an open source, Java-based programming framework that supports the

processing and storage of extremely large data sets in a distributed computing environment.
Hadoop consists MapReduce for computation and HDFS for storage. Hadoop distributed
file system (HDFS) is a distributed storage system consists a master/slave architecture.
Master is Name node which is responsible for metadata and slaves are called as Data node
which is responsible for actual data storage.
Due to uneven distribution of input data, task with more data become stragglers and delay
the overall job completion. Such task can take more than five times longer to complete than
the fastest task so it slows the overall job completion. Hadoop speculatively runs a backup
copy of a slow task on a different node. The backup may run on a better performing node.
However, the differences in the node are not significant enough to mitigate skew and job
performance is still bottlenecked by Stragglers.
This project proposes a new perspective on tackling the skew problem in Hadoop
Applications. Rather than mitigating among skew tasks, this project tries to balance the
processing time of tasks even with the presence of data imbalance. Specifically, task with
more data or more expensive data records are accelerated by having more resources.
KEY WORDS: Stragglers, Name Node, Data Node, Hadoop Distributed File System, speculative
execution
vi
LIST OF TABLES
Table No. Table Name Page No.
Table 2.1 Various phases of the project 5
Table 7.1 Represents the validation testing
Table 7.2 Represents the unit testing
LIST OF FIGURES
Figure No. Figure Name Page No.

Figure 1.1 High level Hadoop architecture 2
Figure 3.1 Represents the Use case diagram 14
Figure 4.1 Depicts the System Architecture 14
Figure 4.2 Represents the Use case diagram
Figure 4.3 Represents the class diagram
Figure 4.4 Represents the data flow diagram
Figure 5.2 Pre-Processing Phase 20
Figure 5.3 Removal of Special Characters 21
Figure 5.4 Parts of Speech Tagging 22
Figure 5.5 Stemming Words 23
Figure 5.6 Hybrid Segmentation 24
Figure 5.7 Splitting task to run in parallel 24
Figure 5.8 Execution of MapReduce 25
Figure 5.9 Implementing speculative task 26
Figure 5.10 Final Output 26
vii
CHAPTER 1
INTRODUCTION
Hadoop Environment
Hadoop is an open source implementation of MapReduce Environment.
MapReduce is a framework that allow users to specify the computation in terms
of Map and Reduce functions. Google started using MapReduce framework for
their large-scale Indexing from its early days. As data generation rate is
increasing year-by-year by various means like scientific data processing, social
media, etc., so we need platform to process these large-scale data. Generally,
data center uses commodity hardware for data processing. Commodity hardware
used for this data processing are not reliable and more prone to failure. So, we
need a fault tolerant framework to process large scale data.
1
Fig 1.1 High level Hadoop architecture
Speculative Execution in Hadoop
Hadoop execute job on commodity hardware in data center. This
hardware has different computing power depending on hardware specification.
Heterogeneous environment comprises of nodes with varying computing power
2
whereas homogenous environment containing equally potential nodes. In
Heterogeneous environment, some of the compute nodes are faster than others.
Slower compute nodes are called stragglers and others are called fast nodes.
These fast nodes will finish their task early and wait for the stragglers to finish
their tasks.
The salient features of the proposed system are:
 Ability to handle large amount of overwhelming twitter data.
 Efficient and scalable for practical use.
 Ability to capture real-time events from social media services.
Advantages:
Job Completion time signifies the time job took from the start to its
completion. The main reason to speculate the task is that it might reduce the job
completion time. Enabling speculation will reduce the job completion time only
if speculative task finish earlier than original task.
For instance, Consider two slow tasks one with 30% and other with 80%
progress. Now Job Tracker identify these tasks as slow task and schedule a
speculative task on fast nodes. As first slow task is in its starting phase So,
3
possibility of speculative task completes first is higher comparative to second
task which is near to its completion.
Hence this signify that early detection of slow task is very necessary for
speculative task to overtake original task. This experiment notes down the start
and end time of job to measure Job completion time and it is expressed in
seconds.
PROBLEM STATEMENT
Existing System:
Uneven distribution of input data may produce straggler tasks thus
increase overall job completion time and also affect the performance of the
system. One of the reason for a straggler is Data skew. Data skew is an
imbalance in distribution of input data.
Proposed System:
If straggler task has been identified, then a copy of that task will be run
(speculative execution) on another node with data locality preservation. Thus,
the proposed solution may decrease the overall job completion time
4
Various reasons for slow task:
1. Hardware degradation and Software misconfiguration
2. More time taken than expected time, Since Hadoop doesn’t try to
diagnose the slow task, instead it tries to detect and runs backup task of
them.
Benefits of Speculative Execution:
1. Improves the efficiency hence reduce the overall job completion.
2. Cluster with 100s of nodes, problem with hardware failure are common.
3. Running parallel or duplicate task would be better since we won’t be
waiting for the task in the problem to complete.
Five unknown Facts about Hadoop
 Hadoop – Future of Raw Data
 Hadoop database is an affordable solution for enterprises
 Security challenges of Hadoop
 Turnkey solutions supporting Hadoop
 Evolution of Hadoop analytics
5
CHAPTER 2
SOFTWARE PROJECT PLAN
DESCRIPTION DURATION
PHASE
Phase 1 Data set collection 4-5 weeks
Phase 2 Data Pre-processing 1-2 weeks
Phase 3 Implementation of 0-1 week
mapper
Phase 4 Implementation of 1-2 weeks
reducer
Phase 5 Implementation in 2-3 weeks
Hadoop
Table 2.1: various phases of the project
6
CHAPTER 3
SOFTWARE REQUIREMENT SPECIFICATION
3.1 Introduction
This Software requirement specification provides a complete description
of all the functions and specifications of the Speculative execution in Hadoop
scheduling.
3.1.1 Purpose
The purpose of this document is to present a detailed description of the
Wordcount application running in Hadoop framework and how the task
speculates the stragglers.
3.1.2 Document Conventions
This document is written in the following style.

7
Font Family : Times New Roman
Font-Face : Bold for Title and subsections
Font-Size : Title & Subsection:14, Description:12
3.1.3 Intended Audience and Reading Suggestions
This document is intended for the Users and Administrators. The
SRS document also contains some information about the system like
scope, system features, assumptions and dependencies and other useful
information.
3.1.4 Product Scope
The User upload the data into a directory as csv file or text file to
run the job.
The application pre-processes the essential keywords from the file,
where the special characters are tokenized and the read records are
segmented.
Once the pre-process gets completed the system asks the user to
split the number of task to run the job in parallel manner to analyze in
what time it get completed i.e. to verify the overall job completion time.
8
Since, the project is working under Hadoop framework
MapReduce plays the vital role to complete the split task and run in
parallel manner. When the records are mapped and reduced it notifies the
timing of the job completion. After the parallel job completion in split
task the system alerts the user which task completes according to the
higher end as its identity. The next user will be prompted with the
speculative task and the particular task gets executed and denotes the
overall performance in a graph.
3.2. Overall Description
3.2.1 Product Perspective
This project is to facilitate the user to give a great start in the world of
MapReduce programming by giving them a hands-on experience in developing
their first Hadoop based Wordcount application. Hadoop MapReduce
Wordcount example is a standard example where Hadoop developers begin
their hands-on programming with.
3.2.2 Product Functions
 Data upload
 Pre-Processing
9
 Running the task in parallel manner
 Execute MapReduce
 Visualize the completion time in performance chart
3.2.3 User Classes and Characteristics
User : The user collects the large volume of dataset and upload
to analyze and segment to count the essential words.
3.2.4 Operating Environment:

Windows XP and above (VMWare)
Processor type 32-bit or 64-bit
RAM 512 MB
Hard Disk 40Gb
3.2.5 Design and Implementation Constraints
Once the pre-processing gets over, system asks the user to split
the task for parallel execution, Since the application works under
Hadoop framework, the task is executed using MapReduce. Map
reduce approach is used for effectively processing the data in
parallel and aggregation of data to obtain the final result. It can be
used to process large volume of data in tera, zeta bytes.
10
a) Mapper Phase
Mapper job involves three sub tasks like Mapper, Partitioner, and
Combiner. Mapper converts data into (key, value) pairs, Combiner
combines the generated mapped pair and Partitioned pair splits the data
into small clusters, after which the shuffling of (key, value) pairs to
unique reducer job is done.
The text from the input text file is tokenized into words to form a key
value pair with all the words present in the input text file. The key is the
word from the input file and value is ‘1’.
For instance, if you consider the sentence “An elephant is an animal”.
The mapper phase in the Wordcount example will split the string into
individual tokens i.e. words. In this case, the entire sentence will be split
into 5 tokens.
b) Shuffler Phase
11
After the map phase execution is completed successfully, shuffle
phase is executed automatically wherein the key-value pairs
generated in the map phase are taken as input and then sorted in
alphabetical order.
c) Reducer Phase
In the reduce phase, all the keys are grouped together and the
values for similar keys are added up to find the occurrences for a
particular word. It is like an aggregation phase for the keys
generated by the map phase. The reducer phase takes the output of
shuffle phase as input and then reduces the key-value pairs to
unique keys with values added up. In our example “An elephant is
an animal.” is the only word that appears twice in the sentence.
3.2.6 Assumptions and Dependencies
All the data entries will be correct and up to date. This software is
developed using Java as font end, Hadoop as the back end. The User can
12
only upload the files. The maintenance of the jobs and scheduling of jobs
will be done by Hadoop.
3.3External Interface Requirements
3.3.1 Hardware Interfaces
Processor : I3
RAM : 4Gb
Hard disk : 300Gb
3.3.2 Software Interfaces
The software requires a ECLIPSE IDE to run the application. The
Hadoop should be properly installed and configured.
Operating System : Windows OS
Front End : JAVA
Back End : HDFS
Framework : Hadoop Framework (plug-in should be
installed)
3.4 System Features
3.4.1 Segmentation Sequences
13
Sequence Response
Removing Special Characters : removes (@, : . …..)
Tag removal : Eliminates HTML and URL Links
POS tagging : Denotes the grammar of the data present
Stemming words : Converts the data into normal form
Hybrid segmentation : Produce the dataset to run MapReduce
3.4.2 Functional Requirements
Functional requirements are those that refer to functionality of the system
(i.e.) what services it will provide to other information needed to produce the
correct system and are detailed separately.
Here the system segments the input data to function the scheduler to
identify the straggler and speculate the task.
14
15
3.4.2.1 Use Case Diagram
Figure 3.1: Represents the Use case diagram
16
3.4.2.2 Use Case Description
Usecase: Data Upload
Breif Description: The user upload the dataset into the file storage to
analyze and segment.
Initial step-by-step description:
The user has to start the Hadoop Services such as Namenode,
Datanode.
Step 1: user collects dataset from twitter tweets.
Step 2: Upload the dataset to the storage to analyse.
Usecase: Pre-processing
Breif Description: the uploaded input gets processed by several steps as
explained in product function.
Initial step-by-step description:
The user pre-process the data to analyze and count the essential words and
remove the non-essential words.
Step 1: The user preprocesses the data by clicking into the preprocessing option
designed in interface.
Step 2: The user normalizes the word to easy the process.
Step 3: The admin stores the selected data into .txt file.
Step 4: If the data has been segmented successfully, user will get into the next
part of data analysis.
Usecase: Mapping with key value
Breif Description: Once the segment process has been completed, the mapper
starts the job with respect to (key , value) pair.
17
Usecase: Identify straggler and speculate
Breif description: The system scheduler executes the task parallely and says the
user which task has been ended first and it works according to the first come
first out.
Usecase: Reduce with all key value
Breif description: The mapreduce job has completed the mapper phase by
shuffling tweets in the input data and split to reduce with respect to reduce key
value pair.
18
3.4.3 Other Nonfunctional Requirements
3.4.3.1 Performance
The system functions by splitting the segmented tweets to run in
parallel manner and compares the performance of each task with the help of
completion time which is measured in terms of Nano seconds.
3.4.3.2 Software Quality Attributes
1. Reliability
The identification of slower task is done by the
scheduler.
2. Availability
The system is available for a 24*7.
3. Maintainability
The admin shall provide a capability to back up the
data.
4. Portability
The file can be downloading at anywhere.
19
CHAPTER 4
SYSTEM ANALYSIS
4.1 SYSTEM ARCHITECTURE
Fig 4.1 Depicts the system architecture
20
4.2 USE CASE DIAGRAM
Fig 4.2: Represents the Usecase diagram
21
4.3 CLASS DIAGRAM
Fig 4.3: represents the class diagram
22
4.4 Data Flow Diagram
Level 0:
Level 1:
Fig 4.5: Represents the Data Flow Diagram
23
4.5 CASE tools for Analysis
UML TOOL : Star UML
Coding : Java
IDE : Eclipse
Environment :Virtual Machine
24
CHAPTER 5
DESIGN
5.1 METHODOLOGY AND APPROACH

The objective of our system is to decrease the overall job completion time
in single node cluster. The application used in my system is wordcount for
Twitter tweets containing large amount of data.
5.1.1 PRE-PROCESSING
The data for our work is collected from various social media services like
Twitter, Wikipedia. Several unwanted data is removed at this stage. The posts
collected from various social media sources undergo various pre-processing
tasks.
Initially, the data are processed to remove hash tags, punctuations, other special
characters, URL, numbers and each word is converted to its lower case. Further,
the processed data is tokenized by white spaces.
25
Then, the data is further processed to convert each word to its base form by the
technique called stemming. At the end of this step, the data set contains only
necessary and useful information for further processing.
The various processing performed at this step can be summarized as follows:
 Remove special characters.
 Tag Removal (HTML links and URL)
 Parts of speech tagging.
 Perform stemming.
 Hybrid Segmentation
26
5.2 INTERFACE DESIGN
27
Figure 5.2: Pre-processing phase
Figure 5.3: Removal of Special Characters
28
Figure 5.4: POS tagging
29
Figure 5.5: Stemming Words
Figure 5.6: Hybrid Segmentation
Fig 5.7: Splitting up tasks
30
Figure 5.8 Execution of MapReduce
Figure 5.9: Implementing Speculative task
31
Figure 5.10: Final Output
5.3 BACK END DESIGN:
Figure 5.11: Hadoop Framework
32
CHAPTER 6
CODING
6.1 ALGORITHM / PSEUDO CODE
Step 1: Collect data from social media services.
Step 2: Remove special characters and convert long sentences into tokens.
Step 3: Remove stemming words and convert every word to its base form.
Step 4: Remove stop words from previous input.
Step 5: Enter the number of task going to run on a single node cluster and pass
the pre-processed data into MapReduce task.
Step 6: System generates the higher end task and denotes the speculative task
identity to run the task parallel to the completed task.
Step 7: Pass the mapper into the reducer where key represents word and value
its count. Meanwhile the system reports the parallel completion time of the task.
33
6.2 SOURCE CODE
Map.java
package tweet;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class Map extends
Mapper<Object, Text, Text, IntWritable> {
private final IntWritable ONE = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
String[] csv = value.toString().split(" ");
for (String str : csv) {
word.set(str);
34
context.write(word, ONE);
35
Reduce.java
package tweet;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class Reduce extends
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text text, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
context.write(text, new IntWritable(sum));}}
36
Main.java
public class Main extends JFrame {
private JPanel contentPane;
File file;
private JEditorPane editorPane;
public static void main(String[] args) {
EventQueue.invokeLater(new Runnable() {
public void run() {
try {
Main frame = new Main();
frame.setVisible(true);
} catch (Exception e) {
e.printStackTrace();
});
btnStopRemoval.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent e) {
37
try {
Vector<String> vs=new
Vector<String>();
vs.add("NN");
vs.add("VBZ");
vs.add("JJ");
vs.add("NNS");
vs.add("VB");
vs.add("CC");
vs.add("VBD");
vs.add("JJR");
vs.add("VBG");
vs.add("RB");
vs.add("VBN");
vs.add("WRB");
vs.add("VBP");
vs.add("JJS");
vs.add("NNP");
DataInputStream fin = new
DataInputStream(new FileInputStream("stem.txt"));
38
FileOutputStream fout=new
FileOutputStream("stop.txt");
String t = null;
while ((t = fin.readLine()) != null) {
String tt[]=t.split("#");
String ss[] = tt[1].trim().split(" ");
StringBuffer sb=new
StringBuffer();
for (int i = 0; i < ss.length; i++) {
ss[i].trim();
String ww[]=ss[i].split("/");
System.out.println(ww[1]);
if(vs.contains(ww[1]))
ss[i]=ww[0]+"/"+ww[1];
sb.append(ss[i]+" ");
fout.write((tt[0]+"#"+sb.toString()
+"\n").getBytes());
39
}
fin.close();
fout.close();
editorPane.setPage(new
File("stop.txt").toURL());
String temp =
editorPane.getText().replaceAll("#", " ");
editorPane.setText(temp);
} catch (Exception ex) {
ex.printStackTrace();
});
btnStopRemoval.setBounds(236, 63, 145, 31);
panel.add(btnStopRemoval);
40
HDProcess.java
public HDProcess() {
setTitle("Text Job - Map / Reduce");
setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
setBounds(100, 100, 906, 600);
contentPane = new JPanel();
contentPane.setBorder(new EmptyBorder(5, 5, 5, 5));
setContentPane(contentPane);
contentPane.setLayout(null);
JPanel panel = new JPanel();
panel.setBounds(10, 11, 376, 544);
contentPane.add(panel);
panel.setLayout(null);
JButton btnMapreduce = new JButton("Map / Reduce");
btnMapreduce.addActionListener(new ActionListener() {
public void actionPerformed(ActionEvent arg0) {
try {
System.out.println(inpath +" , "+
outpath);
41
long st1 = System.currentTimeMillis();
long ptime = 0;
long stime = 0;
ApplicationFrame app = new ApplicationFrame("");
BarChart chart = new BarChart("Performance Chart", "",
"Exeuction Time(nano sec)");
List<Long> numbers = new ArrayList<Long>();
List<Long> numbers1 = new ArrayList<Long>();
WordCount wc = new
WordCount(inpath, outpath);
wc.doMapReduce();
textArea.setText(wc.log);
for(int i=1;i<=nodescnt;i++)
long st = System.currentTimeMillis();
File f1=new
File("aspect/aspect"+i+".txt");
inpath=f1.getAbsolutePath();
outpath=f1.getParent()+"/output";
42
wc = new WordCount(inpath,
outpath);
wc.doMapReduce();
textArea.append(wc.log);
long en =
System.currentTimeMillis();
ptime = (en-st);
System.out.println("Parellel Timing : " +( en-st) +" Nano
Seconds");
textArea.append("Execution Time : " +( en-st) +" Nano
Seconds" );
chart.addValue(ptime, " Pro R"+(i), "");
numbers.add(ptime);
numbers1.add(ptime);
Collections.sort(numbers);
String mess = "";
int c=1;
43
for(int j=0;j<numbers.size();j++)
int ind = numbers.indexOf(numbers1.get(j))+1;
mess += c + " Higher end Node is " + ind
+"\n";
c++;
JOptionPane.showMessageDialog(new
JFrame("Find Higher End Node"), mess);
int ind =
numbers.indexOf(numbers1.get(nodescnt-1))+1;
JOptionPane.showMessageDialog(new
JFrame("Speculative Node"), "Speculative Node :" +ind);
long st = System.currentTimeMillis();
File f1=new File("aspect/aspect"+ind+".txt");
inpath=f1.getAbsolutePath();
outpath=f1.getParent()+"/output";
wc = new WordCount(inpath,
outpath);
wc.doMapReduce();
textArea.append(wc.log);
44
long en =
System.currentTimeMillis();
ptime = (en-st);
System.out.println("Parellel Timing : " +( en-st) +" Nano
Seconds");
textArea.append("Execution Time : " +( en-st) +" Nano
Seconds" );
chart.addValue(ptime, " Spec Node", "");
numbers.add(ptime);
numbers1.add(ptime);
long en1 = System.currentTimeMillis();
stime = (en1-st1);
System.out.println("Single Timing : " +( en1-st1)
+" Nano Seconds");
chart.addValue(stime, "Existing", "");
chart.createChart();
app.setContentPane(chart);
app.setDefaultCloseOperation(app.HIDE_ON_CLOSE);
45
app.setSize(800, 450);
app.setVisible(true);
File ff[] = new File(outpath).listFiles();
Vector<String> vhead = new
Vector<String>();
vhead.add("File Name");
vhead.add("File Size");
vhead.add("File path");
Vector<Vector<String>> vbody = new
Vector<Vector<String>>();
for (int i = 0; i < ff.length; i++) {
Vector<String> vr = new
Vector<String>();
vr.add(ff[i].getName());
vr.add(ff[i].length() + "");
vr.add(ff[i].getAbsolutePath());
vbody.add(vr);
table.setModel(new
DefaultTableModel(vbody, vhead));
46
} catch (Exception e) {
e.printStackTrace();
});
btnMapreduce.setBounds(10, 11, 130, 30);
panel.add(btnMapreduce);
47
Wordcount.java
package tweet;
import java.io.File;
import java.util.*;
import javax.swing.JFrame;
import javax.swing.JOptionPane;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.*;
public class WordCount {
String inFile, outDirPath;
JobConf conf;
String log = "";
48
WordCount(String inFile, String outDirPath) {
this.inFile = inFile;
this.outDirPath = outDirPath;
void doMapReduce() throws IOException, InterruptedException,
ClassNotFoundException {
try {
conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
File dir = new File(outDirPath);
if (dir.exists()) {
FileUtil.fullyDelete(dir);
49
}
FileInputFormat.setInputPaths(conf, inFile);
FileOutputFormat.setOutputPath(conf, new
Path(outDirPath));
RunningJob rj= JobClient.runJob(conf);
log = log + rj.getJobName().toString() + "\n";
log = log + rj.getJobState() + "\n";
log = log + rj.getTrackingURL().toString() + "\n";
log = log + rj.getClass().toString() + "\n";
log = log + rj.mapProgress() + "\n";
log = log + rj.reduceProgress() + "\n";
log = log + rj.getCounters().toString() + "\n";
} catch (Exception ex) {
JOptionPane.showMessageDialog(new JFrame(),
ex.getMessage());
50
}
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter)
throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new
StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
51
public static class Reduce extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
output.collect(key, new IntWritable(sum));}}}
52
CHAPTER 7
TESTING
The purpose of testing is to discover errors. Testing is the process of trying to
discover every conceivable fault or weakness in a work product. It provides a
way to check the functionality of components, sub assemblies, assemblies
and/or a finished product It is the process of exercising software with the intent
of ensuring that the Software system meets its requirements and user
expectations and does not fail in an unacceptable manner. There are various
types of test. Each test type addresses a specific testing requirement.
53
7.1 Validation Testing:
Test Test case Purpose Pre – Test data Expected Pass

case ID title requirements Output /Fail
T01 Remove To extract the Dataset Click the To remove Pass

Special special required button all the
characters characters such special
as (@, ! , ….) characters.
T02 Tag removal Remove the Dataset Click the Produce Pass
URL and html required button data
links. without
any html
link
T03 POS Tagging To analyze the Dataset Click the Reproduce Pass
grammar of the required button the dataset
input. by
denoting
parts of
speech
T04 Stemming Normalize the Dataset Click the Converts Pass
words data into its required button the
normal form participle
words into
normal
form
T05 Hybrid Eliminates the Dataset Click Shuffles Pass
segmentation records read required button the record
numbering
54
T06 MapReduce Analyze the Dataset Click Maps and Pass
input record required button reduce the
with (key , input
value) pair record
Table 7.1: Represents the validation testing
7..2 Unit testing

Test case Test case Purpose Pre – Test data Expected Pass
ID title requirements Output /Fail
T01 Splitting For Enable in Text Enter the Pass
task parallel scheduler field to number
execution enter the in the
number field
T02 Identify Main Processed Data Alerts Pass
straggler objective in coding process the user
of the timing in
project message
box
Table 7.2: Represents the unit testing
55
CHAPTER 8
IMPLEMENTATION
The implementation stage involves the following tasks.
 Correct planning.
 Collection of Datasets
 Design of methods to achieve the changeover.
 Evaluation of the methods.
8.1 Problems Faced
 Configuring secure shell for resource development.
 High end System configuration is required to run the software in
Virtual Machine.
 The framework plug-in installation
 Connecting DFS to eclipse IDE.
8.2 Lessons Learnt
While developing this project, we came across many experiences. They
are
 Before starting the project, we must have proper plan about it.
56
 We should not jump into code directly. First the project should be
analyzed thoroughly.
 If Hadoop framework installed in windows directly with any
virtual boot software’s the user need to set the java home path
correctly in environment settings through control panel.
57
CHAPTER 9
CONCLUSION AND FUTURE PLANS
9.1 CONCLUSION
Hadoop provides an open-source implementation of the MapReduce
framework. But its design poses challenges to performance due to the data
skew. Apache Hadoop does not fix or diagnose slow running task, this project
tries to detect when a task is running slower than expected and launches
another, an equivalent task as backup, this backup task is called speculative
task. We have implemented wordcount running on Hadoop framework which
speculates the task according to completion time for a single node cluster. Our
performance rate is measured in terms of Nano seconds as it differs belonging
to the splitting number of tasks.
58
9.2 FUTURE PLAN
This project had done in Hadoop framework and implemented in a single
node Hadoop cluster setup, in future it may be extended or we will try to
implement in multi node Hadoop setup since it is cost effective and scalable.
Then we will work on giving input dynamically and connect two systems for
parallel execution.
59
CHAPTER 10
REFERENCES
1. Moving Hadoop into the Cloud with Flexible Slot Management and
Speculative Execution Yanfei Guo, Member, IEEE, Jia Rao, Member, IEEE,
Changjun Jiang, Member, IEEE, and Xiaobo Zhou, Senior Member, IEEE.
2. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “Skew-resistant parallel
processing of feature-extracting scientific user-defined functions,” in Proc. 1st
ACM Symp. Cloud Comput., 2010.
3. A. Verma, L. Cherkasova, and R. H. Campbell, “Resource provisioning
framework for MapReduce jobs with performance goals,” in Proc.
ACM/IFIP/USENIX 12th Int. Middleware Conf., 2011.
4. R. C. Chiang and H. H. Huang, “TRACON: Interference-aware
scheduling for data-intensive applications in virtualized environments,” in Proc.
Int Conf. High Perform. Comput., Netw. Storage Anal., 2011.
5. A. Gulati, A. Holler, M. Ji, G. Shanmuganathan, C. Waldspurger,and X.
Zhu, “Vmware distributed resource management: Design, implementation and
lessons learned,” VMware Tech. J., vol. 1, pp. 45–64, 2012.
60
6. F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar,
“Tarazu: Optimizing mapreduce on heterogeneous clusters,” in Proc. 17th Int
Conf. Architectural Support Program. Lang. Operating. Syst., 2012.
7. R. Gandhi, D. Xie, and Y. C. Hu, “PIKACHU: How to rebalance load in
optimizing mapreduce on heterogeneous clusters,” in Proc. USENIX Conf.
Annu. Tech. Conf., 2013.
8. Z. Zhang, L. Cherkasova, and B. T. Loo, “AutoTune: Optimizing
execution concurrency and resource usage in MapReduce workflows,” in Proc.
10th USENIX/ACM Int. Conf. Autonomic Comput., 2013.
9. J. Tan, et al., “DynMR: Dynamic MapReduce with reduce task
interleaving and map task backfilling,” in Proc. 9th Eur. Conf. Comput. Syst.,
2014, Art. no. 2.
10. E. Coppa and I. Finocchi, “On data skewness, stragglers, and mapreduce
progress indicators,” in Proc. 6th ACM Symp. Cloud Comput., 2015.
61

Speculative Execution in Hadoop Scheduling

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Speculative Execution in Hadoop Scheduling

Загружено:

Авторское право:

Доступные форматы

PROJECT REPORT ON

SPECULATIVE EXECUTION IN HADOOP SCHEDULING

Submitted in partial fulfilment of the requirements for the award of the

Under the Guidance of

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Certified that this project work entitled “SPECULATIVE EXECUTION IN HADOOP

Guide HOD (CSE)

Submitted for University Examination held on: …………… 2018

We submit this project entitled “SPECULATIVE EXECUTION IN HADOOP

We express our deepest thanks to our revered Dr.V.Ramaswamy, Dean, Srinivasa

We express our gratitude to Dr.S.Sivagurunathan, HOD, CSE for his constant

We exhibit our pleasure in expressing our thanks to Smt. M.Vanitha M.Tech,

Chapter Content Page No.

Hadoop is an open source, Java-based programming framework that supports the

Table No. Table Name Page No.

Table 2.1 Various phases of the project 5

Table 7.1 Represents the validation testing

Table 7.2 Represents the unit testing

Figure No. Figure Name Page No.

Hadoop is an open source implementation of MapReduce Environment.

MapReduce is a framework that allow users to specify the computation in terms

increasing year-by-year by various means like scientific data processing, social

media, etc., so we need platform to process these large-scale data. Generally,

need a fault tolerant framework to process large scale data.

Speculative Execution in Hadoop

Hadoop execute job on commodity hardware in data center. This

hardware has different computing power depending on hardware specification.

Heterogeneous environment comprises of nodes with varying computing power

The salient features of the proposed system are:

 Ability to handle large amount of overwhelming twitter data.

 Efficient and scalable for practical use.

 Ability to capture real-time events from social media services.

if speculative task finish earlier than original task.

task which is near to its completion.

Uneven distribution of input data may produce straggler tasks thus

imbalance in distribution of input data.

(speculative execution) on another node with data locality preservation. Thus,

1. Hardware degradation and Software misconfiguration

Benefits of Speculative Execution:

1. Improves the efficiency hence reduce the overall job completion.

3. Running parallel or duplicate task would be better since we won’t be

waiting for the task in the problem to complete.

Five unknown Facts about Hadoop

 Hadoop – Future of Raw Data

 Hadoop database is an affordable solution for enterprises

 Security challenges of Hadoop

 Turnkey solutions supporting Hadoop

 Evolution of Hadoop analytics

SOFTWARE PROJECT PLAN

Phase 1 Data set collection 4-5 weeks

Phase 2 Data Pre-processing 1-2 weeks

Phase 3 Implementation of 0-1 week

Phase 4 Implementation of 1-2 weeks

Phase 5 Implementation in 2-3 weeks

Table 2.1: various phases of the project

SOFTWARE REQUIREMENT SPECIFICATION

This Software requirement specification provides a complete description

of all the functions and specifications of the Speculative execution in Hadoop

The purpose of this document is to present a detailed description of the

Wordcount application running in Hadoop framework and how the task

speculates the stragglers.

3.1.2 Document Conventions

This document is written in the following style.