Вы находитесь на странице: 1из 68

PROJECT REPORT ON

SPECULATIVE EXECUTION IN HADOOP SCHEDULING

Submitted in partial fulfilment of the requirements for the award of the


Degree of Bachelor of Technology in Computer Science & Engineering
Submitted by

KATHIRAVAN R 218003048
SHERWIN A 218003080
SRIKANTH S 218003091

Under the Guidance of


Smt. M.Vanitha M.Tech
Assistant Professor-II, Dept. of CSE

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


SASTRA DEEMED TO BE UNIVERSITY
(A University Under Section 3 of the UGC Act, 1956)
SRINIVASA RAMANUJAN CENTRE
Kumbakonam - 612 001
Tamil Nadu, India
MAY 2018

i
SHANMUGHA ARTS, SCIENCE, TECHNOLOGY & RESEARCH ACADEMY
SASTRA DEEMED TO BE UNIVERSITY
(A University Under Section 3 of the UGC Act, 1956)
SRINIVASA RAMANUJAN CENTRE
Kumbakonam – 612 001
Tamil Nadu, India

BONAFIDE CERTIFICATE

Certified that this project work entitled “SPECULATIVE EXECUTION IN HADOOP


SCHEDULING” submitted to the Shanmugha Arts, Science, Technology & Research
Academy (SASTRA Deemed to be University), Kumbakonam – 612 001 by Kathiravan R,
(218003048), Sherwin A (218003080), Srikanth S (218003091) in partial fulfilment of the
requirements for the award of the degree of Bachelor of Technology in Computer Science
and Engineering. This work is an original and independent work carried out under my
guidance, during the period December 2017 - April 2018.

Guide HOD (CSE)

Submitted for University Examination held on: …………… 2018

Examiner I Examiner II

ii
Department of Computer Science and Engineering
SHANMUGHA ARTS, SCIENCE, TECHNOLOGY & RESEARCH ACADEMY
SASTRA DEEMED TO BE UNIVERSITY
(A University Under Section 3 of the UGC Act, 1956)
SRINIVASA RAMANUJAN CENTRE
Kumbakonam – 612 001
Tamil Nadu, India

DECLARATION

We submit this project entitled “SPECULATIVE EXECUTION IN HADOOP


SCHEDULING” to SASTRA Deemed to Be University, Srinivasa Ramanujan Centre – 612
001 in partial fulfilment for the award of the degree of Bachelor of Technology in
Computer Science and Engineering and declare that this is our original work carried out
under the guidance of Smt. M.Vanitha M.Tech.

Date : Signature:
Place : Kumbakonam 1.
2.
3.

Name :
1.
2.
3.

iii
ACKNOWLEDGMENT

We pay our sincere obeisance to God ALMIGHTY who in His infinite mercy
showered the blessings to make us into what we are today.

First, we would like to express our sincere thanks to our Vice Chancellor
Prof. R. Sethuraman and our Registrar Dr. G. Bhalachandran for having given us an
opportunity to be a student of this esteemed institution.

We express our deepest thanks to our revered Dr.V.Ramaswamy, Dean, Srinivasa


Ramanujan Centre for his moral support and suggestions when required without any
reservations.

We express our gratitude to Dr.S.Sivagurunathan, HOD, CSE for his constant


support and valuable suggestions for the completion of the project.

We exhibit our pleasure in expressing our thanks to Smt. M.Vanitha M.Tech,


Assistant professor-II, Department of CSE, our guide for his/her ever encouraging spirit,
and meticulous guidance for the completion of the project.

We would like to place on record the benevolent approach and pain taking efforts of
guidance and corrections of Dr. V.Kalaichelvi, M.E., Ph.D and Smt.R.Sujarani, M.C.A.,
M.Tech., (Ph.D), the project coordinators and all the department staff to whom we owe our
heartily thanks for ever.

Without the support of our parents and friends this project would never have become
a reality.

We dedicate this work to all our well-wishers with love and affection.

iv
Table of contents

Chapter Content Page No.


No.
ABSTRACT
LIST OF TABLES
LIST OF FIGURES
1.0 INTRODUCTION
2.0 SOFTWARE PROJECT PLAN
3.0 SOFTWARE REQUIREMENTS
SPECIFICATION
3.1 Introduction
3.2 Overall Description
3.3 External Interface Requirements
3.4 System Features
4.0 SYSTEM ANALYSIS
4.1 Architecture Diagram
4.2 Use Case Diagram
4.3 Class Diagram
4.4 Dataflow Diagram
4.5 Case Tools for Analysis
5.0 DESIGN
5.1 Methodology and Approach
5.2 Interface Design
5.3 Back end Design
6.0 CODING
6.1 Algorithm
6.2 Source code
7.0 TESTING
7.1 Validation Testing
7.2 Unit Testing
8.0 IMPLEMENTATION
8.1 Problems Faced
8.2 Lessons Learnt
9.0 CONCLUSION AND FUTURE PLAN
9.1 Conclusion
9.2 Future Plan
10.0 REFERENCES

v
ABSTRACT

Hadoop is an open source, Java-based programming framework that supports the


processing and storage of extremely large data sets in a distributed computing environment.
Hadoop consists MapReduce for computation and HDFS for storage. Hadoop distributed
file system (HDFS) is a distributed storage system consists a master/slave architecture.
Master is Name node which is responsible for metadata and slaves are called as Data node
which is responsible for actual data storage.
Due to uneven distribution of input data, task with more data become stragglers and delay
the overall job completion. Such task can take more than five times longer to complete than
the fastest task so it slows the overall job completion. Hadoop speculatively runs a backup
copy of a slow task on a different node. The backup may run on a better performing node.
However, the differences in the node are not significant enough to mitigate skew and job
performance is still bottlenecked by Stragglers.

This project proposes a new perspective on tackling the skew problem in Hadoop
Applications. Rather than mitigating among skew tasks, this project tries to balance the
processing time of tasks even with the presence of data imbalance. Specifically, task with
more data or more expensive data records are accelerated by having more resources.

KEY WORDS: Stragglers, Name Node, Data Node, Hadoop Distributed File System, speculative
execution

vi
LIST OF TABLES

Table No. Table Name Page No.

Table 2.1 Various phases of the project 5

Table 7.1 Represents the validation testing

Table 7.2 Represents the unit testing

LIST OF FIGURES

Figure No. Figure Name Page No.


Figure 1.1 High level Hadoop architecture 2
Figure 3.1 Represents the Use case diagram 14
Figure 4.1 Depicts the System Architecture 14
Figure 4.2 Represents the Use case diagram
Figure 4.3 Represents the class diagram
Figure 4.4 Represents the data flow diagram
Figure 5.2 Pre-Processing Phase 20
Figure 5.3 Removal of Special Characters 21
Figure 5.4 Parts of Speech Tagging 22
Figure 5.5 Stemming Words 23
Figure 5.6 Hybrid Segmentation 24
Figure 5.7 Splitting task to run in parallel 24
Figure 5.8 Execution of MapReduce 25
Figure 5.9 Implementing speculative task 26
Figure 5.10 Final Output 26

vii
CHAPTER 1

INTRODUCTION

Hadoop Environment

Hadoop is an open source implementation of MapReduce Environment.

MapReduce is a framework that allow users to specify the computation in terms

of Map and Reduce functions. Google started using MapReduce framework for

their large-scale Indexing from its early days. As data generation rate is

increasing year-by-year by various means like scientific data processing, social

media, etc., so we need platform to process these large-scale data. Generally,

data center uses commodity hardware for data processing. Commodity hardware

used for this data processing are not reliable and more prone to failure. So, we

need a fault tolerant framework to process large scale data.

1
Fig 1.1 High level Hadoop architecture

Speculative Execution in Hadoop

Hadoop execute job on commodity hardware in data center. This

hardware has different computing power depending on hardware specification.

Heterogeneous environment comprises of nodes with varying computing power

2
whereas homogenous environment containing equally potential nodes. In

Heterogeneous environment, some of the compute nodes are faster than others.

Slower compute nodes are called stragglers and others are called fast nodes.

These fast nodes will finish their task early and wait for the stragglers to finish

their tasks.

The salient features of the proposed system are:

 Ability to handle large amount of overwhelming twitter data.

 Efficient and scalable for practical use.

 Ability to capture real-time events from social media services.

Advantages:

Job Completion time signifies the time job took from the start to its

completion. The main reason to speculate the task is that it might reduce the job

completion time. Enabling speculation will reduce the job completion time only

if speculative task finish earlier than original task.

For instance, Consider two slow tasks one with 30% and other with 80%

progress. Now Job Tracker identify these tasks as slow task and schedule a

speculative task on fast nodes. As first slow task is in its starting phase So,

3
possibility of speculative task completes first is higher comparative to second

task which is near to its completion.

Hence this signify that early detection of slow task is very necessary for

speculative task to overtake original task. This experiment notes down the start

and end time of job to measure Job completion time and it is expressed in

seconds.

PROBLEM STATEMENT

Existing System:

Uneven distribution of input data may produce straggler tasks thus

increase overall job completion time and also affect the performance of the

system. One of the reason for a straggler is Data skew. Data skew is an

imbalance in distribution of input data.

Proposed System:

If straggler task has been identified, then a copy of that task will be run

(speculative execution) on another node with data locality preservation. Thus,

the proposed solution may decrease the overall job completion time

4
Various reasons for slow task:

1. Hardware degradation and Software misconfiguration

2. More time taken than expected time, Since Hadoop doesn’t try to

diagnose the slow task, instead it tries to detect and runs backup task of

them.

Benefits of Speculative Execution:

1. Improves the efficiency hence reduce the overall job completion.

2. Cluster with 100s of nodes, problem with hardware failure are common.

3. Running parallel or duplicate task would be better since we won’t be

waiting for the task in the problem to complete.

Five unknown Facts about Hadoop

 Hadoop – Future of Raw Data

 Hadoop database is an affordable solution for enterprises

 Security challenges of Hadoop

 Turnkey solutions supporting Hadoop

 Evolution of Hadoop analytics

5
CHAPTER 2

SOFTWARE PROJECT PLAN

DESCRIPTION DURATION
PHASE

Phase 1 Data set collection 4-5 weeks

Phase 2 Data Pre-processing 1-2 weeks

Phase 3 Implementation of 0-1 week

mapper

Phase 4 Implementation of 1-2 weeks

reducer

Phase 5 Implementation in 2-3 weeks

Hadoop

Table 2.1: various phases of the project

6
CHAPTER 3

SOFTWARE REQUIREMENT SPECIFICATION

3.1 Introduction

This Software requirement specification provides a complete description

of all the functions and specifications of the Speculative execution in Hadoop

scheduling.

3.1.1 Purpose

The purpose of this document is to present a detailed description of the

Wordcount application running in Hadoop framework and how the task

speculates the stragglers.

3.1.2 Document Conventions

This document is written in the following style.


7
Font Family : Times New Roman
Font-Face : Bold for Title and subsections
Font-Size : Title & Subsection:14, Description:12

3.1.3 Intended Audience and Reading Suggestions

This document is intended for the Users and Administrators. The

SRS document also contains some information about the system like

scope, system features, assumptions and dependencies and other useful

information.

3.1.4 Product Scope

The User upload the data into a directory as csv file or text file to

run the job.

The application pre-processes the essential keywords from the file,

where the special characters are tokenized and the read records are

segmented.

Once the pre-process gets completed the system asks the user to

split the number of task to run the job in parallel manner to analyze in

what time it get completed i.e. to verify the overall job completion time.

8
Since, the project is working under Hadoop framework

MapReduce plays the vital role to complete the split task and run in

parallel manner. When the records are mapped and reduced it notifies the

timing of the job completion. After the parallel job completion in split

task the system alerts the user which task completes according to the

higher end as its identity. The next user will be prompted with the

speculative task and the particular task gets executed and denotes the

overall performance in a graph.

3.2. Overall Description

3.2.1 Product Perspective

This project is to facilitate the user to give a great start in the world of

MapReduce programming by giving them a hands-on experience in developing

their first Hadoop based Wordcount application. Hadoop MapReduce

Wordcount example is a standard example where Hadoop developers begin

their hands-on programming with.

3.2.2 Product Functions

 Data upload
 Pre-Processing

9
 Running the task in parallel manner
 Execute MapReduce
 Visualize the completion time in performance chart

3.2.3 User Classes and Characteristics

User : The user collects the large volume of dataset and upload
to analyze and segment to count the essential words.

3.2.4 Operating Environment:


Windows XP and above (VMWare)
Processor type 32-bit or 64-bit
RAM 512 MB
Hard Disk 40Gb

3.2.5 Design and Implementation Constraints

Once the pre-processing gets over, system asks the user to split

the task for parallel execution, Since the application works under

Hadoop framework, the task is executed using MapReduce. Map

reduce approach is used for effectively processing the data in

parallel and aggregation of data to obtain the final result. It can be

used to process large volume of data in tera, zeta bytes.

10
a) Mapper Phase

Mapper job involves three sub tasks like Mapper, Partitioner, and

Combiner. Mapper converts data into (key, value) pairs, Combiner

combines the generated mapped pair and Partitioned pair splits the data

into small clusters, after which the shuffling of (key, value) pairs to

unique reducer job is done.

The text from the input text file is tokenized into words to form a key

value pair with all the words present in the input text file. The key is the

word from the input file and value is ‘1’.

For instance, if you consider the sentence “An elephant is an animal”.

The mapper phase in the Wordcount example will split the string into

individual tokens i.e. words. In this case, the entire sentence will be split

into 5 tokens.

b) Shuffler Phase

11
After the map phase execution is completed successfully, shuffle

phase is executed automatically wherein the key-value pairs

generated in the map phase are taken as input and then sorted in

alphabetical order.

c) Reducer Phase

In the reduce phase, all the keys are grouped together and the

values for similar keys are added up to find the occurrences for a

particular word. It is like an aggregation phase for the keys

generated by the map phase. The reducer phase takes the output of

shuffle phase as input and then reduces the key-value pairs to

unique keys with values added up. In our example “An elephant is

an animal.” is the only word that appears twice in the sentence.

3.2.6 Assumptions and Dependencies

All the data entries will be correct and up to date. This software is

developed using Java as font end, Hadoop as the back end. The User can

12
only upload the files. The maintenance of the jobs and scheduling of jobs

will be done by Hadoop.

3.3External Interface Requirements

3.3.1 Hardware Interfaces

Processor : I3

RAM : 4Gb

Hard disk : 300Gb

3.3.2 Software Interfaces

The software requires a ECLIPSE IDE to run the application. The

Hadoop should be properly installed and configured.

Operating System : Windows OS

Front End : JAVA

Back End : HDFS

Framework : Hadoop Framework (plug-in should be

installed)

3.4 System Features

3.4.1 Segmentation Sequences

13
Sequence Response

Removing Special Characters : removes (@, : . …..)

Tag removal : Eliminates HTML and URL Links

POS tagging : Denotes the grammar of the data present

Stemming words : Converts the data into normal form

Hybrid segmentation : Produce the dataset to run MapReduce

3.4.2 Functional Requirements

Functional requirements are those that refer to functionality of the system

(i.e.) what services it will provide to other information needed to produce the

correct system and are detailed separately.

Here the system segments the input data to function the scheduler to

identify the straggler and speculate the task.

14
15
3.4.2.1 Use Case Diagram

Figure 3.1: Represents the Use case diagram

16
3.4.2.2 Use Case Description
Usecase: Data Upload
Breif Description: The user upload the dataset into the file storage to
analyze and segment.
Initial step-by-step description:
The user has to start the Hadoop Services such as Namenode,
Datanode.
Step 1: user collects dataset from twitter tweets.
Step 2: Upload the dataset to the storage to analyse.

Usecase: Pre-processing
Breif Description: the uploaded input gets processed by several steps as
explained in product function.
Initial step-by-step description:
The user pre-process the data to analyze and count the essential words and
remove the non-essential words.
Step 1: The user preprocesses the data by clicking into the preprocessing option
designed in interface.
Step 2: The user normalizes the word to easy the process.

Step 3: The admin stores the selected data into .txt file.

Step 4: If the data has been segmented successfully, user will get into the next
part of data analysis.

Usecase: Mapping with key value

Breif Description: Once the segment process has been completed, the mapper

starts the job with respect to (key , value) pair.

17
Usecase: Identify straggler and speculate

Breif description: The system scheduler executes the task parallely and says the

user which task has been ended first and it works according to the first come

first out.

Usecase: Reduce with all key value

Breif description: The mapreduce job has completed the mapper phase by

shuffling tweets in the input data and split to reduce with respect to reduce key

value pair.

18
3.4.3 Other Nonfunctional Requirements

3.4.3.1 Performance

The system functions by splitting the segmented tweets to run in

parallel manner and compares the performance of each task with the help of

completion time which is measured in terms of Nano seconds.

3.4.3.2 Software Quality Attributes

1. Reliability

The identification of slower task is done by the

scheduler.

2. Availability

The system is available for a 24*7.

3. Maintainability

The admin shall provide a capability to back up the

data.

4. Portability

The file can be downloading at anywhere.

19
CHAPTER 4

SYSTEM ANALYSIS

4.1 SYSTEM ARCHITECTURE

Fig 4.1 Depicts the system architecture

20
4.2 USE CASE DIAGRAM

Fig 4.2: Represents the Usecase diagram

21
4.3 CLASS DIAGRAM

Fig 4.3: represents the class diagram

22
4.4 Data Flow Diagram

Level 0:

Level 1:

Fig 4.5: Represents the Data Flow Diagram

23
4.5 CASE tools for Analysis

UML TOOL : Star UML

Coding : Java

IDE : Eclipse

Environment :Virtual Machine

24
CHAPTER 5

DESIGN

5.1 METHODOLOGY AND APPROACH


The objective of our system is to decrease the overall job completion time

in single node cluster. The application used in my system is wordcount for

Twitter tweets containing large amount of data.

5.1.1 PRE-PROCESSING

The data for our work is collected from various social media services like

Twitter, Wikipedia. Several unwanted data is removed at this stage. The posts

collected from various social media sources undergo various pre-processing

tasks.

Initially, the data are processed to remove hash tags, punctuations, other special

characters, URL, numbers and each word is converted to its lower case. Further,

the processed data is tokenized by white spaces.

25
Then, the data is further processed to convert each word to its base form by the

technique called stemming. At the end of this step, the data set contains only

necessary and useful information for further processing.

The various processing performed at this step can be summarized as follows:

 Remove special characters.

 Tag Removal (HTML links and URL)

 Parts of speech tagging.

 Perform stemming.

 Hybrid Segmentation

26
5.2 INTERFACE DESIGN

27
Figure 5.2: Pre-processing phase

Figure 5.3: Removal of Special Characters

28
Figure 5.4: POS tagging

29
Figure 5.5: Stemming Words

Figure 5.6: Hybrid Segmentation

Fig 5.7: Splitting up tasks

30
Figure 5.8 Execution of MapReduce

Figure 5.9: Implementing Speculative task

31
Figure 5.10: Final Output
5.3 BACK END DESIGN:

Figure 5.11: Hadoop Framework

32
CHAPTER 6

CODING

6.1 ALGORITHM / PSEUDO CODE

Step 1: Collect data from social media services.

Step 2: Remove special characters and convert long sentences into tokens.

Step 3: Remove stemming words and convert every word to its base form.

Step 4: Remove stop words from previous input.

Step 5: Enter the number of task going to run on a single node cluster and pass

the pre-processed data into MapReduce task.

Step 6: System generates the higher end task and denotes the speculative task

identity to run the task parallel to the completed task.

Step 7: Pass the mapper into the reducer where key represents word and value

its count. Meanwhile the system reports the parallel completion time of the task.

33
6.2 SOURCE CODE

Map.java

package tweet;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class Map extends

Mapper<Object, Text, Text, IntWritable> {

private final IntWritable ONE = new IntWritable(1);

private Text word = new Text();

public void map(Object key, Text value, Context context)

throws IOException, InterruptedException {

String[] csv = value.toString().split(" ");

for (String str : csv) {

word.set(str);

34
context.write(word, ONE);

35
Reduce.java

package tweet;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class Reduce extends

Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text text, Iterable<IntWritable> values, Context

context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable value : values) {

sum += value.get();

context.write(text, new IntWritable(sum));}}

36
Main.java

public class Main extends JFrame {

private JPanel contentPane;

File file;

private JEditorPane editorPane;

public static void main(String[] args) {

EventQueue.invokeLater(new Runnable() {

public void run() {

try {

Main frame = new Main();

frame.setVisible(true);

} catch (Exception e) {

e.printStackTrace();

});

btnStopRemoval.addActionListener(new ActionListener() {

public void actionPerformed(ActionEvent e) {

37
try {

Vector<String> vs=new

Vector<String>();

vs.add("NN");

vs.add("VBZ");

vs.add("JJ");

vs.add("NNS");

vs.add("VB");

vs.add("CC");

vs.add("VBD");

vs.add("JJR");

vs.add("VBG");

vs.add("RB");

vs.add("VBN");

vs.add("WRB");

vs.add("VBP");

vs.add("JJS");

vs.add("NNP");

DataInputStream fin = new

DataInputStream(new FileInputStream("stem.txt"));

38
FileOutputStream fout=new

FileOutputStream("stop.txt");

String t = null;

while ((t = fin.readLine()) != null) {

String tt[]=t.split("#");

String ss[] = tt[1].trim().split(" ");

StringBuffer sb=new

StringBuffer();

for (int i = 0; i < ss.length; i++) {

ss[i].trim();

String ww[]=ss[i].split("/");

System.out.println(ww[1]);

if(vs.contains(ww[1]))

ss[i]=ww[0]+"/"+ww[1];

sb.append(ss[i]+" ");

fout.write((tt[0]+"#"+sb.toString()

+"\n").getBytes());

39
}

fin.close();

fout.close();

editorPane.setPage(new

File("stop.txt").toURL());

String temp =

editorPane.getText().replaceAll("#", " ");

editorPane.setText(temp);

} catch (Exception ex) {

ex.printStackTrace();

});

btnStopRemoval.setBounds(236, 63, 145, 31);

panel.add(btnStopRemoval);

40
HDProcess.java

public HDProcess() {

setTitle("Text Job - Map / Reduce");

setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);

setBounds(100, 100, 906, 600);

contentPane = new JPanel();

contentPane.setBorder(new EmptyBorder(5, 5, 5, 5));

setContentPane(contentPane);

contentPane.setLayout(null);

JPanel panel = new JPanel();

panel.setBounds(10, 11, 376, 544);

contentPane.add(panel);

panel.setLayout(null);

JButton btnMapreduce = new JButton("Map / Reduce");

btnMapreduce.addActionListener(new ActionListener() {

public void actionPerformed(ActionEvent arg0) {

try {

System.out.println(inpath +" , "+

outpath);

41
long st1 = System.currentTimeMillis();

long ptime = 0;

long stime = 0;

ApplicationFrame app = new ApplicationFrame("");

BarChart chart = new BarChart("Performance Chart", "",

"Exeuction Time(nano sec)");

List<Long> numbers = new ArrayList<Long>();

List<Long> numbers1 = new ArrayList<Long>();

WordCount wc = new

WordCount(inpath, outpath);

wc.doMapReduce();

textArea.setText(wc.log);

for(int i=1;i<=nodescnt;i++)

long st = System.currentTimeMillis();

File f1=new

File("aspect/aspect"+i+".txt");

inpath=f1.getAbsolutePath();

outpath=f1.getParent()+"/output";

42
wc = new WordCount(inpath,

outpath);

wc.doMapReduce();

textArea.append(wc.log);

long en =

System.currentTimeMillis();

ptime = (en-st);

System.out.println("Parellel Timing : " +( en-st) +" Nano

Seconds");

textArea.append("Execution Time : " +( en-st) +" Nano

Seconds" );

chart.addValue(ptime, " Pro R"+(i), "");

numbers.add(ptime);

numbers1.add(ptime);

Collections.sort(numbers);

String mess = "";

int c=1;

43
for(int j=0;j<numbers.size();j++)

int ind = numbers.indexOf(numbers1.get(j))+1;

mess += c + " Higher end Node is " + ind

+"\n";

c++;

JOptionPane.showMessageDialog(new

JFrame("Find Higher End Node"), mess);

int ind =

numbers.indexOf(numbers1.get(nodescnt-1))+1;

JOptionPane.showMessageDialog(new

JFrame("Speculative Node"), "Speculative Node :" +ind);

long st = System.currentTimeMillis();

File f1=new File("aspect/aspect"+ind+".txt");

inpath=f1.getAbsolutePath();

outpath=f1.getParent()+"/output";

wc = new WordCount(inpath,

outpath);

wc.doMapReduce();

textArea.append(wc.log);

44
long en =

System.currentTimeMillis();

ptime = (en-st);

System.out.println("Parellel Timing : " +( en-st) +" Nano

Seconds");

textArea.append("Execution Time : " +( en-st) +" Nano

Seconds" );

chart.addValue(ptime, " Spec Node", "");

numbers.add(ptime);

numbers1.add(ptime);

long en1 = System.currentTimeMillis();

stime = (en1-st1);

System.out.println("Single Timing : " +( en1-st1)

+" Nano Seconds");

chart.addValue(stime, "Existing", "");

chart.createChart();

app.setContentPane(chart);

app.setDefaultCloseOperation(app.HIDE_ON_CLOSE);

45
app.setSize(800, 450);

app.setVisible(true);

File ff[] = new File(outpath).listFiles();

Vector<String> vhead = new

Vector<String>();

vhead.add("File Name");

vhead.add("File Size");

vhead.add("File path");

Vector<Vector<String>> vbody = new

Vector<Vector<String>>();

for (int i = 0; i < ff.length; i++) {

Vector<String> vr = new

Vector<String>();

vr.add(ff[i].getName());

vr.add(ff[i].length() + "");

vr.add(ff[i].getAbsolutePath());

vbody.add(vr);

table.setModel(new

DefaultTableModel(vbody, vhead));

46
} catch (Exception e) {

e.printStackTrace();

});

btnMapreduce.setBounds(10, 11, 130, 30);

panel.add(btnMapreduce);

47
Wordcount.java

package tweet;

import java.io.File;

import java.io.IOException;

import java.util.*;

import javax.swing.JFrame;

import javax.swing.JOptionPane;

import org.apache.hadoop.fs.FileUtil;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapred.*;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.util.*;

public class WordCount {

String inFile, outDirPath;

JobConf conf;

String log = "";

48
WordCount(String inFile, String outDirPath) {

this.inFile = inFile;

this.outDirPath = outDirPath;

void doMapReduce() throws IOException, InterruptedException,

ClassNotFoundException {

try {

conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

File dir = new File(outDirPath);

if (dir.exists()) {

FileUtil.fullyDelete(dir);

49
}

FileInputFormat.setInputPaths(conf, inFile);

FileOutputFormat.setOutputPath(conf, new

Path(outDirPath));

RunningJob rj= JobClient.runJob(conf);

log = log + rj.getJobName().toString() + "\n";

log = log + rj.getJobState() + "\n";

log = log + rj.getTrackingURL().toString() + "\n";

log = log + rj.getClass().toString() + "\n";

log = log + rj.mapProgress() + "\n";

log = log + rj.reduceProgress() + "\n";

log = log + rj.getCounters().toString() + "\n";

} catch (Exception ex) {

JOptionPane.showMessageDialog(new JFrame(),

ex.getMessage());

50
}

public static class Map extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output,

Reporter reporter)

throws IOException {

String line = value.toString();

StringTokenizer tokenizer = new

StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

output.collect(word, one);

51
public static class Reduce extends MapReduceBase implements

Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output,

Reporter reporter)

throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

output.collect(key, new IntWritable(sum));}}}

52
CHAPTER 7

TESTING

The purpose of testing is to discover errors. Testing is the process of trying to

discover every conceivable fault or weakness in a work product. It provides a

way to check the functionality of components, sub assemblies, assemblies

and/or a finished product It is the process of exercising software with the intent

of ensuring that the Software system meets its requirements and user

expectations and does not fail in an unacceptable manner. There are various

types of test. Each test type addresses a specific testing requirement.

53
7.1 Validation Testing:

Test Test case Purpose Pre – Test data Expected Pass


case ID title requirements Output /Fail

T01 Remove To extract the Dataset Click the To remove Pass


Special special required button all the
characters characters such special
as (@, ! , ….) characters.
T02 Tag removal Remove the Dataset Click the Produce Pass
URL and html required button data
links. without
any html
link
T03 POS Tagging To analyze the Dataset Click the Reproduce Pass
grammar of the required button the dataset
input. by
denoting
parts of
speech
T04 Stemming Normalize the Dataset Click the Converts Pass
words data into its required button the
normal form participle
words into
normal
form
T05 Hybrid Eliminates the Dataset Click Shuffles Pass
segmentation records read required button the record
numbering

54
T06 MapReduce Analyze the Dataset Click Maps and Pass
input record required button reduce the
with (key , input
value) pair record

Table 7.1: Represents the validation testing

7..2 Unit testing


Test case Test case Purpose Pre – Test data Expected Pass
ID title requirements Output /Fail
T01 Splitting For Enable in Text Enter the Pass
task parallel scheduler field to number
execution enter the in the
number field
T02 Identify Main Processed Data Alerts Pass
straggler objective in coding process the user
of the timing in
project message
box

Table 7.2: Represents the unit testing

55
CHAPTER 8

IMPLEMENTATION

The implementation stage involves the following tasks.

 Correct planning.

 Collection of Datasets

 Design of methods to achieve the changeover.

 Evaluation of the methods.

8.1 Problems Faced

 Configuring secure shell for resource development.

 High end System configuration is required to run the software in

Virtual Machine.

 The framework plug-in installation

 Connecting DFS to eclipse IDE.

8.2 Lessons Learnt

While developing this project, we came across many experiences. They

are

 Before starting the project, we must have proper plan about it.

56
 We should not jump into code directly. First the project should be

analyzed thoroughly.

 If Hadoop framework installed in windows directly with any

virtual boot software’s the user need to set the java home path

correctly in environment settings through control panel.

57
CHAPTER 9

CONCLUSION AND FUTURE PLANS

9.1 CONCLUSION

Hadoop provides an open-source implementation of the MapReduce

framework. But its design poses challenges to performance due to the data

skew. Apache Hadoop does not fix or diagnose slow running task, this project

tries to detect when a task is running slower than expected and launches

another, an equivalent task as backup, this backup task is called speculative

task. We have implemented wordcount running on Hadoop framework which

speculates the task according to completion time for a single node cluster. Our

performance rate is measured in terms of Nano seconds as it differs belonging

to the splitting number of tasks.

58
9.2 FUTURE PLAN

This project had done in Hadoop framework and implemented in a single

node Hadoop cluster setup, in future it may be extended or we will try to

implement in multi node Hadoop setup since it is cost effective and scalable.

Then we will work on giving input dynamically and connect two systems for

parallel execution.

59
CHAPTER 10

REFERENCES

1. Moving Hadoop into the Cloud with Flexible Slot Management and

Speculative Execution Yanfei Guo, Member, IEEE, Jia Rao, Member, IEEE,

Changjun Jiang, Member, IEEE, and Xiaobo Zhou, Senior Member, IEEE.

2. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia, “Skew-resistant parallel

processing of feature-extracting scientific user-defined functions,” in Proc. 1st

ACM Symp. Cloud Comput., 2010.

3. A. Verma, L. Cherkasova, and R. H. Campbell, “Resource provisioning

framework for MapReduce jobs with performance goals,” in Proc.

ACM/IFIP/USENIX 12th Int. Middleware Conf., 2011.

4. R. C. Chiang and H. H. Huang, “TRACON: Interference-aware

scheduling for data-intensive applications in virtualized environments,” in Proc.

Int Conf. High Perform. Comput., Netw. Storage Anal., 2011.

5. A. Gulati, A. Holler, M. Ji, G. Shanmuganathan, C. Waldspurger,and X.

Zhu, “Vmware distributed resource management: Design, implementation and

lessons learned,” VMware Tech. J., vol. 1, pp. 45–64, 2012.

60
6. F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar,

“Tarazu: Optimizing mapreduce on heterogeneous clusters,” in Proc. 17th Int

Conf. Architectural Support Program. Lang. Operating. Syst., 2012.

7. R. Gandhi, D. Xie, and Y. C. Hu, “PIKACHU: How to rebalance load in

optimizing mapreduce on heterogeneous clusters,” in Proc. USENIX Conf.

Annu. Tech. Conf., 2013.

8. Z. Zhang, L. Cherkasova, and B. T. Loo, “AutoTune: Optimizing

execution concurrency and resource usage in MapReduce workflows,” in Proc.

10th USENIX/ACM Int. Conf. Autonomic Comput., 2013.

9. J. Tan, et al., “DynMR: Dynamic MapReduce with reduce task

interleaving and map task backfilling,” in Proc. 9th Eur. Conf. Comput. Syst.,

2014, Art. no. 2.

10. E. Coppa and I. Finocchi, “On data skewness, stragglers, and mapreduce

progress indicators,” in Proc. 6th ACM Symp. Cloud Comput., 2015.

61

Вам также может понравиться