Академический Документы
Профессиональный Документы
Культура Документы
KATHIRAVAN R 218003048
SHERWIN A 218003080
SRIKANTH S 218003091
i
SHANMUGHA ARTS, SCIENCE, TECHNOLOGY & RESEARCH ACADEMY
SASTRA DEEMED TO BE UNIVERSITY
(A University Under Section 3 of the UGC Act, 1956)
SRINIVASA RAMANUJAN CENTRE
Kumbakonam – 612 001
Tamil Nadu, India
BONAFIDE CERTIFICATE
Examiner I Examiner II
ii
Department of Computer Science and Engineering
SHANMUGHA ARTS, SCIENCE, TECHNOLOGY & RESEARCH ACADEMY
SASTRA DEEMED TO BE UNIVERSITY
(A University Under Section 3 of the UGC Act, 1956)
SRINIVASA RAMANUJAN CENTRE
Kumbakonam – 612 001
Tamil Nadu, India
DECLARATION
Date : Signature:
Place : Kumbakonam 1.
2.
3.
Name :
1.
2.
3.
iii
ACKNOWLEDGMENT
We pay our sincere obeisance to God ALMIGHTY who in His infinite mercy
showered the blessings to make us into what we are today.
First, we would like to express our sincere thanks to our Vice Chancellor
Prof. R. Sethuraman and our Registrar Dr. G. Bhalachandran for having given us an
opportunity to be a student of this esteemed institution.
We would like to place on record the benevolent approach and pain taking efforts of
guidance and corrections of Dr. V.Kalaichelvi, M.E., Ph.D and Smt.R.Sujarani, M.C.A.,
M.Tech., (Ph.D), the project coordinators and all the department staff to whom we owe our
heartily thanks for ever.
Without the support of our parents and friends this project would never have become
a reality.
We dedicate this work to all our well-wishers with love and affection.
iv
Table of contents
v
ABSTRACT
This project proposes a new perspective on tackling the skew problem in Hadoop
Applications. Rather than mitigating among skew tasks, this project tries to balance the
processing time of tasks even with the presence of data imbalance. Specifically, task with
more data or more expensive data records are accelerated by having more resources.
KEY WORDS: Stragglers, Name Node, Data Node, Hadoop Distributed File System, speculative
execution
vi
LIST OF TABLES
LIST OF FIGURES
vii
CHAPTER 1
INTRODUCTION
Hadoop Environment
of Map and Reduce functions. Google started using MapReduce framework for
their large-scale Indexing from its early days. As data generation rate is
data center uses commodity hardware for data processing. Commodity hardware
used for this data processing are not reliable and more prone to failure. So, we
1
Fig 1.1 High level Hadoop architecture
2
whereas homogenous environment containing equally potential nodes. In
Heterogeneous environment, some of the compute nodes are faster than others.
Slower compute nodes are called stragglers and others are called fast nodes.
These fast nodes will finish their task early and wait for the stragglers to finish
their tasks.
Advantages:
Job Completion time signifies the time job took from the start to its
completion. The main reason to speculate the task is that it might reduce the job
completion time. Enabling speculation will reduce the job completion time only
For instance, Consider two slow tasks one with 30% and other with 80%
progress. Now Job Tracker identify these tasks as slow task and schedule a
speculative task on fast nodes. As first slow task is in its starting phase So,
3
possibility of speculative task completes first is higher comparative to second
Hence this signify that early detection of slow task is very necessary for
speculative task to overtake original task. This experiment notes down the start
and end time of job to measure Job completion time and it is expressed in
seconds.
PROBLEM STATEMENT
Existing System:
increase overall job completion time and also affect the performance of the
system. One of the reason for a straggler is Data skew. Data skew is an
Proposed System:
If straggler task has been identified, then a copy of that task will be run
the proposed solution may decrease the overall job completion time
4
Various reasons for slow task:
2. More time taken than expected time, Since Hadoop doesn’t try to
diagnose the slow task, instead it tries to detect and runs backup task of
them.
2. Cluster with 100s of nodes, problem with hardware failure are common.
5
CHAPTER 2
DESCRIPTION DURATION
PHASE
mapper
reducer
Hadoop
6
CHAPTER 3
3.1 Introduction
scheduling.
3.1.1 Purpose
SRS document also contains some information about the system like
information.
The User upload the data into a directory as csv file or text file to
where the special characters are tokenized and the read records are
segmented.
Once the pre-process gets completed the system asks the user to
split the number of task to run the job in parallel manner to analyze in
what time it get completed i.e. to verify the overall job completion time.
8
Since, the project is working under Hadoop framework
MapReduce plays the vital role to complete the split task and run in
parallel manner. When the records are mapped and reduced it notifies the
timing of the job completion. After the parallel job completion in split
task the system alerts the user which task completes according to the
higher end as its identity. The next user will be prompted with the
speculative task and the particular task gets executed and denotes the
This project is to facilitate the user to give a great start in the world of
Data upload
Pre-Processing
9
Running the task in parallel manner
Execute MapReduce
Visualize the completion time in performance chart
User : The user collects the large volume of dataset and upload
to analyze and segment to count the essential words.
Once the pre-processing gets over, system asks the user to split
the task for parallel execution, Since the application works under
10
a) Mapper Phase
Mapper job involves three sub tasks like Mapper, Partitioner, and
combines the generated mapped pair and Partitioned pair splits the data
into small clusters, after which the shuffling of (key, value) pairs to
The text from the input text file is tokenized into words to form a key
value pair with all the words present in the input text file. The key is the
The mapper phase in the Wordcount example will split the string into
individual tokens i.e. words. In this case, the entire sentence will be split
into 5 tokens.
b) Shuffler Phase
11
After the map phase execution is completed successfully, shuffle
generated in the map phase are taken as input and then sorted in
alphabetical order.
c) Reducer Phase
In the reduce phase, all the keys are grouped together and the
values for similar keys are added up to find the occurrences for a
generated by the map phase. The reducer phase takes the output of
unique keys with values added up. In our example “An elephant is
All the data entries will be correct and up to date. This software is
developed using Java as font end, Hadoop as the back end. The User can
12
only upload the files. The maintenance of the jobs and scheduling of jobs
Processor : I3
RAM : 4Gb
installed)
13
Sequence Response
(i.e.) what services it will provide to other information needed to produce the
Here the system segments the input data to function the scheduler to
14
15
3.4.2.1 Use Case Diagram
16
3.4.2.2 Use Case Description
Usecase: Data Upload
Breif Description: The user upload the dataset into the file storage to
analyze and segment.
Initial step-by-step description:
The user has to start the Hadoop Services such as Namenode,
Datanode.
Step 1: user collects dataset from twitter tweets.
Step 2: Upload the dataset to the storage to analyse.
Usecase: Pre-processing
Breif Description: the uploaded input gets processed by several steps as
explained in product function.
Initial step-by-step description:
The user pre-process the data to analyze and count the essential words and
remove the non-essential words.
Step 1: The user preprocesses the data by clicking into the preprocessing option
designed in interface.
Step 2: The user normalizes the word to easy the process.
Step 3: The admin stores the selected data into .txt file.
Step 4: If the data has been segmented successfully, user will get into the next
part of data analysis.
Breif Description: Once the segment process has been completed, the mapper
17
Usecase: Identify straggler and speculate
Breif description: The system scheduler executes the task parallely and says the
user which task has been ended first and it works according to the first come
first out.
Breif description: The mapreduce job has completed the mapper phase by
shuffling tweets in the input data and split to reduce with respect to reduce key
value pair.
18
3.4.3 Other Nonfunctional Requirements
3.4.3.1 Performance
parallel manner and compares the performance of each task with the help of
1. Reliability
scheduler.
2. Availability
3. Maintainability
data.
4. Portability
19
CHAPTER 4
SYSTEM ANALYSIS
20
4.2 USE CASE DIAGRAM
21
4.3 CLASS DIAGRAM
22
4.4 Data Flow Diagram
Level 0:
Level 1:
23
4.5 CASE tools for Analysis
Coding : Java
IDE : Eclipse
24
CHAPTER 5
DESIGN
5.1.1 PRE-PROCESSING
The data for our work is collected from various social media services like
Twitter, Wikipedia. Several unwanted data is removed at this stage. The posts
tasks.
Initially, the data are processed to remove hash tags, punctuations, other special
characters, URL, numbers and each word is converted to its lower case. Further,
25
Then, the data is further processed to convert each word to its base form by the
technique called stemming. At the end of this step, the data set contains only
Perform stemming.
Hybrid Segmentation
26
5.2 INTERFACE DESIGN
27
Figure 5.2: Pre-processing phase
28
Figure 5.4: POS tagging
29
Figure 5.5: Stemming Words
30
Figure 5.8 Execution of MapReduce
31
Figure 5.10: Final Output
5.3 BACK END DESIGN:
32
CHAPTER 6
CODING
Step 2: Remove special characters and convert long sentences into tokens.
Step 3: Remove stemming words and convert every word to its base form.
Step 5: Enter the number of task going to run on a single node cluster and pass
Step 6: System generates the higher end task and denotes the speculative task
Step 7: Pass the mapper into the reducer where key represents word and value
its count. Meanwhile the system reports the parallel completion time of the task.
33
6.2 SOURCE CODE
Map.java
package tweet;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
word.set(str);
34
context.write(word, ONE);
35
Reduce.java
package tweet;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
context)
int sum = 0;
sum += value.get();
36
Main.java
File file;
EventQueue.invokeLater(new Runnable() {
try {
frame.setVisible(true);
} catch (Exception e) {
e.printStackTrace();
});
btnStopRemoval.addActionListener(new ActionListener() {
37
try {
Vector<String> vs=new
Vector<String>();
vs.add("NN");
vs.add("VBZ");
vs.add("JJ");
vs.add("NNS");
vs.add("VB");
vs.add("CC");
vs.add("VBD");
vs.add("JJR");
vs.add("VBG");
vs.add("RB");
vs.add("VBN");
vs.add("WRB");
vs.add("VBP");
vs.add("JJS");
vs.add("NNP");
DataInputStream(new FileInputStream("stem.txt"));
38
FileOutputStream fout=new
FileOutputStream("stop.txt");
String t = null;
String tt[]=t.split("#");
StringBuffer sb=new
StringBuffer();
ss[i].trim();
String ww[]=ss[i].split("/");
System.out.println(ww[1]);
if(vs.contains(ww[1]))
ss[i]=ww[0]+"/"+ww[1];
sb.append(ss[i]+" ");
fout.write((tt[0]+"#"+sb.toString()
+"\n").getBytes());
39
}
fin.close();
fout.close();
editorPane.setPage(new
File("stop.txt").toURL());
String temp =
editorPane.setText(temp);
ex.printStackTrace();
});
panel.add(btnStopRemoval);
40
HDProcess.java
public HDProcess() {
setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
setContentPane(contentPane);
contentPane.setLayout(null);
contentPane.add(panel);
panel.setLayout(null);
btnMapreduce.addActionListener(new ActionListener() {
try {
outpath);
41
long st1 = System.currentTimeMillis();
long ptime = 0;
long stime = 0;
WordCount wc = new
WordCount(inpath, outpath);
wc.doMapReduce();
textArea.setText(wc.log);
for(int i=1;i<=nodescnt;i++)
long st = System.currentTimeMillis();
File f1=new
File("aspect/aspect"+i+".txt");
inpath=f1.getAbsolutePath();
outpath=f1.getParent()+"/output";
42
wc = new WordCount(inpath,
outpath);
wc.doMapReduce();
textArea.append(wc.log);
long en =
System.currentTimeMillis();
ptime = (en-st);
Seconds");
Seconds" );
numbers.add(ptime);
numbers1.add(ptime);
Collections.sort(numbers);
int c=1;
43
for(int j=0;j<numbers.size();j++)
+"\n";
c++;
JOptionPane.showMessageDialog(new
int ind =
numbers.indexOf(numbers1.get(nodescnt-1))+1;
JOptionPane.showMessageDialog(new
long st = System.currentTimeMillis();
inpath=f1.getAbsolutePath();
outpath=f1.getParent()+"/output";
wc = new WordCount(inpath,
outpath);
wc.doMapReduce();
textArea.append(wc.log);
44
long en =
System.currentTimeMillis();
ptime = (en-st);
Seconds");
Seconds" );
numbers.add(ptime);
numbers1.add(ptime);
stime = (en1-st1);
chart.createChart();
app.setContentPane(chart);
app.setDefaultCloseOperation(app.HIDE_ON_CLOSE);
45
app.setSize(800, 450);
app.setVisible(true);
Vector<String>();
vhead.add("File Name");
vhead.add("File Size");
vhead.add("File path");
Vector<Vector<String>>();
Vector<String> vr = new
Vector<String>();
vr.add(ff[i].getName());
vr.add(ff[i].length() + "");
vr.add(ff[i].getAbsolutePath());
vbody.add(vr);
table.setModel(new
DefaultTableModel(vbody, vhead));
46
} catch (Exception e) {
e.printStackTrace();
});
panel.add(btnMapreduce);
47
Wordcount.java
package tweet;
import java.io.File;
import java.io.IOException;
import java.util.*;
import javax.swing.JFrame;
import javax.swing.JOptionPane;
import org.apache.hadoop.fs.FileUtil;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.*;
JobConf conf;
48
WordCount(String inFile, String outDirPath) {
this.inFile = inFile;
this.outDirPath = outDirPath;
ClassNotFoundException {
try {
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
if (dir.exists()) {
FileUtil.fullyDelete(dir);
49
}
FileInputFormat.setInputPaths(conf, inFile);
FileOutputFormat.setOutputPath(conf, new
Path(outDirPath));
JOptionPane.showMessageDialog(new JFrame(),
ex.getMessage());
50
}
Reporter reporter)
throws IOException {
StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
51
public static class Reduce extends MapReduceBase implements
Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
52
CHAPTER 7
TESTING
and/or a finished product It is the process of exercising software with the intent
of ensuring that the Software system meets its requirements and user
expectations and does not fail in an unacceptable manner. There are various
53
7.1 Validation Testing:
54
T06 MapReduce Analyze the Dataset Click Maps and Pass
input record required button reduce the
with (key , input
value) pair record
55
CHAPTER 8
IMPLEMENTATION
Correct planning.
Collection of Datasets
Virtual Machine.
are
Before starting the project, we must have proper plan about it.
56
We should not jump into code directly. First the project should be
analyzed thoroughly.
virtual boot software’s the user need to set the java home path
57
CHAPTER 9
9.1 CONCLUSION
framework. But its design poses challenges to performance due to the data
skew. Apache Hadoop does not fix or diagnose slow running task, this project
tries to detect when a task is running slower than expected and launches
speculates the task according to completion time for a single node cluster. Our
58
9.2 FUTURE PLAN
implement in multi node Hadoop setup since it is cost effective and scalable.
Then we will work on giving input dynamically and connect two systems for
parallel execution.
59
CHAPTER 10
REFERENCES
1. Moving Hadoop into the Cloud with Flexible Slot Management and
Speculative Execution Yanfei Guo, Member, IEEE, Jia Rao, Member, IEEE,
Changjun Jiang, Member, IEEE, and Xiaobo Zhou, Senior Member, IEEE.
60
6. F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar,
interleaving and map task backfilling,” in Proc. 9th Eur. Conf. Comput. Syst.,
10. E. Coppa and I. Finocchi, “On data skewness, stragglers, and mapreduce
61