02-Wordcount Mapreduce

MapReduce is the core component of the distributed processing framework Hadoop which is
written in Java –
Map Phase – Data transformation and pre-processing step. Data is input in terms of key value
pairs and after processing is sent to the reduce phase.
Reduce Phase- Data is aggregated and the business logic is implemented in this phase which
is sent to the next big data tool in the data pipeline for further processing.
The standard Hadoop’s MapReduce model has Mappers, Reducers, Combiners, Partitioner,
and sorting all of which manipulate the structure of the data to fit the business requirements.
It is evident that to manipulate the structure of the data – Map and Reduce phase need to
make use of data structures like arrays to perform various transformation operations.
Ex. No. 2:
AIM:
Word count program to demonstrate the use of Map and Reduce tasks
STEPS:
1. Analyze the input file content
2. Develop the code
a. Writing a map function
b. Writing a reduce function
c. Writing the Driver class
3. Compiling the source
4. Building the JAR file
5. Starting the DFS
6. Creating Input path in HDFS and moving the data into Input path
7. Executing the program
$ cd ~
$ sudo mkdir wordcount
$ cd wordcount
$ sudo nano WordCount.java
Wordcount program:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*; // to tell hadoop what to run
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*; //
import org.apache.hadoop.util.*; // to run mapreduce application
public class WordCount

{
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable,
Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException
{
int sum=0;
while (values.hasNext())
{
sum+=values.next().get();
}
output.collect (key, new IntWritable(sum));
}
}
public static void main (String[] args) throws Exception
{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
Start the hdfs daemon

$start-all.sh
To check the hadoop classpath

$ hadoop classpath
/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop
/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoo
p/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/mapreduc
e/*:/usr/local/hadoop/share/hadoop/yarn:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/
hadoop/share/hadoop/yarn/*
Grant all permissions to user to access the folder ‘wordcount’

$ sudo chmod 777 wordcount/
Run javac
$ cd ~
$ cd wordcount
$ javac WordCount.java -cp $(hadoop classpath)
The three java class files are created. To check it, type
$ ls
create jar file to combine these class files

$ jar cf wc.jar WordCount*.class
Create input and output directory in the hadoop file system

$ hadoop fs -mkdir /input
$ hadoop fs -mkdir /output
$ hadoop fs -ls /
Create text file and move it to input folder in hadoop file system
$ nano hello.txt
Data transformation and pre-processing step. Data is input in terms of key value pairs and
after processing is sent to the reduce phase.
$ hadoop fs -put hello.txt /input
To check all files and folder

$ hadoop fs -lsr /
Execute the program
$ hadoop jar wc.jar WordCount /input /output/out1
The last two arguments are source directory location which has input file and output directory
location where result will be generated in ‘out1’ folder. ‘Out1’ must not be created before.
output is in part-00000 subfolder
To check your output

$ hadoop fs -cat /output/out1/part-00000
Any error
1. edit the file hadoop-env.sh in /usr/local/etc/hadoop
Adding Hadoop library into LD_LIBRARY_PATH :
export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH
2. Include /native in bashrc file

export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native"
3. java related error

in bashrc file, include the following line after case statements “case
${HADOOP_OS_TYPE}”
export HADOOP_OPTS="--add-modules java.activation"
4. any error in yarn. include the following

usr/local/hadoop/etc/hadoop/yarn-site.xml
sudo gedit /usr/local/Hadoop/etc/hadoop/yarn-site.xml

Add the following to the configuration Tag
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
Steps to rectify Error

If the SSH connect should fail, these general tips might help:
● Enable debugging with ssh -vvv localhost and investigate the error in
detail.
● Check the SSH server configuration in /etc/ssh/sshd_config, in
particular the options PubkeyAuthentication (which should be set to yes)
and AllowUsers (if this option is active, add the hduser user to it). If you
made any changes to the SSH server configuration file, you can force a
configuration reload with sudo /etc/init.d/ssh reload.

02-Wordcount Mapreduce

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

02-Wordcount Mapreduce

Загружено:

Авторское право:

Доступные форматы

MapReduce is the core component of the distributed processing framework Hadoop which is

public class WordCount

Start the hdfs daemon

To check the hadoop classpath

Grant all permissions to user to access the folder ‘wordcount’

create jar file to combine these class files

Create input and output directory in the hadoop file system

$ hadoop fs -put hello.txt /input

To check all files and folder

Execute the program

$ hadoop jar wc.jar WordCount /input /output/out1

output is in part-00000 subfolder

To check your output

Adding Hadoop library into LD_LIBRARY_PATH :

2. Include /native in bashrc file

3. java related error

4. any error in yarn. include the following

sudo gedit /usr/local/Hadoop/etc/hadoop/yarn-site.xml

Steps to rectify Error

Вам также может понравиться