Вы находитесь на странице: 1из 5

MapReduce is the core component of the distributed processing framework Hadoop which is

written in Java –

Map Phase – Data transformation and pre-processing step. Data is input in terms of key value
pairs and after processing is sent to the reduce phase.

Reduce Phase- Data is aggregated and the business logic is implemented in this phase which
is sent to the next big data tool in the data pipeline for further processing.

The standard Hadoop’s MapReduce model has Mappers, Reducers, Combiners, Partitioner,
and sorting all of which manipulate the structure of the data to fit the business requirements.
It is evident that to manipulate the structure of the data – Map and Reduce phase need to
make use of data structures like arrays to perform various transformation operations.

Ex. No. 2:

AIM:
Word count program to demonstrate the use of Map and Reduce tasks

STEPS:
1. Analyze the input file content
2. Develop the code
a. Writing a map function
b. Writing a reduce function
c. Writing the Driver class
3. Compiling the source
4. Building the JAR file
5. Starting the DFS
6. Creating Input path in HDFS and moving the data into Input path
7. Executing the program

$ cd ~
$ sudo mkdir wordcount
$ cd wordcount
$ sudo nano WordCount.java

Wordcount program:
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*; // to tell hadoop what to run
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*; //
import org.apache.hadoop.util.*; // to run mapreduce application

public class WordCount


{
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map (LongWritable key, Text value, OutputCollector<Text, IntWritable>
output, Reporter reporter) throws IOException
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens())
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable,
Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException
{
int sum=0;
while (values.hasNext())
{
sum+=values.next().get();
}
output.collect (key, new IntWritable(sum));
}
}
public static void main (String[] args) throws Exception
{
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}

Start the hdfs daemon


$start-all.sh

To check the hadoop classpath


$ hadoop classpath
/usr/local/hadoop/etc/hadoop:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop
/share/hadoop/common/*:/usr/local/hadoop/share/hadoop/hdfs:/usr/local/hadoop/share/hadoo
p/hdfs/lib/*:/usr/local/hadoop/share/hadoop/hdfs/*:/usr/local/hadoop/share/hadoop/mapreduc
e/*:/usr/local/hadoop/share/hadoop/yarn:/usr/local/hadoop/share/hadoop/yarn/lib/*:/usr/local/
hadoop/share/hadoop/yarn/*

Grant all permissions to user to access the folder ‘wordcount’


$ sudo chmod 777 wordcount/

Run javac
$ cd ~
$ cd wordcount
$ javac WordCount.java -cp $(hadoop classpath)

The three java class files are created. To check it, type
$ ls

create jar file to combine these class files


$ jar cf wc.jar WordCount*.class

Create input and output directory in the hadoop file system


$ hadoop fs -mkdir /input
$ hadoop fs -mkdir /output
$ hadoop fs -ls /

Create text file and move it to input folder in hadoop file system
$ nano hello.txt
Data transformation and pre-processing step. Data is input in terms of key value pairs and
after processing is sent to the reduce phase.

$ hadoop fs -put hello.txt /input

To check all files and folder


$ hadoop fs -lsr /

Execute the program

$ hadoop jar wc.jar WordCount /input /output/out1

The last two arguments are source directory location which has input file and output directory
location where result will be generated in ‘out1’ folder. ‘Out1’ must not be created before.

output is in part-00000 subfolder

To check your output


$ hadoop fs -cat /output/out1/part-00000

Any error
1. edit the file hadoop-env.sh in /usr/local/etc/hadoop

Adding Hadoop library into LD_LIBRARY_PATH :

export LD_LIBRARY_PATH=/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH

2. Include /native in bashrc file


export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib/native"

3. java related error


in bashrc file, include the following line after case statements “case
${HADOOP_OS_TYPE}”
export HADOOP_OPTS="--add-modules java.activation"

4. any error in yarn. include the following


usr/local/hadoop/etc/hadoop/yarn-site.xml

sudo gedit /usr/local/Hadoop/etc/hadoop/yarn-site.xml


Add the following to the configuration Tag
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

Steps to rectify Error


If the SSH connect should fail, these general tips might help:

● Enable debugging with ssh -vvv localhost and investigate the error in
detail.
● Check the SSH server configuration in /etc/ssh/sshd_config, in
particular the options PubkeyAuthentication (which should be set to yes)
and AllowUsers (if this option is active, add the hduser user to it). If you
made any changes to the SSH server configuration file, you can force a
configuration reload with sudo /etc/init.d/ssh reload.

Вам также может понравиться