Вы находитесь на странице: 1из 24

Hadoop Installation Documentation

by Vanshish Mehra CEG CoE Hadoop

Contents
1.Prerequisites for Pseudo-mode Cluster Installation.............................................3 1.Sun Java 6......................................................................................................3 2.Adding a dedicated Hadoop system user.......................................................3 3.Configuring SSH............................................................................................3 2.Pseudo-mode Cluster Installation...........................................................................5 1.Configuration.................................................................................................5 2.Formatting the name node.............................................................................7 3.Starting your single-node cluster...................................................................7 4.Running a MapReduce job............................................................................8 3.Prerequisites for Fully-distributed Cluster Installation.......................................11 1.Networking....................................................................................................11 2.SSH access.....................................................................................................11 4.Fully-distributed Cluster Installation.....................................................................13 1.Configuration..................................................................................................13 2.Formatting the name node..............................................................................15 3.Starting your single-node cluster....................................................................15 4.Running a MapReduce job.............................................................................16 5.Hadoop Web Interfaces............................................................................................18 1.HDFS Name Node Web Interface..................................................................18 2.MapReduce Job Tracker Web Interface.........................................................19 3.Task Tracker Web Interface...........................................................................19 6.Points to remember...................................................................................................21

Perequisites for Pseudo-mode Cluster Installation


Sun Java 6
Hadoop requires a working Java 1.5.x (aka 5.0.x) installation. However, using Java 1.6.x is recommended for running Hadoop.

Adding a dedicated Hadoop system user


We will use a dedicated Hadoop user account for running Hadoop. While thats not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine.

Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it . For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hadoop user we create in the previous section. First, we have to generate an SSH key for the hadoop user.For this the command is: $ssh -keygen -t rsa -P

The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key

without your interaction (you dont want to enter the passphrase every time Hadoop interacts with its nodes). Second, you have to enable SSH access to your local machine with this newly created key.For this type the command: $cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Also do chmod 644 to authorized_keys in .ssh folder

The final step is to test the SSH setup by connecting to your local machine with the hadoop user. The step is also needed to save your local machines host key fingerprint to the hadoop users known_hosts file.

Pseudo-mode Cluster Installation


Download Hadoop 0.18.3 from the following website viz. http://hadoop.apache.org/common/releases.html Extract the hadoop tar.gz file as follows:$tar -xvzf hadoop-0.20.2.tar.gz

Configuration
Open conf/hadoop-env.sh in the editor of your choice and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

cygdrive/C/"Program Files"/Java/jdk1.6.0_20

Set the PATH and CLASSPATH variables appropriately in your .profile file and load it. For eg.

Edit the core-site.xml file as follows

Edit the hdfs-site.xml file as follows

Edit the mapred-site.xml file as follows

Now make directory with name pseudo in hadoop-0.20.2 folder.

Formatting the name node


To format the namenode $bin/hadoop namenode -format

Starting your single-node cluster


To start all the daemons $ bin/start-all.sh

To view the started daemons $ jps This should show the started daemons .

To see files in HDFS $bin/hadoop fs lsr hdfs://localhost:9000/

To create a directory in HDFS $bin/hadoop fs mkdir hdfs://localhost:9000/input To copy files to HDFS $ bin/hadoop fs copyFromLocal input/* hdfs://localhost:9000/input/

Running a MapReduce job


Copy local data to HDFS

Run the MapReduce job Now, we actually run the WordCount example job.

Run the program as follows:$ bin/hadoop jar hadoop-0.18.3-examples.jar wordcount hdfs://localhost:9000/input hdfs://localhost:9000/output

To copy files from HDFS onto local system, use below command. $ bin/hadoop fs copyToLocal hdfs://localhost:9000/output/part-r-00000 .

Output is as shown below

Perequisites for Fully-distributed Cluster Installation


Assume that ip address of the master machine is x.x.x.x and that ip address of the slave machine is y.y.y.y Configure Prerequisites for Pseudo -mode Cluster on both the machines

Networking
Both machines must be able to reach each other over the network. The easiest way is to put both machines in the same network with regard to hardware and software configuration, for example connect both machines via a single hub or switch and configure the network interfaces to use a common network. To make it simple, we will assign the IP address x.x.x.x to the master machine and y.y.y.y to the slave machine. Update /etc/hosts on both machines .For example:

You need to have root access to update the hosts file.

SSH access
The hadoop user on the master must be able to connect 1) To its own user account on the master i.e. ssh master in this context and not ssh localhost. 2) To the hadoop user account on the slave via a password-less SSH login. If you followed pseudo-mode Cluster Prequisities, you just have to add the hadoop@masters public SSH key (which should be in $HOME/.ssh/id_rsa.pub to the authorized_keys file of hadoop@slave (in this users $HOME/.ssh/authorized_keys). You can do this manually or use the following ssh command. $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@slave This command will prompt you for the login password for user hadoop on slave, then copy the public SSH key for you, creating the correct directory and fixing the permissions as necessary

So, connecting from master to master

and from master to slave

Fully-distributed Cluster Installation


Configuration
conf/masters
Update conf/masters that it looks like this:

conf/slaves
This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (DataNodes and TaskTrackers) will be run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data. Update conf/slaves that it looks like this:

If you have additional slave nodes, just add them to the conf/slaves file, one per line Make the following configuration entries for both master and slave machines. $ vi core-site.xml

$> vi hdfs-site.xml

$ vi mapred-site.xml

Formatting the name node


To format the namenode $bin/hadoop namenode -format

Starting your cluster


To start all the daemons $ bin/start-all.sh

To view the started daemons $ jps This should show the started daemons . Master:

Slave:

To see files in HDFS $bin/hadoop fs lsr hdfs://master:9000/

To create a directory in HDFS $bin/hadoop fs mkdir hdfs://localhost:9000/input To copy files to HDFS $ bin/hadoop fs copyFromLocal input/* hdfs://localhost:9000/input/

Running a MapReduce job


Copy local data to HDFS

Run the MapReduce job Now, we actually run the WordCount example job. Run the program as follows:$ bin/hadoop jar hadoop-0.18.3-examples.jar wordcount hdfs://localhost:9000/input hdfs://localhost:9000/output1

To copy files from HDFS onto local system, use below command. $ bin/hadoop fs copyToLocal hdfs://localhost:9000/output/part-r-00000 .

Output is as shown below

Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default available at these locations: http://localhost:50070/ web UI for HDFS name node(s) http://localhost:50030/ web UI for MapReduce job tracker(s) http://localhost:50060/ web UI for task tracker(s)

These web interfaces provide concise information about whats happening in your Hadoop cluster.

HDFS Name Node Web Interface


The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machines Hadoop log files.

MapReduce Job Tracker Web Interface


The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the local machines Hadoop log files (the machine on which the web UI is running on).

Task Tracker Web Interface


The task tracker web UI shows you running and non-running tasks. It also gives access to the local machines Hadoop log files.

Points to remember
If there is any problem in the cluster, do not forget to go through the logs.It will provide you all the details about the errors. While starting a fully distributed cluster,if daemons does not start on the slave machine and in slave machine datanode logs shows errors regarding connection refused with master.There might be a problem with firewall.Turn off the firewall and try again.You can turn the firewall off with following command on the RHEL machine. $/etc/init.d/iptables save $/etc/init.d/iptables stop You need to have root access to implement this. Insure that sshd service is running on your machine.You can start sshd service on RHEL as follows $/etc/init.d/sshd start While configuring ssh on the machine insure that User directory is chmod 700 authorized_keys file in .ssh directory is chmod 644 If you see the error java.io.Exception:Incompatible namespaceIDs in the logs of a DataNode , chances are you are affected by issue HDFS-107 (formerly known as HADOOP-1212). At the moment, there seem to be two options as described below.
Option 1: Start from scratch

1. Stop the cluster 2. Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml; if you followed this tutorial, the relevant directory is /home/hadoop/hadoop-0.20.2/full/dfs/data 3. Reformat the NameNode (NOTE: all HDFS data is lost during this process!) 4. Restart the cluster When deleting all the HDFS data and starting from scratch does not sound like a good idea (it might be ok during the initial setup/testing), you might give the second approach a try. Option 2: Updating namespaceID of problematic DataNodes 1. Stop the DataNode 2. Edit the value of name spaceID in /current/VERSION to match the value of the current NameNode

3. Restart the DataNode If you followed the instructions in this tutorials, the full path of the relevant files are: NameNode:/home/hadoop/hadoop-0.20.2/full/dfs/name/current/VERSION DataNode:/home/hadoop/hadoop-0.20.2/full/dfs/name/current/VERSION