Академический Документы
Профессиональный Документы
Культура Документы
These are the instructions to setup a fully distributed cluster on AMAZON EC2.
The main steps are as follows.
A) Create an Amazon instance and connect to it
B) Install Java JDK 1.6.u_45
C) Install Hadoop tarball and extract the tarball
D) Setup userids, .profile
E) Specify config files
F) Format name node
G) Start name node and data node
H) Create specific directories in HDFS for mapred userid
I) Start job tracker and task tracker
J) Run word count program
K) Add data nodes to cluster (as many as necessary)
*******************************************************************************************
1) Create an amazon instance (preferably CentOS) and connect to it as root.
Some Amazon instances will not allow to connect directly as root.
In which case you need to connect as ec2-user and su -root.
For eg.
bash_local_machine$>
ssh -i k1_sqldata.pem root@ec2-54-219-52-64.us-west-1.compute.amazonaws.com
where
k1_sqldata.pem = the file containing key value ssh pair for the instance
root = userid used
ec2-54-219-52-64.us-west-1.compute.amazonaws.com = public DNS of the instance
Make sure the permissions of the pem file are 400 ie. only readable by the owner.
2) See if the command "which" is installed. If not, install it.
bash_root$> which file1.txt
(If it says command not found, then run the following command).
bash_root$> yum install which
3) Copy jdk-6u45-linux-x64.bin by scp-ing it from the local filesystem to the Amazon instance
bash_local$>
scp -i k1_sqldata.pem jdk-6u45-linux-x64.bin root@ec2-54-219-52-64.us-west-
1.compute.amazonaws.com:~/
4) Move the jdk binary to /usr/local/ directory after connecting to the instance
bash_root$> mv /root/jdk-6u45-linux-x64.bin /usr/local/
5) Make the binary executable by all
bash_root$> chmod 755 /usr/local/jdk-6u45-linux-x64.bin
6) Change over to /usr/local directory
bash_root$> cd /usr/local
7) Run the binary as root and keep pressing ok to install JDK.
bash_usr_local$> ./jdk-6u45-linux-x64.bin
8) Change the name of the directory to something simple (like java).
bash_usr_local$> mv jdk-6u45-linux-x64 java
9) Add the "hdfs" user as "root"
bash_usr_local$> adduser hdfs
10) Make the "hdfs" user part of the root group
bash_root$ usermod -a -G root hdfs
11) Add the "mapred" user as root
bash_root$> adduser mapred
12) Make the "mapred" user part of the root group
bash_root$ usermod -a -G root mapred
13) Add hdfs to the sudoers list for convenience. Later it can be removed.
bash_usr_local$> visudo
14) bash_usr_local> su - hdfs
15) Extract the hadoop tarball after scp-ing it from the local machine to the Amazon instance
bash_home_hdfs$> tar -xvzf hadoop-0.20.2-cdh3u1.tar.gz
16) Rename the directory
bash_home_hdfs$> mv hadoop-0.20.2-cdh3u1 hadoop
17) Configure the .profile for each one of "root", "hdfs, "mapred" userids.
# This is the .profile file
export JAVA_HOME=/usr/local/java
export PATH=$JAVA_HOME/bin:"$PATH"
export HADOOP_HOME=/home/hdfs/hadoop
export PATH=$HADOOP_HOME/bin:"$PATH"
# Ensure that this file is executable (chmod 755 .profile)
# end of .profile file
# Make an entry in .bash_profile file as below
source .profile
18) Change the ownership of the files and directories under the hdfs home directory
bash_home_hdfs$> sudo chown -R hdfs:root *
19) Change the ownership of the home directory itself
bash_home_hdfs$> chown hdfs:root .
20) Change the permissions of the files and directories under the HDFS home directory
bash_home_hdfs$> sudo chmod 750 -R .
21) Specify the config files in the hadoop conf directory
bash_home_hdfs_hadoop_conf$> ls -ltr core-site.xml hdfs-site.xml mapred-site.xml
(Total 3 files)
These files will be provided by the instructor in the class.
22) Format the namenode
bash_home_hdfs$> hadoop namenode -format
23) Start the namenode and datanode as hdfs userid
bash_home_hdfs$> hadoop-daemon.sh start namenode
bash_home_hdfs$> hadoop-daemon.sh start datanode
24) Start the jobtracker as mapred. You will observe it will NOT start.
bash_home_mapred$> hadoop-daemon.sh start jobtracker
25) The mapred process also needs to write to the logs and pid directories so change the
permissions as follows. Switch back to hdfs userid.
bash_home_hdfs_hadoop_logs$ chmod -R 775 *
26) Make the directory permissions also 775
bash_home_hdfs_hadoop_logs$ chmod 775 logs/
27) Make the pids also 775.
bash_home_hdfs_hadoop_pids$ chmod -R 775 *
28) Make the pids directory also 775.
bash_home_hdfs_hadoop_pids$> chmod 775 pids/
29) Create a directory called /mapred/system in hdfs
bash_home_hdfs$ hadoop fs -mkdir /mapred
30) Make it owned by mapred:supergroup
bash_hdfs_home$ hadoop fs -chown mapred:supergroup /mapred
31) bash_hdfs_home$ hadoop fs -chmod 775 /
32) Create /tmp directory in HDFS and make it 777
bash_hdfs_home$ hadoop fs -mkdir /tmp
bash_hdfs_home$ hadoop fs -chmod 777 /tmp
33) Start jobtracker.
bash_home_mapred$> hadoop-daemon.sh start jobtracker
34) Start tasktracker
bash_home_mapred$> hadoop-daemon.sh start tasktracker
35) Run sample word count program
=========================================================================
Follow the same procedure above to create data nodes and add them to the Hadoop cluster.