Hadoop On AWS

SETTING UP FULLY DISTRIBUTED HADOOP CLUSTER ON AWS
These are the instructions to setup a fully distributed cluster on AMAZON EC2.

The main steps are as follows.

A) Create an Amazon instance and connect to it
B) Install Java JDK 1.6.u_45
C) Install Hadoop tarball and extract the tarball
D) Setup userids, .profile
E) Specify config files
F) Format name node
G) Start name node and data node
H) Create specific directories in HDFS for mapred userid
I) Start job tracker and task tracker
J) Run word count program
K) Add data nodes to cluster (as many as necessary)

*******************************************************************************************

1) Create an amazon instance (preferably CentOS) and connect to it as root.

Some Amazon instances will not allow to connect directly as root.

In which case you need to connect as ec2-user and su -root.

For eg.

bash_local_machine$>
ssh -i k1_sqldata.pem root@ec2-54-219-52-64.us-west-1.compute.amazonaws.com

where

k1_sqldata.pem = the file containing key value ssh pair for the instance
root = userid used
ec2-54-219-52-64.us-west-1.compute.amazonaws.com = public DNS of the instance

Make sure the permissions of the pem file are 400 ie. only readable by the owner.

2) See if the command "which" is installed. If not, install it.

bash_root$> which file1.txt

(If it says command not found, then run the following command).

bash_root$> yum install which

3) Copy jdk-6u45-linux-x64.bin by scp-ing it from the local filesystem to the Amazon instance

bash_local$>

scp -i k1_sqldata.pem jdk-6u45-linux-x64.bin root@ec2-54-219-52-64.us-west-
1.compute.amazonaws.com:~/

4) Move the jdk binary to /usr/local/ directory after connecting to the instance

bash_root$> mv /root/jdk-6u45-linux-x64.bin /usr/local/

5) Make the binary executable by all

bash_root$> chmod 755 /usr/local/jdk-6u45-linux-x64.bin

6) Change over to /usr/local directory

bash_root$> cd /usr/local

7) Run the binary as root and keep pressing ok to install JDK.

bash_usr_local$> ./jdk-6u45-linux-x64.bin

8) Change the name of the directory to something simple (like java).

bash_usr_local$> mv jdk-6u45-linux-x64 java

9) Add the "hdfs" user as "root"

bash_usr_local$> adduser hdfs

10) Make the "hdfs" user part of the root group

bash_root$ usermod -a -G root hdfs

11) Add the "mapred" user as root

bash_root$> adduser mapred

12) Make the "mapred" user part of the root group

bash_root$ usermod -a -G root mapred

13) Add hdfs to the sudoers list for convenience. Later it can be removed.

bash_usr_local$> visudo

14) bash_usr_local> su - hdfs

15) Extract the hadoop tarball after scp-ing it from the local machine to the Amazon instance

bash_home_hdfs$> tar -xvzf hadoop-0.20.2-cdh3u1.tar.gz

16) Rename the directory

bash_home_hdfs$> mv hadoop-0.20.2-cdh3u1 hadoop

17) Configure the .profile for each one of "root", "hdfs, "mapred" userids.

# This is the .profile file
export JAVA_HOME=/usr/local/java
export PATH=$JAVA_HOME/bin:"$PATH"
export HADOOP_HOME=/home/hdfs/hadoop
export PATH=$HADOOP_HOME/bin:"$PATH"

# Ensure that this file is executable (chmod 755 .profile)
# end of .profile file

# Make an entry in .bash_profile file as below
source .profile

18) Change the ownership of the files and directories under the hdfs home directory

bash_home_hdfs$> sudo chown -R hdfs:root *

19) Change the ownership of the home directory itself

bash_home_hdfs$> chown hdfs:root .

20) Change the permissions of the files and directories under the HDFS home directory

bash_home_hdfs$> sudo chmod 750 -R .

21) Specify the config files in the hadoop conf directory

bash_home_hdfs_hadoop_conf$> ls -ltr core-site.xml hdfs-site.xml mapred-site.xml

(Total 3 files)

These files will be provided by the instructor in the class.

22) Format the namenode

bash_home_hdfs$> hadoop namenode -format

23) Start the namenode and datanode as hdfs userid

bash_home_hdfs$> hadoop-daemon.sh start namenode
bash_home_hdfs$> hadoop-daemon.sh start datanode

24) Start the jobtracker as mapred. You will observe it will NOT start.

bash_home_mapred$> hadoop-daemon.sh start jobtracker

25) The mapred process also needs to write to the logs and pid directories so change the
permissions as follows. Switch back to hdfs userid.

bash_home_hdfs_hadoop_logs$ chmod -R 775 *

26) Make the directory permissions also 775

bash_home_hdfs_hadoop_logs$ chmod 775 logs/

27) Make the pids also 775.

bash_home_hdfs_hadoop_pids$ chmod -R 775 *

28) Make the pids directory also 775.

bash_home_hdfs_hadoop_pids$> chmod 775 pids/

29) Create a directory called /mapred/system in hdfs

bash_home_hdfs$ hadoop fs -mkdir /mapred

30) Make it owned by mapred:supergroup

bash_hdfs_home$ hadoop fs -chown mapred:supergroup /mapred

31) bash_hdfs_home$ hadoop fs -chmod 775 /

32) Create /tmp directory in HDFS and make it 777

bash_hdfs_home$ hadoop fs -mkdir /tmp
bash_hdfs_home$ hadoop fs -chmod 777 /tmp

33) Start jobtracker.

bash_home_mapred$> hadoop-daemon.sh start jobtracker

34) Start tasktracker

bash_home_mapred$> hadoop-daemon.sh start tasktracker

35) Run sample word count program

=========================================================================

Follow the same procedure above to create data nodes and add them to the Hadoop cluster.

Hadoop On AWS

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Hadoop On AWS

Загружено:

Авторское право:

Доступные форматы

SETTING UP FULLY DISTRIBUTED HADOOP CLUSTER ON AWS

Вам также может понравиться