Академический Документы
Профессиональный Документы
Культура Документы
Upcoming Classes
Hadoop Training for System Administrators
Redwood City, CA - Feb 17-18
Hadoop Training for Developers
NYC - Feb 22-24
Hadoop Training for Developers
Chicago - Feb 28-Mar 2
Hadoop Training for System Administrators
Chicago - Mar 3-4
Hadoop Training for Developers
Seattle - Mar 7-9
Analyzing Data with Hive & Pig
Redwood City, CA - Mar 8-9
For a full list of scheduled training visit:
www.cloudera.com/info/training
brought to you by..
#133
Get More Refcardz! Visit refcardz.com
CONTENTS INCLUDE:
n
Introduction Apache Hadoop Deployment:
Which Hadoop Distribution?
A Blueprint for Reliable Distributed Computing
n
n
Apache Hadoop Installation
n
Hadoop Monitoring Ports
n
Apache Hadoop Production Deployment By Eugene Ciurana
n
Hot Tips and more...
Minimum Prerequisites
INTRODUCTION
• Java 1.6 from Oracle, version 1.6 update 8 or later; identify
your current JAVA_HOME
This Refcard presents a basic blueprint for deploying
• sshd and ssh for managing Hadoop daemons across
Apache Hadoop HDFS and MapReduce in development and
multiple systems
production environments. Check out Refcard #117, Getting
Started with Apache Hadoop, for basic terminology and for an • rsync for file and directory synchronization across the nodes
overview of the tools available in the Hadoop Project. in the cluster
• Create a service account for user hadoop where $HOME=/
WHICH HADOOP DISTRIBUTION? home/hadoop
There are two basic Hadoop distributions: Listing 1 - Hadoop SSH Prerequisits
• Apache Hadoop is the main open-source, bleeding-edge keyFile=$HOME/.ssh/id_rsa.pub
distribution from the Apache foundation. pKeyFile=$HOME/.ssh/id_rsa
authKeys=$HOME/.ssh/authorized_keys
• The Cloudera Distribution for Apache Hadoop (CDH) is an if ! ssh localhost -C true ; then \
open-source, enterprise-class distribution for production- if [ ! -e “$keyFile” ]; then \
ready environments. ssh-keygen -t rsa -b 2048 -P ‘’ \
-f “$pKeyFile”; \
The decision of using one or the other distributions depends fi; \
on the organization’s desired objective. cat “$keyFile” >> “$authKeys”; \
• The Apache distribution is fine for experimental learning chmod 0640 “$authKeys”; \
echo “Hadoop SSH configured”; \
exercises and for becoming familiar with how Hadoop is else echo “Hadoop SSH OK”; fi
put together.
The public key for this example is left blank. If this were to run
• CDH removes the guesswork and offers an almost turnkey
on a public network it could be a security hole. Distribute the
product for robustness and stability; it also offers some
public key from the master node to all other nodes for data
tools not available in the Apache distribution.
exchange. All nodes are assumed to run in a secure network
behind the firewall.
Cloudera offers professional services and puts
out an enterprise distribution of Apache
Apache Hadoop Deployment
Hot
Tip
Hadoop. Their toolset complements Apache’s.
Documentation about Cloudera’s CDH is available
Find out how Cloudera’s
from http://docs.cloudera.com. Distribution for Apache
The Apache Hadoop distribution assumes that the person Hadoop makes it easier
installing it is comfortable with configuring a system manually.
CDH, on the other hand, is designed as a drop-in component for to run Hadoop in your
all major Linux distributions.
enterprise.
Linux is the supported platform for production
Hot systems. Windows is adequate but is not www.cloudera.com/downloads/
Tip supported as a development platform.
Comprehensive Apache
Hadoop Training and
Certification
All the bash shell commands in this Refcard are APACHE HADOOP INSTALLATION
Hot available for cutting and pasting from:
Tip http://ciurana.eu/DeployingHadoopDZone This Refcard is a reference for development and production
deployment of the components shown in Figure 1. It includes
the components available in the basic Hadoop distribution and
Enterprise: CDH Prerequisites the enhancements that Cloudera released.
Cloudera simplified the installation process by offering packages
for Ubuntu Server and Red Hat Linux distributions.
" !#
CDH packages have names like CDH2, CDH3, and
Hot so on, corresponding to the CDH version. The
!
!
Listing 2 - Ubuntu Pre-Install Setup Whether the user intends to run Hadoop in
DISTRO=$(lsb_release -c | cut -f 2) non-distributed or distributed modes, it’s best
REPO=/etc/apt/sources.list.d/cloudera.list Hot to install every required component in every
echo “deb \ Tip machine in the computational network. Any
http://archive.cloudera.com/debian \
$DISTRO-cdh3 contrib” > “$REPO” computer may assume any role thereafter.
echo “deb-src \
http://archive.cloudera.com/debian \
$DISTRO-cdh3 contrib” >> “$REPO” A non-trivial, basic Hadoop installation includes at least
apt-get update
these components:
CDH on Red Hat Pre-Install Setup • Hadoop Common: the basic infrastructure necessary for
Run these commands as root or through sudo to add the yum
running all components and applications
Cloudera repository:
• HDFS: the Hadoop Distributed File System
Listing 3 - Red Hat Pre-Install Setup
• MapReduce: the framework for large data set
curl -sL http://is.gd/3ynKY7 | tee \ distributed processing
/etc/yum.repos.d/cloudera-cdh3.repo | \
awk ‘/^name/’ • Pig: an optional, high-level language for parallel
yum update yum computation and data flow
Ensure that all the pre-required software and configuration Enterprise users often chose CDH because of:
are installed on every machine intended to be a Hadoop
node. Don’t mix and match operating systems, distributions, • Flume: a distributed service for efficient large data
Hadoop, or Java versions! transfers in real-time
• Sqoop: a tool for importing relational databases into
Hadoop for Development
Hadoop clusters
• Hadoop runs as a single Java process, in non-distributed
mode, by default. This configuration is optimal for Apache Hadoop Development Deployment
development and debugging. The steps in this section must be repeated for every node in
• Hadoop also offers a pseudo-distributed mode, in which a Hadoop cluster. Downloads, installation, and configuration
every Hadoop daemon runs in a separate Java process. could be automated with shell scripts. All these steps
This configuration is optimal for development and will be are performed as the service user hadoop, defined in the
prerequisites section.
used for the examples in this guide.
http://hadoop.apache.org/common/releases.html has the latest
If you have an OS X or a Windows development version of the common tools. This guide used version 0.20.2.
Hot workstation, consider using a Linux distribution
1. Download Hadoop from a mirror and unpack it in the
Tip hosted on VirtualBox for running Hadoop. It will /home/hadoop work directory.
help prevent support or compatibility headaches.
2. Set the JAVA_HOME environment variable.
3. Set the run-time environment:
Hadoop for Production
• Production environments are deployed across a group
of machines that make the computational network.
Hadoop must be configured to run in fully distributed,
clustered mode.
Listing 4 - Set the Hadoop Runtime Environment Listing 6 - Testing the Hadoop Installation
version=0.20.2 # change if needed start-all.sh ; sleep 5
identity=”hadoop-dev” hadoop fs -put runtime/conf input
runtimeEnv=”runtime/conf/hadoop-env.sh” hadoop jar runtime/hadoop-*-examples.jar\
ln -s hadoop-”$version” runtime grep input output ‘dfs[a-z.]+’
ln -s runtime/logs .
export HADOOP_HOME=”$HOME”
cp “$runtimeEnv” “$runtimeEnv”.org
echo “export \ Hot You may ignore any warnings or errors about a
HADOOP_SLAVES=$HADOOP_HOME/slaves” \ Tip missing slaves file.
>> “$runtimeEnv”
mkdir “$HADOOP_HOME”/slaves
echo \
“export HADOOP_IDENT_STRING=$identity” >> \ • View the output files in the HDFS volume and stop the
“$runtimeEnv” Hadoop daemons to complete testing the install
echo \
“export JAVA_HOME=$JAVA_HOME” \ Listing 7 - Job Completion and Daemon Termination
>>”$runtimeEnv”
hadoop fs -cat output/*
export \
stop-all.sh
PATH=$PATH:”$HADOOP_HOME”/runtime/bin
unset version; unset identity; unset runtimeEnv
That’s it! Apache Hadoop is installed in your system and ready
Configuration for development.
Pseudo-distributed operation (each daemon runs in a separate
CDH Development Deployment
Java process) requires updates to core-site.xml, hdfs-site.xml,
CDH removes a lot of grueling work from the Hadoop
and the mapred-site.xml. These files configure the master,
installation process by offering ready-to-go packages
the file system, and the MapReduce framework and live in the
for mainstream Linux server distributions. Compare the
runtime/conf directory.
instructions in Listing 8 against the previous section. CDH
simplifies installation and configuration for huge time savings.
Listing 5 - Pseudo-Distributed Operation Config
<!-- core-site.xml --> Listing 8 - Installing CDH
<configuration>
<property> ver=”0.20”
<name>fs.default.name</name> command=”/usr/bin/aptitude”
<value>hdfs://localhost:9000</value> if [ ! -e “$command” ];
</property> then command=”/usr/bin/yum”; fi
</configuration> “$command” install\
hadoop-”$ver”-conf-pseudo
<!-- hdfs-site.xml --> unset command ; unset ver
<configuration>
Leveraging some or all of the extra components in Hadoop
<property>
<name>dfs.replication</name> or CDH is another good reason for using it over the Apache
<value>1</value> version. Install Flume or Pig with the instructions in Listing 9.
</property>
</configuration> Listing 9 - Adding Optional Components
apt-get install hadoop-pig
<!-- mapred-site.xml -->
apt-get install flume
<configuration>
apt-get install sqoop
<property>
<name>mapred.job.tracker</name>
Test the CDH Installation
<value>localhost:9001</value>
</property>
The CDH daemons are ready to be executed as services.
</configuration> There is no need to create a service account for executing
them. They can be started or stopped as any other Linux
These files are documented in the Apache Hadoop Clustering service, as shown in Listing 10.
reference, http://is.gd/E32L4s — some parameters are discussed
in this Refcard’s production deployment section. Listing 10 - Starting the CDH Daemons
Test the Hadoop Installation for s in /etc/init.d/hadoop* ; do \
“$s” start; done
Hadoop requires a formatted HDFS cluster to do its work:
CDH will create an HDFS partition when its daemons start. It’s
hadoop namenode -format
another convenience it offers over regular Hadoop. Listing 11
The HDFS volume lives on top of the standard file system. The shows how to validate the installation by:
format command will show this upon successful completion:
• Listing the HDFS module
/tmp/dfs/name has been successfully formatted. • Moving files to the HDFS volume
Start the Hadoop processes and perform these operations to • Running an example job
validate the installation:
• Validating the output
• Use the contents of runtime/conf as known input
• Use Hadoop for finding all text matches in the input
• Check the output directory to ensure it works
!
!
! !
"
#
! # #
!
! !
! !
! !
!
Figure 2 - NameNode status web UI # #
!
The web interface can be used for monitoring the JobTracker,
which dispatches tasks to specific nodes in a cluster, the
namespaces and file nodes in the file system.
Check the main page to learn more about Listing 16 - File System Setup
Listing 20 - Minimal MapReduce Config Update The instructions in this Refcard result in a working development
<!-- mapred-site.xml --> or production Hadoop cluster. Hadoop is a complex framework
<property> and requires attention to configure and maintain it. Review
<name>mapred.local.dir</name> the Apache Hadoop and Cloudera CDH documentation. Pay
<value> particular attention to the sections on:
/data/1/mapred/local,
/data/2/mapred/local • How to write MapReduce, Pig, or Hive applications
</value>
<final>true</final> • Multi-node cluster management with ZooKeeper
</property>
<property>
• Hadoop ETL with Sqoop and Flume
<name>mapred.systemdir</name> Happy Hadoop computing!
<value>
/mapred/system
STAYING CURRENT
</value>
<final>true</final>
</property> Do you want to know about specific projects and use cases
Start the JobTracker and all other nodes. You now have a working where Hadoop and data scalability are the hot topics? Join the
Hadoop cluster. Use the commands in Listing 11 to validate that scalability newsletter:
it’s operational. http://ciurana.eu/scalablesystems
Thank You!
Thanks to all the technical reviewers, especially to Pavel Dovbush
at http://dpp.su
#82
CONTENTS INCLUDE:
■
■
About Cloud Computing
Usage Scenarios Getting Started with
Aldon Cloud#64Computing
■
Underlying Concepts
Cost
by...
■
Upcoming Refcardz
youTechnologies ®
■
Data
t toTier
brough Comply.
borate.
Platform Management and more...
■
Chan
ge. Colla By Daniel Rubio
on:
dz. com
grati s
also minimizes the need to make design changes to support
CON
ABOUT CLOUD COMPUTING one time events. TEN TS
INC
s Inte -Patternvall
■
HTML LUD E:
Basics
Automated growthHTM
ref car
RichFaces
Usef
ContiPatterns an
■
However, the demands and technology used on such servers faced by web applications.Structure Tools
Core
Key ■
Structur Elem
E: has changed substantially in recent years, especially with al Elem ents
INC LUD gration the entrance of service providers like Amazon, Google and Large scale growth scenarios involvingents
specialized
NTS and mor equipment
rdz !
HTML
CO NTE Microsoft. es e... away by
(e.g. load balancers and clusters) are all but abstracted
Continu at Every e chang
About ns to isolat
relying on a cloud computing platform’s technology.
Software i-patter
space
CSS3
■
n
Re fca
e Work
Build
riptio
and Ant These companies Desc have a Privat
are in long deployed trol repos
itory
webmana applications
ge HTM
L BAS
■
Build
re
Repo
This Refcard active
will introduce are within
to you to cloud riente computing, with an
ION d units etc. Some platforms ram support large grapRDBMS deployments.
■
The src
dy Ha
softw
EGRAT e ine loping task-o it and Java s written in hical on of
all attribute
softwar emphasis onDeve es by
Mainl
INT these ines providers, so youComm can better understand
also rece JavaScri user interfac web develop and the rris
Vis it
OUS
Spring Roo
reposit -patter particu s cies Pay only what you consume
tagge or Amazon’s cloud
the curr you cho platform
computing becisome
heavily based moron very little fine. b></ ed insid
lained ) and anti x” the are solution duce
nden For each (e.g. WAR es t
ent stan ose to writ more e imp a></ e
not lega each othe
b> is
be exp text to “fi
al Depeapplication deployment
Web ge until
nden a few years
t librari agonmen
t enviro was similar that will software
industry standard and virtualization apparen ortant, the
e HTM technology.
tterns to pro Minim packa
dard
Mo re
CI can ticular con used can rity all depe that all
targe
simp s will L t. HTM l, but r. Tags
es i-pa tend but to most phone services:
y Integ alize plans with late alloted resources,
file
ts with an and XHT lify all help or XHTML, Reg ardless L VS XH <a><
in a par hes som
etim s. Ant , they ctices,
Binar Centr temp your you prov TML b></
proces in the end bad pra enting
nt nmen
incurred cost
geme whether e a single based on
Creatsuchareresources were consumed
t enviro orthenot. Virtualization
muchallows MLaare physical othe
piece ofr hardware to be understand of HTML
approac ed with the cial, but, rily implem nden
cy Mana rties nt targe es to of the actually web cod ide a solid ing has
essa d to Depe prope into differe chang
utilized by multiple operating
function systems.simplerThis allows ing.resourcesfoundati job adm been arou
efi nec itting
associat to be ben
er
pare
Ge t
te builds commo
are not Fortuna
late Verifi e comm than on
ality be allocated nd for
n com Cloud computing asRun it’sremo
known etoday has changed this. expecte irably, that
befor
They they
etc.
(e.g. bandwidth, elem CPU) tohas
nmemory, exclusively totely
Temp
job has some time
Build
lts whe
ually,
appear effects. rm a
Privat y, contin nt team Every mov
entsinstances. ed to CSS
used
to be, HTML d. Earl
ed resu gration
The various resourcesPerfo consumed by webperio applications
dicall (e.g.
opme page system
individual operating bec Brow y exp . Whi
adverse unintend d Builds sitory Build r to devel common (HTM . ause ser HTM anded le it
ous Inte web dev manufacture L had very far mor has done
Stage Repo
e bandwidth, memory, CPU) areIntegtallied
ration on a per-unit CI serve basis L or XHT
produc Continu Refcard
e Build rm an ack from extensio .) All are
e.com
the not
DZone, Inc.
s on
expand
ISBN-13: 978-1-936502-03-5
140 Preston Executive Dr. ISBN-10: 1-936502-03-8
Suite 100 50795
Cary, NC 27513
DZone communities deliver over 6 million pages each month to 888.678.0399
more than 3.3 million software developers, architects and decision 919.678.0300
makers. DZone offers something for everyone, including news,
Refcardz Feedback Welcome
$7.95
tutorials, cheat sheets, blogs, feature articles, source code and more.
refcardz@dzone.com 9 781936 502035
“DZone is a developer’s dream,” says PC Magazine.
Sponsorship Opportunities
Copyright © 2011 DZone, Inc. All rights reserved. No part of this publication may be reproduced, stored in a
retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, sales@dzone.com Version 1.0
without prior written permission of the publisher.