Вы находитесь на странице: 1из 22

Introduction to Hadoop Apache Hadoop is a free distributed software framework developed in Java , C++ and bash scripts that

supports data intensive distributed applications. It enables applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers.

Hadoop is a top level Apache project, being built and used by a community of contributors from all over the world Yahoo! has been the largest contributor to the project and uses Hadoop extensively in its web search and advertising businesses.

Running Hadoop 0.20 Pseudo Distributed Mode on Mac OS X


Although Hadoop is developed for running distributed computing applications (Map Reduce) on commodity hardware it is possible to run Hadoop on single machine in pseudo distributed mode. Running Hadoop in psuedo distributed mode is first step towards running Hadoop in distributed mode. To setup and run Hadoop in pseudo distributed mode you need Java 6 installed on your system , also make sure that Java home is added in system variables . Download Hadoop 0.20 from here . Download and Extract Hadoop Download and save hadoop-xx.xx.tar.gz . To extract hadoop zip file execute command tar This should extract hadoop xvf binary and source in hadoop-0.20.0.tar.gz hadoop-0.20.0 directory .

By default Hadoop is configured to run in Stand alone mode . To view hadoop commands and options execute bin/hadoop from Hadoop root directory . You can see hadoop basic commands and options as below . matrix:Hadoop rphulari$ bin/hadoop Usage: hadoop [--config confdir] COMMAND where COMMAND is one of: fs run a generic filesystem user client version print the version jar run a jar file distcp copy file or directories recursively archive -archiveName NAME * create a hadoop archive daemonlog get/set the log level for each daemon or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters. changes mode .

Configuration We are 5 steps

away

from

running

hadoop

in

pseudo

distributed

Step

Configure

conf/hadoop-env.sh

Update JAVA_HOME to point your system Java home directory . On Mac OS X it should point to/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/ Step 2 Add following to conf/hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1>/value> </property> </configuration> Configure conf/hdfs-site.xml

Step 3 - Configure conf/core-site.xml Add following to conf/core-site.xml <configuration> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> </configuration>

Step 4 - Configure conf/mapred-site.xml Add following to conf/mapred-site.xml <configuration> <property> <name>mapred.job.tracker</name> <value>localhost:9001</value> </property> </configuration>

Now you are all set to start Hadoop in pseudo distributed mode . You can either start all hadoop process (hdfs and mapred processes ) using bin/start-all.sh from hadoop root directory or you can start only hdfs - bin/start-hdfs.sh or only map reduce process - bin/start-mapred.sh . Before starting hadoop dfs (Distributed file system ) we need to format it using namenode format command . matrix:Hadoop rphulari$ bin/hadoop namenode -format this will print lot of information on screen which include Hadoop version , host name and ip address , namenode storage directory which is by default set to /tmp/hadoop-$username. Once hdfs is formatted and ready for use we execute bin/start-all.sh to start all process. If you execute bin/start-all.sh all hadoop process will start and you can see log of starting job tracker , task tracker , namenode ,datanode on screen.

You can also make sure if all process are running by executing java jps command . matrix-lm:Hadoop rphulari$ jps 12543 DataNode 12776 Jps 12677 JobTracker 12755 TaskTracker 12619 SecondaryNameNode Playing with HDFS shell HDFS - hadoop distributed file system is very similar to unix / posix file system . HDFS also gives same shell commands to do file system operations like mkdir , ls , du etc . HDFS - ls HDFS ls is part of hadoop fs (file system) which can be executed as following , which shows contents of root ( / ) directory . matrix:Hadoop rphulari$ bin/hadoop fs -ls / Found 1 items drwxr-xr-x - rphulari supergroup 0 2009-05-13 22:04 /tmp NOTE - By default hdfs starter , name node formatter is superuser of hdfs . HDFS - mkdir To create a dir on hdfs use fs -mkdir . matrix:Hadoop rphulari$ bin/hadoop matrix:Hadoop rphulari$ bin/hadoop Found 2 items drwxr-xr-x - rphulari supergroup 0 drwxr-xr-x - rphulari supergroup 0 fs -mkdir user fs -ls / 2009-05-13 22:04 /tmp 2009-05-13 22:06 /user

You can find complete list of hadoop shell commands here In next blogs we will execute first map reduce program on hadoop .

How to use Java member accessibility modifiers


Java member accessibility modifiers in Java In Java by using accessibility modifiers a class can control what information is accessible to other classes. Accessibility of members can be one of the following.

public protected default(also known as package accessibility) private

public Members Public access is the least restrictive of all access modifiers .A public member is accessible everywhere , both in its class's package and in other packages where it's class is visible .This is true for both instances and static members. protected Members Protected members are accessible in the package containing class and by all subclasses of this class in any package where this class is visible . {it is less restrictive than the defaultaccessibility} Default Accessibility Members When no access modifier is specified for a member it is only accessible by another class in the package where it's class is defined .Even if its class is visible in another package the member is not accessible there . private Members private is most restrictive of all the access modifiers .Private members are not accessible from any other class .This also applies to subclasses , whether they are in the same package or not. It is a good design strategy to make all members variables private and provide public access methods for them.

Java Clone method


To clone something is to make a duplicate of it.The clone() method in Java makes an exact duplicate of an object . Why would someone need cloneing ? Java's method calling semantics are call-by-refrence , which allows the called method to modify the state of an object that is passed into it . Cloning the input object before calling the method would pass a copy of the object keeping orignal safe. Cloneing is not enabled by default in classes that you write.Clone method is a protected method , which means that your code cannot simply call it .Only the class defining can clone it's objects. Foo Foo f f2 = = new new Foo(); f.clone();

If you try clone() without any special prepration as in code written above you will encouter errors. How You

must

do

two

things

to to

make

your

class

clone? cloneable:

Override Object's Clone method Implement the empty Cloneable interface Example :

public class FooClone impements Cloneable{ public FooObject clone() throws CloneNotSupportedException ( return super.clone(); ) // more code

JPS - Java Process Status tool


You are running your java program and wondering what all process are running in JVM. Ever wondered how to see java process ? Use JPS for viewing Java Virtual Machine Status

The jps tool lists the instrumented HotSpot Java Virtual Machines (JVMs) on the target system. The tool is limited to reporting information on JVMs for which it has the access permissions. If jps is run without specifying a hostid, it will look for instrumented JVMs on the local host. If started with a hostid, it will look for JVMs on the indicated host, using the specified protocol and port. A jstatd process is assumed to be running on the target host. The jps command will report the local VM identifier, or lvmid, for each instrumented JVM found on the target system. The lvmid is typically, but not necessarily, the operating system's process identifier for the JVM process. With no options, jps will list each Java application'slvmid followed by the short form of the application's class name or jar file name. The short form of the class name or JAR file name omits the class's package information or the JAR files path information. The jps command uses the java launcher to find the class name and arguments passed to themain method. If the target JVM is started with a custom launcher, the class name (or JAR file name) and the arguments to the main method will not be available. In this case, the jpscommand will output the string Unknown for the class name or JAR file name and for the arguments to the main method. The list of JVMs produced by the jps command may be limited by the permissions granted to the principal running the command. The command will only list the JVMs for which the principle has access rights as determined by operating system specific access control mechanisms. OptionsThe jps command supports a number of options that modify the output of the command. These options are subject to change or removal in the future. -q Suppress the output of the class name, JAR file name, and arguments passed to the mainmethod, producing only a list of local VM identifiers.

-m Output the arguments passed to the main method. The output may be null for embedded JVMs. -l Output the full package name for the application's main class or the full path name to the application's JAR file. -v Output the arguments passed to the JVM. -V Output the arguments passed to the JVM through the flags file (the .hotspotrc file or the file specified by the -XX:Flags=<filename> argument). -Joption Pass option to the java launcher called by javac. For example, -J-Xms48m sets the startup memory to 48 megabytes. It is a common convention for -J to pass options to the underlying VM executing applications written in Java.

Beginners view of Hadoop MiniDFSCluster


If you are new to Hadoop source code and you want to write Test-driven development code then MiniDfsCluster is what you can use for your first step. Although there are many Hadoop developers who will argue that using MiniDFSCluster is not an excellent way to write unit tests for Hadoop. And there are many other efficient ways (e.g Using Mock objects - Mokito ) for writing unit tests for Hadoop. We will discuss about this in some other post. MiniDfsCluster class creates a single-process DFS cluster for Junit testing which includes non-simulated DFS and simulated DFS. The data directories for non-simulated DFS are under the testing directory ( /build/test/data ) . And for simulated data nodes, no underlying fs storage is used. MiniDfsCluster is mostly used in following four ways 1. public MiniDFSCluster() {} This null constructor is used only when wishing to start a data node cluster without a name node (ie when the name node is started elsewhere). 2. public MiniDFSCluster(Configuration nameNodeOperation) conf, int numDataNodes, StartupOption

Modify the config and start up the servers with the given operation. Servers will be started on free ports. The caller must manage the creation of NameNode and DataNode directories and have already set dfs.name.dir and dfs.data.dir in the given conf. Here conf the base configuration to use in starting the servers. This will be modified as necessary. numDataNodes Number of DataNodes to start; may be zero nameNodeOperation the operation with which to start the servers. If null or StartupOption.FORMAT, then StartupOption.REGULAR will be used. 3. public MiniDFSCluster(Configuration format,String[] racks) conf,int numDataNodes,boolean

Modify the config and start up the servers. The rpc and info ports for servers are guaranteed to use free ports. NameNode and DataNode directory creation and configuration will be managed by this class. Here : conf the base configuration to use in starting the servers. This will be modified as necessary. numDataNodes Number of DataNodes to start; may be zero format if true, format the NameNode and DataNodes before starting up racks array of strings indicating the rack that each DataNode is on 4. public MiniDFSCluster(Configuration format,String[] racks,String[] hosts) conf,int numDataNodes,boolean

Modify the config and start up the servers. The rpc and info ports for servers are guaranteed to use free ports. NameNode and DataNode directory creation and configuration will be managed by this class. Here : conf the base configuration to use in starting the servers. This will be modified as necessary. numDataNodes Number of DataNodes to start; may be zero format if true, format the NameNode and DataNodes before starting up racks array of strings indicating the rack that each DataNode is on hosts array of strings indicating the hostname for each DataNode Below is the simple example in which we configure and start MiniDfsCluster

public class StartMiniDFSCluster { private static final Configuration CONF = new HdfsConfiguration();

private static final int DFS_REPLICATION_INTERVAL = 1; private static final Path TEST_ROOT_DIR_PATH = new Path(System.getProperty("test.build.data", "build/test/data")); // Number of datanodes in the cluster private static final int DATANODE_COUNT = 3; static { CONF.setLong(DFSConfigKeys.DFS_BLOCK_SIZE_KEY, 100); CONF.setInt(DFSConfigKeys.DFS_BYTES_PER_CHECKSUM_KEY, 1); CONF.setLong(DFSConfigKeys.DFS_HEARTBEAT_INTERVAL_KEY, DFS_REPLICATION_INTERVAL); CONF.setInt(DFSConfigKeys.DFS_NAMENODE_REPLICATION_INTERVAL_KEY, DFS_REPLICATION_INTERVAL); } private MiniDFSCluster cluster; private DistributedFileSystem fs; private FSNamesystem namesystem; private static Path getTestPath(String fileName) { return new Path(TEST_ROOT_DIR_PATH, fileName); } @Override protected void setUp() throws Exception { cluster = new MiniDFSCluster(CONF, DATANODE_COUNT, true, null); cluster.waitActive();

namesystem = cluster.getNamesystem(); fs = (DistributedFileSystem) cluster.getFileSystem(); nn = cluster.getNameNode(); } @Override protected void tearDown() throws Exception { cluster.shutdown(); } /** create a file with a length of fileLen */ private void IOException { createFile(Path file, long fileLen, short replicas) throws

DFSTestUtil.createFile(fs, file, fileLen, replicas, rand.nextLong());

Most useful VI commands.

Primary Commands d delete y copy c change r replace p paste u undo a append o new line

x delete char control-R redo

Edit dd dw d4j delete line, delete word, delete 5 lines yy yw copy line, copy word p paste a line :s/a/b/ replace a with b

Navigation h j k l left, down, up right, one character/line at a time w b e next word, back word, end of word W B E same, but ignores punctuation gg beginning of file G end of File 0 first column in a line ^ beginning of line $ end of line / enter a string, searches for string control-U page up control-B page down 100G goto line 1000

Modes Escape command mode i insert mode v visual mode

V visual (line) mode

File :u undo :q quit :q! force quit :wq write, quit : / Search for next occurrence of : ? Search for previous occurrence of : % s///g Replace all str_a with str_b in current buffer

How much data is generated on internet every year? How much data is generated on internet every year/month/day?
According to Neilson Online currently there are more than 1,733,993,741 internet users. How much data these users are generating ? Few numbers to understand how much data is generated every year. Email * 90 trillion The number of emails sent on the Internet in 2009. * 247 billion Average number of email messages per day. * 1.4 billion The number of email users worldwide. * 100 million New email users since the year before. * 81% The percentage of emails that were spam. * 92% Peak spam levels late in the year. * 24% Increase in spam since last year. * 200 billion The number of spam emails per day (assuming 81% are spam). Websites * 234 million The number * 47 million Added websites in 2009. Web servers * * 13.9% -22.1% The The growth growth of of Apache IIS websites websites in in 2009. 2009.

of

websites

as

of

December

2009.

* 35.0% The growth of Google GFE * 384.4% The growth of Nginx * -72.4% The growth of Lighttpd websites in 2009. Domain names

websites websites

in in

2009. 2009.

* 81.8 million .COM domain names at the end of 2009. * 12.3 million .NET domain names at the end of 2009. * 7.8 million .ORG domain names at the end of 2009. * 76.3 million The number of country code top-level domains (e.g. .CN, .UK, .DE, etc.). * 187 million The number of domain names across all top-level domains (October 2009). * 8% The increase in domain names since the year before. Internet users * 1.73 billion Internet users worldwide (September 2009). * 18% Increase in Internet users since the previous year. * 738,257,230 Internet users in Asia. * 418,029,796 Internet users in Europe. * 252,908,000 Internet users in North America. * 179,031,479 Internet users in Latin America / Caribbean. * 67,371,700 Internet users in Africa. * 57,425,046 Internet users in the Middle East. * 20,970,490 Internet users in Oceania / Australia. Social media * 126 million The number of blogs on the Internet (as tracked by BlogPulse). * 84% Percent of social network sites with more women than men. * 27.3 million Number of tweets on Twitter per day (November, 2009) * 57% Percentage of Twitters user base located in the United States. * 4.25 million People following @aplusk (Ashton Kutcher, Twitters most followed user). * 350 million People on Facebook. * 50% Percentage of Facebook users that log in every day. * 500,000 The number of active Facebook applications. Images * 4 billion Photos hosted by Flickr (October 2009). * 2.5 billion Photos uploaded each month to Facebook. * 30 billion At the current rate, the number of photos uploaded to Facebook per year. Videos * 1 billion The total number of videos YouTube serves in one day. * 12.2 billion Videos viewed per month on YouTube in the US (November 2009). * 924 million Videos viewed per month on Hulu in the US (November 2009). * 182 The number of online videos the average Internet user watches in a month (USA). * 82% Percentage of Internet users that view videos online (USA).

* 39.4% YouTube online video market * 81.9% Percentage of embedded videos on blogs that are YouTube videos. Web browsers * * * * * * 1.2% Other 62.7% 24.6% 4.6% 4.5% 2.4% Internet

share

(USA).

Explorer Firefox Chrome Safari Opera

Malicious software * 148,000 New zombie computers created per day (used in botnets for sending spam, etc.) * 2.6 million Amount of malicious code threats at the start of 2009 (viruses, trojans, etc.) * 921,143 The number of new malicious code signatures added by Symantec in Q4 2009.

Data is abundant, Information is useful, Knowledge is precious.


Data. Data is raw and its abundant. It simply exists and has no significance beyond its existence . It can exist in any form, usable or not. It does not have meaning of itself. Collecting users activity log will produces data. Information. Information is data that has been given meaning by way of relational connection. Knowledge. - Knowledge is the appropriate collection of information, such that its intent is to be useful. Internet users are generating petabytes of data every day . Millions of users access billions of web pages every millisecond,creating hundreds of server logs with every keystroke and mouse click. Having only user log data is not useful. To give better service to user and generate money for business it is required to process raw data and collect information which can be used for providing knowledge to users and advertisers.

Open source solutions for processing big data.


Following are some of the open source solutions for processing big data. Hadoop : Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop includes these sub-projects Hadoop ecosystem consists. HDFS - Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. Map Reduce MapReduce is a software framework introduced by Google to support distributed computing on large data sets on clusters of computers.

Pig Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. Hive Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files. It provides a mechanism to put structure on this data and it also provides a simple query language called Hive QL which is based on SQL and which enables users familiar with SQL to query this data. At the same time, this language also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis which may not be supported by the built-in capabilities of the language. Hbase HBase is the Hadoop database. Use it when you need random, realtime read/write access to your Big Data. This projects goal is the hosting of very large tables billions of rows X millions of columns atop clusters of commodity hardware. Voldemart - Voldemort is a distributed key-value storage system Cassandra -The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamos fully distributeddesign and Bigtables ColumnFamily-based data model. Website and web server stats from Netcraft. Domain name stats from Verisign and Webhosting.info. Internet user stats from Internet World Stats. Web browser stats from Net Applications. Email stats from Radicati Group. Spam stats from McAfee. Malware stats from Symantec (and here) and McAfee. Online video stats from Comscore, Sysomos and YouTube. Photo stats from Flickr and Facebook. Social media stats from BlogPulse, Pingdom (here and here), Twittercounter, Facebook and GigaOm.

Netflix + (Hadoop & Hive ) = Your Favorite Movie


I love watching movies and I am a huge fan of Netflix , I love the way Netflix suggests new movies based on my previous watching history and ratings. Netflix established in 1997 and headquartered in Los Gatos, California, Netflix (NASDAQ: NFLX) is a service offering online flat rate DVD and Blu-ray disc rental-by-mail and video streaming in the United States.It has a collection of 100,000 titles and more than10 million subscribers. The company has more than 55 million discs and, on average, ships 1.9 million DVDs to customers each day. On April 2, 2009, the company announced that it had mailed its two billionth DVD Netflix offers Internet video streaming, enabling the viewing of films directly on a PC or TV at home. According to Comscore in the month of December 2008 Netflix streamed 127,016,000 videos which makes Netflix top 20 online video site. Netflixs movie recommendation algorithm uses Hive ( underneath using Hadoop, HDFS & MapReduce ) for query processing and BI. Hive is used for scalable log analysis to gain business

intelligence. Netflix collects all logs from website which are streaming logs collected using Chukwa (Soon to be replaced by Hunu alternative to Chukwa, Which will be open sourced very soon). Currently Netflix is processing there streaming logs on Amazon S3. How Much Data?

1. Parsing 0.6 TB logs per day. 2. Running 50 persistent nodes on Amazon EC3 cluster. How Often? Hadoop jobs are run every hour to parse last hour logs and reconstruct sessions. Then small files are merged from each reducer which are loaded to Hive. How Data is processed? Netflix web usage logs generated by web applications are collected by Chukwa Collector. These logs are dumped on Amazon S3 running HDFS instances. Later these logs are consumed by Hive & Hadoop running on the cloud. For NetFlix processing data on AS3 is not cheaper but it's less risky because they can recover data. Useful 1. 2. 3. 4. 5. 6. 7. data generated using log analysis.

Streaming summary data CDN performance. Number of streams per day. Number of errors / session. Test cell analysis. Ad hoc query for further analysis. And BI processing using MicroStrategy.

How to build Hadoop with my custom patch?


Problem Solution : How : do I build Apply my own version patch of Hadoop and with my build custom patch. hadoop.

You will need : Hadoop Source code, Custom Patch, Java 6 , Apache Ant, Java 5 (for generating Documents), Apache Forrest (for generating documents). Steps Checkout hadoop source : code,

> svn co https://svn.apache.org/repos/asf/hadoop/common/tags/release-X.Y.Z-rcR -m HadoopX.Y.Z-rcR.release. Apply your patch for checking its functionality using following command

> patch -p0 -E < ~/Path/To/Patch.patch Ant test and compile source code with latest patch. > ant ant -Djava5.home=/System/Library/Frameworks/JavaVM.framework/Versions/1.5/Home/ -

Dforrest.home=/Path/to/forrest/apache-forrest-0.8 -Dfindbugs.home=/Path/to/findbugs/latest compile-core compile-core tar How to build documents.

> ant -Dforrest.home=$FORREST_HOME -Djava5.home=$JAVA5 docs

How to transfer data between different HDFS clusters.


Problem : You have multiple Hadoop clusters running and you want to transfer several tera bytes of data from one cluster to another. Solution : DistCp Distributed copy.

Its common that hadoop clusters are loaded with tera bytes of data (not all clusters are of Petabytes of size ), It will take forever to transfer terabytes of data from one cluster to another. Distributed or parallel copying of data can be a good solution for this and that is what Distcp does. Distcp runs map reduce job to transfer your data from one cluster to another. To transfer data using DistCp you need to specify hdfs path name of source and destination as shown below. bash$ hadoop distcp hdfs://nn1:8020/foo/bar \

hdfs://nn2:8020/bar/foo You can also specify multiple source distcp directories on the command line: \ \

bash$ hadoop hdfs://nn1:8020/foo/b hdfs://nn2:8020/bar/foo Or, equivalently, bash$ hadoop hdfs://nn2:8020/bar/foo Where hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b from distcp

hdfs://nn1:8020/foo/a

file -f

using the -f hdfs://nn1:8020/srclist

option: \

srclist

contains

How to run multiple hadoop data nodes on one machine.


Although Hadoop is designed and developed for distributed computing it can be run on a single node in pseudo distributed mode and with multiple data node on single machine . Developers often run multiple data nodes on single node to develop and test distributed features,data node behavior,

Name

node

interaction

with

data

node

and

for

other

reasons.

If you want to feel Hadoop's distributed data node - name node working and you have only one machine then you can run multiple data nodes on single machine. You can see how Name node stores it's metadata , fsimage,edits , fstime and how data node stores data blocks on local file system. Steps To start multiple data nodes on a single node first download / build hadoop binary. 1. Download hadoop binary or build hadoop binary from hadoop source. 2. Prepare hadoop configuration to run on single node (Change Hadoop default tmp dir location from /tmp to some other reliable location) 3. Add following script to the $HADOOP_HOME/bin directory and chmod it to 744. Format HDFS - bin/hadoop namenode -format (for Hadoop 0.20 and below), bin/hdfs namenode format (for version > 0.21) 4. Start HDFS bin/start-dfs.sh (This will start Namenode and 1 data node ) which can be viewed on http://localhost:50070 5. Start additional data nodes using bin/run-additionalDN.sh run-additionalDN.sh
#!/bin/sh # This is used for starting multiple datanodes on the same machine. # run it from hadoop-dir/ just like 'bin/hadoop' #Usage: run-additionalDN.sh [start|stop] dnnumber #e.g. run-datanode.sh start 2 DN_DIR_PREFIX="/path/to/store/data_and_log_of_additionalDN/" if [ -z $DN_DIR_PREFIX ]; then echo $0: DN_DIR_PREFIX is not set. set it to something like "/hadoopTmp/dn" exit 1 fi run_datanode () { DN=$2 export HADOOP_LOG_DIR=$DN_DIR_PREFIX$DN/logs export HADOOP_PID_DIR=$HADOOP_LOG_DIR DN_CONF_OPTS="\ -Dhadoop.tmp.dir=$DN_DIR_PREFIX$DN\ -Ddfs.datanode.address=0.0.0.0:5001$DN \ -Ddfs.datanode.http.address=0.0.0.0:5008$DN \ -Ddfs.datanode.ipc.address=0.0.0.0:5002$DN" bin/hadoop-daemon.sh --script bin/hdfs $1 datanode $DN_CONF_OPTS } cmd=$1 shift; for i in $* do run_datanode done

$cmd $i

Use jps or Namenode Web UI to verify if additional data nodes are started. I started total 3 data nodes ( 2 additional data nodes) on my single node machine which are running on ports 50010,50011 and 50012 .

Hadoop Cookbook : How to write create/write-to hdfs files directly from map/reduce tasks?
You can use ${mapred.output.dir} to get this done. ${mapred.output.dir} is the eventual output directory for the job (JobConf.setOutputPath / JobConf.getOutputPath). ${taskid} is the actual id task_200709221812_0001_m_000000_0), task_200709221812_0001_m_000000). of a TIP the is individual a bunch task-attempt of ${taskid}s (e.g. (e.g.

With speculative-execution on, one could face issues with 2 instances of the same TIP (running simultaneously) trying to open/write-to the same file (path) on hdfs. Hence the app-writer will have to pick unique names (e.g. using the complete taskid i.e. task_200709221812_0001_m_000000_0) per task-attempt, not just per TIP. (Clearly, this needs to be done even if the user doesn't create/write-to files directly via reduce tasks.) To get around this the framework helps the application-writer out by maintaining a special ${mapred.output.dir}/_${taskid} sub-dir for each task-attempt on hdfs where the output of the reduce task-attempt goes. On successful completion of the task-attempt the files in the ${mapred.output.dir}/_${taskid} (of the successful taskid only) are moved to ${mapred.output.dir}. Of course, the framework discards the sub-directory of unsuccessful task-attempts. This is completely transparent to the application. The application-writer can take advantage of this by creating any side-files required in ${mapred.output.dir} during execution of his reduce-task, and the framework will move them out similarly thus you don't have to pick unique paths per task-attempt. Fine-print: the value of ${mapred.output.dir} during execution of a particular task-attempt is actually ${mapred.output.dir}/_{$taskid}, not the value set by JobConf.setOutputPath. So, just create any hdfs files you want in ${mapred.output.dir} from your reduce task to take advantage of this feature. The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to hdfs.

Hadoop Cookbook : How to configure hadoop data nodes to store data on multiple volumes/disks?
Data-nodes can store blocks in multiple directories typically allocated on different local disk drives. In order to setup multiple directories one needs to specify a comma separated list of pathnames as a value of the configuration parameter dfs.data.dir in hdfs-site.xml . Data-nodes will attempt to place equal amount of data in each of the directories. Example : Add following to hdfs-site.xml

<property> <name>dfs.data.dir</name> <value>/data/data01/hadoop/hdfs/data, /data/data02/hadoop/hdfs/data,/data/data03/hadoop/hdfs/data, /data/data04/hadoop/hdfs/data, /data/data05/hadoop/hdfs/data, /data/data06/hadoop/hdfs/data <final>true</final> </property>

Hadoop Cookbook : How to configure hadoop Name Node to store data on multiple volumes/disks?
The name-node supports storing name node meta data on multiple directories, which in the case store the name space image and the edits log. The directories are specified via the dfs.name.dir configuration parameter in hdfs-site.xml . The name-node directories are used for the name space data replication so that the image and the log could be restored from the remaining volumes if one of them fails. Example: Add this to hdfs-site.xml

<property> <name>dfs.name.dir</name> <value>/data/data01/hadoop/hdfs/name, /data/data02/hadoop/hdfs/name <final>true</final> </property> You want to take out some data nodes from your cluster, what is the graceful way to remove nodes without corrupting file system. On a large cluster removing one or two data-nodes will not lead to any data loss, because name-node will replicate their blocks as long as it will detect that the nodes are dead. With a large number of nodes getting removed or dying the probability of losing data is higher. Hadoop offers the decommission feature to retire a set of existing data-nodes. The nodes to be retired should be included into the exclude file, and the exclude file name should be specified as a configuration parameter dfs.hosts.exclude. This file should have been specified during namenode startup. It could be a zero length file. You must use the full hostname, ip or ip:port format in this file.

Then bin/hadoop

the dfsadmin

shell

command -refreshNodes

should be called, which forces the name-node to re-read the exclude file and start the decommission process. Decommission does not happen momentarily since it requires replication of potentially a large number of blocks and we do not want the cluster to be overwhelmed with just this one job. The decommission progress can be monitored on the name-node Web UI. Until all blocks are replicated the node will be in "Decommission In Progress" state. When decommission is done the state will change to "Decommissioned". The nodes can be removed whenever decommission is finished. The decommission process can be terminated at any time by editing the configuration or the exclude files and repeating the -refreshNodes command.

Hadoop Cookbook : How to test Hadoop cluster setup?


You just finished Hadoop cluster setup, How will you verify hadoop cluster setup was successful ? Perform following steps to make sure that Hadoop cluster setup was successful. 1 Use . following > > hadoop fs -copyToLocal Copy/Put commands to data put some test hadoop on data on fs HDFS cluster. -put or

HDFS

Now perform hadoop fs -ls to verify copied files and make sure that copy/put operation was successful. To make sure that there was no error in data node connections tail on name node and data node logs for any errors. Possible connection errors or setup errors in permissions will be exposed in copy operation. 2. Run word count job. To make sure that Hadoop mapreduce is working properly run word count job. E.g : > hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount Check if word count job completes successfully and output dir is created. Also make sure that there are no error messages while word count job is completing. 3. Run Teragen job. You can run Teragen job to write huge data to cluster making sure all data nodes and task trackers are running correctly and all network connections. E.g : > hadoop jar $HADOOP_HOME/hadoop-examples.jar teragen 1000000000

4.

Run TestDFSIO or Finally to get the throughput of the cluster you can run TestDFSIO or DFSCIOTest.

DFSCIOTest

Hadoop Cookbook : How to interview for hadoop admin job?


These are few problems whose solution a good hadoop admin should know. List 3 hadoop fs shell commands to perform copy operation fs -copyToLocal fs -copyFromLocal fs -put How to decommission nodes from HDFS cluster? - Remove list of nodes from slaves files and execute -refreshNodes. How to add new nodes to the HDFS cluster ? - Add new node hostname to slaves file and start data node & task tracker on new node. How to perform copy across multiple HDFS clusters. - Use distcp to copy files across multiple clusters.

How to verify if HDFS is corrupt? Execute Hadoop fsck to check for missing blocks. What are the default configuration files that are used in Hadoop As of 0.20 release, Hadoop supported the following read-only default configurations - src/core/core-default.xml - src/hdfs/hdfs-default.xml - src/mapred/mapred-default.xml How will you make changes to the default configuration files Hadoop does not recommends changing the default configuration files, instead it recommends making all site specific changes in the following files - conf/core-site.xml - conf/hdfs-site.xml - conf/mapred-site.xml Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath: - core-default.xml : Read-only defaults for hadoop. - core-site.xml: Site-specific configuration for a given hadoop installation. Hence if same configuration is defined in file core-default.xml and src/core/core-default.xml then the values in file core-default.xml (same is true for other 2 file pairs) is used.

Consider case scenario where you have set property mapred.output.compress to true to ensure that all output files are compressed for efficient space usage on the cluster. If a cluster user does not want to compress data for a specific job then what will you recommend him to do ? Ask him to create his own configuration file and specify configuration mapred.output.compress to false and load this file as a resource in his job.

What of the following is the only required variable that needs to be set in file conf/hadoopenv.sh for hadoop to work - HADOOP_LOG_DIR - JAVA_HOME - HADOOP_CLASSPATH The only required variable to set is JAVA_HOME that needs to point to directory List all the daemons required to run the Hadoop cluster - NameNode - DataNode - JobTracker - TaskTracker Whats the default port that jobtrackers listens to : 50030 Whats the default port where the dfs namenode web ui will listen on : 50070

Вам также может понравиться