Академический Документы
Профессиональный Документы
Культура Документы
Hadoop
Introduction & Setup
Prepared By, Nishant M Gandhi.
Certified network Manager by Nettech. Diploma in Cyber Security(persuing).
What is
Hadoop is a framework for running applications on large clusters built of commodity hardware.
HADOOP WIKI Open source, Java Googles MapReduce inspired Yahoos Hadoop. Now part of Apache group
Other Hadoop-related projects: Avro: A data serialization system. Cassandra: A scalable multi-master database Chukwa: A data collection system for managing large distributed systems. HBase: A scalable, distributed database that supports structured data storage for large tables. Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. Mahout: A Scalable machine learning and data mining library. Pig: A high-level data-flow language and execution framework for parallel computation. ZooKeeper:co-ordination services
Click to edit Master text styles econd level Third level Fourth level Fifth level
The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query. On February 19, 2008, Yahoo! Inc. launched what it claimed was the world's largest Hadoop production application
ck to edit Master text styles 21 PB of storage in a single HDFS cluster cond level 2000 machines Third level 12 TB per machine (a few machines have 24 TB each) Fourth level 1200 machines with 8 cores each + 800 machines with 16 cores each Fifth level 32 GB of RAM per machine
15 map-reduce tasks per machine That's a total of more than 21 PB of configured storage capacity! This is larger than the previously known Yahoo!'s cluster of 14 PB. Here are the cluster statistics from the HDFS cluster at Facebook:
Map Reduce
Programming model developed at Google Sort/merge based distributed computing Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) It is functional style programming (e.g., LISP) that is naturally parallelizable across a large cluster of workstations or PCS. The underlying system takes care of the partitioning of the input data, scheduling the programs execution across several machines, handling machine failures, and managing required intermachine communication. (This is the key for Hadoops success)
Map Reduce
At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose. GFS is not open source. Doug Cutting and others at Yahoo! reverse engineered the GFS and called it Hadoop Distributed File System (HDFS).
Goals of HDFS
Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recovers from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth User Space, runs on heterogeneous OS
DFShell
The HDFS shell is invoked by: bin/hadoop dfs <args>
core--site.xml:
<configuration> <property> <name>hadoop.tmp.dir</name> <value>TEMPORARY-DIR-FOR-HADOOPDATASTORE</ value> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:54310</value> </property> </configuration>
hdfs--site.xml:
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
Using Hadoop
1)How to start Hadoop? cd hadoop/bin ./start-all.sh 2)How to stop Hadoop? cd hadoop/bin ./stop-all.sh 3)How to copy file from local to HDFS? cd hadoop bin/hadoop dfs put local_machine_path hdfs_path 4)How to list files in HDFS? cd hadoop bin/hadoop dfs -ls
Thank You..