CAP2956-Inside The Hadoop Machine - Final - US PDF

APP-CAP2956
Inside the Hadoop Machine
Jeff Buell, VMware, Inc. Richard McDougall, VMware, Inc. Sanjay Radia, Hortonworks
#vmworldapps
Disclaimer
This session may contain product features that are

currently under development.
This session/overview of the new technology represents

no commitment from VMware to deliver these features in any generally available product.
Features are subject to change, and must not be included in

contracts, purchase orders, or sales agreements of any kind.
Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features
discussed or presented have not been determined.
Broad Application of Hadoop technology

Horizontal Use Cases Log Processing / Click Stream Analytics Machine Learning / sophisticated data mining Web crawling / text processing Vertical Use Cases
Financial Services
Internet Retailer Pharmaceutical / Drug Discovery Mobile / Telecom
Extract Transform Load (ETL) replacement

Image / XML message processing General archiving / compliance
Scientific Research
Social Media
Hadoops ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields.
3
How does Hadoop enable parallel processing?
A framework for distributed processing of large data sets across

clusters of computers using a simple programming model.
Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
4
Hadoop System Architecture
MapReduce: Programming
framework for highly parallel data processing
Hadoop Distributed File System

(HDFS): Distributed data storage
Hadoop Map-Reduce Framework (Runtime Layer)
Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
Hadoop Distributed File System
Hadoop Data Locality and Replication
Hadoop Virtualization Extensions: Topology Awareness
Why Virtualize Hadoop?
Simple to Operate Rapid deployment Unified operations across enterprise Easy Clone of Cluster
Highly Available
Elastic Scaling Shrink and expand cluster on demand Resource Guarantee Independent scaling of Compute and data
No more single point of failure

One click to setup High availability for MR Jobs
10
Enterprise Challenges with Using Hadoop

Deployment
Slow to provision Complex to keep running/tune
Single Points of Failure

Single point of failure with Name Node and Job tracker No HA for Hadoop Framework Components (Hive, HCatalog, etc.)
Low Utilization
Dedicated clusters to run Hadoop with low CPU utilization No easy way to share resource between Hadoop and non-Hadoop workloads Noisy neighbor, lack resource containment
Need Multi-tenant Isolation, Resource Management, etc,

Noisy Neighbor - no performance or security isolation between different tenants/users Lack of configuration isolation - Cant run multiple versions on the cluster
11
Virtualization enables a Common Infrastructure for Big Data
MPP DB
HBase
Hadoop
Virtualization Platform Virtualization Platform

Hadoop
HBase
Cluster Consolidation
MPP DB
Simplify
Cluster Sprawling
Single purpose clusters for various business applications lead to cluster sprawl.
Single Hardware Infrastructure Unified operations
Optimize
Shared Resources = higher utilization
Elastic resources = faster on-demand access
12
Deploy a Hadoop Cluster in under 30 Minutes

Step 1: Deploy Serengeti virtual appliance on vSphere.
Deploy vHelperOVF to vSphere
Step 2: A few simple commands to stand up Hadoop Cluster.

Select Compute, memory, storage and network
Select configuration template
Automate deployment
Done
13
A Tour Through Serengeti
$ ssh serengeti@serengeti-vm
$ serengeti
serengeti>
14
serengeti> cluster create --name myElephant
serengeti> cluster list -name myElephant

name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------master [hive, hadoop_client, pig] 1 1 3700 LOCAL
50
50
NAME HOST IP ----------------------------------------------------------------myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.184
15
$ ssh rmc@rmc-elephant-009.eng.vmware.com
$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data
16
Serengeti Spec File

[ "distro":"apache", Choice of Distro { "name": "master", "roles": [ "hadoop_NameNode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM", ha:true, HA Option }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL", "storage": { Choice of Shared Storage or Local Disk "type": "LOCAL", "sizeGB": 10 } }, ]
17
Configuring Distros
{ "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] },
18
Open Source of Serengeti, Spring Hadoop, Hadoop Extensions
Commercial Vendors
Community Projects
Support major distribution and multiple projects Contribute Hadoop Virtualization Extension (HVE) to Open Source Community
19
Use Local Disk where its Needed
SAN Storage
$2 - $10/Gigabyte $1M gets: 0.5Petabytes 200,000 IOPS 8Gbyte/sec
20
NAS Filers
Local Storage
$1 - $5/Gigabyte
$1M gets: 1 Petabyte 200,000 IOPS 10Gbyte/sec
$0.05/Gigabyte
$1M gets: 10 Petabytes 400,000 IOPS 250 Gbytes/sec
Extend Virtual Storage Architecture to Include Local Disk
Shared Storage: SAN or NAS

Easy to provision Automated cluster rebalancing
Hybrid Storage
SAN for boot images, VMs, other
workloads
Local disk for Hadoop & HDFS

Scalable Bandwidth, Lower Cost/GB
Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Hadoop
Host
Host
Host
Host
Host
Host
21
Hadoop
Virtualized Hadoop Performance
Issues of interest
Native vs various virtual configurations Local disks vs Fibre Channel SAN Effect of protecting Hadoop master daemons with Fault Tolerance Public cloud (renting) vs private cloud (buying)
Arista 7124SX 10 GbE switch
24x HP DL380 G7 2x X5687, 72 GB 16x SAS 146 GB Broadcom 10 GbE adapter Qlogic 8 Gb/s HBA
EMC VNX7500
22
Configuration
Software
vSphere 5.0 U1 (storage tests), 5.1 (Native/Virtual, FT) RHEL 6.1 x86_64 Cloudera CDH3u4 Hadoop applications: TeraGen, TeraSort, TeraValidate (1 TB)
Hadoop VMs
Processors (16 logical threads), memory (72 GB), disks (12) partitioned among
1, 2, or 4 VMs per host
Separate VMs for NameNode and JobTracker for storage and FT tests
Hadoop configuration
One map and one reduce task per vCPU (= logical thread)
Machines are highly loaded
256 MB block size FT tests: 8 256 MB block sizes to vary load on NN and JT
23
Native versus Virtual Platforms, 24 hosts, 12 disks/host

450 400 Elapsed time, seconds (lower is better)
350
Native 300 250 200 150 100 50 0 TeraGen TeraSort TeraValidate 1 VM 2 VMs 4 VMs
24
Local vs Various SAN Storage Configurations

4.5 16 x HP DL380G7, EMC VNX 7500, 96 physical disks Elapsed time ratio to Local disks (lower is better) 4 Local disks SAN JBOD SAN RAID-0, 16 KB page size SAN RAID-0 3 2.5 2 1.5 1 0.5 0 TeraGen
25
3.5
SAN RAID-5
TeraSort
TeraValidate
Performance Effect of FT for Master Daemons
NameNode and JobTracker placed in separate UP VMs Small overhead: Enabling FT causes 2-4% slowdown for TeraSort 8 MB case places similar load on NN &JT as >200 hosts with 256 MB
1.04 Elapsed time ratio to FT off TeraSort 1.03
1.02
1.01
1 256 64 16 HDFS block size, MB 8
26
Different Clouds for Different Folks
Yahoo! Hadoop 2009: Classic benchmark test, 1460 hosts Google/MapR: SaaS on Google Compute Engine vSphere 5.1: 24 host cluster, 2 VMs/host, 8 or 12 disks/host,
CDH3u4
Vastly different cluster sizes

Compare throughput (MB sorted per second) normalized with resources
Cost: rental or estimate of running continuously for 3 years

#cores Yahoo! Google/MapR 11680 5024 #disks 5840 1256 TeraSort, s MB/s/core MB/s/disk 62 80 1.3 2.4 2.6 9.5 cost ~$7 $16
vSphere 5.1
vSphere 5.1
27
192
192
192
288
442
359
11.2
13.8
11.2
9.2
~$2
~$2
Highly Available

28
VMware-Hortonworks Joint Engineering
Hortonworks goal
Expand Hadoop ecosystem Provide first class support of various platforms Hadoop should run well on VMs VMs offer several advantages as presented earlier Take advantage of vSphere for HA
First class support for VMs

Topology plugins (Hadoop-8468) 2 VMs can be on same host
Pick closer data Schedule tasks closer Dont put two replicas on same host
MR-tmp on HDFS using block pools Elastic Compute-VMs will not need local disk Fast communications within VMs
29
Hadoop Full-Stack High Availability
Slave Nodes of Hadoop Cluster
jo b
jo b
jo b
jo b
jo b
Apps Running Outside
Failover
JT into Safemode
NN
JT
NN
Server
Server
Server
N+K failover
HA Cluster for Master Daemons
30
HA is in HDP 1.0
Using Total System Availability Architecture
31
HA in Hadoop 1 with HDP1
Full Stack High Availability

Namenode
Clients pause automatically JobTracker pauses automatically
Other Hadoop master services (JT, ) coming
Use industry proven HA framework

VMWare vSphere-HA
Failover, fencing, Corner cases are tricky if not addressed, corruption
Addition benefits:
N-N & N+K failover Migration for maintenance
32
Hadoop NN/JT HA with vSphere
33
Namenode Failover Times
60 Nodes, 60K files, 6 million blocks, 300 TB raw storage 1-3.5

minutes Failure detection and Failover 0.5 to 2 minutes Namenode Startup (exit safemode) 30 sec
180 Nodes, 200K files, 18 million blocks, 900TB raw storage 2-4.5
minutes Failure detection and Failover 0.5 to 2 minutes Namenode Startup (exit safemode) 110 sec
For vSphere - OS bootup is needed 10-20 seconds is included above. Cold Failover is good enough for small/medium clusters
Failure Detection and Automatic Failover Dominates
34
34
Summary
Advantages of Hadoop on VMs

Cluster Management Cluster consolidation Greater Elasticity in mixed environment Alternate multi-tenancy to capacity schedulers offerings
HA for Hadoop Master Daemons

vSphere based HA for NN, JT, in Hadoop 1
Total System Availability Architecture
35
Highly Available

36
Elastic Scaling and Multi-tenancy of Hadoop on vSphere
VM
VM
VM
VM
Current Hadoop:
Compute
VM
T1
VM
T2
Combined Storage/Com pute
Storage
Storage
1. Hadoop in VM
Single Tenant Fixed Resources
2. Separate Compute and Data 3. Multi. Clusters

Single Tenant Elastic Compute Multiple Tenants Elastic Compute
37
Separated Compute and Data

Slot Slot Virtual Slot Virtual Hadoop Virtual Slot Hadoop Node Hadoop Node Node Task Tracker Task Tracker
Slot
Virtual Hadoop Node Other Workload
Slot Task Tracker
Virtual Hadoop Node
Datanode
Virtualization Host
VMDK
VMDK
Truly Elastic Hadoop: Scalable through virtual nodes
38
References
www.projectserengeti.org www.hortonworks.com www.cloudera.com Fault Tolerance performance whitepaper: www.vmware.com/resources/techresources/10301 MapR/Google blog: www.mapr.com/blog/google-mapr
39
FILL OUT A SURVEY

EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A $25 VMWARE COMPANY STORE GIFT CERTIFICATE
APP-CAP2956
Inside the Hadoop Machine
Jeff Buell, VMware, Inc. Richard McDougall, VMware, Inc. Sanjay Radia, Hortonworks
#vmworldapps

CAP2956-Inside The Hadoop Machine - Final - US PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CAP2956-Inside The Hadoop Machine - Final - US PDF

Загружено:

Авторское право:

Доступные форматы

APP-CAP2956

Inside the Hadoop Machine

This session may contain product features that are

This session/overview of the new technology represents

Features are subject to change, and must not be included in

Broad Application of Hadoop technology

Internet Retailer Pharmaceutical / Drug Discovery Mobile / Telecom

Extract Transform Load (ETL) replacement

How does Hadoop enable parallel processing?

A framework for distributed processing of large data sets across

Hadoop System Architecture

Hadoop Distributed File System

Hadoop Map-Reduce Framework (Runtime Layer)

Hadoop Distributed File System

Hadoop Data Locality and Replication

Hadoop Virtualization Extensions: Topology Awareness

Why Virtualize Hadoop?

No more single point of failure

Enterprise Challenges with Using Hadoop

Single Points of Failure

Need Multi-tenant Isolation, Resource Management, etc,

Virtualization enables a Common Infrastructure for Big Data

Virtualization Platform Virtualization Platform

Single Hardware Infrastructure Unified operations

Deploy a Hadoop Cluster in under 30 Minutes

Deploy vHelperOVF to vSphere

Step 2: A few simple commands to stand up Hadoop Cluster.

Select configuration template

A Tour Through Serengeti

A Tour Through Serengeti

serengeti> cluster create --name myElephant

serengeti> cluster list -name myElephant

NAME HOST IP ----------------------------------------------------------------myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.184

A Tour Through Serengeti

$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data

Serengeti Spec File

Open Source of Serengeti, Spring Hadoop, Hadoop Extensions

Use Local Disk where its Needed

Extend Virtual Storage Architecture to Include Local Disk

Shared Storage: SAN or NAS

Local disk for Hadoop & HDFS

Virtualized Hadoop Performance

Native versus Virtual Platforms, 24 hosts, 12 disks/host

Local vs Various SAN Storage Configurations

Performance Effect of FT for Master Daemons

1 256 64 16 HDFS block size, MB 8

Different Clouds for Different Folks

Vastly different cluster sizes

Cost: rental or estimate of running continuously for 3 years

Why Virtualize Hadoop?

No more single point of failure

VMware-Hortonworks Joint Engineering

First class support for VMs

Hadoop Full-Stack High Availability

Slave Nodes of Hadoop Cluster

Apps Running Outside

HA Cluster for Master Daemons

HA in Hadoop 1 with HDP1

Full Stack High Availability

Other Hadoop master services (JT, ) coming

Use industry proven HA framework

Hadoop NN/JT HA with vSphere

Namenode Failover Times