Вы находитесь на странице: 1из 41

APP-CAP2956

Inside the Hadoop Machine

Jeff Buell, VMware, Inc. Richard McDougall, VMware, Inc. Sanjay Radia, Hortonworks

#vmworldapps

Disclaimer

This session may contain product features that are


currently under development.

This session/overview of the new technology represents


no commitment from VMware to deliver these features in any generally available product.

Features are subject to change, and must not be included in


contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery. Pricing and packaging for any new technologies or features
discussed or presented have not been determined.

Broad Application of Hadoop technology


Horizontal Use Cases Log Processing / Click Stream Analytics Machine Learning / sophisticated data mining Web crawling / text processing Vertical Use Cases

Financial Services

Internet Retailer Pharmaceutical / Drug Discovery Mobile / Telecom

Extract Transform Load (ETL) replacement


Image / XML message processing General archiving / compliance

Scientific Research

Social Media

Hadoops ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields.
3

How does Hadoop enable parallel processing?

A framework for distributed processing of large data sets across


clusters of computers using a simple programming model.

Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
4

Hadoop System Architecture

MapReduce: Programming
framework for highly parallel data processing

Hadoop Distributed File System


(HDFS): Distributed data storage

Hadoop Map-Reduce Framework (Runtime Layer)

Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works

Hadoop Distributed File System

Hadoop Data Locality and Replication

Hadoop Virtualization Extensions: Topology Awareness

Why Virtualize Hadoop?

Simple to Operate Rapid deployment Unified operations across enterprise Easy Clone of Cluster

Highly Available

Elastic Scaling Shrink and expand cluster on demand Resource Guarantee Independent scaling of Compute and data

No more single point of failure


One click to setup High availability for MR Jobs

10

Enterprise Challenges with Using Hadoop


Deployment
Slow to provision Complex to keep running/tune

Single Points of Failure


Single point of failure with Name Node and Job tracker No HA for Hadoop Framework Components (Hive, HCatalog, etc.)

Low Utilization
Dedicated clusters to run Hadoop with low CPU utilization No easy way to share resource between Hadoop and non-Hadoop workloads Noisy neighbor, lack resource containment

Need Multi-tenant Isolation, Resource Management, etc,


Noisy Neighbor - no performance or security isolation between different tenants/users Lack of configuration isolation - Cant run multiple versions on the cluster

11

Virtualization enables a Common Infrastructure for Big Data

MPP DB

HBase

Hadoop

Virtualization Platform Virtualization Platform


Hadoop

HBase

Cluster Consolidation
MPP DB

Simplify
Cluster Sprawling
Single purpose clusters for various business applications lead to cluster sprawl.

Single Hardware Infrastructure Unified operations

Optimize
Shared Resources = higher utilization
Elastic resources = faster on-demand access

12

Deploy a Hadoop Cluster in under 30 Minutes


Step 1: Deploy Serengeti virtual appliance on vSphere.

Deploy vHelperOVF to vSphere

Step 2: A few simple commands to stand up Hadoop Cluster.


Select Compute, memory, storage and network

Select configuration template

Automate deployment

Done

13

A Tour Through Serengeti

$ ssh serengeti@serengeti-vm

$ serengeti
serengeti>

14

A Tour Through Serengeti

serengeti> cluster create --name myElephant

serengeti> cluster list -name myElephant


name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------master [hadoop_NameNode, hadoop_jobtracker] 1 2 7500 LOCAL name: myElephant, distro: cdh, status:RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE --------------------------------------------------------------------------master [hive, hadoop_client, pig] 1 1 3700 LOCAL

50

50

NAME HOST IP ----------------------------------------------------------------myElephant-client0 rmc-elephant-009.eng.vmware.com 10.0.20.184

15

A Tour Through Serengeti

$ ssh rmc@rmc-elephant-009.eng.vmware.com

$ hadoop jar hadoop-examples.jar teragen 1000000000 tera-data

16

Serengeti Spec File


[ "distro":"apache", Choice of Distro { "name": "master", "roles": [ "hadoop_NameNode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM", ha:true, HA Option }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL", "storage": { Choice of Shared Storage or Local Disk "type": "LOCAL", "sizeGB": 10 } }, ]
17

Configuring Distros

{ "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] },

18

Open Source of Serengeti, Spring Hadoop, Hadoop Extensions

Commercial Vendors

Community Projects

Support major distribution and multiple projects Contribute Hadoop Virtualization Extension (HVE) to Open Source Community

19

Use Local Disk where its Needed

SAN Storage
$2 - $10/Gigabyte $1M gets: 0.5Petabytes 200,000 IOPS 8Gbyte/sec
20

NAS Filers

Local Storage

$1 - $5/Gigabyte
$1M gets: 1 Petabyte 200,000 IOPS 10Gbyte/sec

$0.05/Gigabyte
$1M gets: 10 Petabytes 400,000 IOPS 250 Gbytes/sec

Extend Virtual Storage Architecture to Include Local Disk

Shared Storage: SAN or NAS


Easy to provision Automated cluster rebalancing

Hybrid Storage
SAN for boot images, VMs, other
workloads

Local disk for Hadoop & HDFS


Scalable Bandwidth, Lower Cost/GB
Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Hadoop

Host

Host

Host

Host

Host

Host

21

Hadoop

Virtualized Hadoop Performance

Issues of interest
Native vs various virtual configurations Local disks vs Fibre Channel SAN Effect of protecting Hadoop master daemons with Fault Tolerance Public cloud (renting) vs private cloud (buying)
Arista 7124SX 10 GbE switch

24x HP DL380 G7 2x X5687, 72 GB 16x SAS 146 GB Broadcom 10 GbE adapter Qlogic 8 Gb/s HBA

EMC VNX7500

22

Configuration

Software
vSphere 5.0 U1 (storage tests), 5.1 (Native/Virtual, FT) RHEL 6.1 x86_64 Cloudera CDH3u4 Hadoop applications: TeraGen, TeraSort, TeraValidate (1 TB)

Hadoop VMs
Processors (16 logical threads), memory (72 GB), disks (12) partitioned among
1, 2, or 4 VMs per host

Separate VMs for NameNode and JobTracker for storage and FT tests

Hadoop configuration
One map and one reduce task per vCPU (= logical thread)
Machines are highly loaded

256 MB block size FT tests: 8 256 MB block sizes to vary load on NN and JT
23

Native versus Virtual Platforms, 24 hosts, 12 disks/host


450 400 Elapsed time, seconds (lower is better)

350
Native 300 250 200 150 100 50 0 TeraGen TeraSort TeraValidate 1 VM 2 VMs 4 VMs

24

Local vs Various SAN Storage Configurations


4.5 16 x HP DL380G7, EMC VNX 7500, 96 physical disks Elapsed time ratio to Local disks (lower is better) 4 Local disks SAN JBOD SAN RAID-0, 16 KB page size SAN RAID-0 3 2.5 2 1.5 1 0.5 0 TeraGen
25

3.5

SAN RAID-5

TeraSort

TeraValidate

Performance Effect of FT for Master Daemons

NameNode and JobTracker placed in separate UP VMs Small overhead: Enabling FT causes 2-4% slowdown for TeraSort 8 MB case places similar load on NN &JT as >200 hosts with 256 MB
1.04 Elapsed time ratio to FT off TeraSort 1.03

1.02

1.01

1 256 64 16 HDFS block size, MB 8

26

Different Clouds for Different Folks

Yahoo! Hadoop 2009: Classic benchmark test, 1460 hosts Google/MapR: SaaS on Google Compute Engine vSphere 5.1: 24 host cluster, 2 VMs/host, 8 or 12 disks/host,
CDH3u4

Vastly different cluster sizes


Compare throughput (MB sorted per second) normalized with resources

Cost: rental or estimate of running continuously for 3 years


#cores Yahoo! Google/MapR 11680 5024 #disks 5840 1256 TeraSort, s MB/s/core MB/s/disk 62 80 1.3 2.4 2.6 9.5 cost ~$7 $16

vSphere 5.1
vSphere 5.1
27

192
192

192
288

442
359

11.2
13.8

11.2
9.2

~$2
~$2

Why Virtualize Hadoop?

Simple to Operate Rapid deployment Unified operations across enterprise Easy Clone of Cluster

Highly Available

Elastic Scaling Shrink and expand cluster on demand Resource Guarantee Independent scaling of Compute and data

No more single point of failure


One click to setup High availability for MR Jobs

28

VMware-Hortonworks Joint Engineering

Hortonworks goal
Expand Hadoop ecosystem Provide first class support of various platforms Hadoop should run well on VMs VMs offer several advantages as presented earlier Take advantage of vSphere for HA

First class support for VMs


Topology plugins (Hadoop-8468) 2 VMs can be on same host
Pick closer data Schedule tasks closer Dont put two replicas on same host

MR-tmp on HDFS using block pools Elastic Compute-VMs will not need local disk Fast communications within VMs
29

Hadoop Full-Stack High Availability

Slave Nodes of Hadoop Cluster

jo b

jo b

jo b

jo b

jo b

Apps Running Outside

Failover
JT into Safemode

NN

JT

NN

Server

Server

Server

N+K failover

HA Cluster for Master Daemons

30

HA is in HDP 1.0
Using Total System Availability Architecture

31

HA in Hadoop 1 with HDP1

Full Stack High Availability


Namenode
Clients pause automatically JobTracker pauses automatically

Other Hadoop master services (JT, ) coming

Use industry proven HA framework


VMWare vSphere-HA
Failover, fencing, Corner cases are tricky if not addressed, corruption

Addition benefits:
N-N & N+K failover Migration for maintenance

32

Hadoop NN/JT HA with vSphere

33

Namenode Failover Times

60 Nodes, 60K files, 6 million blocks, 300 TB raw storage 1-3.5


minutes Failure detection and Failover 0.5 to 2 minutes Namenode Startup (exit safemode) 30 sec

180 Nodes, 200K files, 18 million blocks, 900TB raw storage 2-4.5
minutes Failure detection and Failover 0.5 to 2 minutes Namenode Startup (exit safemode) 110 sec
For vSphere - OS bootup is needed 10-20 seconds is included above. Cold Failover is good enough for small/medium clusters
Failure Detection and Automatic Failover Dominates

34

34

Summary

Advantages of Hadoop on VMs


Cluster Management Cluster consolidation Greater Elasticity in mixed environment Alternate multi-tenancy to capacity schedulers offerings

HA for Hadoop Master Daemons


vSphere based HA for NN, JT, in Hadoop 1
Total System Availability Architecture

35

Why Virtualize Hadoop?

Simple to Operate Rapid deployment Unified operations across enterprise Easy Clone of Cluster

Highly Available

Elastic Scaling Shrink and expand cluster on demand Resource Guarantee Independent scaling of Compute and data

No more single point of failure


One click to setup High availability for MR Jobs

36

Elastic Scaling and Multi-tenancy of Hadoop on vSphere

VM

VM

VM

VM

Current Hadoop:

Compute
VM

T1
VM

T2

Combined Storage/Com pute

Storage

Storage

1. Hadoop in VM
Single Tenant Fixed Resources

2. Separate Compute and Data 3. Multi. Clusters


Single Tenant Elastic Compute Multiple Tenants Elastic Compute

37

Separated Compute and Data


Slot Slot Virtual Slot Virtual Hadoop Virtual Slot Hadoop Node Hadoop Node Node Task Tracker Task Tracker

Slot
Virtual Hadoop Node Other Workload

Slot Task Tracker

Virtual Hadoop Node

Datanode

Virtualization Host

VMDK

VMDK

Truly Elastic Hadoop: Scalable through virtual nodes

38

References
www.projectserengeti.org www.hortonworks.com www.cloudera.com Fault Tolerance performance whitepaper: www.vmware.com/resources/techresources/10301 MapR/Google blog: www.mapr.com/blog/google-mapr

39

FILL OUT A SURVEY


EVERY COMPLETE SURVEY IS ENTERED INTO DRAWING FOR A $25 VMWARE COMPANY STORE GIFT CERTIFICATE

APP-CAP2956

Inside the Hadoop Machine

Jeff Buell, VMware, Inc. Richard McDougall, VMware, Inc. Sanjay Radia, Hortonworks

#vmworldapps

Вам также может понравиться