Вы находитесь на странице: 1из 122

InfiniBand and 10-Gigabit Ethernet for Dummies

A Tutorial at Supercomputing 09 by
Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio state.edu/ panda http://www.cse.ohio-state.edu/~panda Matthew Koop NASA Goddard E-mail: matthew.koop@nasa.gov E il tth k @ http://www.cse.ohio-state.edu/~koop Pavan Balaji Argonne National Laboratory E-mail: balaji@mcs.anl.gov http://www.mcs.anl.gov/ balaji http://www.mcs.anl.gov/~balaji

Presentation Overview
Introduction Why InfiniBand and 10-Gigabit Ethernet? Overview of IB, 10GE, their Convergence and Features IB and 10GE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A

Supercomputing '09

Current and Next Generation Applications and Computing Systems A li ti dC ti S t


Big demand for
High Performance Computing (HPC) File-systems, multimedia, database, visualization Enterprise Multi-tier datacenters Multi tier

Processor performance continues to grow


Chip density doubling every 18 months (multi-cores)

Commodity networking also continues to grow


Increase in speed and features & affordable pricing

Clusters are increasingly becoming popular to design next generation computing systems
Scalability Modularity and Upgradeability with Scalability, compute and network technologies
Supercomputing '09
3

Trends for Computing Clusters in the Top T 500 List Li t


Top 500 list of Supercomputers (www.top500.org)
Jun. 2001: 33/500 (6.6%) Nov. 2001: 43/500 (8.6%) Jun. 2002: 80/500 (16%) Nov. 2002: 93/500 (18.6%) Jun. 2003: 149/500 (29.8%) Nov. 2003: 208/500 (41.6%) Jun. 2004: 291/500 (58.2%) Nov. 2004: 294/500 (58.8%) Jun. 2005: 304/500 (60.8%) Nov. 2005: 360/500 (72.0%) Jun. 2006: 364/500 (72.8%) Nov. 2006: 361/500 (72.2%) Jun. 2007: 373/500 (74.6%) Nov. 2007: 406/500 (81.2%) Jun. 2008: 400/500 (80.0%) Nov. 2008: 410/500 (82.0%) Jun. 2009: 410/500 (82.0%) Nov. 2009: To be announced
Supercomputing '09
4

Integrated High-End Computing Environments E i t


Compute cluster
Compute Node Compute Node L A N

Storage cluster
Meta-Data Manager I/O Server Node I/O Server Node I/O Server Node Meta Data Data Data Data

Frontend

LAN

Compute Node

LAN/WAN

Compute Node

Enterprise Multi-tier Datacenter for Visualization and Mining


Routers/ Servers Routers/ Servers Routers/ Servers Application Server Application Server Application Server Database Server Switch Database Server Database Server Switch

Switch

. .

Routers/ Servers Tier1

Application Server

. .

Database Server Tier3

Supercomputing '09

Networking and I/O Requirements g q


Good Systems Area Network with excellent performance (low latency and high bandwidth) for inter-processor communication (IPC) and I/O Good Storage Area Networks high performance I/O Good WAN connectivity in addition to intra-cluster intra cluster SAN/LAN connectivity Quality of Service (QoS) for interactive applications RAS (Reliability, Availability, and Serviceability) With l low cost t
Supercomputing '09
6

Major Components in Computing Systems S t


P0
Core 0 Core 1 Core 2 Core 3

Hardware Components
Processing Bottleneck
Memory

Processing Core and Memory sub-system I/O Bus B Network Adapter

P1
Core 0 Core 1 Core 2

Memory
Core 3 I O

Network Switch

I/O B Bottleneck U
S Network Switch

Software Components
Communication software

Network Adapter

Network Bottleneck

Supercomputing '09

Processing Bottlenecks in Traditional Protocols T diti lP t l


Ex: TCP/IP, UDP/IP Generic architecture for all network interfaces Host-handles almost all aspects of communication
Data buffering (copies on sender and receiver) Data integrity (checksum) Routing aspects (IP routing)

Signaling between different layers


Hardware interrupt whenever a packet arrives or is sent Software signals between different layers to handle protocol processing in different priority levels
Supercomputing '09
8

Bottlenecks in Traditional I/O Interfaces and Networks I t f dN t k


Traditionally relied on bus-based technologies
E.g., PCI, PCI-X, Shared Ethernet One bit per wire Performance increase through:
Increasing clock speed Increasing bus width

Not scalable:
Cross talk bet een bits between Skew between wires Signal integrity makes it difficult to increase bus width significantly, especially for high clock speeds
Supercomputing '09
9

InfiniBand (Infinite Bandwidth) and 10-Gigabit Ethernet 10 Gi bit Eth t


Industry Networking Standards Processing Bottleneck:
Hardware offloaded protocol stacks with user-level communication access

Network Bottleneck:
Bit serial differential signaling
Independent pairs of wires to transmit independent data (called a lane) Scalable to any number of lanes Easy to increase clock speed of lanes (since each lane consists only of a pair of wires)
Supercomputing '09
10

Interplay with I/O Technologies p y g


InfiniBand initially intended to replace I/O bus technologies with networking-like technology
That is, bit serial differential signaling With enhancements in I/O technologies that use a similar architecture (HyperTransport, PCI Express), this has become mostly irrelevant now

Both IB and 10GE today come as network adapters that plug into existing I/O technologies

Supercomputing '09

11

Presentation Overview
Introduction Why InfiniBand and 10-Gigabit Ethernet? Overview of IB, 10GE, their Convergence and Features IB and 10GE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A

Supercomputing '09

12

Trends in I/O Interfaces with Servers


Network performance depends on
Networking technology (adapter + switch) Network interface (last mile bottleneck)
PCI PCI-X HyperTransport (HT) by AMD PCI-Express (PCIe) by Intel 1990 1998 (v1.0) ( ) 2003 (v2.0) 2001 (v1.0), 2004 (v2.0) 2006 (v3.0), 2008 (v3.1) 2003 (Gen1), 2007 (Gen2) 2009 (Gen3 standard) 33MHz/32bit: 1.05Gbps (shared bidirectional) 133MHz/64bit: 8.5Gbps (shared bidirectional) p (shared bidirectional) ) 266-533MHz/64bit: 17Gbps ( 102.4Gbps (v1.0), 179.2Gbps (v2.0) 332.8Gbps (v3.0), 409.6Gbps (v3.1) Gen1: 4X (8Gbps) 8X (16Gbps) 16X (32Gbps) (8Gbps), (16Gbps), Gen2: 4X (16Gbps), 8X (32Gbps), 16X (64Gbps) Gen3: 4X (~32Gbps), 8X (~64Gbps), 16X (~128Gbps) 153.6-204.8Gbps 153 6 204 8Gbps per link

Intel QuickPath

2009

Supercomputing '09

13

Growth in Commodity Network Technology T h l


Representative commodity networks; their entries into the market
Ethernet (1979 - ) Fast Ethernet (1993 -) Gigabit Ethernet (1995 -) ATM (1995 -) Myrinet (1993 -) Fibre Channel (1994 -) InfiniBand (2001 -) 10-Gigabit Ethernet (2001 -) InfiniBand (2003 -) InfiniBand (2005 -) 10 Mbit/sec 100 Mbit/sec 1000 Mbit /sec 155/622/1024 Mbit/sec 1 Gbit/sec 1 Gbit/sec 2 Gbit/sec (1X SDR) 10 Gbit/sec 8 Gbit/sec (4X SDR) 16 Gbit/sec (4X DDR) 24 Gbit/sec (12X SDR) InfiniBand (2007 -) InfiniBand (2011-) 16 times in the last 9 years
Supercomputing '09
14

32 Gbit/sec (4X QDR) 64 Gbit/sec (4X EDR)

Capabilities of High-Performance Networks Hi h P f N t k


Intelligent Network Interface Cards Support entire protocol processing completely in hardware (hardware protocol offload engines) ( p g ) Provide a rich communication interface to applications
User-level communication capability User level Gets rid of intermediate data buffering requirements

No software signaling between communication layers


All layers are implemented on a dedicated hardware unit, unit and not on a shared host CPU
Supercomputing '09
15

Previous High-Performance Network Stacks St k


Virtual Interface Architecture
Standardized by Intel, Compaq, Microsoft

Fast Messages (FM)


Developed by UIUC

Myricom GM
P Proprietary protocol stack from Myricom i t t l t kf M i

These network stacks set the trend for highperformance communication requirements
Hardware offloaded protocol stack Support for fast and secure user-level access to the pp protocol stack
Supercomputing '09
16

IB Trade Association
IB Trade Association was formed with seven industry leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun) Goal: To design a scalable and high performance communication and I/O architecture by taking an integrated i ti d hit t b t ki i t t d view of computing, networking, and storage technologies Many other industry participated in the effort to define the IB architecture specification ( ) IB Architecture (Volume 1, Version 1.0) was released to public on Oct 24, 2000 Latest version 1.2.1 released January 2008 http://www.infinibandta.org
Supercomputing '09
17

IB Hardware Acceleration
Some IB models have multiple hardware accelerators p
E.g., Mellanox IB adapters

Protocol Offload Engines


Completely implement layers 2-4 in hardware

Additional hardware supported features also present


RDMA, Multicast, QoS, Fault Tolerance, and many more

Supercomputing '09

18

10-Gigabit Ethernet Consortium g


10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step Goal: To achieve a scalable and high performance communication architecture while maintaining backward compatibility with Ethernet http://www.ethernetalliance.org Upcoming 40-Gbps (Servers) and 100-Gbps Ethernet (Backbones, Switches, Routers): IEEE 802.3 WG Energy-efficient and power-conscious protocols
On-the-fly link speed reduction for under-utilized links
Supercomputing '09
19

Ethernet Hardware Acceleration


Interrupt Coalescing
Improves throughput, but degrades latency

Jumbo Frames
N l No latency i impact; I Incompatible with existing switches ibl ih i i i h

Hardware Checksum Engines


Checksum performed in hardware significantly faster Shown to have minimal benefit independently

Segmentation Offload Engines


Supported by most 10GE products because of its backward compatibility considered regular Ethernet Heavily used in the server-on-steroids model
Supercomputing '09
20

TOE and iWARP Accelerators


TCP Offload Engines (TOE)
Hardware Acceleration for the entire TCP/IP stack Initially patented by Tehuti Networks Actually refers to the IC on the network adapter that implements TCP/IP I practice, usually referred t as the entire network adapter In ti ll f d to th ti t k d t

Internet Wide-Area RDMA Protocol (iWARP)


St d di d b IETF and th RDMA C Standardized by d the Consortium ti Support acceleration features (like IB) for Ethernet

htt // http://www.ietf.org & htt // i tf http://www.rdmaconsortium.org d ti


Supercomputing '09
21

Converged Enhanced Ethernet g


Popularly known as Datacenter Ethernet Combines a number of Ethernet (optional) standards into one umbrella; sample enhancements include:
Priority-based flow-control: Link-level flow control for each Class of Service (CoS) Enhanced Transmission Selection: Bandwidth assignment to each CoS Datacenter Bridging Exchange Protocols: Congestion notification, Priority classes End-to-end Congestion notification: Per flow End to end congestion control to supplement per link flow control
Supercomputing '09
22

Presentation Overview
Introduction Why InfiniBand and 10-Gigabit Ethernet? Overview of IB, 10GE their Convergence and Features IB and 10GE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A

Supercomputing '09

23

IB, 10GE and their Convergence , g


InfiniBand Architecture and Basic Hardware Components Novel Features IB Verbs Interface Management and Services 10-Gigabit Ethernet Family Architecture and Components Existing Implementations of 10GE/iWARP InfiniBand/Ethernet Convergence Technologies Virtual Protocol Interconnect InfiniBand over Ethernet RDMA over Converged Enhanced Ethernet
Supercomputing '09
24

IB Overview
InfiniBand
Architecture and Basic Hardware Components Communication Model and Semantics
C Communication M d l i ti Model Memory registration and protection Channel and memory semantics

Novel Features
Hardware Protocol Offload
Link network and transport layer features Link,

Management and Services


Subnet Management Hardware support for scalable network management
Supercomputing '09
25

A Typical IB Network yp

Three primary components


Channel Adapters Switches/Routers Links and connectors

Supercomputing '09

26

Components: Channel Adapters p p


Used by processing and I/O units to connect to fabric Consume & generate IB packets Programmable DMA engines with protection features May have multiple ports
Independent buffering channeled through Virtual Lanes

Host Channel Adapters (HCAs)

Supercomputing '09

27

Components: Switches and Routers p

Relay packets from a link to another Switches: intra-subnet Routers: inter-subnet May support multicast
Supercomputing '09
28

Components: Links & Repeaters p p


Network Links
Copper, Optical, Printed Circuit wiring on Back Plane Not directly addressable

Traditional adapters built for copper cabling


Restricted by cable length (signal integrity)

Intel Connects: Optical cables with Copper-to-optical conversion hubs (acquired by Emcore)
Up to 100m length 550 picoseconds copper-to-optical conversion latency Available from other vendors (Luxtera) Repeaters (Vol. 2 of InfiniBand specification)
Supercomputing '09
29

(Courtesy Intel)

IB Overview
InfiniBand
Architecture and Basic Hardware Components Communication Model and Semantics
C Communication M d l i ti Model Memory registration and protection Channel and memory semantics

Novel Features
Hardware Protocol Offload
Link network and transport layer features Link,

Management and Services


Subnet Management Hardware support for scalable network management
Supercomputing '09
30

InfiniBand Communication Model


Basic InfiniBand Communication Semantics

Supercomputing '09

31

Q Queue Pair Model


Communication in InfiniBand uses a Queue Pair Model for all data transfer Each QP has two queues Send Queue (SQ) Receive Queue (RQ) A QP must be linked to a Complete Queue (CQ) Gives notification of operation completion from QPs
InfiniBand Device

QP
Send Recv

CQ

Supercomputing '09

32

Queue Pair Model: WQEs and CQEs


Entries used for QP communication are data structures called Work Queue Requests (WQEs)
Called Wookies
QP
Send Recv

CQ

WQEs

CQEs

Completed WQEs are placed in the CQ with additional information They are now called CQEs (Cookies)
InfiniBand Device I fi iB d D i

Supercomputing '09

33

WQEs and CQEs Q Q


Send WQEs contain data about what buffer to send from, how much to send, etc. Receive WQEs contain data about what buffer to receive i t h i into, how much h to receive, etc. CQEs contain data about which QP the completed WQE was posted on how much on, data actually arrived
34

Supercomputing '09

Memory Registration y g
Before we do any communication: All memory used for communication must be registered 1. Registration Request
Process

Send virtual address and length g

2. Kernel handles virtual->physical mapping


1 2 4 Kernel

and pins region into physical memory


Process cannot map memory that it does not own (security !)

3. 3 HCA caches the virtual to physical


3 HCA/RNIC

mapping and issues a handle


Includes an l_key and r_key

4. Handle is returned to application


Supercomputing '09
35

Memory Protection y
For security, keys are required for all operations that touch buffers
Process

To send or receive data the l_key must be provided to the HCA

Kernel

l_key

HCA verifies access to local memory

For RDMA, the initiator must have the RDMA r_key for the remote virtual address
Possibly exchanged with a send/recv r_key is not encrypted in IB

HCA/RNIC

r_key is needed for RDMA operations


Supercomputing '09
36

Communication in the Channel Semantics (Send/Receive Model) S ti (S d/R i M d l)


Memory
Send Buffer

Processor

Processor

Memory
Receive Buffer

QP CQ
Send Recv

Processor is involved only to:

QP
Send Recv

CQ

1. Post receive WQE Q 2. Post send WQE 3. Pull out completed CQEs from the CQ

InfiniBand Device

Hardware ACK

InfiniBand Device

Send S d WQE contains information about the t i i f ti b t th send buffer

Receive WQE contains information on the receive buffer; Incoming messages have to be matched to a receive WQE to know where to place the data
Supercomputing '09
37

Communication in the Memory Semantics (RDMA M d l) S ti Model)


Memory
Send Buffer

Processor

Processor

Memory
Receive Buffer

QP CQ
Send

QP Initiator processor is involved only to: Send Recv Recv 1. Post send WQE 2. Pull out completed CQE from the send CQ No involvement from the target processor

CQ

InfiniBand Device

Hardware ACK

InfiniBand Device

Send S d WQE contains information about the t i i f ti b t th send buffer and the receive buffer
Supercomputing '09
38

IB Overview
InfiniBand
Architecture and Basic Hardware Components Communication Model and Semantics
C Communication M d l i ti Model Memory registration and protection Channel and memory semantics

Novel Features
Hardware Protocol Offload
Link network and transport layer features Link,

Management and Services


Subnet Management Hardware support for scalable network management
Supercomputing '09
39

Hardware Protocol Offload

Complete Hardware Implementations Exist

Supercomputing '09

40

Link Layer Capabilities y p


CRC-based Data Integrity CRC based Buffering and Flow Control Virtual Lanes, Service Levels and QoS Switching and Multicast IB WAN Capability

Supercomputing '09

41

CRC-based Data Integrity g y


Two forms of CRC to achieve both early error detection and end-to-end reliability
Invariant CRC (ICRC) covers fields that do not change per link (per network hop)
E.g., routing headers (if there are no routers), transport headers, data payload 32-bit CRC (compatible with Ethernet CRC) End-to-end reliability (does not include I/O bus)

V i t CRC (VCRC) covers everything Variant thi


Erroneous packets do not have to reach the destination before being discarded Early error detection
Supercomputing '09
42

Buffering and Flow Control g


IB provides an absolute credit-based flow-control p
Receiver guarantees that it has enough space allotted for N blocks of data Occasional update of available credits by the receiver

Has no relation to the number of messages but only to messages, the total amount of data being sent
O 1MB message i equivalent to 1024 1KB messages One is i l (except for rounding off at message boundaries)

Supercomputing '09

43

Link Layer Capabilities y p


CRC-based Data Integrity CRC based Buffering and Flow Control Virtual Lanes, Service Levels and QoS Switching and Multicast IB WAN Capability

Supercomputing '09

44

Virtual Lanes
Multiple virtual links within same physical link
Between 2 and 16

S Separate buffers and flow t b ff d fl control


Avoids Head of Line Head-of-Line Blocking

VL15: reserved for management Each port supports one or more data VL
Supercomputing '09
45

Service Levels and QoS Q


Service Level (SL):
Packets may operate at one of 16 different SLs Meaning not defined by IB

S to VL mapping: SL
SL determines which VL on the next link is to be used E h port (switches, routers, end nodes) h a SL t VL Each t ( it h t d d ) has to mapping table configured by the subnet management

Partitions:
Fabric administration (through Subnet Manager) may assign specific SLs to different partitions to isolate traffic flows
46

Supercomputing '09

Traffic Segregation Benefits g g


Segregation of Server, Network and Storage Traffic On the same physical network
IPC, Load Balancing, Web Caches, ASP

Servers S

Servers S

Servers S

InfiniBand Virtual Lanes allow the multiplexing of multiple independent logical traffic flows on the same physical link Providing the benefits of independent, separate networks while eliminating the cost and difficulties associated with maintaining two or more networks t k
47

Virtual Lanes InfiniBand Network


InfiniBand Fabric Fabric

IP Network
Routers, Switches VPNs, DSLAMs

Storage Area Network RAID, NAS, Backup

Courtesy: Mellanox Technologies

Supercomputing '09

Switching (Layer-2 Routing) and Multicast M lti t


Each port has one or more associated LIDs (Local Identifiers)
Switches look up which port to forward a packet to based on it d ti ti LID (DLID) its destination This information is maintained at the switch

For multicast packets, the switch needs to maintain packets multiple output ports to forward the packet to
Packet is replicated to each appropriate output port Ensures at-most once delivery & loop-free forwarding There is an interface for a group management protocol
Create, join/leave, prune, delete group
Supercomputing '09
48

Destination-based Switching/Routing g g
Spine Blocks

Leaf Blocks

An Example IB Switch Block Diagram (Mellanox 144-Port)

Switching: IB supports S it hi t Virtual Cut Through (VCT)

Routing: U R ti Unspecified b IB SPEC ifi d by Up*/Down*, Shift are popular routing engines supported by OFED

Fat-Tree is a popular topology for IB Clusters Different over subscription ratio may be used over-subscription
Supercomputing '09
49

IB Switching/Routing: An Example g g p
Spine Blocks
1 2 3 4

Leaf Blocks

P2
LID: 2 LID: 4

P1

DLID

Out-Port

Forwarding Table
2 4 1 4

Someone has to setup these tables and give every port an LID
Subnet Manager does this work (more discussion on this later)

Different routing algorithms may give different paths


Supercomputing '09
50

IB Multicast Example p

Supercomputing '09

51

IB WAN Capability p y
Getting increased attention for:
Remote Storage, Remote Visualization Cluster Aggregation (Cluster-of-clusters)

IB Optical switches by multiple vendors IB-Optical


Obsidian Research Corporation: www.obsidianresearch.com Network Equipment Technology (NET): www.net.com Layer-1 changes from copper to optical; everything else stays the same
Low latency copper-optical-copper conversion Low-latency copper optical copper

Large link-level buffers for flow-control


Data messages do not have to wait for round-trip hops g p p Important in the wide-area network
Supercomputing '09
52

Hardware Protocol Offload

Complete Hardware Implementations Exist

Supercomputing '09

53

IB Network Layer Capabilities y p


Most capabilities are similar to that of the link layer, but p y as applied to IB routers
Routers can send packets across subnets (subnet are management domains, not administrative domains) Subnet management packets are consumed by routers routers, not forwarded to the next subnet

Several additional features as well


E.g., routing and flow labels

Supercomputing '09

54

Routing and Flow Labels g


Routing follows the IPv6 packet format
Easy interoperability with Wide-area translations Link layer might still need to be translated to the y g appropriate layer-2 protocol (e.g., Ethernet, SONET)

Flow Labels allow routers to specify which p p y packets belong to the same connection
Switches can optimize communication by sending p p y g packets with the same label in order Flow labels can change in the router, but packets belonging to one label will always do so
Supercomputing '09
55

Hardware Protocol Offload

Complete Hardware Implementations Exist

Supercomputing '09

56

IB Transport Services p
Service Type Reliable Connection Unreliable Connection Reliable Datagram Unreliable Datagram RAW Datagram Connection Oriented Yes Yes No No No Acknowledged Yes No Yes No No Transport IBA IBA IBA IBA Raw

Each transport service can have zero or more QPs associated with it
e.g., you can have 4 QPs based on RC and one based on UD
Supercomputing '09
57

Trade-offs in Different Transport Types p yp

Supercomputing '09

58

Shared Receive Queue (SRQ) Q ( Q)


Process Process

m
One RQ per connection

One SRQ for all connections

n -1

SRQ is a hardware mechanism for a process to share p receive resources (memory) across multiple connections
Introduced in specification v1.2

0 < p << m*(n-1)


Supercomputing '09
59

eXtended Reliable Connection (XRC) ( )


M = # of nodes N = # of processes/node RC Connections XRC Connections

(M2-1)*N 1)

(M-1) (M 1)*N

Each QP takes at least one page of memory g y


Connections between all processes is very costly for RC

New IB Transport added: eXtended Reliable Connection


Allows connections between nodes instead of processes
Supercomputing '09
60

IB Overview
InfiniBand
Architecture and Basic Hardware Components Communication Model and Semantics
C Communication M d l i ti Model Memory registration and protection Channel and memory semantics

Novel Features
Hardware Protocol Offload
Link network and transport layer features Link,

Management and Services


Subnet Management Hardware support for scalable network management
Supercomputing '09
61

Concepts in IB Management p g
Agents
Processes or hardware units running on each adapter, switch, router (everything on the network) Provide capability to query and set parameters

Managers
Make high-level decisions and implement it on the network fabric using the agents

Messaging schemes g g
Used for interactions between the manager and agents (or between agents)

Messages
Supercomputing '09
62

Subnet Manager g

Inactive Link

Multicast Setup Switch Compute Node Multicast Setup

Active Inactive Links Multicast Join

Multicast Join

Subnet Manager

Supercomputing '09

63

10GE Overview
10-Gigabit Ethernet Family
Architecture and Components
Stack Layout y Out-of-Order Data Placement Dynamic and Fine-grained Data Rate Control y g

Existing Implementations of 10GE/iWARP

Supercomputing '09

64

IB and 10GE: Commonalities and Differences C liti d Diff


IB
Hardware Acceleration RDMA Atomic Operations Multicast Data Placement Data Rate-control QoS Supported Supported Supported Supported Ordered Static and Coarse-grained Prioritization

iWARP/10GE
Supported (for TOE and iWARP) Supported (for iWARP) Not supported Supported Out-of-order (for iWARP) Dynamic and Fine-grained (for TOE and iWARP) Prioritization and Fixed Bandwidth QoS

Supercomputing '09

65

iWARP Architecture and Components p


iWARP Offload Engines
User Application or Library

RDMA Protocol (RDMAP) Feature-rich interface Security Management

RDMAP RDDP MPA SCTP Hardware TCP IP Device Driver Network Adapter (e.g., 10GigE)

Remote Direct Data Placement (RDDP) Data Placement and Delivery Multi Stream Semantics Connection Management

Marker PDU Aligned (MPA) Middle Box Fragmentation Data Integrity (CRC)
Supercomputing '09
66

(Courtesy iWARP Specification)

Decoupled Data Placement and Data Delivery D t D li


Place data as it arrives, whether in or out-of-order If data is out-of-order, place it at the appropriate offset Issues from the application s perspective: applications
Second half of the message has been placed does not mean th t the first half of the message has arrived as well that th fi t h lf f th h i d ll If one message has been placed, it does not mean that that the previous messages have been placed

Supercomputing '09

67

Protocol Stack Issues with Out-of-Order D t Pl O t f O d Data Placement t


The receiver network stack has to understand each frame of data
If the frame is unchanged during transmission, this is easy!

Issues to consider:
Can we guarantee that the frame will be unchanged? What happens when intermediate switches segment data?

Supercomputing '09

68

Switch Splicing p g
Switch
Splicing

Splicing

Intermediate Ethernet switches (e.g., those which support splicing) can segment a frame to multiple segments or coalesce multiple segments to a single segment

Supercomputing '09

69

Marker PDU Aligned (MPA) Protocol g ( )


Deterministic Approach to find segment boundaries Approach:
Places strips of data at regular intervals ( p g (based on data sequence number) Interval is set to be 512 bytes (small enough to ensure that each Ethernet frame has at least one)
Minimum IP packet size is 536 bytes (RFC 879)

Each strip points to the RDDP header

Each segment independently has enough information about where it needs to be placed
Supercomputing '09
70

MPA Frame Format

RDDP Header

Data Payload (if any)

Data Payload (if any)

Segment Length

Pad C C CRC

Supercomputing '09

71

Dynamic and Fine-grained Rate Control R t C t l


Part of the Ethernet standard, not iWARP
Network vendors use a separate interface to support it

Dynamic bandwidth allocation to flows based on interval between two packets in a flow
E.g., one stall for every packet sent on a 10 Gbps network refers to a bandwidth allocation of 5 Gbps Complicated because of TCP windowing behavior

Important for high-latency/high-bandwidth networks


Large windows exposed on the receiver side Receiver overflow controlled through rate control
Supercomputing '09
72

Prioritization vs. Fixed Bandwidth QoS Q


Can allow for simple prioritization:
E.g., connection 1 performs better than connection 2 8 classes provided (a connection can be in any class)
Similar to SLs in InfiniBand

Two priority classes for high-priority traffic


E E.g., management traffic or your favorite application t t ffi f it li ti

Or can allow for specific bandwidth requests:


E E.g., can request for 3.62 Gb b d idth t f 3 62 Gbps bandwidth Packet pacing and stalls used to achieve this

Query functionality to find out remaining bandwidth remaining bandwidth


Supercomputing '09
73

10GE Overview
10-Gigabit Ethernet Family
Architecture and Components
Stack Layout y Out-of-Order Data Placement Dynamic and Fine-grained Data Rate Control y g

Existing Implementations of 10GE/iWARP

Supercomputing '09

74

Current Usage of Ethernet g

Regular Ethernet

TOE

Regular Ethernet Cluster iWARP

Wide Area Network

System Area Network or Cluster Environment iWARP Cluster

Distributed Cluster Environment


Supercomputing '09
75

Software iWARP based Compatibility p y


Regular Ethernet adapters and TOEs g p are fully compatible C Compatibility with iWARP required tibilit ith i d
TOE

Software iWARP emulates the functionality of iWARP on the host


Regular Ethernet iWARP

Fully compatible with hardware iWARP y p Internally utilizes host TCP/IP stack

Ethernet Environment

Supercomputing '09

76

Different iWARP Implementations p


OSU, OSC OSU, ANL Application
High Performance Sockets

Chelsio, NetEffect Chelsio NetEffect, Ammasso Application


High Performance Sockets Sockets TCP IP Device Driver

Application
User-level iWARP

Application
Kernel-level iWARP

Sockets

Sockets TCP IP

Sockets
TCP (Modified with MPA) TCP IP Device Driver IP

Software iWARP

Device Driver

Device Driver

Offloaded iWARP Offloaded TCP Offloaded TCP Offloaded IP Offloaded IP

Network Adapter

Network Adapter

Network Adapter

Network Adapter

Regular Ethernet Adapters

TCP Offload Engines

iWARP compliant Adapters

Supercomputing '09

77

IB and 10GE Convergence g


InfiniBand/Ethernet Convergence Technologies g g
Virtual Protocol Interconnect InfiniBand over Ethernet RDMA over Converged Enhanced Ethernet

Supercomputing '09

78

Virtual Protocol Interconnect (VPI) ( )


Applications

Single network firmware to support both b th IB and Eth d Ethernet t Autosensing of layer-2 protocol
Can be configured to automatically g y work with either IB or Ethernet networks

IB Verbs

Sockets

IB Transport Layer

TCP

Hardware

IB Network Layer

IP
TCP/IP support

Multi-port Multi port adapters can use one port on IB and another on Ethernet Multiple use modes:
Datacenters with IB inside the cluster and Ethernet outside Clusters with IB network and Ethernet management

IB Link Layer

Ethernet Link Layer

IB Port

Ethernet Port

Supercomputing '09

79

(InfiniBand) RDMA over Ethernet (IBoE) (IB E)

Application IB Verbs IB Transport Hardware IB Network

Native convergence of IB network and transport layers with Eth t tl ith Ethernet link l t li k layer IB packets encapsulated in Ethernet frames IB network layer already uses IPv6 frames Pros:
Works natively in Ethernet environments Has all the benefits of IB verbs

Ethernet

Cons:
Network bandwidth limited to Ethernet switches (currently 10Gbps), even though IB can provide 32Gbps S Some IB native link-layer f t ti li k l features are optional in (regular) Ethernet
Supercomputing '09
80

(InfiniBand) RDMA over Converged Enhanced Ethernet (RoCEE) C dE h d Eth t (R CEE)

Application IB Verbs

Very similar to IB over Ethernet


Often used interchangeably with IBoE Can be used to explicitly specify link layer is Converged Enhanced Ethernet ( g (CEE) )

IB Transport Hardware IB Network CEE

Pros:
Works natively in Ethernet environments Has all the benefits of IB verbs CEE is very similar to the link layer of native IB, so there are no missing features

Cons:
Network bandwidth limited to Ethernet switches (currently 10Gbps) even though 10Gbps), IB can provide 32Gbps
Supercomputing '09
81

IB and 10GE: Feature Comparison p


IB
Hardware Acceleration RDMA Atomic Operations Multicast Data Placement Prioritization Fixed BW QoS (ETS) Ethernet Compatibility TCP/IP Compatibility Yes Yes Yes Optional Ordered Optional No No Yes (using IPoIB)

iWARP/10GE
Yes Yes No No Out-of-order Optional Optional Yes Yes
Supercomputing '09

IBoE
Yes Yes Yes Optional Ordered Optional Optional Yes No

RoCEE
Yes Yes Yes Optional Ordered Yes Yes Yes No
82

Presentation Overview
Introduction Why InfiniBand and 10-Gigabit Ethernet? Overview of IB, 10GE, their Convergence and Features IB and 10GE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A

Supercomputing '09

83

IB Hardware Products
Many IB vendors: Mellanox, Voltaire and Qlogic
Ali Aligned with many server vendors: I t l IBM SUN, Dell d ith d Intel, IBM, SUN D ll And many integrators: Appro, Advanced Clustering, Microway,

Broadly two kinds of adapters


Offloading (Mellanox) and Onloading (Qlogic)

Adapters with different interfaces:


Dual port 4X with PCI-X (64 bit/133 MHz) PCIe x8 PCIe 2.0 and HT PCI X MHz), x8, 20

MemFree Adapter
No memory on HCA Uses System memory (through PCIe) Good for LOM designs (Tyan S2935, Supermicro 6015T-INFB)

Different speeds
SDR (8 Gbps), DDR (16 Gbps) and QDR (32 Gbps)

Some 12X SDR adapters exist as well (24 Gbps each way)
Supercomputing '09
84

Tyan Thunder S2935 Board y

(Courtesy Tyan)

Similar b Si il boards from Supermicro with LOM features are also available d f S i ith f t l il bl
Supercomputing '09
85

IB Hardware Products (contd.) ( )


Customized adapters to work with IB switches C Cray XD1 (formerly b O ti b ) C (f l by Octigabay), Cray CX1 Switches: 4X SDR and DDR (8-288 ports); 12X SDR (small sizes) (8 288 3456-port Magnum switch from SUN
72-port nano magnum

used at TACC

36-port Mellanox InfiniScale IV QDR switch silicon in early 2008


Up to 648-port QDR switch by SUN

N New IB switch silicon f it h ili from Ql i i t d Qlogic introduced at SC 08 d t Up to 846-port QDR switch by Qlogic Switch Routers with Gateways IB-to-FC; IB-to-IP
Supercomputing '09
86

IB Software Products
Low-level software stacks
VAPI (Verbs-Level API) from Mellanox Modified and customized VAPI from other vendors New initiative: Open Fabrics (formerly OpenIB)
http://www.openfabrics.org Open-source code available with Linux distributions Initially IB; later extended to incorporate iWARP

High-level software stacks


MPI, SDP, IPoIB, SRP, iSER, DAPL, NFS, PVFS on various stacks ( i i k (primarily VAPI and O il d OpenFabrics) F bi )
Supercomputing '09
87

10G, 40G and 100G Ethernet Products ,


10GE adapters I t l M i Intel, Myricom, M ll Mellanox (C (ConnectX) tX) 10GE/iWARP adapters Chelsio, NetEffect (now owned by Intel) , ( y ) 10GE switches Fulcrum Microsystems
L Low l latency switch b i h based on 24 d 24-port silicon ili FM4000 switch with IP routing, and TCP/UDP support

Fujitsu, Woven Systems (144 ports), Myricom (512 ports), Quadrics (96 ports), Force10, Cisco, Arista (formerly Arastra) 40GE and 100GE switches Nortel Networks
10GE downlinks with 40GE and 100GE uplinks
Supercomputing '09
88

Mellanox ConnectX Architecture


Early adapter supporting IB/10GE convergence
Support for VPI and IBoE

Includes other f features as well


Hardware support for Virtualization Quality of Service Stateless Offloads

(Courtesy Mellanox)
89

Supercomputing '09

OpenFabrics p
Open source organization (formerly OpenIB)
www.openfabrics.org

Incorporates both IB and iWARP in a unified manner


Support for Linux and Windows Design of complete stack with `best of breed components
Gen1 Gen2 (current focus)

Users can download the entire stack and run


Latest release is OFED 1.4.2 OFED 1.5 is underway y
90

Supercomputing '09

OpenFabrics Stack with Unified Verbs I t f V b Interface


Verbs Interface (libibverbs)
IBM (libehca) User Level Mellanox (ib_mthca) QLogic (ib_ipath) IBM (ib_ehca) Chelsio (ib_cxgb3) Kernel Level

Mellanox (libmthca)

QLogic (libipathverbs)

Chelsio (libcxgb3)

Mellanox Adapters

QLogic Adapters

IBM Adapters

Chelsio Adapters

Supercomputing '09

91

OpenFabrics on Convergent IB/10GE p g


Verbs Interface (libibverbs)
ConnectX (libmlx4) ConnectX (ib_mlx4)

For IBoE and RoCEE, the upperlevel stacks remain completely unchanged Within the hardware:
Transport and network layers remain completely unchanged

User Level

Kernel Level

ConnectX Adapters Ad t

B th IB and Eth Both d Ethernet (or CEE) link t( li k layers are supported on the network adapter

Note: The OpenFabrics stack is not valid for the Ethernet path in VPI
That still uses sockets and TCP/IP

10GE

IB
Supercomputing '09
92

OpenFabrics Software Stack p


SA Subnet Administrator

Application Level

Diag Open SM Tools User Level MAD API

IP Based App Access A

Sockets Based Access A

Various MPIs

Block Storage Access A

Clustered DB Access

Access to File Systems S t

MAD

Management Datagram

SMA

Subnet Manager Agent

UDAPL OpenFabrics User Level Verbs / API iWARP R-NIC

PMA IPoIB SDP

User APIs

Performance Manager Agent IP over InfiniBand Sockets Direct Protocol

InfiniBand

User Space U S Kernel Space


IPoIB

SDP Lib

SRP iSER RDS UDAPL

Upper Layer Protocol

SCSI RDMA Protocol (Initiator) iSCSI RDMA Protocol (Initiator) Reliable Datagram Service g User Direct Access Programming Lib Host Channel Adapter

SDP

SRP

iSER

RDS

NFS-RDMA RPC

Cluster File Sys

Kerne bypass el

Kerne bypass el

Connection Manager Abstraction (CMA) SA MAD Client InfiniBand SMA Connection Manager Connection Manager iWARP R-NIC

Mid-Layer

HCA

R-NIC

RDMA NIC

OpenFabrics Kernel Level Verbs / API

Provider

Hardware Specific Driver InfiniBand HCA

Hardware Specific Driver iWARP R-NIC

Key

Common InfiniBand

Hardware

iWARP

Apps & Access Methods for using OF Stack

Supercomputing '09

93

InfiniBand in the Top500 p


Systems Performance

Percentage share of InfiniBand is steadily increasing P t h f I fi iB d i t dil i i


94

Supercomputing '09

Large-scale InfiniBand Installations g


151 IB clusters (30.2%) in the June 09 TOP500 list (www.top500.org) Installations in the Top 30 (15 of them):
129,600 cores (RoadRunner) at LANL (1st) 51,200 cores (Pleiades) at NASA Ames (4th) 62,976 cores (Ranger) at TACC (8th) 26,304 cores (Juropa) at TACC (10th) 30,720 cores (Dawning) at Shanghai (15th) 14,336 14 336 cores at New Mexico (17th) 14,384 cores at Tata CRL, India (18th) 18,224 cores at LLNL (19th) 12,288 cores at GENCI-CINES, France (20th) 8,320 cores in UK (25th) 8,320 cores in UK (26th) 8,064 cores (DKRZ) in Germany (27th) 12,032 cores at JAXA, Japan (28th) 10,240 10 240 cores at TEP France (29th) TEP, 13,728 cores in Sweden (30th) More are getting installed !

Supercomputing '09

95

10GE Installations
Several Enterprise Computing Domains
E t Enterprise Datacenters (HP, Intel) and Financial Markets i D t t (HP I t l) d Fi i lM k t Animation firms (e.g., Universal Studios created The Hulk and many new movies using 10GE)

Scientific Computing Installations


5,600-core installation in Purdue with Chelsio/iWARP 640 640-core i t ll ti i U i installation in University of H id lb it f Heidelberg, G Germany 512-core installation at Sandia National Laboratory (SNL) with Chelsio/iWARP and Woven Systems switch 256-core installation at Argonne National Lab with Myri-10G

Integrated Systems
BG/P uses 10GE for I/O (ranks 3 7 9 14 24 in the Top 25) 3, 7, 9, 14, ESnet to install 62M $ 100GE infrastructure for US DOE
Supercomputing '09
96

Dual IB/10GE Systems y


Such systems are being integrated E.g., the T2KTsukuba system (300 TFlop System) Systems at three sites (Tsukuba, Tokyo, Kyoto)

(Courtesy Taisuke Boku, University of Tsukuba)

Internal connectivity: Quad-rail IB ConnectX network External connectivity: 10GE


Supercomputing '09
97

Presentation Overview
Introduction Why InfiniBand and 10-Gigabit Ethernet? Overview of IB, 10GE, their Convergence and Features IB and 10GE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A

Supercomputing '09

98

Case Studies
Low level Performance Low-level Message Passing Interface ( g g (MPI) ) File Systems

Supercomputing '09

99

Low-level Latency Measurements y


40 35 30
Latency (us)

Small Messages
Native N ti IB VPI-IB VPI-Eth IBoE

12000 10000 8000 6000 4000 2000

Large Messages

25 20 15 10 5 0 1K 2K 128 256
Message Size (bytes)

Latency (us)

0 512 4K 2 4 8 16 32 64

Message Size (bytes)

ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
Supercomputing '09
100

Low-level Uni- and Bi-directional Bandwidth M B d idth Measurements t


Uni-directional Bandwidth 1600 1400
Ban ndwidth (MBps)

1200 1000 800 600


Native IB VPI-IB

400 200 0
VPI-Eth IBoE

Message Size (bytes)

ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
Supercomputing '09
101

Case Studies
Low level Performance Low-level Message Passing Interface ( g g (MPI) ) File Systems

Supercomputing '09

102

Message Passing Interface (MPI) g g ( )


De-facto message passing standard
Point-to-point communication Collective communication (broadcast, multicast, reduction, barrier) MPI-1 and MPI-2 available; MPI-3 under discussion

Has been implemented for various past commodity networks (Myrinet, Quadrics) How can it be designed and efficiently implemented for InfiniBand and iWARP?

Supercomputing '09

103

MVAPICH/MVAPICH2 Software
High Performance MPI Library for IB and 10GE
MVAPICH (MPI-1) and MVAPICH2 (MPI-2) Used by more than 975 organizations in 51 countries More than 34,000 downloads from OSU site directly Empowering many TOP500 clusters
8th ranked 62,976-core cluster (Ranger) at TACC

Available with software stacks of many IB, 10GE and server vendors including Open Fabrics Enterprise Distribution (OFED) Also supports uDAPL device to work with any network supporting uDAPL http://mvapich.cse.ohio-state.edu/
Supercomputing '09
104

MPICH2 Software Stack


High-performance and Widely Portable MPI
S Supports MPI-1, MPI 2 and MPI 2 1 t MPI 1 MPI-2 d MPI-2.1 Supports multiple networks (TCP, IB, iWARP, Myrinet) Commercial support by many vendors
IBM (integrated stack distributed by Argonne) Microsoft, Intel (in process of integrating their stack)

Used by many derivative implementations


E.g., MVAPICH2, IBM, Intel, Microsoft, SiCortex, Cray, Myricom MPICH2 and its derivatives support many Top500 systems (estimated at more than 90%)

Available with many software distributions Integrated with the ROMIO MPI-IO implementation and the MPE profiling library http://www.mcs.anl.gov/research/projects/mpich2
Supercomputing '09
105

One-way Latency: MPI over IB y y


7 6 5
Latency (us)

Small Message Latency 400 350 300


Latency (us)

Large Message Latency


MVAPICH-InfiniHost III-DDR MVAPICH-Qlogic-SDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX-QDR-PCIe2 MVAPICH-Qlogic-DDR-PCIe2

250 200 150 100 50 0

4 3 2.77 2.19 2 1.49 1.28 1 1.06 0

Message Size (bytes)

Message Size (bytes)

InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch

ConnectX-QDR-PCIe2: 2.83 GHz Quad-core (Harpertown) Intel with back-to-back


106

Supercomputing '09

Bandwidth: MPI over IB


3500 3000
MillionBytes/s sec

Unidirectional Bandwidth
MVAPICH-InfiniHost III-DDR MVAPICH-Qlogic-SDR MVAPICH-ConnectX-DDR MVAPICH-ConnectX-QDR-PCIe2 MVAPICH-Qlogic-DDR-PCIe2

6000 3022.1 5000


MillionBytes/s sec

Bidirectional Bandwidth 5553.5

2500 2000

1952.9 1399.8

4000 3000 2000 1000 0

3621.4 2718.3 2457.4 1519.8

1500 1000 500 0

1389.4 936.5

Message Size (bytes)

Message Size (bytes)

InfiniHost III and ConnectX-DDR: 2.33 GHz Quad-core (Clovertown) Intel with IB switch

ConnectX-QDR-PCIe2: 2.4 GHz Quad-core (Nehalem) Intel with IB switch


107

Supercomputing '09

One-way Latency: MPI over iWARP y y


One-way Latency 35 30 25
L Latency (us) Chelsio (TCP/IP) Chelsio (iWARP)

20 15.47 15 10 5 0 0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K


Message Size (bytes)

6.88

2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch


Supercomputing '09
108

Bandwidth: MPI over iWARP


1400 1200 1000 839.8 839 8 800 600 400 500 200 0 0
Mi illionBytes/sec c

Unidirectional Bandwidth
Chelsio Ch l i (TCP/IP) Chelsio (iWARP)

2500 1231.8 2000

Bidirectional Bandwidth 2260.8 2260 8

M MillionBytes/se ec

1500

1000

855.3

Message Size (bytes)

Message Size (bytes)

2.0 GHz Quad-core Intel with 10GE (Fulcrum) Switch


Supercomputing '09
109

Convergent Technologies: MPI Latency L t


25 Small Messages
Native IB

2500

Large Messages

20
L Latency (us)

VPI-IB VPI-Eth IBoE IB E L Latency (us)

2000

15

1500

10

1000

500

0 0 1 2 4 8 16 32 64 2 256 5 512 128


Message Size (bytes)

0 1K

Message Size (bytes)

ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
Supercomputing '09
110

Convergent Technologies: MPI Uni- and Bi-directional Bandwidth U i d Bi di ti l B d idth


1600 1400 1200
Ban ndwidth (MBps s)

Uni-directional Bandwidth
Native IB VPI-IB VPI-Eth IBoE Ban ndwidth (MBps s)

3500 3000 2500 2000 1500 1000 500 0

Bi-directional Bandwidth

1000 800 600 400 200 0 12 28K 512K 2K 8K 3 32K 128 512 2M 0 2 8 32
Message Size (bytes)

12 28K

512K

128

512

Message Size (bytes)

ConnectX-DDR: 2.4 GHz Quad-core (Nehalem) Intel with IB and 10GE switches
Supercomputing '09
111

3 32K

2M

32

2K

8K

Case Studies
Low level Performance Low-level Message Passing Interface ( g g (MPI) ) File Systems

Supercomputing '09

112

Sample Diagram of State-of-the-Art File Systems S t


Sample file systems:
Lustre, Panasas, GPFS, Sistina/Redhat GFS PVFS, Google File systems, Oracle Cluster File system (OCFS2)

Metadata Server

Metadata Server

Computing node Network Computing node

I/O server

I/O server

Computing node

I/O server
Supercomputing '09
113

Lustre Performance
Write Performance (4 OSSs) 1200
Throughpu (MBps) ut Native Throughpu (MBps) ut

Read Performance (4 OSSs) 3500 3000 2500 2000 1500 1000 500 0

1000 800 600 400 200 0 1

IPoIB

Number of Clients

Number of Clients

Lustre over Native IB


Write: 1.38X faster than IPoIB; Read: 2.16X faster than IPoIB

Memory copies in IPoIB and Native IB


Reduced throughput and high overhead; I/O servers are saturated
Supercomputing '09
114

CPU Utilization
100 90 80
CPU Utilization (%) IPoIB (Read) Native (Read) IPoIB (Write) Native (Write)

70 60 50 40 30 20 10 0 1 2
Number of Clients

4 OSS nodes, IOzone record size 1MB Offers potential for greater scalability
Supercomputing '09
115

NFS/RDMA Performance
Write (tmpfs)
1000 900 Throughput (MB/s) 700 600 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of Threads Read-Read Read-Write Throughput (MB/s) t 800 1000 900 800 700 600 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Number of Threads Read (tmpfs)

IOzone Read Bandwidth up to 913 MB/s (Sun x2200s with x8 PCIe) Read-Write design by OSU, available with the latest OpenSolaris NFS/RDMA is being added into OFED 1 4 1.4

R. Noronha, L. Chai, T. Talpey and D. K. Panda, Designing NFS With RDMA For Security, Performance and Scalability, ICPP 07
Supercomputing '09
116

Summary of Design Performance Results R lt


Current generation IB adapters, 10GE/iWARP g p adapters and software environments are already delivering competitive p g p performance IB and 10GE/iWARP hardware, firmware, and software are going through rapid changes Convergence between IB and 10GigE is emerging Significant performance improvement is expected in near future

Supercomputing '09

117

Presentation Overview
Introduction Why InfiniBand and 10-Gigabit Ethernet? Overview of IB, 10GE, their Convergence and Features IB and 10GE HW/SW Products and Installations Sample Case Studies and Performance Numbers Conclusions and Final Q&A

Supercomputing '09

118

Concluding Remarks g
Presented network architectures & trends in Clusters Presented background and details of IB and 10GE
Highlighted the main features of IB and 10GE and their convergence Gave an overview of IB and 10GE hw/sw products Discussed sample performance numbers in designing various high-end systems with IB and 10GE

IB and 10GE are emerging as new architectures leading to a new generation of networked computing systems, opening many research issues needing novel solutions
Supercomputing '09
119

Funding Acknowledgments g g
Our research is supported by the following organizations Funding support by

Equipment support by

Supercomputing '09

120

Personnel Acknowledgments g
Current Students
M Kalaiya (M S ) M. (M. S.) K. Kandalla (M.S.) P. Lai (Ph.D.) M. Luo (Ph.D.) G. Marsh (Ph. D.) X. Ouyang (Ph.D.) S. Potluri (M. S.) H Subramoni (Ph D ) H. S b i (Ph.D.)

Past Students
P Balaji (Ph D ) P. (Ph.D.) D. Buntinas (Ph.D.) S. Bhagvat (M.S.) L. Chai (Ph.D.) B. Chandrasekharan (M.S.) T. Gangadharappa (M.S.) K. Gopalakrishnan (M.S.) W. Huang (Ph.D.) W. Jiang (M.S.) S. Kini (M.S.) M Koop (Ph D ) M. (Ph.D.) R. Kumar (M.S.) S. Krishnamoorthy (M.S.) P. Lai (Ph. D.) ( ) J. Liu (Ph.D.)
Supercomputing '09
121

A Mamidala (Ph D ) A. (Ph.D.) S. Naravula (Ph.D.) R. Noronha (Ph.D.) G. Santhanaraman (Ph.D.) J. Sridhar (M.S.) S. Sur (Ph.D.) K. Vaidyanathan (Ph.D.) A. Vishnu (Ph.D.) J. Wu (Ph.D.) W. Yu (Ph.D.)

Current Post-Doc
E Mancini E.

Current Programmer
J. Perkins

Web Pointers
http://www.cse.ohio-state.edu/~panda http://www.mcs.anl.gov/~balaji http://www.cse.ohio-state.edu/~koop http://nowlab.cse.ohio-state.edu MVAPICH Web Page http://mvapich.cse.ohio-state.edu

panda@cse.ohio-state.edu balaji@mcs.anl.gov matthew.koop@nasa.gov


Supercomputing '09
122

Вам также может понравиться