Вы находитесь на странице: 1из 8

PROGRAMMING THE INFINIBAND NETWORK ARCHITECTURE FOR HIGH

PERFORMANCE MESSAGE PASSING SYSTEMS

Vijay Velusamy$*, Changzheng Rao$ Srigurunath Chakravarthi*, Anthony Skjellum*$##


Mississippi State University$ Jothi Neelamegam*, Department of Computer and
Department of Computer Science Weiyi Chen*, Sanjay Verma*, Information Sciences
and Engineering MPI Software Technology, Inc. * 115A Campbell Hall,
Box 9637 101, S. Lafayette St., Suite 33 1300 University Boulevard
Mississippi State, MS 39762, USA Starkville, MS 39759, USA University of Alabama at
{vijay, cr90}@hpcl.cs.msstate.edu {ecap, jothi, wychen, sanjay}@mpi- Birmingham,
softtech.com Birmingham, Alabama 35294,
USA
tony@cis.uab.edu

Abstract enhancement opportunities. Using these APIs to create


fast MPI-1.2 implementation is also discussed. It is
The InfiniBand Architecture provides availability, envisioned that features in MPI-2, including MPI-I/O will
reliability, scalability, and performance for server I/O and be able to benefit from the support of InfiniBand
inter-server communication. The InfiniBand specification Architecture for storage networks and this in aggregation
not only describes the physical interconnect and the high- with its benefits for message passing could be used for
level management functions but also a wide range of high performance computing systems.
functions from simple unreliable communication to
network partitioning, with many options. The specification The remainder of this paper is organized as follows.
defines line protocols for message passing as a collection Section 2 briefly describes the various programming
of proto verbs. This paper discusses various programming interfaces that are available to utilize InfiniBand.
options that available to benefit from the capabilities of Programming models applicable to InfiniBand are
InfiniBand Network Architecture. An analysis is presented described in Section 3. Section 4 encapsulates some
of the various strategies and programming models that experiences in adopting InfiniBand to high performance
would be enabled and/or enhanced to benefit from the Message Passing, and Section 5 summarizes the paper.
InfiniBand infrastructure, for message passing systems.
Conclusions are offered about benefits, trends, and further 2. PROGRAMMING INTERFACES
opportunities for low-level InfiniBand programming.
The HCA Verbs Provider Driver is the lowest level of
Keywords: InfiniBand, Verbs, Message Passing Interface, software in the operating system, and interfaces with the
Middleware, RDMA hardware directly. The HCA driver consists of two parts,
a loadable kernel mode driver, and a user mode library
1. INTRODUCTION (most often a dynamically loadable library module), as
shown in Figure 1.
The InfiniBand Architecture is an industry standard
developed by the InfiniBand Trade Association [3], to Programming interfaces for InfiniBand may be broadly
provide availability, reliability, scalability, and classified into two main categories, namely, Channel
performance for server I/O and inter-server Access Interfaces and Portable Interfaces. Programming
communication. Design principles enlisting interfaces are generally provided by the adapter vendor
communication strategies to virtualize resources among an and communicate directly with the drivers of the Channel
arbitrary number of processes form the core of the Access Interfaces. The interfaces layer on top of these
standard. The InfiniBand specification defines only the line channel access interfaces, thereby hiding the
protocols for message passing, and not any APIs. This implementation specifics, belong to the class of Portable
paper discusses InfiniBand Verb APIs, the VI Provider Interfaces.
Library API [1], and certain other portable network APIs,
while offering perspective on trends, convergence, and

#
Corresponding author, to whom inquiries should be sent.
2.1 Channel Access Interfaces 2.2 Portable interfaces

The InfiniBand Architecture incorporates many of the In addition to the above interfaces that can be used to
concepts of the Virtual Interface Architecture [1], which in communicate with the Host Channel Adapters' provider
turn drew from Shrimp [11] and Cornell University’s U- drivers directly, there also exist certain portable
Net [12]. The Virtual Interface (VI) Architecture is a interfaces that hide the channel access interface from the
specification for high-performance networking interfaces, user. These include the Direct Access Transport and
which reduces the overhead of ubiquitous transports such Sandia Portals interface.
as TCP/IP. VI supports both, the traditional two-sided
send/receive model of communication, as well as a one- Direct Access Transport (DAT) [2] is a standard set of
sided remote memory access model. Some of the transport-independent, platform independent APIs that
characteristics of VI have been found to have significant are designed to benefit from the remote direct memory
impact on the scalability, and performance of the systems access (RDMA) capabilities of interconnect technologies
that support the Message Passing Interface (MPI) [6,7]. such as InfiniBand, Virtual Interface Architecture, and
These include the connections that are required by a large RDMA over IP. DAT defines both user level (uDAPL)
application, time taken to dynamically open/close and kernel level (kDAPL) provider libraries (DAPL). The
connections, and support for unexpected messages. The VI current version 1.0 of the DAT provides minimal
model also does not support the sharing of a single pool of functionality to exploit RDMA performance capabilities.
buffers between multiple VI's that limits scalability of VI It falls short at present of completely supporting IB
based systems. functionality, such as datagrams, and multicast groups.
This short-term limitation has adverse effects on the
The IBM InfiniBlue Host Channel Adapter Access scalability of large applications. However, DAT also
Application Programming Interface is a vendor- targets systems other than InfiniBand, which enhances its
independent interface that allows the development of both long-term value.
kernel space and user-space applications based on the
InfiniBand transport model [4]. It has the ability to work The Interconnect Software Consortium – Transport API
with different verbs interfaces, and supports multiple (ICSC-Transport API) defines specifications related to
channel adapters from the same or different vendors. It also programming interfaces and protocols that enable direct
recommends that the implementation ensure that multiple interaction with RDMA-capable transports. The
threads may safely use the APIs provided they do not specifications includes extensions to the UNIX Sockets
access the same InfiniBand entity (such as queue pair or API, an API that provides direct user application access
completion queue). to interconnect transport, and APIs that provide
application access to interconnect fabric management
Other verbs API include those defined by Mellanox and infrastructure. The Phase I Specification covers VI
VIEO. The Mellanox IB-Verbs API (VAPI) is Mellanox's
software programmer's interface for InfiniBand verbs [5].

The Mellanox VAPI interface provides a set of operations


that closely parallel the proto verbs of the InfiniBand
standard, plus additional extension functionality in the
areas of enhanced memory management and adapter
properties specifications.

The VIEO InfiniBand Channel Abstraction Layer (CAL)


Application Programming Interface provides a vendor-
independent interface for InfiniBand channel adapter
hardware [9]. The CAL API evidently lies under the verbs
abstraction, between the Channel Interface driver software
and adapter hardware. It isolates specific hardware
implementation details, providing both a common function
call interface and a common data structure interface for the
supported InfiniBand chipsets. An advantage of such an
abstraction is the simultaneous support for heterogeneous
channel adapters. This potentially enhances path selection.
Figure 1: High Level Architecture (Adapted from Linux
However, CAL is not widely used at present.
InfiniBand Project [10])
Architecture networks, and the Reliable Connection and Table 1. Layering strategies
Unreliable Datagram services of IB.
Layering Advantage/Potential Use(s)
Sandia Portals Interface [8] is an interface for message
passing between nodes in a system area network. The VIPL over Makes legacy VIPL applications
Portals interface is primarily designed for scalability, with DAPL usable on IB, RDMA over IP
the ability to support a scalable implementation of the MPI DAPL over Useful to architect DAPL
standard. To support scalability, the Portals interface VIPL applications on legacy VIPL based
maintains a minimal amount of state. Portals supports hardware (such as Giganet)
reliable, ordered delivery of message between pairs of
DAPL over IB Portable programming across
processes, without explicit point-to-point connection
Verbs multi-vendor IB HCA’s
establishment between them. Portals combines the
characteristics of both one-sided operations and two-sided Portals over Portable programming across
send/receive operations. Since this interface was designed IB multi-vendor IB HCA’s,
originally to support MPI, implementations of Portals over particularly suitable for MPI
the IB infrastructure could enable rapid deployment of high applications
performance message passing systems, however it could Portals over Makes Portals applications usable
also provide for an alternative option for other DAPL on VIPL, IB, RDMA over IP,
communication paradigms. particularly suitable for MPI
applications
2.3 Interesting Scenarios
Portals over Suitable for Portals based
Though the IBM HCA access API and the Mellanox Verbs VIPL applications on legacy VIPL based
API are primarily targeted at IBM and Mellanox chipsets hardware (such as Giganet)
based adapters respectively, an interesting scenario would
be to use IBM verbs on Mellanox adapters or vice versa. first time the function is invoked with attr_size=0; the
Evidence of such planned cross-support exists. routine returns the actual size of the buffer. The caller is
then able to allocate the buffer and pass it to the routine.
Another interesting scenario would be a port of VI This enables efficient memory utilization, as the call
Provider Library (VIPL) over the DAT library. Even cannot estimate how much memory should be allocated.
though DAT provides with a portable layer over VI
architecture, this scenario is ideal for applications that have There is also certain additional functionality provided by
been developed over VI Provider Library, and need to be some Programming Interfaces. For example, the
ported to a layer that is already supported by DAT. Current Mellanox API supports querying the HCA’s gid, and
studies that incorporate such an idea are geared towards pkey table. The Mellanox API also as a single interface
enumerating the shortcomings of DAT for support of the for both simple data transfer, and RDMA based data
VI Provider Library. The various layering strategies transfer functions. In the Post Send Interface, the opcode
possible, along with their advantages are enlisted in Table parameter specifies the type of operation to be performed.
1. It also supports a set_ack bit flag. This enables the last
packet to have its ack_required bit. This can be used to
2.4 Interface Comparison indicate to the corresponding end point whether the
initiator expects an acknowledgement or not.
The various Verb APIs discussed above are based on the
Verbs specification of the InfiniBand Architecture. Almost DAPL provides many features that make it attractive to
all APIs have many similarities in the semantics used. The applications. Since the API hides the internals of the
many functions are generally classified as HCA Access, implementation, it provides a user-friendly platform for
Memory Management, Address Management, Completion application developers. DAPL provides a common event
Queue and Completion Queue domains, Queue Pair model for notifications, connection management events,
related, End-to-end contexts, Multicast groups, Work data transfer completions, asynchronous errors, and all
Request Processing, and Event handling. However, they other notifications. It also supports combining OS wait
have some subtle differences in the parameters that are objects, and DAT events. However, in certain cases, this
associated with certain calls. For example, in the prevents the application from benefiting from certain
Query_hca function provided in the IBM Verbs API, the features that may be available to developers using the
caller is not required to pre-allocate memory for the underlying Verbs/VIPL interface directly. Also
attributes structure. The caller calls the function twice. The presence of an additional layer, has an adverse affect on
the performance of the system.
3. PROGRAMMING MODELS either be effectively implemented with a fully connected
graph. The coupling of such routing information to the
InfiniBand supports the following transport models: services of InfiniBand is also an important question that
Reliable Datagrams (RD); Remote Operations (RO); impacts overall scalability. For instance, if RD is
Unreliable Datagrams (UD); Reliable Connection (RC); exploited as a means to enhance scalability (for example
Unreliable Connection (UC); and Unreliable Multicast data either directly or used in connection with Portals), then a
transfer paradigms (Fig. 2). Though InfiniBand provides hidden unscalability from routing information must also
support only for Unreliable Multicast, applications be identified and removed for large-scale networks, or
requiring Reliable multicast, such as MPI implementations else the cost of higher scalability in terms of dynamic
could use a model that uses Reliable Datagrams coupled caching must be estimated. Brightwell et al identify in
with a selective retransmission strategy to achieve Reliable [8] reasons why connection-oriented protocols such as
Multicast (RMC) capability. The Reliable Datagrams, VIPL limit scalability because each connection uses fixed
Remote Operations, and Unreliable Datagram models may resources in the NIC, rather than each session. Similarly,
be used to implement the Sandia Portals API. the lower-level routing information may effectively
Implementations of MPI-1 and MPI-2 standards may use reproduce such unscalable growth in resource
the Portals implementation to benefit from InfiniBand. requirements. Then, one has to reason if these limitations
DAT 1.0 and VIPL will benefit from Reliable Connection are binding on systems of size to be created practically
oriented data transfers. Implementations of MPI-1 and (e.g., up to 65,536 nodes in current generation
MPI-2 standards may be layered over DAT 1.0 or VIPL. superclusters). From the emphases of this paper, it is
MPI-2 One Sided Communications will benefit from important to mention that there may be coupling issues to
Remote Operations supported by InfiniBand. Unreliable existing verb layers that require new verbs, both for
Connection, Multicast, Reliable Multicast will be used to message passing and also impacts on connection
implement MPI Collective Communications. management.

Networks like Myrinet offer non-deadlocking routes MPI I/O, the I/O chapter of the MPI-2 standard is a
through the Myrinet mapper [13]. Fully connected potential beneficiary of the RC and UD capabilities
networks that can be composed from either Myrinet, offered by InfiniBand. Collective communication, and
InfiniBand fabrics, or similar technologies should offer non-blocking I/O calls could take advantage of the low
access to such non-interfering routes in order to allow latency, high bandwidth capabilities offered by
programs to exploit full bisection bandwidth and to avoid InfiniBand to simultaneously offer extended overlapping
unexpected bandwidth collapse for cases that the fabric of not only computation and communication, but also
should be able to handle without performance compromise. I/O.
While this is generally well understood in older
technologies such as Myrinet, it is an issue to be resolved
for InfiniBand. Issues include access to appropriate sets of MPI – 1/2
routes for static/clique programs that assume all-to-all
MPI – 2 PORTALS
virtual topologies (such as MPI-1 programs), access to One Sided
appropriate sets of routes for dynamic programs that add MPI - RD
1 RD Remote
RemoteOPs
OPs UD
and remove cliques during computation and may or may
not need full virtual topology support (such as MPI-2 MPI – 1/2 RMC
RMC MC
MC
programs that use DPM). There are several kinds of MPI Collective
DATDAT
1.0,1.0,
VIPLVIPL Communication
issues that arise, which we raise here, rather than solving. RC UC
First, there has to be an API that gives access to static
Figure 2. Layering Options for Network Programming
routes, if they are precomputable for a fabric. Second,
Models
there has to be an API that allows modification of such
routes, and finally, if multiple route options should be
available for more complex InfiniBand networks, the 4. HIGH PERFORMANCE MESSAGE
ability to use these multiple routes should be possible. PASSING
Finally, if there is a resource limitation that prevents
precomputed routes from being stored for large systems, a
The following are some of our experiences in porting
systematic understanding of the costs for dynamically
MPI/Pro (a commercial implementation of the MPI-1
caching such routes is important. While formally requiring
standard) and ChaMPIon/Pro (a commercial
all-to-all virtual connectivity, most real applications will
implementation of the MPI-2 standard) to benefit from
use rather more limited runtime graphs, except those that
InfiniBand.
actually do data transposition (corner turns), which might
4.1 Impacts on efficient and scalable MPI large sized data.
implementations
d. Use multiple QPs, one for each {tag,
Some of the features provided by InfiniBand include communicator} pair. This model extends model (a) to
Reliable Connection and Reliable Datagram services. In use even more QPs. It would benefit applications using
Reliable Connection oriented services, programmers using several communicators and/or several tags, and
IB Verbs need to explicitly change Queue Pair (QP) states inherently maps well to the tag+communicator
to initiate handshake between two QP’s. Applications ordering rule for MPI. However, handling the
could benefit from this for increasing the performance by MPI_ANY_TAG wild card will require the MPI
overlapping handshaking and computation, using this fine- implementation to add extra ordering logic.
grained control. This would also increase the scalability of
applications in a large-scale InfiniBand network, as QP’s e. Use a separate QP(s) for Collective
could be multiplexed/reused, and reducing the number of Communication and Point-to-point communication.
connections required. (Perhaps making and breaking QP This model should be easy to implement because MPI
connections can be used in MPI Communicator creation dictates no ordering constraints between point-to-point
and destruction to help us use QPs efficiently, especially and collective messages spaces.
when the number of QPs are limited. When the number of
QPs are limited all QPs can be created in MPI_Init, and On the other hand, when the number of QPs are
lightweight state change operations can be used in restricted, it limits liberal use of QPs. Performance has
MPI_Comm_create/dup and MPI_Comm_free.) to be traded off for scalability in light of this resource
constraint.
4.2 Exploiting HCA Parallelism
Still developers of high performance cluster
IB HCAs typically exploit parallelism in hardware. A middleware face scalability vs. bandwidth trade-offs.
direct consequence is that throughput can be maximized by However, as the Reliable Datagram (RD)
pipelining messages using multiple QPs to transfer data communication service type is widely supported by
whenever possible. MPI requires message ordering in mainstream IB hardware vendors, the conflict may be
communicator and tag scope. This may limit unrestricted solved by the solution to provide a dynamic mapping
use of multiple QPs between MPI ranks, because message between application communication channels and End-
ordering is guaranteed by IB Reliable Connections using to-end (EE) Contexts. This strategy requires
the same QP connection. Naïve MPI implementations sophisticated design and it may be application-oriented
would use a single QP for each MPI rank pair. However, to avoid introducing additional overhead and/or
there are several ideas to help leverage concurrent latency. It is widely believed to be a promising solution
messaging bandwidth by MPI libraries through usage of for high performance clustering middleware.
multiple QPs. They are as follows:
4.3 Timeouts and retries
a. Use one QP for each process pair for each
communicator. This is a conservative exploitation of IB allows a programmer to specify timeouts and number
parallelism that benefits MPI application using multiple of retries (RNR_TIMEOUT and RNR_RETRY flags).
communicators. No special measures are required by the This can help the MPI library eliminate adding explicit
MPI library to enforce message ordering since ordering flow control mechanisms that typically add overhead.
is limited to communicator. Using an appropriate combination of timeout and retry
can help achieve low or no-overhead flow control.
b. Preallocate a fixed number (more than one) of QP Applications that inherently do not require flow control
connections between each rank pair, and multiplex large (applications that never exhaust system buffers) will
message transfers across multiple QPs. This exploits never incur extra overhead, whereas this is not true for
HCA parallelism by pipelining message transfers better. even simple credit based flow control mechanisms.
As bus limitations are removed over time, such methods
may increase in value. 4.4 Selective event completion notification

c. Use separate QPs for control messages (handshake IB allows QPs to be configured so that notification of
messages, flow control messages) and data. This event completion can be specified when posting a send or
insulates control messages from being blocked behind receive work queue entry. Typically generating an event
long data transfers. An extension of this model would be queue entry is reasonably expensive as it involves
to use one QP each for small, medium, large and very programmed I/O (PIO) across the PCI bus by the driver.
The MPI library can request for notification for only every 4.7 Some experimental results
Nth send work queue entry, rather than for each one.
Some of our initial experiments with porting MPI/Pro
4.5 Fine-grained control over IB enabling efficient over IBM Verbs API, uDAPL, and Mellanox VAPI
communications included ping-pong latency, and bandwidth
measurements. Back-to-back configurations were used
Traditional flow control, such as credit-based flow-control, for the IBM Verbs testbed. For experiments with IBM
is built on top of implicit message ordering requirements Verbs API, and uDAPL (implementation over IBM
often utilizing features such as fencing, immediate data, Verbs API), the configuration included Supermicro
etc. The receive-side needs to send messages with credit to motherboards, with INTEL E7500 chipsets, Intel Xeon
the send-side as flow control notifications. In the cases that processors, and Paceline 2100 HCA's. Experiments with
data is constantly transferred one way, this credit based the Mellanox VAPI were performed on InfiniHost
flow control is unnecessary. Credit notifications may also MTEK23108 HCA's connected by Topspin 360 switch.
waste bandwidth. The configuration included Intel Server Board SHG2,
with ServerWorks Grand Champion LE chipset, Intel
The Work Request Entry of IB verbs contain fields such as Xeon processors. The IBM Verbs based tests were
retry, timeout, etc. These provide a more fine-grained performed with two microcode versions (1.5.1, and
control of communication. When specifying a proper retry 1.6.4), while uDAPL tests were performed only with the
count and timeout value, flow control can be more efficient HCA firmware version 1.5.1.
or even remove in some cases.
Figure 3 shows the Latencies achieved with MPI/Pro for
4.6 Atomic operations for MPI-2 implementations two processes, in back-to-back configuration, and Figure
4 shows the one-way Bandwidths achieved with similar
The shared memory window paradigm of MPI with
Window-based communicator matches well the memory
region model offered by InfiniBand. The semantics for Latency
DAPL
updating such windows, versus required MPI IBMV (Microcode Ver. 1.5.1)

synchronization semantics, is an attractive match. 200


IBMV (Microcode Ver. 1.6.4)

InfiniBand supports two Atomic operations: Compare &


150
Swap, and Fetch & Add. The Compare & Swap instructs
the remote QP to read a 64-bit value, compare it with the 100
us

compare data provided, and if equal, replace it with the


50
swap data, provided in the QP. The Fetch & Add operation
on the other hand, instructs the remote QP to read a 64-bit 0
value and replace it with the sum of the 64-bit value and 0 200 400 600 800 1,000 1,200
the add data value, provided in the QP. MPI-2 offers a rich, Message Length (Bytes)

high-level distributed shared memory paradigm with one-


sided semantics in addition to two-sided message passing. Figure 3: MPI/Pro Latency over IBM Verbs
These one-sided semantics appear to work well with the
availability of both RDMA constructs of InfiniBand, and Bandwidth DAPL
IBMV (Microcode Ver. 1.5.1)
also with atomic operations of InfiniBand. While the latter 600 IBMV (Microcode Ver. 1.6.4)
have not been explored extensively at all, we find that both 500
the read-modify-write and fetch-and-add can be exploited 400
MB/s

in support of certain 1-sided operations. The MPI 300


collective accumulate operation appears to be supported, at 200
least in part by such atomic operations. The direct use of 100
0
Distributed Shared Memory (DSM) for collective data
reorganization in third-party memory appears promising.
0

0
0

0
,0

,0

,0

,0

,0

,0

,0

,0

,0

Finally, the use of DSM and RDMA together with MPI-2's


0

00

00

00

00

00

00

00

00
50

5
1,

1,

2,

2,

3,

3,

4,

4,

parallel I/O chapter appears extremely promising, offering Message Length (Bytes)
to allow the removal of intermediate TCP/IP connections
between parallel I/O clients and parallel I/O servers, and Figure 4: MPI/Pro Bandwidth over IBM Verbs
also allow flexible choices for data reorganization at either
the (clique/parallel) client or server side of the parallel I/O.
configuration. The latencies and bandwidths achieved in scale remaining to be explored in great depth.
IBM Verbs (microcode version 1.6.4) were considerably
better than that with IBM Verbs (microcode version 1.5.1). InfiniBand is a useful technology for implementing
Although bandwidths were comparable for uDAPL with parallel clusters, and several of the verb APIs are
IBM Verbs (microcode version 1.5.1), the observed particularly efficient for supporting the requirements of
latencies were slightly higher. This is attributed to the MPI-1 point-to-point message passing via RDMA, while
overhead associated with the uDAPL layer. Also, RDMA offering the potential to enhance implementations of
writes were used for long messages, as opposed RDMA MPI-2 one-sided communication. Opportunities for
reads, which would have yielded better performance. Peak enhancing parallel I/O system implementations using
bandwidth of up to 500 MB/s was observed. InfiniBand also exist.

Latency and Bandwidth measurements for two processes


MPI/Pro VAPI Latency
using MPI/Pro over Mellanox VAPI are shown in figures 5 Blocking

and 6 respectively. MPI/Pro over VAPI supports two Polling


50.00
modes, namely polling and blocking. The blocking mode is

(microseconds)
40.00
implemented by using VAPI event handling Verbs. Polling

Latency
30.00
latencies of 8.5us, and 11.8us were observed for the VAPI
20.00
layer and MPI/Pro respectively. Event latencies of 33us,
10.00
and 38us were observed in VAPI and MPI/Pro
0.00
respectively. Peak bandwidth of 827 MB/s was observed
0 200 400 600 800 1,000 1,200
for the VAPI layer. For MPI/Pro, peak bandwidth of 802
Message Size (Bytes)
MB/s was observed when the buffers are aligned on the
page size. However, when the buffers were aligned, and Figure 5: MPI/Pro Latency over Mellanox VAPI
the same buffers were used for sending and receiving in the
ping-pong tests, peak bandwidth of 815 MB/s was
observed. It was interesting to note that best results were MPI/Pro VAPI Bandwidth (Long Messages)
obtained for MTU 1024, whereas the maximum bandwidth Polliing (Not Aligned)
Polling (Aligned)
was only 560 MB/s using MTU 2048. Blocking (Not Aligned)
900.00
Blocking (Aligned)
800.00
5. CONCLUSIONS 700.00
600.00
This paper compares, contrasts, and maps APIs to 4X 500.00

InfiniBand hardware. Commentary on porting between 400.00


300.00
APIs, and simultaneously using multiple APIs is given.
200.00
This paper analyzes the various programming models that 100.00
might be used by applications to benefit from various 0.00
features provided by InfiniBand, while assessing potential
36

16

32

64
07

14

28

57

15

30

60

converges of low-level APIs. Outstanding scalability


,5

,2

,4

,8
1,

2,

4,

8,

7,

4,

8,
65

77

54

08
13

26

52

04

09

19

38

,7

,5

,1
concerns are also raised, notably that RC is not as scalable
1,

2,

4,

8,

16

33

as desired for large clusters, and that underlying route 67 Message Size (Bytes)

computation must be utilized. Opportunities to use RD and


other InfiniBand modes together with the Portals API Figure 6: MPI/Pro Bandwidth over Mellanox VAPI
appear to be excellent paths to high scalability, but routing
issues will also be important here. 6. ACKNOWLEDGEMENTS
InfiniBand introduces many new programming options and Work at both Mississippi State University and MPI
many new ways to build high performance clusters. In the Software Technology, Inc was enabled through loaner
next few months, convergence of programming models, hardware from the respective commercial vendors
programming options, and hardening of low-level mentioned herein, as well as Los Alamos National
implementations will help identify how fully commercial Laboratory, as well as superb technical support from both
offerings for production computing will evolve. The Paceline Systems and Mellanox concerning low-level
opportunity to explore maximum scalability of InfiniBand software issues. Work at MPI Software Technology, Inc
(APIs, protocols, and routing issues) remains for and Mississippi State University was funded, in part,
immediate work. Effective scalability up to commonly under a grant for the National Science Foundation, SBIR
used sizes is obviously possible, with issues at massive Program, Phase II, IIb. Grants DMI-9983413 and DMI-
0222804 are acknowledged. MPI Software Technology, [13] Nanette J. Boden and Danny Cohen and Robert
Inc also was enabled through a subcontract with the US E. Felderman and Alan E. Kulawik and Charles
Department of Energy, Los Alamos National Laboratory. L. Seitz and Jakov N. Seizovic and Wen-King
Dr. Gary Grider of Los Alamos is specifically Su, “Myrinet: A Gigabit-per-second Local Area
acknowledged as well. Network,” IEEE Micro, 15(1):29--36, February
1995.
7. REFERENCES [14] Interconnect Software Consortium,“ICSC
[1] Compaq, Microsoft, and Intel, “Virtual Transport API Specification,”
Architecture Specification Version 1.0”, http://www.opengroup.org/icsc
Technical report, Compaq, Microsoft, and Intel,
December 1997.
[2] DAT Collaborative. uDAPL API Specification
Version 1.0. http://www.datcollaborative.org
[3] InfiniBand Trade Association, “InfiniBand
Architecture Specification, Release 1.0,”
http://www.InfiniBandta.org.
[4] International Business Machines (IBM)
Corporation, “InfiniBlue Host Channel Adapter
Access Application Programming Interfaces
Programmer's Guide,” October 2002
[5] Mellanox Technologies Inc., “Mellanox IB-Verbs
API (VAPI),” 2001.
[6] Message Passing Interface Forum, “MPI: A
Message Passing Interface standard,” The
International Journal of Supercomputer
Applications and High Performance Computing,
vol. 8, 1994.
[7] Ron Brightwell and Arthur B. Maccabe,
“Scalability Limitations of VIA-Based
Technologies in supporting MPI,” Proceedings of
Fourth MPI Developer's and User's Conference,
March 2000.
[8] Ron Brightwell, Arthur B. Maccabe, and Rolf
Riesen, “The Portals 3.2 Message Passing
Interface Revision 1.1,” Sandia National
Laboratories, November 2002.
[9] Vieo Inc., “Channel Abstraction Layer (CAL)
API Reference Manual V2.0,” January 2002.
[10] Intel Corporation, “Linux InfiniBand Project,”
http://InfiniBand.sourceforge.net.
[11] C. Dubnicki et al., “Shrimp Project Update:
Myrinet Communication,” IEEE Micro, Jan.-Feb.
1998, pp. 50-52.
[12] T. von Eicken et al., “U-Net: A User-level
Network Interface for Parallel and Distributed
Computing,” Proc.15th Symp. Operating System
Principles, ACM Press, New York, 1995, pp. 40-
53.

Вам также может понравиться