Академический Документы
Профессиональный Документы
Культура Документы
#
Corresponding author, to whom inquiries should be sent.
2.1 Channel Access Interfaces 2.2 Portable interfaces
The InfiniBand Architecture incorporates many of the In addition to the above interfaces that can be used to
concepts of the Virtual Interface Architecture [1], which in communicate with the Host Channel Adapters' provider
turn drew from Shrimp [11] and Cornell University’s U- drivers directly, there also exist certain portable
Net [12]. The Virtual Interface (VI) Architecture is a interfaces that hide the channel access interface from the
specification for high-performance networking interfaces, user. These include the Direct Access Transport and
which reduces the overhead of ubiquitous transports such Sandia Portals interface.
as TCP/IP. VI supports both, the traditional two-sided
send/receive model of communication, as well as a one- Direct Access Transport (DAT) [2] is a standard set of
sided remote memory access model. Some of the transport-independent, platform independent APIs that
characteristics of VI have been found to have significant are designed to benefit from the remote direct memory
impact on the scalability, and performance of the systems access (RDMA) capabilities of interconnect technologies
that support the Message Passing Interface (MPI) [6,7]. such as InfiniBand, Virtual Interface Architecture, and
These include the connections that are required by a large RDMA over IP. DAT defines both user level (uDAPL)
application, time taken to dynamically open/close and kernel level (kDAPL) provider libraries (DAPL). The
connections, and support for unexpected messages. The VI current version 1.0 of the DAT provides minimal
model also does not support the sharing of a single pool of functionality to exploit RDMA performance capabilities.
buffers between multiple VI's that limits scalability of VI It falls short at present of completely supporting IB
based systems. functionality, such as datagrams, and multicast groups.
This short-term limitation has adverse effects on the
The IBM InfiniBlue Host Channel Adapter Access scalability of large applications. However, DAT also
Application Programming Interface is a vendor- targets systems other than InfiniBand, which enhances its
independent interface that allows the development of both long-term value.
kernel space and user-space applications based on the
InfiniBand transport model [4]. It has the ability to work The Interconnect Software Consortium – Transport API
with different verbs interfaces, and supports multiple (ICSC-Transport API) defines specifications related to
channel adapters from the same or different vendors. It also programming interfaces and protocols that enable direct
recommends that the implementation ensure that multiple interaction with RDMA-capable transports. The
threads may safely use the APIs provided they do not specifications includes extensions to the UNIX Sockets
access the same InfiniBand entity (such as queue pair or API, an API that provides direct user application access
completion queue). to interconnect transport, and APIs that provide
application access to interconnect fabric management
Other verbs API include those defined by Mellanox and infrastructure. The Phase I Specification covers VI
VIEO. The Mellanox IB-Verbs API (VAPI) is Mellanox's
software programmer's interface for InfiniBand verbs [5].
Networks like Myrinet offer non-deadlocking routes MPI I/O, the I/O chapter of the MPI-2 standard is a
through the Myrinet mapper [13]. Fully connected potential beneficiary of the RC and UD capabilities
networks that can be composed from either Myrinet, offered by InfiniBand. Collective communication, and
InfiniBand fabrics, or similar technologies should offer non-blocking I/O calls could take advantage of the low
access to such non-interfering routes in order to allow latency, high bandwidth capabilities offered by
programs to exploit full bisection bandwidth and to avoid InfiniBand to simultaneously offer extended overlapping
unexpected bandwidth collapse for cases that the fabric of not only computation and communication, but also
should be able to handle without performance compromise. I/O.
While this is generally well understood in older
technologies such as Myrinet, it is an issue to be resolved
for InfiniBand. Issues include access to appropriate sets of MPI – 1/2
routes for static/clique programs that assume all-to-all
MPI – 2 PORTALS
virtual topologies (such as MPI-1 programs), access to One Sided
appropriate sets of routes for dynamic programs that add MPI - RD
1 RD Remote
RemoteOPs
OPs UD
and remove cliques during computation and may or may
not need full virtual topology support (such as MPI-2 MPI – 1/2 RMC
RMC MC
MC
programs that use DPM). There are several kinds of MPI Collective
DATDAT
1.0,1.0,
VIPLVIPL Communication
issues that arise, which we raise here, rather than solving. RC UC
First, there has to be an API that gives access to static
Figure 2. Layering Options for Network Programming
routes, if they are precomputable for a fabric. Second,
Models
there has to be an API that allows modification of such
routes, and finally, if multiple route options should be
available for more complex InfiniBand networks, the 4. HIGH PERFORMANCE MESSAGE
ability to use these multiple routes should be possible. PASSING
Finally, if there is a resource limitation that prevents
precomputed routes from being stored for large systems, a
The following are some of our experiences in porting
systematic understanding of the costs for dynamically
MPI/Pro (a commercial implementation of the MPI-1
caching such routes is important. While formally requiring
standard) and ChaMPIon/Pro (a commercial
all-to-all virtual connectivity, most real applications will
implementation of the MPI-2 standard) to benefit from
use rather more limited runtime graphs, except those that
InfiniBand.
actually do data transposition (corner turns), which might
4.1 Impacts on efficient and scalable MPI large sized data.
implementations
d. Use multiple QPs, one for each {tag,
Some of the features provided by InfiniBand include communicator} pair. This model extends model (a) to
Reliable Connection and Reliable Datagram services. In use even more QPs. It would benefit applications using
Reliable Connection oriented services, programmers using several communicators and/or several tags, and
IB Verbs need to explicitly change Queue Pair (QP) states inherently maps well to the tag+communicator
to initiate handshake between two QP’s. Applications ordering rule for MPI. However, handling the
could benefit from this for increasing the performance by MPI_ANY_TAG wild card will require the MPI
overlapping handshaking and computation, using this fine- implementation to add extra ordering logic.
grained control. This would also increase the scalability of
applications in a large-scale InfiniBand network, as QP’s e. Use a separate QP(s) for Collective
could be multiplexed/reused, and reducing the number of Communication and Point-to-point communication.
connections required. (Perhaps making and breaking QP This model should be easy to implement because MPI
connections can be used in MPI Communicator creation dictates no ordering constraints between point-to-point
and destruction to help us use QPs efficiently, especially and collective messages spaces.
when the number of QPs are limited. When the number of
QPs are limited all QPs can be created in MPI_Init, and On the other hand, when the number of QPs are
lightweight state change operations can be used in restricted, it limits liberal use of QPs. Performance has
MPI_Comm_create/dup and MPI_Comm_free.) to be traded off for scalability in light of this resource
constraint.
4.2 Exploiting HCA Parallelism
Still developers of high performance cluster
IB HCAs typically exploit parallelism in hardware. A middleware face scalability vs. bandwidth trade-offs.
direct consequence is that throughput can be maximized by However, as the Reliable Datagram (RD)
pipelining messages using multiple QPs to transfer data communication service type is widely supported by
whenever possible. MPI requires message ordering in mainstream IB hardware vendors, the conflict may be
communicator and tag scope. This may limit unrestricted solved by the solution to provide a dynamic mapping
use of multiple QPs between MPI ranks, because message between application communication channels and End-
ordering is guaranteed by IB Reliable Connections using to-end (EE) Contexts. This strategy requires
the same QP connection. Naïve MPI implementations sophisticated design and it may be application-oriented
would use a single QP for each MPI rank pair. However, to avoid introducing additional overhead and/or
there are several ideas to help leverage concurrent latency. It is widely believed to be a promising solution
messaging bandwidth by MPI libraries through usage of for high performance clustering middleware.
multiple QPs. They are as follows:
4.3 Timeouts and retries
a. Use one QP for each process pair for each
communicator. This is a conservative exploitation of IB allows a programmer to specify timeouts and number
parallelism that benefits MPI application using multiple of retries (RNR_TIMEOUT and RNR_RETRY flags).
communicators. No special measures are required by the This can help the MPI library eliminate adding explicit
MPI library to enforce message ordering since ordering flow control mechanisms that typically add overhead.
is limited to communicator. Using an appropriate combination of timeout and retry
can help achieve low or no-overhead flow control.
b. Preallocate a fixed number (more than one) of QP Applications that inherently do not require flow control
connections between each rank pair, and multiplex large (applications that never exhaust system buffers) will
message transfers across multiple QPs. This exploits never incur extra overhead, whereas this is not true for
HCA parallelism by pipelining message transfers better. even simple credit based flow control mechanisms.
As bus limitations are removed over time, such methods
may increase in value. 4.4 Selective event completion notification
c. Use separate QPs for control messages (handshake IB allows QPs to be configured so that notification of
messages, flow control messages) and data. This event completion can be specified when posting a send or
insulates control messages from being blocked behind receive work queue entry. Typically generating an event
long data transfers. An extension of this model would be queue entry is reasonably expensive as it involves
to use one QP each for small, medium, large and very programmed I/O (PIO) across the PCI bus by the driver.
The MPI library can request for notification for only every 4.7 Some experimental results
Nth send work queue entry, rather than for each one.
Some of our initial experiments with porting MPI/Pro
4.5 Fine-grained control over IB enabling efficient over IBM Verbs API, uDAPL, and Mellanox VAPI
communications included ping-pong latency, and bandwidth
measurements. Back-to-back configurations were used
Traditional flow control, such as credit-based flow-control, for the IBM Verbs testbed. For experiments with IBM
is built on top of implicit message ordering requirements Verbs API, and uDAPL (implementation over IBM
often utilizing features such as fencing, immediate data, Verbs API), the configuration included Supermicro
etc. The receive-side needs to send messages with credit to motherboards, with INTEL E7500 chipsets, Intel Xeon
the send-side as flow control notifications. In the cases that processors, and Paceline 2100 HCA's. Experiments with
data is constantly transferred one way, this credit based the Mellanox VAPI were performed on InfiniHost
flow control is unnecessary. Credit notifications may also MTEK23108 HCA's connected by Topspin 360 switch.
waste bandwidth. The configuration included Intel Server Board SHG2,
with ServerWorks Grand Champion LE chipset, Intel
The Work Request Entry of IB verbs contain fields such as Xeon processors. The IBM Verbs based tests were
retry, timeout, etc. These provide a more fine-grained performed with two microcode versions (1.5.1, and
control of communication. When specifying a proper retry 1.6.4), while uDAPL tests were performed only with the
count and timeout value, flow control can be more efficient HCA firmware version 1.5.1.
or even remove in some cases.
Figure 3 shows the Latencies achieved with MPI/Pro for
4.6 Atomic operations for MPI-2 implementations two processes, in back-to-back configuration, and Figure
4 shows the one-way Bandwidths achieved with similar
The shared memory window paradigm of MPI with
Window-based communicator matches well the memory
region model offered by InfiniBand. The semantics for Latency
DAPL
updating such windows, versus required MPI IBMV (Microcode Ver. 1.5.1)
0
0
0
,0
,0
,0
,0
,0
,0
,0
,0
,0
00
00
00
00
00
00
00
00
50
5
1,
1,
2,
2,
3,
3,
4,
4,
parallel I/O chapter appears extremely promising, offering Message Length (Bytes)
to allow the removal of intermediate TCP/IP connections
between parallel I/O clients and parallel I/O servers, and Figure 4: MPI/Pro Bandwidth over IBM Verbs
also allow flexible choices for data reorganization at either
the (clique/parallel) client or server side of the parallel I/O.
configuration. The latencies and bandwidths achieved in scale remaining to be explored in great depth.
IBM Verbs (microcode version 1.6.4) were considerably
better than that with IBM Verbs (microcode version 1.5.1). InfiniBand is a useful technology for implementing
Although bandwidths were comparable for uDAPL with parallel clusters, and several of the verb APIs are
IBM Verbs (microcode version 1.5.1), the observed particularly efficient for supporting the requirements of
latencies were slightly higher. This is attributed to the MPI-1 point-to-point message passing via RDMA, while
overhead associated with the uDAPL layer. Also, RDMA offering the potential to enhance implementations of
writes were used for long messages, as opposed RDMA MPI-2 one-sided communication. Opportunities for
reads, which would have yielded better performance. Peak enhancing parallel I/O system implementations using
bandwidth of up to 500 MB/s was observed. InfiniBand also exist.
(microseconds)
40.00
implemented by using VAPI event handling Verbs. Polling
Latency
30.00
latencies of 8.5us, and 11.8us were observed for the VAPI
20.00
layer and MPI/Pro respectively. Event latencies of 33us,
10.00
and 38us were observed in VAPI and MPI/Pro
0.00
respectively. Peak bandwidth of 827 MB/s was observed
0 200 400 600 800 1,000 1,200
for the VAPI layer. For MPI/Pro, peak bandwidth of 802
Message Size (Bytes)
MB/s was observed when the buffers are aligned on the
page size. However, when the buffers were aligned, and Figure 5: MPI/Pro Latency over Mellanox VAPI
the same buffers were used for sending and receiving in the
ping-pong tests, peak bandwidth of 815 MB/s was
observed. It was interesting to note that best results were MPI/Pro VAPI Bandwidth (Long Messages)
obtained for MTU 1024, whereas the maximum bandwidth Polliing (Not Aligned)
Polling (Aligned)
was only 560 MB/s using MTU 2048. Blocking (Not Aligned)
900.00
Blocking (Aligned)
800.00
5. CONCLUSIONS 700.00
600.00
This paper compares, contrasts, and maps APIs to 4X 500.00
16
32
64
07
14
28
57
15
30
60
,2
,4
,8
1,
2,
4,
8,
7,
4,
8,
65
77
54
08
13
26
52
04
09
19
38
,7
,5
,1
concerns are also raised, notably that RC is not as scalable
1,
2,
4,
8,
16
33
as desired for large clusters, and that underlying route 67 Message Size (Bytes)