Академический Документы
Профессиональный Документы
Культура Документы
Charlie Russel
Microsoft MVP for Windows Server AuthorMicrosoft Windows Server 2008 Administrators Companion (Microsoft Press, 2008)
Eric Lantz
Microsoft Program Manager, High Performance Computing Team
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. This White Paper is for informational purposes only. It is up to each reader to carefully consider the statements and recommendations in this White Paper and determine if the guidance is appropriate for his or her unique technical situation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. 2008 Microsoft Corporation. All rights reserved. Microsoft, Internet Explorer, SQL Server, Windows, and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.
Contents
Introduction..........................................................................................................................4 Introduction..........................................................................................................................4 Criteria for Evaluating Performance..................................................................................4 Categories of Parallel Applications....................................................................................5 Message Intensive..........................................................................................................5 Embarrassingly Parallel..................................................................................................5 SOA Applications............................................................................................................6 Other Considerations......................................................................................................6 General Performance Tuning Tools and Methods..................................................................7 General Performance Tuning Tools and Methods..................................................................7 NetworkDirect vs. TCP/IP..................................................................................................9 GigE vs. Specialty Networking........................................................................................11 InfiniBand and the Open Fabrics Alliance........................................................................13 Other IB Tricks................................................................................................................14 Memory and CPU Constraints.........................................................................................15 Affinity..........................................................................................................................16 Large Clusters.................................................................................................................17 Tuning for Messaging-Intensive Applications......................................................................18 Tuning for Messaging-Intensive Applications......................................................................18 Heavy Messaging............................................................................................................18 Latency-Sensitive Messaging..........................................................................................18 Setting the MPI Network...............................................................................................18 Performance Tips for Message-Intensive Applications....................................................18 MS-MPI Shared Memory................................................................................................20 Simple Multipurpose Daemon.......................................................................................20 Typical Microbenchmark Network Performance..............................................................21 Tuning for Embarrassingly Parallel Applications.................................................................23 Tuning for Embarrassingly Parallel Applications.................................................................23 Performance Tuning and Measurement Tools.....................................................................24 Performance Tuning and Measurement Tools.....................................................................24 Using Built-In Diagnostics in the Admin Console ............................................................24 Using MPIPingPong..........................................................................................................24 MPIPingPong Examples ..................................................................................................25 Using NetworkDirect Pingpong.......................................................................................26 Performance Measurement and Tuning Procedure ............................................................27
Page 1 of 41
Performance Measurement and Tuning Procedure ............................................................27 Conclusion..........................................................................................................................29 Conclusion..........................................................................................................................29 References.........................................................................................................................30 References.........................................................................................................................30 Appendix A: Performance Counters for Windows HPC Server 2008....................................32 Appendix A: Performance Counters for Windows HPC Server 2008....................................32 Recommended Counters to Monitor When Assessing Cluster Performance....................33 Example of Collecting Data from Multiple Compute Nodes.............................................36 Appendix B: Performance Console Launcher HTML Example..............................................38 Appendix B: Performance Console Launcher HTML Example..............................................38
Page 2 of 41
Page 3 of 41
Introduction
This white paper discusses the factors that can affect your Windows HPC clusters performance. The paper's goal is to help identify the interconnect hardware to choose for your cluster and to explain how to tune that interconnect and software stack for your application and needs. It focuses on the specifics of tuning and configuring Windows HPC Server 2008. The paper doesn't cover every possible scenariothat would require a book, not a white paperbut it will help you get started on identifying what kind of HPC cluster you have and help you concentrate your performance-tuning efforts (and resources) on what may be the most effective areas for your type of cluster. What this paper is not going to do is tell you what specific models, processors, or brands of computer to buy. It does, however, make specific recommendations about: Criteria to include in your purchasing decisions. Performance testing tools and methods. Specific configurations for general execution cases.
Even though the price of astonishing amounts of computing power has come down substantially, it is still important to spend your computing resources wisely. For example, it makes little sense to spend extra money on high-speed networking gear when your application doesnt do much network communication, but is RAM-constrained. By understanding your application and where the performance bottleneck is, you can make intelligent decisions about how to improve performance.
for your needs. Your application. Testing with the actual application or suite of applications that you use
is the best way to know what the actual performance is. The computing usage pattern of your application is a critical component of performance. Your data. Using the actual data, or a subset of that data, is the best way to measure
your applications performance. By testing your applications performance with data that closely resembles the data sets in actual cluster usage, you can better predict the ultimate real-world performance.
Page 4 of 41
Your equipment. The best test of whether performance meets your needs is to test
against your equipment, preferably in your own environment. Network topology, driver choices, and varying hardware models all affect the performance you see in actual cluster usage. Finally, performance evaluation is not a one-time event, but an ongoing process. Your initial evaluation serves as a baseline to evaluate the state of your system and the effects of hardware and software changes. So, does all this mean that benchmarks are useless? Of course not. By carefully choosing the correct benchmarks for your type of application and data and then intelligently interpreting the results, you can get some very useful preliminary indications of what your system's performance is and how to spend your money to improve that performance. The key is to use the right benchmarks for your type of system and application.
Among applications that are message intensive, there are those that pass lots of small messages and those that pass large messages or data sets. Windows HPC Server 2008 also supports a new type of applicationservice-oriented architecture (SOA) applications. These applications use Windows Communication Foundation (WCF) brokers to communicate. This leaves us with four types of parallel applications to tune for performance as described below. Message Intensive Latency bound. Each nodes assigned chore is highly dependent on the work being done on other nodes, and messages between nodes pass control and data often, using lots of small messages for synchronization. The end-to-end latency of the network, rather than data capacity, is the critical factor. Bandwidth bound. Large messages pass between nodes during runtime, or large data
sets or result sets are staged to nodes before and after runtime. Here, the absolute bandwidth of the networking is the critical factor. Embarrassingly Parallel Each node processes data independently of other nodes, and little message passing is required. The total number of nodes and the efficiency of each node are the critical factors affecting performance.
Page 5 of 41
SOA Applications One or more WCF brokers control communications between the cluster, the job scheduler, and the interactive user. SOA applications tend to be CPU-constrained on the WCF broker nodes. Other Considerations In addition to these broad categories of cluster applications, you should take into account two additional considerations in choosing your cluster hardware and software: Application scaling. The ability of your application to scale to the number of processors
available. Scheduler performance. The ability of the scheduler to handle the number of jobs
required by your application. Application Scaling Application scaling refers to the maximum number of processors you can apply to a single job or problem and still reduce the solution time. Applying more processors than the applications scaling limit can actually increase the solution time. The ability of an application to scale to many processors is affected by the cluster hardware and the operating system; however, many applications in which units of work are interdependent (applications that pass messages) also have scaling limits imposed by software architecture and implementation, mathematics, physics, or all of the above. You should work with your applications vendor to determine if there are practical limits on your applications ability to scale to many processors. If so, get guidance on scaling limits related to your use of the application. Scheduler Performance Some HPC applications run multiple jobs, each of very short duration. In applications such as these (often found in the financial sector, for example), the performance of the job scheduler becomes quite important, and the new Windows HPC Server 2008 Job Scheduler is designed to handle this demanding usage. For more information on factors that affect Job Scheduler performance, see Windows HPC Server 2008 Head Node Performance Tuning available at http://go.microsoft.com/fwlink/? LinkID=132962.
Page 6 of 41
Number of cores per CPU Saturation of the memory bus (Cores waiting for memory response arent doing anyone any good.)
Depends on specific characteristics of your applications memory access patterns and the hardware limitations of your CPU architecture, and on your memory architecture, cache size, and speed. There are so many variables related to an applications hardware utilization that testing is the best way to gauge whether your application scales well to multiple cores on specific hardware. Consider purchasing two or three compute nodes to test your application with different process placement patterns in addition to testing your application on various compute node configurations. Examining solution times for different process placement patterns on a single node will provide insights on the processor and memory needs of your specific application. Process placement patterns should include: loading all the cores, loading one core per processor chip, loading one core per NUMA unit, and so forth.
RAM per
Size your RAM to the working data set you expect the cluster to achieve by the time of your next major
Page 7 of 41
Considerati on node
Limiting Factor working data set to avoid paging to disk. If not possible, ensure the outof-core solver/computations in your commercial app are using the correct RAM size. Not applicable
Network latency (embarrassi ngly parallel applications) Network latency (message intensive applications)
When the delay in the delivery of messages from one node to another (network latency) is so great that it causes your applications computations to wait for data or control messages Large data sets staged to each node before execution Large data sets collected from each node after execution Large messages (>16 kilobits per message) in a message-intensive application
Gigabit Ethernet (GigE) minimum. Higher speed (InfiniBand, Myrinet, 10 GigE, and others) for latency sensitive applications.
Higher bandwidth networking: Ethernet or GigE with offload (TCP/IP Offload Engine and the like) capability InfiniBand, Myrinet, and others
Parallel file system attached to one or more compute nodes. Add additional drives to compute nodes to host local copies of data sets. Use a dedicated staging server with wide, fast RAID arrays to reduce load on the head node. Use Distributed File System (DFS) across multiple staging servers. When using or creating large data sets, the speed of cluster node local disks can be a limiting factor. Use SAS, Fibre, or SCSI arrays. More small disks are faster than a few large disks. On the head node, disk access for Microsoft SQL Server can be the primary limiting factor for clusters with large rates of job submission or configuration
Storage subsystem
Page 8 of 41
Considerati on
Limiting Factor
Page 9 of 41
The NetworkDirect architecture allows a Message Passing Interface (MPI) application to use the most efficient network stack for the networking hardware available in the cluster, with performance comparable to custom hardware-specific MPI libraries, while maintaining application independence from the details of the hardware. As we can see from the figure, with NetworkDirect, the Microsoft Message Passing Interface (MS-MPI) library uses Remote Direct Memory Access (RDMA) to bypass Windows Sockets and the traditional TCP/IP stack, and interfaces directly with the hardware-specific NetworkDirect provider, which writes directly to the underlying network hardware. Not only is the NetworkDirect architecture much faster than conventional TCP-based networking, but it also has the distinct advantage of hiding hardware differences from the MS-MPI library. Applications that write to the MS-MPI stack are shielded from changes in
Page 10 of 41
the underlying network hardware, making it easier and more cost effective to upgrade your networking hardware after the initial cluster purchase, if necessary for performance reasons. Many other MPI stacks write directly to the application programming interface (API) of the networking hardware. This approach limits the ability to change networking hardware later and provides little or no performance advantage over NetworkDirect. Windows HPC Server 2008 will work with any Windows-compatible MPI stack. For extremely network-sensitive applications you should choose a NetworkDirect solution, or, if hardware independence is not an issue, you can choose a hardware-specific MPI stack such as: Myricom MPI stack for Myrinet (source code): For Myrinet GM: http://www.myri.com/scs/download-mpichgm.html For Myrinet MX (newest): http://www.myri.com/scs/download-mpichmx.html
Alternative MPI stacks that work on multiple network fabrics include: Hewlett-Packard HP-MPI: http://h21007.www2.hp.com/dspp/tech/tech_TechDocumentDetailPage_IDX/1,1701, 1238,00.html#new2 Intel MPI: http://www.intel.com/cd/software/products/asmona/eng/cluster/mpi/308295.htm
Page 11 of 41
Note Applications tend to become more sensitive to network latency as the number of compute nodes increases. As you increase the size of your cluster, you might need to improve your networking performance. The baseline networking option for typical off-the-shelf Windows HPC Server 2008-based clusters is a basic GigE network adapter. Some systems include network adapters that have TCP or other offload options that perform network operations on the network adapter rather than on the host computers CPU. Others include network adapters that support RDMA, enabling one computer to directly manipulate memory on a second computer, thus reducing the workload for the hosts CPU. You can use several steps to help improve GigE networking, including: Experiment with disabling CPU interrupt moderation if your GigE network adapter includes
this feature. Tune performance parameters for GigE networking, including TCP Chimney Offload,
receive-side scaling (RSS), and jumbo frames. For details, see the Performance Tuning Guidelines for Windows Server 2008 white paper at http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fde-d599bac8184a/Perftun-srv.docx. Use high-quality network switches or routers that maintain full line speed even when many
switch ports are in simultaneous use. Poor quality switches and cabling can significantly slow an applications solution time, as the networking stack repeatedly retries erroneous transmissions.
Page 12 of 41
Many GigE network adapters have a CPU-interrupt modulation feature in their drivers that batches network interrupts to reduce the load on the CPU. This feature does reduce the CPU overhead, therefore improving overall throughput for some workloads, but it can cause large variances in effective network bandwidth and may also increase network latency, as suggested by anecdotal evidence, by 501,000 percent. CPU interrupt modulation, if supported, can be enabled or disabled on the Advanced tab of the Properties sheet of the adapter, as shown in Figure 3.
Note This particular vendors driver calls the feature Interrupt Moderation. Each vendor has a slightly different name for the feature, but all perform a similar functionality.
TipTo change the settings for identical network cards across a cluster, identify the registry changes used by the network card, and then use clusrun with setx to read and set those values across the cluster.
Page 13 of 41
Direct. For details on tuning Winsock Direct, see the version of this white paper for Windows Computer Cluster Server 2003 at: http://www.microsoft.com/Downloads/details.aspx?FamilyID=40cd8152-f89d-4abf-ab1ca467e180cce4&displaylang=en InfiniBand can be used in three modes: IB with NetworkDirect provider. Provides lowest possible latency for MS-MPI applications and allows lower CPU utilization due to bypass of the operating systems TCP stack. IB with Winsock Direct (WSD) provider. Provides low latency for sockets-based applications and lowers CPU utilization due to bypass of the operating system's TCP stack. IP-over-IB. Uses TCP/IP stack and provides lower latency and higher bandwidth than GigE, but with reduced performance when compared to NetworkDirect. Use IP-overIB when you need network traffic routed over dissimilar networks or when you need TCP functionality for your cluster.
No matter which mode you use, you need one or more drivers. Although IB drivers for Windows are available from several sources, we suggest drivers based on those of the OpenFabrics Alliance (http://www.openfabrics.org). Microsoft has worked closely with Mellanox Technologies (http://www.mellanox.com/content/pages.php?pg=windows_hpc) and other OpenFabrics members as they developed and tuned the open-source code used in these drivers for Windows operating systems. The OpenFabrics Windows drivers (WinOF) include a miniport driver for use with IP-over-IB, a Winsock Direct provider for maximum performance for sockets-based applications, and a NetworkDirect provider for optimal MSMPI performance in addition to support for the native IB interface called InfiniBand Access Layer (IBAL). The OpenFabrics drivers for Windows are posted at http://www.openfabrics.org/downloads/WinOF/v2.0/. Driver bugs can be reported at http://openib.org/bugzilla/buglist.cgi? query_format=specific&order=relevance+desc&bug_status=__open__&product=OpenFabr ics+Windows&content.
Other IB Tricks
Other useful IB tips for improving performance and tuning the IB stack include the following: Determine your brand of IB network adapter (HCA) without opening the computer. Use the Vstat utility and check the first 3 bytes of the node_guid against the list of vendor organizationally unique identifiers (OUIs) at http://standards.ieee.org/regauth/oui/index.shtml.
Note The vstat command also gives you lots of other useful information about the IB configuration. Ensure the IB adapter is enabled on all nodes. You should get the same count as you have nodes when you run the following command:
Page 14 of 41
Determine the number of PCI devices found on each ready node as follows:
clusrun /readynodes \\headnode\share\devcon findall pci\* | find "matching"
Determine the number of Mellanox cards found across all nodes as follows:
clusrun /all \\headnode\share\devcon findall pci\* | find /c "Mellanox"
Check IB driver and firmware versions with the tool from Mellanox at http://www.mellanox.com/content/pages.php?pg=firmware_download Firmware tools for Windows: http://www.mellanox.com/content/pages.php? pg=firmware_HCA_FW_update
If your network runs slower than expected, the first thing to check is the link speed. IPover-IB reports the link speed and, if you have a bad cable, the link speed will show up as 2.5 gigabits per second rather than 10 gigabits per second. If its 2.5 gigabits per second, switch cables. A second possibility, though less likely, is that the switch port is having problems and cant support 10 gigabits per second. To verify, try changing to another port. If that solves the problem, contact your switch vendor.
Page 15 of 41
Switching off hyperthreading for CPU-intensive applications. Hyperthreading enables utilization of unused CPU cycles by pushing several processes or threads through a single processor. However, HPC processes typically consume all available CPU cycles, so pushing many of them through a single processor often lengthens solution time as the processes contest for time, and the CPU context-switches between them. Hyperthreading can usually be disabled in the BIOS. Setting processor affinity for applications. Processor affinity causes the operating system scheduler to always assign a particular process to a particular CPU (or CPU core).
A useful white paper written by a Microsoft MVP on monitoring memory usage can be found at http://members.shaw.ca/bsanders/WindowsGeneralWeb/RAMVirtualMemoryPageFileEtc.ht m. Affinity The scenarios where setting processor affinity can significantly improve performance are fairly easy to identifyprograms doing large amounts of parallel computation with relatively little MPI communications. The version of MS-MPI included in Windows HPC Server 2008 enables setting processor affinity for each of the ranks. Use either the affinity mpiexec command-line switch or set the mpiexec environment variable MPIEXEC_AFFINITY=[0|1]. Note Because setting the MPIEXEC_AFFINITY environment variable must be done before mpiexec is started, the preferred method is to set it as a task term. Using the affinity switch, the affinity for each process is set to a single core. If not all cores on a node are used, msmpi.exe will automatically assign cores that are the furthest away (cache wise). This means that when allocating two processes on a dual-core, two-sockets computer, msmpi.exe will set the affinity mask for each process to a single core on a different socket. For a task using four processes on a quad-core, dual-socket node, where two cores share an L2 cache, msmpi.exe will set the affinity mask for each process to a single core on a different L2 cache boundary. Thus each process gets the whole L2 cache to itself. In some situations, this approach results in reduced solution times, presumably from less memory paging. How effective this is depends on your application and node hardware. However, there are other situations where setting the affinity can actually lead to suboptimal utilization of shared memory and degraded performance. Some customers with OpenMP or threaded sections in their applications have had reduced performance using the affinity switch. Two applications where affinity can create problems are Abaqus and ANSYS Mechanical. For these turning on affinity is totally the wrong thing to do.
Page 16 of 41
Large Clusters
When deploying large clusters of hundreds of nodes, you can run into connection failures in MPI jobs or have problems when a large number of nodes connect to the same share. To help avoid these problems configure your cluster as follows: The head node is not a compute node. Compute nodes are isolated from the public network. The file share used by compute nodes to start executables is not hosted on the head node. Use Microsoft SQL Server Standard Edition instead of Microsoft SQL Server Express
Edition. For more details on planning and implementing large clusters, see the white paper Windows HPC Server 2008 Head Node Performance Tuning at http://go.microsoft.com/fwlink/?LinkID=132962. Also of interest is the Windows HPC Server 2008 Top 500 white paper at http://go.microsoft.com/fwlink/?LinkID=132964.
Page 17 of 41
Heavy Messaging
Because heavy messaging applications transfer large amounts of data with each message, its important to design your cluster for maximum bandwidth on the MPI network. You can get 10 GigE over either fiber or copper cabling, and specialty networks such as InfiniBand (10 gigabits, 20 gigabits, and soon higher), Myrinet (10 gigabits), and others meet very high bandwidth requirements. Regardless of the networking type used, selecting routers and switches that can sustain full port-to-port bandwidth when all ports are in use shortens application solution times, in general. As described earlier, many GigE drivers include a CPU interrupt modulation feature to reduce the CPU load. This feature should be turned off to provide additional bandwidth at the expense of CPU usage on the compute nodes.
Latency-Sensitive Messaging
Latency-sensitive applications are where high-performance networking interfaces become critical for cluster performance. Low latency communications on the Windows Server 2008 operating system are achieved with a NetworkDirect provider and specialty networking gear such as InfiniBand or Myrinet. Setting the MPI Network Use the Windows HPC Server 2008 Cluster Manager to assign the clusters highestperforming network as the MPI network. Doing so sends all MPI messaging traffic on that network, though some MPI management traffic still goes over the clusters private network adapter. You can accomplish the same task using the command prompt by setting the MPICH_NETMASK environment variable in the Mpiexec command for the job. A typical example might be:
mpiexec -env MPICH_NETMASK 172.16.0.0/255.255.0.0 myApp.exe
The net mask value is of the form ip_mask/subnet_mask. In the example above, the network addresses are 172.16.yyy.zzz and the subnet mask is 255.255.0.0.
Page 18 of 41
variables as described in Table 1 and setting processor affinity using MPI variables as described in Table 2. The MPICH environment variables are set using the -env, -genv or -genvlist commandline options. These variables are visible to the launched application and are affecting its execution.
Environment variables can be passed with mpiexec using the env, -genv or genvlist command-line options as follows:
mpiexec env VARIABLE SETTING -env OTHERVARIABLE OTHERSETTING <command line>
MPICH_DISABLE_SOCK MPICH_NETMASK
MPICH_PROGRESS_SPIN_LIM IT
MPICH_SHM_EAGER_LIMIT
MPICH_ND_EAGER_LIMIT
MPICH_ND_ENABLE_FALLBA CK
Page 19 of 41
MPICH_ND_ZCOPY_THRESH OLD
Set the message size above which to perform zcopy transfers. The default of 0 uses the threshold indicated by the NetworkDirect provider. A value of -1 disables zcopy transfers Set the size in megabytes of the NetworkDirect memory registration cache. The default is half of physical memory divided by the number of cores
MPICH_ND_MR_CACHE_SIZE
MS-MPI Shared Memory MS-MPI uses shared memory to communicate between processes on the same node, and network queues for communication with other nodes. Disabling the use of shared memory can improve bandwidth consistency when an application runs across many nodes. This prevents MS-MPI from alternating between shared memory (used for communication between processes on the same node) and network queues (used for communication among nodes) as it polls for incoming data. Thus, MS-MPI polls a single incoming queue, increasing the consistency of measured bandwidth. This results in a smoother bandwidthversus-message-size curve such as those produced by the Pallas microbenchmark. However, shared memory latency is an order of magnitude faster than network communications; therefore, in most situations it is best to leave it enabled on systems with many processors. Ultimately, the only way to know for certain whether disabling shared memory is a good choice for you is to experiment with your particular application, problem size, and hardware. The Microsoft HPC Server team has made significant strides in improving this part of MS-MPIs design. This is another example of the sometimes tenuous connection between microbenchmark data and actual application solution times.
Simple Multipurpose Daemon
The simple multipurpose daemon (SMPD) also uses environment variables to control the behavior of spawned applications. Table 3 describes some useful variables you can use in batch or script files that start your HPC applications on the compute nodes. See the Setting Affinity at the Command Line section below for an example of using SMPD environment variables.
Page 20 of 41
nonetheless. This is automatically set by Windows for each process and indicates the total number of processor cores on the compute node. Other MS-MPI Tips MS-MPI is identical to MPICH2 in most respects, but MS-MPI supports more secure execution in that it does not pass unencrypted user credentials over the network and interacts with the Windows HPC Server 2008 Job Scheduler. For some helpful tips on running MPI jobs using MS-MPI, see the Microsoft HPC Team applications blog at http://blogs.technet.com/WindowsHPC/.
Important The performance data in Table 4 represents typical ranges only and is not maximum performance data. It also is not necessarily representative of how your application will perform on your cluster.
Page 21 of 41
Notes The measurements in Table 4 were taken using the Pallas ping-pong benchmark on midrange servers with the network adapters attached through a PCIe bus. Nodes were connected by a network switch, not cross-over cabling.
Page 22 of 41
Page 23 of 41
Using MPIPingPong
MPIpingPong can automatically ping between each node and its neighbor in a ring or every link between every node. The output presents average latency and bandwidth, provides individual latency and bandwidth, and identifies unresponsive links between nodes. Using the mpipingpong command-line tool gives detailed information and summaries of the clusters performance, and can generate full latency and throughput curve data for packet sizes up to 4 MB. The basic usage scenario for mpipingpong is to submit a job with nodes as requested resources, and run mpiexec mpipingpong on every node. For example, if the cluster has N nodes, then the command line is Job submit /numnodes:N mpiexec mpipingpong The output is printed to stdout in XML format, giving the overall, per-node, and per-link information. You can run mpipingpong on any set of nodes in the cluster by specifying the appropriate resources at job submission time. You need to make sure that you allocate only one node per ping-pong process: otherwise, mpipingpong will complain and exit. You can, however, use the -m command-line option to allow mpipingpong to use multiple cores on the same node if desired. By default, the test runs in tournament mode. Given N nodes, the test is divided into N 1 rounds. During each round, nodes are paired off against each other, and send packets to each other simultaneously. This is fast and complete, but is likely to provide inconsistent
Page 24 of 41
results for larger packet sizes due to network switch oversubscription. Two additional test modes are supported: Serial modeall pairs of nodes are tested, and while each pair talks, all others remain
silent. Ring modeeach node talks only to its neighbors. This means node i will exchange
packets with nodes i + s and i s, where s is a step-size parameter with s = 1 by default, while the rest remain silent. Serial tests are slow, particularly for large clusters and packet sizes, whereas ring tests are incomplete, because they dont test all the links. To run the test in serial mode, use the -s command-line option; to run it in ring mode, use the -r option, and optionally add -rs s to specify the ring step size s. By default, mpipingpong tests network latency only, using the 4-byte packets. You can also test network throughput using 4-MB packets via the -pt command-line option, test both latency and throughput in the same run via the -pb option, or generate a full set of latency and throughput data as a function packets size (using packet sizes ranging from 4 b to 4 MB, incremented by powers of two) via the -pc option. You can also ignore the default packet sizes and supply a set of your own by specifying the -p option multiple times on the command-line. Each -p option must be followed by an integer p or a pair of colon-separated integers p:n,where p is the packet size in bytes and n is the number of iterations for that packet size (i.e. the number of packets of that size that will be exchanged between each pair of nodes during the test). If n is not specified, the test will pick an appropriate number of iterations based on packet size p. By default, mpipingpong will output the complete XML results to standard output at the end of the run, and will print progress information (as percentage of test completed) to standard error as the run takes place. Use the -oa option to print abbreviated XML test results to standard output: this will print the overall summary and per-node summaries, but skip the per-link details. Use the -ob option to print an even briefer XML version of the output, showing the overall summary only. The -op option displays full progress information, printing real-time latency and throughput results to standard error, while the -oq option suppresses the printing of progress information altogether. Finally, you can explicitly print test results to an XML file, rather than standard output, by providing a file name parameter at the end of the mpipingpong command line. If a file name is provided in conjunction with the -oa or -ob switches, the full test results will be printed to the file, and the abbreviated version will go to standard output.
MPIPingPong Examples
All the examples below assume that the resources for the job have been specified, and stdout and stderr have been redirected to desired locations.
Run mpipingpong in serial mode, testing both latency and throughput, sending abbreviated XML results to stdout, real-time plain-text results to stderr, and full results to \\mymachine\myshare\pingpong.xml:
Page 25 of 41
Run mpipingpong in ring mode, with ring step size 4, generating data for packet sizes ranging from 4 bytes to 4 MB, and printing full XML output to stdout and realtime, plain-text output to stderr.
mpiexec mpipingpong -r -rs 4 -pc -op
Run mpipingpong in tournament mode (default), using a set of custom packet sizes and iteration: 8 bytes for a default number of iterations each, 4 KB for 20 iterations each, 512 KB for 6 iterations each. Do not print output to stdout or progress information to stderr: output everything \\mymachine\myshare\pingpong2.xml.
mpiexec mpipingpong -p 8 -p 4096:20 -p 524288:6 -oq \\mymachine\myshare\pingpong2.xml
Important server.
The server continues to run after the client finishes. Use CTRL+C to cancel the
Page 26 of 41
3. Verify that the scheduler can address all the clusters compute nodes by running this command:
clusrun /readynodes hostname & ipconfig /all
This command returns a list of all the active nodes on the cluster and their network configurations. 4. If the cluster has an InfiniBand network, verify the NetworkDirect provider is activated on all compute nodes by running this command:
clusrun ndinstall L
5. Verify that you can run a simple MPI job across the cluster using a simple test application such as this one:
job submit /numprocessors:128 /workdir:\\headnode\share\ mpiexec batchpi.exe 100000
The command assumes you have 128 processors in your cluster, a file share on the head node named \\headnode\fileshare\, and a simple test application like Batchpi.exe (available at the HPC Community site at http://www.windowshpc.net/Resources/Programs/batchpi.zip). 6. Verify point-to-point MPI performance for the cluster. Use a test program like mpipingpong that can automatically ping between each node and its neighbor in a ring (use the allnodes argument) or every link between every node (use the alllinks argument). The output presents average latency and bandwidth, provides individual latency and bandwidth, and identifies unresponsive links between nodes. 7. When measuring latency and bandwidth on GigE: a. To help reduce latency and bandwidth variation and dramatically reduce the minimum latency, switch off CPU interrupt modulation on the network adapter driver. This ensures incoming and outgoing network data is processed immediately. b. Make sure the two test nodes are using the same network switch to avoid hops between switches, and use high-quality switches and cabling. c. Tune performance parameters for GigE networking, including TCP Chimney Offload, RSS, and jumbo frames. For details see the Performance Tuning Guidelines for Windows Server 2008 white paper at
Page 27 of 41
http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fded599bac8184a/Perf-tun-srv.docx. 8. If and only if the network gear is using an RDMA Winsock Direct provider, set the MS-MPI send buffer size to 0 (by setting the environment variable MPICH_SOCKET_SBUFFER_SIZE=0). Doing so eliminates an extra data copy on send operations so you get good latency numbers, but you overburden the CPU in the process. If you use this setting with a non-Winsock Direct or NetworkDirect driver, the computer will likely stop responding. 9. Measure total system performance in addition to microbenchmarks like latency and bandwidth. 10. Use Linpack (or similar) to get a total system measurement that is comparable to other systems. Although a useful comparison, Linpack (or similar) does not necessarily model the behavior of your HPC application. 11. Use a set of commercial HPC applications (and their standard benchmark problems) to measure and compare solution times with other systems. Most end users are not interested in microbenchmarkstheyre focused on wall-clock solution times. 12. When testing a users application, complete the following steps: a. On GigE networking, if the application sends lot of small messages between nodes (latency-sensitive application), then switch off CPU interrupt modulation and additional advanced features of GigE networking, including TCP Chimney Offload, RSS, and jumbo frames b. If the application is very bandwidth-bound and the networking gear is using a Winsock Direct provider, experiment with setting the MS-MPI send buffer size to 0. This eliminates an extra data copy on send operations, which increases bandwidth at the expense of higher CPU utilization. Do not attempt this with non-Winsock Direct drivers: the node will stop responding. c. For applications that are both CPU-intensive and network-intensive, experiment with running fewer processes on each compute node to leave some spare CPU cycles on the node for faster networking. Most HPC applications highly load the CPU. It makes more sense to try this on a GigE cluster than InfiniBand, because GigE tends to use more CPU. You can use the MS-MPI -hosts argument to specify the number of processes to run on each node. Windows HPC has example scripts to help you at http://www.microsoft.com/technet/scriptcenter/hubs/HPCS.mspx. d. Some applications benefit from reduced context switching between processors on a compute node. You can experiment with assigning an application to a specific processor (setting its affinity) on the compute nodes to see if it reduces your applications solution time. Processor affinity can be tagged onto an executable file or set at the Windows command line when an application is started with the Windows Server 2008 start command. See the Setting Processor Affinity section (above) for more detail.
Page 28 of 41
Conclusion
Windows HPC Server 2008 extends and expands the gains of Windows Compute Cluster Server 2003, which brought supercomputing power to the Windows platform, providing a solution that is easy to use and to deploy for building high-performance HPC clusters on commodity x64 computers. By understanding the type of application and how it processes data across the cluster, you can make informed decisions about the specific hardware to specify for your cluster. Understanding where the bottlenecks are in your cluster lets you spend wisely to improve performance.
Page 29 of 41
References
Microsoft HPC Web Site
http://www.microsoft.com/hpc
Microsoft HPC Community Site
http://windowshpc.net/default.aspx
Windows HPC Server 2008 Partners
http://www.microsoft.com/hpc/en/us/partners.aspx
Tutorial from Lawrence Livermore National Lab
http://www.llnl.gov/computing/tutorials/mpi/
Tuning MPI Programs for Peak Performance
http://mpc.uci.edu/wget/www-unix.mcs.anl.gov/mpi/tutorial/perf/mpiperf/index.htm
Winsock Direct Stability Update
http://support.microsoft.com/kb/927620
DevCon Command-Line Utility
http://support.microsoft.com/?kbid=311272
New Networking Features in Windows Server 2003 Service Pack 1
http://www.microsoft.com/technet/community/columns/cableguy/cg1204.mspx
Mellanox MPI Stack and Tools for InfiniBand on Windows http://www.mellanox.com/content/pages.php? pg=products_dyn&product_family=32&menu_section=34 Myricom MPI stack for Myrinet
http://www.verari.com/software_products.asp#mpi
Intel MPI Library
http://www.intel.com/cd/software/products/asmo-na/eng/cluster/mpi/308295.htm
Designing and Building Parallel Programs
http://www-unix.mcs.anl.gov/dbpp/
Parallel Programming Workshop
http://www.mhpcc.edu/training/workshop/parallel_intro/MAIN.html
What Every Dev Must Know About Multithreaded Apps
http://msdn.microsoft.com/msdnmag/issues/05/08/Concurrency/
RAM, Virtual Memory, Page File, and Monitoring
Page 30 of 41
http://members.shaw.ca/bsanders/WindowsGeneralWeb/RAMVirtualMemoryPageFileEtc.ht m
Performance Visualization for Parallel Programs
http://www-unix.mcs.anl.gov/perfvis/download/index.htm
Windows HPC Team Blog
http://blogs.technet.com/WindowsHPC
Building and Measuring the Performance of Windows HPC Server 2008Based Clusters for TOP500 Runs
http://download.microsoft.com/download/1/7/c/17c274a3-50a9-4a55-a5b129fe42d955d8/Top%20500%20White%20Paper_FINAL.DOCX
Performance Tuning Guidelines for Windows Server 2008
http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fded599bac8184a/Perf-tun-srv.docx
Windows Performance Tools Kit
http://www.microsoft.com/whdc/system/sysperf/perftools.mspx
Tuning Parallel MPI Programs with Vampir
http://tudresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/software_werkzeuge_zur_ unterstuetzung_von_programmierung_und_optimierung/vampirwindows/dateien/Tuning_Parallel_MPI_Programs_with_Vampir.pdf
Page 31 of 41
Figure 1A. The Reliability and Performance Monitor user interface Microsoft TechNet provides extensive information on using the Performance Monitor. An article that provides details on the Reliability and Performance Monitor, Performance and Reliability Monitoring Step-by-Step Guide for Windows Server 2008, is found at http://technet.microsoft.com/en-us/library/cc771692.aspx.
Page 32 of 41
\Memory\AvailableMBytes
\Memory\Pages/sec
Page 33 of 41
Memory\Pages Input/sec and Memory\Pages Output/sec. It is counted in numbers of pages, so it can be compared to other counts of pages such as Memory\Page Faults/sec without conversion. Pages/sec includes pages retrieved to satisfy faults in the file system cache (usually requested by applications noncached mapped memory files.) \ NetworkInterface(MyNetworkCo nnectionName)\BytesTotal/sec BytesTotal/sec is the rate at which bytes are sent and received over each network adapter, including framing characters, where MyNetworkConnectionName is the name of your network adapter. Network Interface\Bytes Total/sec is a sum of Network Interface\Bytes Received/sec and Network Interface\Bytes Sent/sec. Use these counters to watch all network adapters to ensure the traffic you expect is on each one, such as to ensure that MPI traffic really is on the MPI network. PacketsOutboundErrors is the number of outbound packets that could not be transmitted because of errors, where MyNetworkConnectionName is the name of your network adapter. PacketsReceivedErrors is the number of inbound packets that contained errors preventing them from being deliverable to a higher-layer protocol, where MyNetworkConnectionName is the name of your network adapter. Packets/sec is the rate at which packets are sent and received on the network adapter, where MyNetworkConnectionName is the name of your network adapter. The amount of the Page File instance in use in percent. See also Process\Page File Bytes. The peak usage of the Page File instance in percent. See also Process\Page File Bytes Peak. %ProcessorTime is the percentage of elapsed time that all of process threads of the process MyHpcAppName used the processor to execute instructions, where MyHpcAppName is the name of your HPC application. %InterruptTime (_Total) is the time all processors on the system spend receiving and servicing hardware interrupts during sample intervals. This
\ NetworkInterface(MyNetworkCo nnectionName)\PacketsOutboun dErrors \ NetworkInterface(MyNetworkCo nnectionName)\PacketsReceived Errors \ NetworkInterface(MyNetworkCo nnectionName)\Packets/sec \PagingFile(_Total)\%Usage \PagingFile(_Total)\%UsagePeak \Process(MyHpcAppName)\ %ProcessorTime
\Processor(_Total)\ %InterruptTime
Page 34 of 41
value is an indirect indicator of the activity of devices that generate interrupts, such as the system clock, the mouse, disk drivers, data communication lines, network adapters, and other peripheral devices. These devices normally interrupt the processor when they have completed a task or require attention. Normal thread execution is suspended during interrupts. Most system clocks interrupt the processor every 10 milliseconds, creating a background of interrupt activity. \Processor(_Total)\ %ProcessorTime %ProcessorTime (_Total) is the percentage of elapsed time that the processor spends to execute all non-idle threads on the system. This counter is the primary indicator of processor activity, and displays the average percentage of busy time observed during the sample interval. Interrupts/sec (_Total) is the average rate, in incidents per second, at which all processors received and serviced hardware interrupts. It does not include deferred procedure calls (DPCs), which are counted separately. Same as %InterruptTime above except only for processor 0, where 0 represents the processor being tested. Same as %ProcessorTime above except only for processor 0, where 0 represents the processor being tested. Same as Interrupts/sec above except only for processor 0, where 0 represents the processor being tested. Same as %InterruptTime above except only for processor 1, where 1 represents the processor being tested. Same as %ProcessorTime above except only for processor 1, where 1 represents the processor being tested. Same as Interrupts/sec above except only for processor 1, where 1 represents the processor being tested. ContextSwitches/sec is the combined rate at which all processors on the computer are switched from
\Processor(_Total)\Interrupts/sec
\Processor(0)\%InterruptTime
\Processor(0)\%ProcessorTime
\Processor(0)\Interrupts/sec
\Processor(1)\%InterruptTime
\Processor(1)\%ProcessorTime
\Processor(1)\Interrupts/sec
\System\ContextSwitches/sec
Page 35 of 41
one thread to another. Context switches occur when a running thread voluntarily relinquishes the processor, is preempted by a higher priority ready thread, or switches between user-mode and privileged (kernel) mode to use an Executive or subsystem service. \System\Processes Processes is the number of processes in the computer at the time of data collection. This is an instantaneous count, not an average over the time interval. Each process represents the running of a program. ProcessorQueueLength is the number of threads in the processor queue. Unlike the disk counters, this counter shows ready threads only, not threads that are running. There is a single queue for processor time even on computers with multiple processors. Therefore, if a computer has multiple processors, you need to divide this value by the number of processors servicing the workload. A sustained processor queue of fewer than 10 threads per processor is normally acceptable, depending on the workload. SystemUpTime is the elapsed time (in seconds) that the computer has been running since it was last started. This counter displays the difference between the start time and the current time. Monitor this counter for reductions over time, which could indicate the system is periodically restarting due to a problem in configuration, application, operating system, or elsewhere. Threads is the number of threads in the computer at the time of data collection. This is an instantaneous count, not an average over the time interval. A thread is the basic executable entity that can execute instructions in a processor.
\System\ProcessorQueueLength
\System\SystemUpTime
\System\Threads
To collect data from several nodes into one Performance Monitor counter log, you can specify the node names like this:
Page 36 of 41
<PARAM NAME="Counter00001.Path" VALUE="\\Node1\Processor(_Total)\% Processor Time"/> <PARAM NAME="Counter00002.Path" VALUE="\\Node2\Processor(_Total)\% Processor Time"/>
This example assumes the computer names of the two nodes are Node1 and Node2.
Page 37 of 41
Page 38 of 41
<PARAM NAME="Counter00009.Path" VALUE="\Network Interface(Intel[R] PRO_1000 PM Network Connection - Virtual Machine Network Services Driver)\Packets/sec"/> <PARAM NAME="Counter00010.Path" VALUE="\Paging File(_Total)\% Usage"/> <PARAM NAME="Counter00011.Path" VALUE="\Paging File(_Total)\% Usage Peak"/> <PARAM NAME="Counter00012.Path" VALUE="\Process(vulcan_solver)\% Processor Time"/> <PARAM NAME="Counter00013.Path" VALUE="\Processor(_Total)\% Interrupt Time"/> <PARAM NAME="Counter00014.Path" VALUE="\Processor(_Total)\% Processor Time"/> <PARAM NAME="Counter00015.Path" VALUE="\Processor(_Total)\Interrupts/sec"/> <PARAM NAME="Counter00016.Path" VALUE="\Processor(0)\% Interrupt Time"/> <PARAM NAME="Counter00017.Path" VALUE="\Processor(0)\% Processor Time"/> <PARAM NAME="Counter00018.Path" VALUE="\Processor(0)\Interrupts/sec"/> <PARAM NAME="Counter00019.Path" VALUE="\Processor(1)\% Interrupt Time"/> <PARAM NAME="Counter00020.Path" VALUE="\Processor(1)\% Processor Time"/> <PARAM NAME="Counter00021.Path" VALUE="\Processor(1)\Interrupts/sec"/> <PARAM NAME="Counter00022.Path" VALUE="\System\Context Switches/sec"/> <PARAM NAME="Counter00023.Path" VALUE="\System\Processes"/> <PARAM NAME="Counter00024.Path" VALUE="\System\Processor Queue Length"/> <PARAM NAME="Counter00025.Path" VALUE="\System\System Up Time"/> <PARAM NAME="Counter00026.Path" VALUE="\System\Threads"/> <PARAM NAME="CounterCount" VALUE="26"/> <PARAM NAME="UpdateInterval" VALUE="2"/> <PARAM NAME="SampleIntervalUnitType" VALUE="1"/> <PARAM NAME="SampleIntervalValue" VALUE="2"/> </OBJECT> </BODY> </HTML>
Page 39 of 41