Performance Tuning An HPC Cluster - FINAL 012009

Performance Tuning a Windows HPC Cluster for Parallel Applications
Charlie Russel
Microsoft MVP for Windows Server AuthorMicrosoft Windows Server 2008 Administrators Companion (Microsoft Press, 2008)
Eric Lantz
Microsoft Program Manager, High Performance Computing Team
Published: December 2008
Windows HPC Server 2008 White Paper
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. This White Paper is for informational purposes only. It is up to each reader to carefully consider the statements and recommendations in this White Paper and determine if the guidance is appropriate for his or her unique technical situation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. 2008 Microsoft Corporation. All rights reserved. Microsoft, Internet Explorer, SQL Server, Windows, and Windows Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.
Contents
Introduction..........................................................................................................................4 Introduction..........................................................................................................................4 Criteria for Evaluating Performance..................................................................................4 Categories of Parallel Applications....................................................................................5 Message Intensive..........................................................................................................5 Embarrassingly Parallel..................................................................................................5 SOA Applications............................................................................................................6 Other Considerations......................................................................................................6 General Performance Tuning Tools and Methods..................................................................7 General Performance Tuning Tools and Methods..................................................................7 NetworkDirect vs. TCP/IP..................................................................................................9 GigE vs. Specialty Networking........................................................................................11 InfiniBand and the Open Fabrics Alliance........................................................................13 Other IB Tricks................................................................................................................14 Memory and CPU Constraints.........................................................................................15 Affinity..........................................................................................................................16 Large Clusters.................................................................................................................17 Tuning for Messaging-Intensive Applications......................................................................18 Tuning for Messaging-Intensive Applications......................................................................18 Heavy Messaging............................................................................................................18 Latency-Sensitive Messaging..........................................................................................18 Setting the MPI Network...............................................................................................18 Performance Tips for Message-Intensive Applications....................................................18 MS-MPI Shared Memory................................................................................................20 Simple Multipurpose Daemon.......................................................................................20 Typical Microbenchmark Network Performance..............................................................21 Tuning for Embarrassingly Parallel Applications.................................................................23 Tuning for Embarrassingly Parallel Applications.................................................................23 Performance Tuning and Measurement Tools.....................................................................24 Performance Tuning and Measurement Tools.....................................................................24 Using Built-In Diagnostics in the Admin Console ............................................................24 Using MPIPingPong..........................................................................................................24 MPIPingPong Examples ..................................................................................................25 Using NetworkDirect Pingpong.......................................................................................26 Performance Measurement and Tuning Procedure ............................................................27
Performance Tuning a Windows HPC Cluster
Page 1 of 41
Performance Measurement and Tuning Procedure ............................................................27 Conclusion..........................................................................................................................29 Conclusion..........................................................................................................................29 References.........................................................................................................................30 References.........................................................................................................................30 Appendix A: Performance Counters for Windows HPC Server 2008....................................32 Appendix A: Performance Counters for Windows HPC Server 2008....................................32 Recommended Counters to Monitor When Assessing Cluster Performance....................33 Example of Collecting Data from Multiple Compute Nodes.............................................36 Appendix B: Performance Console Launcher HTML Example..............................................38 Appendix B: Performance Console Launcher HTML Example..............................................38
Page 2 of 41
Contributors and Acknowledgements

Josh Barnard, Microsoft Jeff Baxter, Microsoft Eric Lantz, Microsoft Erez Haba, Microsoft Leonid Meyerguz, Microsoft Fab Tillier, Microsoft Wenhao Wu, Microsoft Carolyn Mader, Xtreme Consulting Kathie Werner, Xtreme Consulting
Page 3 of 41
Introduction
This white paper discusses the factors that can affect your Windows HPC clusters performance. The paper's goal is to help identify the interconnect hardware to choose for your cluster and to explain how to tune that interconnect and software stack for your application and needs. It focuses on the specifics of tuning and configuring Windows HPC Server 2008. The paper doesn't cover every possible scenariothat would require a book, not a white paperbut it will help you get started on identifying what kind of HPC cluster you have and help you concentrate your performance-tuning efforts (and resources) on what may be the most effective areas for your type of cluster. What this paper is not going to do is tell you what specific models, processors, or brands of computer to buy. It does, however, make specific recommendations about: Criteria to include in your purchasing decisions. Performance testing tools and methods. Specific configurations for general execution cases.
Even though the price of astonishing amounts of computing power has come down substantially, it is still important to spend your computing resources wisely. For example, it makes little sense to spend extra money on high-speed networking gear when your application doesnt do much network communication, but is RAM-constrained. By understanding your application and where the performance bottleneck is, you can make intelligent decisions about how to improve performance.
Criteria for Evaluating Performance

It is important to understand what improving performance means. There are those who think that the measure of performance is some magic number produced by one or another benchmark. Benchmarks are interesting, and they do provide a useful and focused way to measure aspects of performance. Benchmarks need to be used with caution, however. When not used wisely, they can actually give a completely incorrect indication compared with the performance that really mattersyour time to solution with your application on your hardware, which is sometimes called the time to decision. Your personal time to decision is based on four important determinants of performance: Your needs. Only you can decide what the acceptable and appropriate processing time is
for your needs. Your application. Testing with the actual application or suite of applications that you use
is the best way to know what the actual performance is. The computing usage pattern of your application is a critical component of performance. Your data. Using the actual data, or a subset of that data, is the best way to measure
your applications performance. By testing your applications performance with data that closely resembles the data sets in actual cluster usage, you can better predict the ultimate real-world performance.
Page 4 of 41
Your equipment. The best test of whether performance meets your needs is to test
against your equipment, preferably in your own environment. Network topology, driver choices, and varying hardware models all affect the performance you see in actual cluster usage. Finally, performance evaluation is not a one-time event, but an ongoing process. Your initial evaluation serves as a baseline to evaluate the state of your system and the effects of hardware and software changes. So, does all this mean that benchmarks are useless? Of course not. By carefully choosing the correct benchmarks for your type of application and data and then intelligently interpreting the results, you can get some very useful preliminary indications of what your system's performance is and how to spend your money to improve that performance. The key is to use the right benchmarks for your type of system and application.
Categories of Parallel Applications

Before you can make intelligent decisions about what interconnect hardware to choose for your cluster, you need to understand how the specifics of your application affect its performance. The many types of applications that run on HPC clusters span a broad spectrum. Analyzing and tuning the performance of those applications and their clusters tends to focus on two ends of a spectrum: Message intensive Embarrassingly parallel
Among applications that are message intensive, there are those that pass lots of small messages and those that pass large messages or data sets. Windows HPC Server 2008 also supports a new type of applicationservice-oriented architecture (SOA) applications. These applications use Windows Communication Foundation (WCF) brokers to communicate. This leaves us with four types of parallel applications to tune for performance as described below. Message Intensive Latency bound. Each nodes assigned chore is highly dependent on the work being done on other nodes, and messages between nodes pass control and data often, using lots of small messages for synchronization. The end-to-end latency of the network, rather than data capacity, is the critical factor. Bandwidth bound. Large messages pass between nodes during runtime, or large data
sets or result sets are staged to nodes before and after runtime. Here, the absolute bandwidth of the networking is the critical factor. Embarrassingly Parallel Each node processes data independently of other nodes, and little message passing is required. The total number of nodes and the efficiency of each node are the critical factors affecting performance.
Page 5 of 41
SOA Applications One or more WCF brokers control communications between the cluster, the job scheduler, and the interactive user. SOA applications tend to be CPU-constrained on the WCF broker nodes. Other Considerations In addition to these broad categories of cluster applications, you should take into account two additional considerations in choosing your cluster hardware and software: Application scaling. The ability of your application to scale to the number of processors
available. Scheduler performance. The ability of the scheduler to handle the number of jobs
required by your application. Application Scaling Application scaling refers to the maximum number of processors you can apply to a single job or problem and still reduce the solution time. Applying more processors than the applications scaling limit can actually increase the solution time. The ability of an application to scale to many processors is affected by the cluster hardware and the operating system; however, many applications in which units of work are interdependent (applications that pass messages) also have scaling limits imposed by software architecture and implementation, mathematics, physics, or all of the above. You should work with your applications vendor to determine if there are practical limits on your applications ability to scale to many processors. If so, get guidance on scaling limits related to your use of the application. Scheduler Performance Some HPC applications run multiple jobs, each of very short duration. In applications such as these (often found in the financial sector, for example), the performance of the job scheduler becomes quite important, and the new Windows HPC Server 2008 Job Scheduler is designed to handle this demanding usage. For more information on factors that affect Job Scheduler performance, see Windows HPC Server 2008 Head Node Performance Tuning available at http://go.microsoft.com/fwlink/? LinkID=132962.
Page 6 of 41
General Performance Tuning Tools and Methods

This section covers performance-tuning issues that apply generally to all clusters, especially around networking issues, including specific updates, types of networking hardware, and driver choices that are generally applicable. When planning your HPC cluster deployment and evaluating its performance, use the checklist in Table 1.
Table 1: Performance Considerations Checklist

Considerati on CPU performance Limiting Factor Power and cooling available Recommendations or Remarks The faster the better. Faster CPUs will only shorten solution times when your application is CPU-bound. For SOA workloads, WCF brokers are CPU-intensive. Memory performance Saturation of the memory access A lot of HPC applications are also memory bound. Hardware attributes that affect memory access are: Memory architecture (for example, the number of processor cores served by each memory controller and bus) Size and speed of L1, L2, and L3 memory caches Memory bus throughput (RAM speed, front side bus speed, and bus width all affect this) Translation lookaside buffer
Number of cores per CPU Saturation of the memory bus (Cores waiting for memory response arent doing anyone any good.)
Depends on specific characteristics of your applications memory access patterns and the hardware limitations of your CPU architecture, and on your memory architecture, cache size, and speed. There are so many variables related to an applications hardware utilization that testing is the best way to gauge whether your application scales well to multiple cores on specific hardware. Consider purchasing two or three compute nodes to test your application with different process placement patterns in addition to testing your application on various compute node configurations. Examining solution times for different process placement patterns on a single node will provide insights on the processor and memory needs of your specific application. Process placement patterns should include: loading all the cores, loading one core per processor chip, loading one core per NUMA unit, and so forth.
RAM per
Ideally, the physical RAM should contain the entire
Size your RAM to the working data set you expect the cluster to achieve by the time of your next major
Page 7 of 41
Considerati on node
Limiting Factor working data set to avoid paging to disk. If not possible, ensure the outof-core solver/computations in your commercial app are using the correct RAM size. Not applicable
Recommendations or Remarks hardware upgrade.
Network latency (embarrassi ngly parallel applications) Network latency (message intensive applications)
Generally unimportant except when processing real-time data.
When the delay in the delivery of messages from one node to another (network latency) is so great that it causes your applications computations to wait for data or control messages Large data sets staged to each node before execution Large data sets collected from each node after execution Large messages (>16 kilobits per message) in a message-intensive application
Gigabit Ethernet (GigE) minimum. Higher speed (InfiniBand, Myrinet, 10 GigE, and others) for latency sensitive applications.
Network and file access bandwidth
Higher bandwidth networking: Ethernet or GigE with offload (TCP/IP Offload Engine and the like) capability InfiniBand, Myrinet, and others
Parallel file system attached to one or more compute nodes. Add additional drives to compute nodes to host local copies of data sets. Use a dedicated staging server with wide, fast RAID arrays to reduce load on the head node. Use Distributed File System (DFS) across multiple staging servers. When using or creating large data sets, the speed of cluster node local disks can be a limiting factor. Use SAS, Fibre, or SCSI arrays. More small disks are faster than a few large disks. On the head node, disk access for Microsoft SQL Server can be the primary limiting factor for clusters with large rates of job submission or configuration
Storage subsystem
Saturation of the I/O subsystem
Page 8 of 41
Considerati on
Limiting Factor
Recommendations or Remarks changes.
NetworkDirect vs. TCP/IP

The NetworkDirect drivers, as shown in Figure 2, provide a more efficient and direct way to interface with the network hardware, eliminating much of the overhead associated with a standard TCP/IP stack.
Page 9 of 41
Figure 2 NetworkDirect Architecture
The NetworkDirect architecture allows a Message Passing Interface (MPI) application to use the most efficient network stack for the networking hardware available in the cluster, with performance comparable to custom hardware-specific MPI libraries, while maintaining application independence from the details of the hardware. As we can see from the figure, with NetworkDirect, the Microsoft Message Passing Interface (MS-MPI) library uses Remote Direct Memory Access (RDMA) to bypass Windows Sockets and the traditional TCP/IP stack, and interfaces directly with the hardware-specific NetworkDirect provider, which writes directly to the underlying network hardware. Not only is the NetworkDirect architecture much faster than conventional TCP-based networking, but it also has the distinct advantage of hiding hardware differences from the MS-MPI library. Applications that write to the MS-MPI stack are shielded from changes in
Page 10 of 41
the underlying network hardware, making it easier and more cost effective to upgrade your networking hardware after the initial cluster purchase, if necessary for performance reasons. Many other MPI stacks write directly to the application programming interface (API) of the networking hardware. This approach limits the ability to change networking hardware later and provides little or no performance advantage over NetworkDirect. Windows HPC Server 2008 will work with any Windows-compatible MPI stack. For extremely network-sensitive applications you should choose a NetworkDirect solution, or, if hardware independence is not an issue, you can choose a hardware-specific MPI stack such as: Myricom MPI stack for Myrinet (source code): For Myrinet GM: http://www.myri.com/scs/download-mpichgm.html For Myrinet MX (newest): http://www.myri.com/scs/download-mpichmx.html
Alternative MPI stacks that work on multiple network fabrics include: Hewlett-Packard HP-MPI: http://h21007.www2.hp.com/dspp/tech/tech_TechDocumentDetailPage_IDX/1,1701, 1238,00.html#new2 Intel MPI: http://www.intel.com/cd/software/products/asmona/eng/cluster/mpi/308295.htm
Vendor MPI stacks can also use NetworkDirect drivers.
GigE vs. Specialty Networking

When you specify your cluster hardware, you need to choose what type of networking to use. One approach to building an HPC cluster is simply to buy the latest and greatest hardware regardless of price. This is nice if you have unlimited resources, but not terribly realistic. If your application is highly parallel and isnt particularly latency-sensitive or bandwidth-sensitive, then a GigE network adapter is a reasonable choice. Even if your application is latency-sensitive, a GigE cluster shortens solution times compared to running on a single computer. However, if you find that latency is a bottleneck, you can replace the GigE card with a higher-performance network adapter, such as InfiniBand (IB), which has much lower latency. And if youre using MS-MPI and NetworkDirect-capable hardware, you dont have to change anything in your application. If you know when you are buying your cluster hardware that your application is latencysensitive, it is a good idea to select a high-performance network adapter. Although they are more expensive than a GigE network adapter, they do offer much lower latencyfor example, a ping-pong test on InfiniBand (double data rate host channel adapter [HCA]) yields latency on the order of 25 microseconds. When you choose a NetworkDirectcapable network adapter, you bypass the CPU overhead and latency of the standard TCP/IP stack.
Page 11 of 41
Note Applications tend to become more sensitive to network latency as the number of compute nodes increases. As you increase the size of your cluster, you might need to improve your networking performance. The baseline networking option for typical off-the-shelf Windows HPC Server 2008-based clusters is a basic GigE network adapter. Some systems include network adapters that have TCP or other offload options that perform network operations on the network adapter rather than on the host computers CPU. Others include network adapters that support RDMA, enabling one computer to directly manipulate memory on a second computer, thus reducing the workload for the hosts CPU. You can use several steps to help improve GigE networking, including: Experiment with disabling CPU interrupt moderation if your GigE network adapter includes
this feature. Tune performance parameters for GigE networking, including TCP Chimney Offload,
receive-side scaling (RSS), and jumbo frames. For details, see the Performance Tuning Guidelines for Windows Server 2008 white paper at http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fde-d599bac8184a/Perftun-srv.docx. Use high-quality network switches or routers that maintain full line speed even when many
switch ports are in simultaneous use. Poor quality switches and cabling can significantly slow an applications solution time, as the networking stack repeatedly retries erroneous transmissions.
Page 12 of 41
Many GigE network adapters have a CPU-interrupt modulation feature in their drivers that batches network interrupts to reduce the load on the CPU. This feature does reduce the CPU overhead, therefore improving overall throughput for some workloads, but it can cause large variances in effective network bandwidth and may also increase network latency, as suggested by anecdotal evidence, by 501,000 percent. CPU interrupt modulation, if supported, can be enabled or disabled on the Advanced tab of the Properties sheet of the adapter, as shown in Figure 3.
Figure 3. Setting the CPU interrupt modulation property
Note This particular vendors driver calls the feature Interrupt Moderation. Each vendor has a slightly different name for the feature, but all perform a similar functionality.
TipTo change the settings for identical network cards across a cluster, identify the registry changes used by the network card, and then use clusrun with setx to read and set those values across the cluster.
InfiniBand and the Open Fabrics Alliance

If you are using high performance InfiniBand (IB) networking for MPI applications, Microsoft recommends you use a NetworkDirect provider unless your application is sockets-based instead of MPI-based. For sockets-based applications, we recommend you try Winsock
Page 13 of 41
Direct. For details on tuning Winsock Direct, see the version of this white paper for Windows Computer Cluster Server 2003 at: http://www.microsoft.com/Downloads/details.aspx?FamilyID=40cd8152-f89d-4abf-ab1ca467e180cce4&displaylang=en InfiniBand can be used in three modes: IB with NetworkDirect provider. Provides lowest possible latency for MS-MPI applications and allows lower CPU utilization due to bypass of the operating systems TCP stack. IB with Winsock Direct (WSD) provider. Provides low latency for sockets-based applications and lowers CPU utilization due to bypass of the operating system's TCP stack. IP-over-IB. Uses TCP/IP stack and provides lower latency and higher bandwidth than GigE, but with reduced performance when compared to NetworkDirect. Use IP-overIB when you need network traffic routed over dissimilar networks or when you need TCP functionality for your cluster.
No matter which mode you use, you need one or more drivers. Although IB drivers for Windows are available from several sources, we suggest drivers based on those of the OpenFabrics Alliance (http://www.openfabrics.org). Microsoft has worked closely with Mellanox Technologies (http://www.mellanox.com/content/pages.php?pg=windows_hpc) and other OpenFabrics members as they developed and tuned the open-source code used in these drivers for Windows operating systems. The OpenFabrics Windows drivers (WinOF) include a miniport driver for use with IP-over-IB, a Winsock Direct provider for maximum performance for sockets-based applications, and a NetworkDirect provider for optimal MSMPI performance in addition to support for the native IB interface called InfiniBand Access Layer (IBAL). The OpenFabrics drivers for Windows are posted at http://www.openfabrics.org/downloads/WinOF/v2.0/. Driver bugs can be reported at http://openib.org/bugzilla/buglist.cgi? query_format=specific&order=relevance+desc&bug_status=__open__&product=OpenFabr ics+Windows&content.
Other IB Tricks
Other useful IB tips for improving performance and tuning the IB stack include the following: Determine your brand of IB network adapter (HCA) without opening the computer. Use the Vstat utility and check the first 3 bytes of the node_guid against the list of vendor organizationally unique identifiers (OUIs) at http://standards.ieee.org/regauth/oui/index.shtml.
Note The vstat command also gives you lots of other useful information about the IB configuration. Ensure the IB adapter is enabled on all nodes. You should get the same count as you have nodes when you run the following command:
Page 14 of 41
clusrun /all c:\drivers\openib\mft\mst status | find /c mt25208_pciconf0
Determine the number of PCI devices found on each ready node as follows:
clusrun /readynodes \\headnode\share\devcon findall pci\* | find "matching"
Determine the number of Mellanox cards found across all nodes as follows:
clusrun /all \\headnode\share\devcon findall pci\* | find /c "Mellanox"
List IB connection rate on all nodes.

clusrun /readynodes /interleaved \\sc08-hnode9\tools\vstat | find "rate"
List IB Firmware version on all nodes.

clusrun /readynodes /interleaved \\sc08-hnode9\tools\vstat | find "fw_ver"
List driver version for IPoIB on all nodes.

clusrun /readynodes /interleaved \\sc08-hnode9\tools\devcon drivernodes IBA\IPoIB | find "version"
Check IB driver and firmware versions with the tool from Mellanox at http://www.mellanox.com/content/pages.php?pg=firmware_download Firmware tools for Windows: http://www.mellanox.com/content/pages.php? pg=firmware_HCA_FW_update
If your network runs slower than expected, the first thing to check is the link speed. IPover-IB reports the link speed and, if you have a bad cable, the link speed will show up as 2.5 gigabits per second rather than 10 gigabits per second. If its 2.5 gigabits per second, switch cables. A second possibility, though less likely, is that the switch port is having problems and cant support 10 gigabits per second. To verify, try changing to another port. If that solves the problem, contact your switch vendor.
Memory and CPU Constraints

When planning to add processing power to your HPC environment, its tempting to choose a four-processor or eight-processor node, but in some applications placing the additional processing power into fewer nodes can actually lengthen your solution times. There are two ways to help avoid this situation: 1. Ensure that you have sufficient memory to fit N copies (where N=number of CPU cores per node) of your applications working data set into your node's physical RAM. 2. Ensure you have high bandwidth access to your memory. Multiple cores must share the memory bandwidth of their CPU socket. Modern CPUs provide each socket its own memory controller and RAM so that memory bandwidth scales more linearly but only so long as working data size does not exceed the memory installed for the socket. In summary, when sizing memory for cluster nodes, the entire working data set should fit in memory, but when this point is reached, there is usually little gain to adding more physical RAM. Instead, experiment with the following to improve performance:
Page 15 of 41
Switching off hyperthreading for CPU-intensive applications. Hyperthreading enables utilization of unused CPU cycles by pushing several processes or threads through a single processor. However, HPC processes typically consume all available CPU cycles, so pushing many of them through a single processor often lengthens solution time as the processes contest for time, and the CPU context-switches between them. Hyperthreading can usually be disabled in the BIOS. Setting processor affinity for applications. Processor affinity causes the operating system scheduler to always assign a particular process to a particular CPU (or CPU core).
A useful white paper written by a Microsoft MVP on monitoring memory usage can be found at http://members.shaw.ca/bsanders/WindowsGeneralWeb/RAMVirtualMemoryPageFileEtc.ht m. Affinity The scenarios where setting processor affinity can significantly improve performance are fairly easy to identifyprograms doing large amounts of parallel computation with relatively little MPI communications. The version of MS-MPI included in Windows HPC Server 2008 enables setting processor affinity for each of the ranks. Use either the affinity mpiexec command-line switch or set the mpiexec environment variable MPIEXEC_AFFINITY=[0|1]. Note Because setting the MPIEXEC_AFFINITY environment variable must be done before mpiexec is started, the preferred method is to set it as a task term. Using the affinity switch, the affinity for each process is set to a single core. If not all cores on a node are used, msmpi.exe will automatically assign cores that are the furthest away (cache wise). This means that when allocating two processes on a dual-core, two-sockets computer, msmpi.exe will set the affinity mask for each process to a single core on a different socket. For a task using four processes on a quad-core, dual-socket node, where two cores share an L2 cache, msmpi.exe will set the affinity mask for each process to a single core on a different L2 cache boundary. Thus each process gets the whole L2 cache to itself. In some situations, this approach results in reduced solution times, presumably from less memory paging. How effective this is depends on your application and node hardware. However, there are other situations where setting the affinity can actually lead to suboptimal utilization of shared memory and degraded performance. Some customers with OpenMP or threaded sections in their applications have had reduced performance using the affinity switch. Two applications where affinity can create problems are Abaqus and ANSYS Mechanical. For these turning on affinity is totally the wrong thing to do.
Page 16 of 41
Large Clusters
When deploying large clusters of hundreds of nodes, you can run into connection failures in MPI jobs or have problems when a large number of nodes connect to the same share. To help avoid these problems configure your cluster as follows: The head node is not a compute node. Compute nodes are isolated from the public network. The file share used by compute nodes to start executables is not hosted on the head node. Use Microsoft SQL Server Standard Edition instead of Microsoft SQL Server Express
Edition. For more details on planning and implementing large clusters, see the white paper Windows HPC Server 2008 Head Node Performance Tuning at http://go.microsoft.com/fwlink/?LinkID=132962. Also of interest is the Windows HPC Server 2008 Top 500 white paper at http://go.microsoft.com/fwlink/?LinkID=132964.
Page 17 of 41
Tuning for Messaging-Intensive Applications

Messaging patterns exhibited by HPC applications vary from chatty messaging-intensive applicationsapplications that send lots of smaller messages (latency-sensitive messaging)to applications that send large messages (heavy messaging). Most applications use a range of messaging patterns during any given execution, but it is useful when configuring a cluster to consider the extremes of the messaging spectrum and where your cluster applications predominantly lie on that spectrum. Note that while not strictly messaging-intensive, database applications that retrieve large data sets have many of the same performance considerations as heavy messaging.
Heavy Messaging
Because heavy messaging applications transfer large amounts of data with each message, its important to design your cluster for maximum bandwidth on the MPI network. You can get 10 GigE over either fiber or copper cabling, and specialty networks such as InfiniBand (10 gigabits, 20 gigabits, and soon higher), Myrinet (10 gigabits), and others meet very high bandwidth requirements. Regardless of the networking type used, selecting routers and switches that can sustain full port-to-port bandwidth when all ports are in use shortens application solution times, in general. As described earlier, many GigE drivers include a CPU interrupt modulation feature to reduce the CPU load. This feature should be turned off to provide additional bandwidth at the expense of CPU usage on the compute nodes.
Latency-Sensitive Messaging
Latency-sensitive applications are where high-performance networking interfaces become critical for cluster performance. Low latency communications on the Windows Server 2008 operating system are achieved with a NetworkDirect provider and specialty networking gear such as InfiniBand or Myrinet. Setting the MPI Network Use the Windows HPC Server 2008 Cluster Manager to assign the clusters highestperforming network as the MPI network. Doing so sends all MPI messaging traffic on that network, though some MPI management traffic still goes over the clusters private network adapter. You can accomplish the same task using the command prompt by setting the MPICH_NETMASK environment variable in the Mpiexec command for the job. A typical example might be:
mpiexec -env MPICH_NETMASK 172.16.0.0/255.255.0.0 myApp.exe
The net mask value is of the form ip_mask/subnet_mask. In the example above, the network addresses are 172.16.yyy.zzz and the subnet mask is 255.255.0.0.
Performance Tips for Message-Intensive Applications

Several performance settings can help improve the overall performance of NetworkDirectenabled networking drivers and MS-MPI applications. These include setting environment
Page 18 of 41
variables as described in Table 1 and setting processor affinity using MPI variables as described in Table 2. The MPICH environment variables are set using the -env, -genv or -genvlist commandline options. These variables are visible to the launched application and are affecting its execution.
Environment variables can be passed with mpiexec using the env, -genv or genvlist command-line options as follows:
mpiexec env VARIABLE SETTING -env OTHERVARIABLE OTHERSETTING <command line>
Table 2: Performance Settings for MS-MPI Applications

Variable MPICH_NETMASK MPICH_PROGRESS_SPIN_LIM IT Setting ip_mask/subnet_mask Routes MPI messaging traffic to a particular subnet. Set the progress engine fixed spin count limit (1 - 65536). The default of 0 uses an adaptive spin count limit. For oversubscribed cores use a low value fixed spin limit (for example, 16) Set the number of times to retry a Socket connection. The default is 5. A value of 0 enables the NetworkDirect interconnect. A value of 1 disables NetworkDirect. A value of 0 enables Shared Memory communication between processes on the same node. A value of 1 disables Shared Memory communication. A value of 0 enables the Sockets interconnect. A value of 1 disables the Sockets interconnect. Set to an address or subnet to limit Sockets or NetworkDirect interconnects to matching network mask. For example: -env MPICH_NETMASK=10.0.0.5/255.255.255.0 uses only networks that match 10.0.0.x. Set the progress engine fixed spin count limit (1 - 65536). The default of 0 uses an adaptive spin count limit. For oversubscribed cores use a low value fixed spin limit (for example, 16) Set the message size threshold, in bytes, above which to use the rendezvous protocol for shared memory communication. The default is 128000. Set the message size threshold, in bytes, above which to use the rendezvous protocol for NetworkDirect communication. The default is 128000. When set, enables the use of the sockets interconnect if the NetworkDirect interconnect is enabled but connection over NetworkDirect fails
MPICH_CONNECT_RETRIES MPICH_DISABLE_ND MPICH_DISABLE_SHM
MPICH_DISABLE_SOCK MPICH_NETMASK
MPICH_PROGRESS_SPIN_LIM IT
MPICH_SHM_EAGER_LIMIT
MPICH_ND_EAGER_LIMIT
MPICH_ND_ENABLE_FALLBA CK
Page 19 of 41
MPICH_ND_ZCOPY_THRESH OLD
Set the message size above which to perform zcopy transfers. The default of 0 uses the threshold indicated by the NetworkDirect provider. A value of -1 disables zcopy transfers Set the size in megabytes of the NetworkDirect memory registration cache. The default is half of physical memory divided by the number of cores
MPICH_ND_MR_CACHE_SIZE
MS-MPI Shared Memory MS-MPI uses shared memory to communicate between processes on the same node, and network queues for communication with other nodes. Disabling the use of shared memory can improve bandwidth consistency when an application runs across many nodes. This prevents MS-MPI from alternating between shared memory (used for communication between processes on the same node) and network queues (used for communication among nodes) as it polls for incoming data. Thus, MS-MPI polls a single incoming queue, increasing the consistency of measured bandwidth. This results in a smoother bandwidthversus-message-size curve such as those produced by the Pallas microbenchmark. However, shared memory latency is an order of magnitude faster than network communications; therefore, in most situations it is best to leave it enabled on systems with many processors. Ultimately, the only way to know for certain whether disabling shared memory is a good choice for you is to experiment with your particular application, problem size, and hardware. The Microsoft HPC Server team has made significant strides in improving this part of MS-MPIs design. This is another example of the sometimes tenuous connection between microbenchmark data and actual application solution times.
Simple Multipurpose Daemon
The simple multipurpose daemon (SMPD) also uses environment variables to control the behavior of spawned applications. Table 3 describes some useful variables you can use in batch or script files that start your HPC applications on the compute nodes. See the Setting Affinity at the Command Line section below for an example of using SMPD environment variables.
Table 3: Useful SMPD Environment Variables

Variable PMI_RANK PMI_APPNUM Setting The rank of the process in the jobs MPI world. The application number on mpiexec command line, zero-based. (Applications on an mpiexec command-line are separated by colons.) The rank of that process in the MPI world. The size of the MPI world. The SMPD local ID of the launched process. A value of 1 denotes that the application was spawned by another MPI application. Not a SMPD variable but useful
PMI_RANK PMI_SIZE PMI_SMPD_KEY PMI_SPAWN NUMBER_OF_PROCESSORS
Page 20 of 41
nonetheless. This is automatically set by Windows for each process and indicates the total number of processor cores on the compute node. Other MS-MPI Tips MS-MPI is identical to MPICH2 in most respects, but MS-MPI supports more secure execution in that it does not pass unencrypted user credentials over the network and interacts with the Windows HPC Server 2008 Job Scheduler. For some helpful tips on running MPI jobs using MS-MPI, see the Microsoft HPC Team applications blog at http://blogs.technet.com/WindowsHPC/.
Typical Microbenchmark Network Performance

Typical benchmark performance seen in a Microsoft lab environment is described in Table 4. Please note that this is typical data, which varies according to: CPU speed PCI bus speed (PCI-e X1, PCI-e X4, PCI-e X8, etc.) Network switches Network adapter driver version and settings Network adapter firmware version Operating system network settings Test software used for measurement
Important The performance data in Table 4 represents typical ranges only and is not maximum performance data. It also is not necessarily representative of how your application will perform on your cluster.
Page 21 of 41
Table 4. Typical MS-MPI Network Performance on GigE, InfiniBand, and Myrinet

Network GigE InfiniBand IP over IB InfiniBand Single Data Rate (SDR) Winsock Direct InfiniBand SDR NetworkDirect InfiniBand Double Data Rate (DDR) NetworkDirect InfiniBand ConnectX PCIe 1.0 InfiniBand ConnectX PCIe 2.0 InfiniBand Quad Data Rate (QDR) NetworkDirect Latency (Seconds) 40-70 30 12 6 5 2.8 2 2.3 Bandwidth (MB/second) 105 150180 950 950 1400 1400 1500 2360
Notes The measurements in Table 4 were taken using the Pallas ping-pong benchmark on midrange servers with the network adapters attached through a PCIe bus. Nodes were connected by a network switch, not cross-over cabling.
Page 22 of 41
Tuning for Embarrassingly Parallel Applications

Embarrassingly parallel applications run almost entirely parallel, with little or no message passing. Typically, these applications are like those used in the biomedical field, where a fixed data set is partitioned across multiple nodes, each of which processes its data independently. When designing a cluster to process this kind of application, networking is not the limiting factor. What limits the overall solution time are the efficiency and speed at which the data can be processed at each node, and the total number of nodes available to do the processing. The network can be a factor if large data sets must be transferred at the beginning and end of processing. Probably the single most important performance requirement for each node is physical RAM. If you can get the entire data set that each node processes into RAM on that node, it is much faster than if the data must be paged in and out from disk during processing. Appropriate sizing of RAM to match the size of the data set is essential. If your application has a substantial output, or large data sets, using a local fast disk I/O subsystem with a wide disk stripe (RAID 0) array on the computer nodes can have a significant impact on performance. Using a staging server with a high I/O capacity that is separate from the head node can also help. Another key factor is the number and type of nodes. The speed of the processor or processors on each node is important, as is the type and number of processors. Having a dual-core or quad-core processor is not as efficient as having two processors, each with its own memory bus. If the processors support hyperthreading technology, you should experiment with turning it off in the compute nodes system BIOS to avoid running multiple processes on a single physical CPU. Because these applications are so parallel, increasing the number of nodes in the cluster has an essentially linear effect on the time to solution. For networking embarrassingly parallel applications, a GigE network is probably perfectly adequate. If your embarrassingly parallel application involves large data sets, consider choosing GigE with TCP Offload Engine (TOE) to lower CPU utilization while moving the data. If moving the data becomes the bottleneck, then InfiniBand, Myrinet, or other NetworkDirect-capable hardware is a good choice.
Page 23 of 41
Performance Tuning and Measurement Tools

There are a number of tools used to measure the performance of an HPC cluster, and assist in tuning that cluster. Two of those tools covered here are mpipingpong, and ndpingpong. Additional tools such as the High-Performance Linpack (HPL) and the uSane toolkit are covered in the Building and Measuring the Performance of Windows HPC Server 2008Based Clusters for TOP500 Runs white paper, available at http://download.microsoft.com/download/1/7/c/17c274a3-50a9-4a55-a5b129fe42d955d8/Top%20500%20White%20Paper_FINAL.DOCX. The Windows Performance Toolkit includes tools optimized for Windows Server 2008, which can provide important insights into the clusters characteristics under load. It is available at http://www.microsoft.com/whdc/system/sysperf/perftools.mspx. Another tuning tool is the Vampir toolkit available at http://tudresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/software_werkzeuge_zur_ unterstuetzung_von_programmierung_und_optimierung/vampirwindows/dateien/Tuning_Parallel_MPI_Programs_with_Vampir.pdf.
Using Built-In Diagnostics in the Admin Console

The new Windows HPC Server 2008 Admin Console includes built-in diagnostics that measure latency and throughput for a set of nodes or the whole cluster. Results are automatically saved so results from past runs are easily accessible from within the console.
Using MPIPingPong
MPIpingPong can automatically ping between each node and its neighbor in a ring or every link between every node. The output presents average latency and bandwidth, provides individual latency and bandwidth, and identifies unresponsive links between nodes. Using the mpipingpong command-line tool gives detailed information and summaries of the clusters performance, and can generate full latency and throughput curve data for packet sizes up to 4 MB. The basic usage scenario for mpipingpong is to submit a job with nodes as requested resources, and run mpiexec mpipingpong on every node. For example, if the cluster has N nodes, then the command line is Job submit /numnodes:N mpiexec mpipingpong The output is printed to stdout in XML format, giving the overall, per-node, and per-link information. You can run mpipingpong on any set of nodes in the cluster by specifying the appropriate resources at job submission time. You need to make sure that you allocate only one node per ping-pong process: otherwise, mpipingpong will complain and exit. You can, however, use the -m command-line option to allow mpipingpong to use multiple cores on the same node if desired. By default, the test runs in tournament mode. Given N nodes, the test is divided into N 1 rounds. During each round, nodes are paired off against each other, and send packets to each other simultaneously. This is fast and complete, but is likely to provide inconsistent
Page 24 of 41
results for larger packet sizes due to network switch oversubscription. Two additional test modes are supported: Serial modeall pairs of nodes are tested, and while each pair talks, all others remain
silent. Ring modeeach node talks only to its neighbors. This means node i will exchange
packets with nodes i + s and i s, where s is a step-size parameter with s = 1 by default, while the rest remain silent. Serial tests are slow, particularly for large clusters and packet sizes, whereas ring tests are incomplete, because they dont test all the links. To run the test in serial mode, use the -s command-line option; to run it in ring mode, use the -r option, and optionally add -rs s to specify the ring step size s. By default, mpipingpong tests network latency only, using the 4-byte packets. You can also test network throughput using 4-MB packets via the -pt command-line option, test both latency and throughput in the same run via the -pb option, or generate a full set of latency and throughput data as a function packets size (using packet sizes ranging from 4 b to 4 MB, incremented by powers of two) via the -pc option. You can also ignore the default packet sizes and supply a set of your own by specifying the -p option multiple times on the command-line. Each -p option must be followed by an integer p or a pair of colon-separated integers p:n,where p is the packet size in bytes and n is the number of iterations for that packet size (i.e. the number of packets of that size that will be exchanged between each pair of nodes during the test). If n is not specified, the test will pick an appropriate number of iterations based on packet size p. By default, mpipingpong will output the complete XML results to standard output at the end of the run, and will print progress information (as percentage of test completed) to standard error as the run takes place. Use the -oa option to print abbreviated XML test results to standard output: this will print the overall summary and per-node summaries, but skip the per-link details. Use the -ob option to print an even briefer XML version of the output, showing the overall summary only. The -op option displays full progress information, printing real-time latency and throughput results to standard error, while the -oq option suppresses the printing of progress information altogether. Finally, you can explicitly print test results to an XML file, rather than standard output, by providing a file name parameter at the end of the mpipingpong command line. If a file name is provided in conjunction with the -oa or -ob switches, the full test results will be printed to the file, and the abbreviated version will go to standard output.
MPIPingPong Examples
All the examples below assume that the resources for the job have been specified, and stdout and stderr have been redirected to desired locations.
Run mpipingpong in serial mode, testing both latency and throughput, sending abbreviated XML results to stdout, real-time plain-text results to stderr, and full results to \\mymachine\myshare\pingpong.xml:
Page 25 of 41
mpiexec mpipingpong -s -pb -oa -op \\mymachine\myshare\pingpong.xml
Run mpipingpong in ring mode, with ring step size 4, generating data for packet sizes ranging from 4 bytes to 4 MB, and printing full XML output to stdout and realtime, plain-text output to stderr.
mpiexec mpipingpong -r -rs 4 -pc -op
Run mpipingpong in tournament mode (default), using a set of custom packet sizes and iteration: 8 bytes for a default number of iterations each, 4 KB for 20 iterations each, 512 KB for 6 iterations each. Do not print output to stdout or progress information to stderr: output everything \\mymachine\myshare\pingpong2.xml.
mpiexec mpipingpong -p 8 -p 4096:20 -p 524288:6 -oq \\mymachine\myshare\pingpong2.xml
To obtain detailed help, run

mpipingpong -?
Using NetworkDirect Pingpong

When you want to measure the native latency of the network, without the MPI stack, you can use ndpingpong.exe. Ndpingpong is a NetworkDirect-specific ping-pong test utility that is available in the software development kit (SDK). Ndpingpong uses IPv4 addresses, not host names. Initial tests should be done using a client and server on the same network switch to reduce variables. Start the server first:
ndpingpong s <local IP Address> <port number> p1
Then, on the client, run:

ndpingpong c <server IP Address> <port number> p1
Important server.
The server continues to run after the client finishes. Use CTRL+C to cancel the
Page 26 of 41
Performance Measurement and Tuning Procedure

To prepare to measure and tune performance on a Windows HPC Server 2008-based cluster, first ensure the cluster is running normally. Steps to verify this include: 1. Ensure the system BIOS (from your hardware OEM) is up to date on all your compute nodes. 2. If desired, restart all the compute nodes so that youre sure no stray processes disturb your performance measurements. Run the following command:
clusrun /readynodes shutdown r f
3. Verify that the scheduler can address all the clusters compute nodes by running this command:
clusrun /readynodes hostname & ipconfig /all
This command returns a list of all the active nodes on the cluster and their network configurations. 4. If the cluster has an InfiniBand network, verify the NetworkDirect provider is activated on all compute nodes by running this command:
clusrun ndinstall L
5. Verify that you can run a simple MPI job across the cluster using a simple test application such as this one:
job submit /numprocessors:128 /workdir:\\headnode\share\ mpiexec batchpi.exe 100000
The command assumes you have 128 processors in your cluster, a file share on the head node named \\headnode\fileshare\, and a simple test application like Batchpi.exe (available at the HPC Community site at http://www.windowshpc.net/Resources/Programs/batchpi.zip). 6. Verify point-to-point MPI performance for the cluster. Use a test program like mpipingpong that can automatically ping between each node and its neighbor in a ring (use the allnodes argument) or every link between every node (use the alllinks argument). The output presents average latency and bandwidth, provides individual latency and bandwidth, and identifies unresponsive links between nodes. 7. When measuring latency and bandwidth on GigE: a. To help reduce latency and bandwidth variation and dramatically reduce the minimum latency, switch off CPU interrupt modulation on the network adapter driver. This ensures incoming and outgoing network data is processed immediately. b. Make sure the two test nodes are using the same network switch to avoid hops between switches, and use high-quality switches and cabling. c. Tune performance parameters for GigE networking, including TCP Chimney Offload, RSS, and jumbo frames. For details see the Performance Tuning Guidelines for Windows Server 2008 white paper at
Page 27 of 41
http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fded599bac8184a/Perf-tun-srv.docx. 8. If and only if the network gear is using an RDMA Winsock Direct provider, set the MS-MPI send buffer size to 0 (by setting the environment variable MPICH_SOCKET_SBUFFER_SIZE=0). Doing so eliminates an extra data copy on send operations so you get good latency numbers, but you overburden the CPU in the process. If you use this setting with a non-Winsock Direct or NetworkDirect driver, the computer will likely stop responding. 9. Measure total system performance in addition to microbenchmarks like latency and bandwidth. 10. Use Linpack (or similar) to get a total system measurement that is comparable to other systems. Although a useful comparison, Linpack (or similar) does not necessarily model the behavior of your HPC application. 11. Use a set of commercial HPC applications (and their standard benchmark problems) to measure and compare solution times with other systems. Most end users are not interested in microbenchmarkstheyre focused on wall-clock solution times. 12. When testing a users application, complete the following steps: a. On GigE networking, if the application sends lot of small messages between nodes (latency-sensitive application), then switch off CPU interrupt modulation and additional advanced features of GigE networking, including TCP Chimney Offload, RSS, and jumbo frames b. If the application is very bandwidth-bound and the networking gear is using a Winsock Direct provider, experiment with setting the MS-MPI send buffer size to 0. This eliminates an extra data copy on send operations, which increases bandwidth at the expense of higher CPU utilization. Do not attempt this with non-Winsock Direct drivers: the node will stop responding. c. For applications that are both CPU-intensive and network-intensive, experiment with running fewer processes on each compute node to leave some spare CPU cycles on the node for faster networking. Most HPC applications highly load the CPU. It makes more sense to try this on a GigE cluster than InfiniBand, because GigE tends to use more CPU. You can use the MS-MPI -hosts argument to specify the number of processes to run on each node. Windows HPC has example scripts to help you at http://www.microsoft.com/technet/scriptcenter/hubs/HPCS.mspx. d. Some applications benefit from reduced context switching between processors on a compute node. You can experiment with assigning an application to a specific processor (setting its affinity) on the compute nodes to see if it reduces your applications solution time. Processor affinity can be tagged onto an executable file or set at the Windows command line when an application is started with the Windows Server 2008 start command. See the Setting Processor Affinity section (above) for more detail.
Page 28 of 41
Conclusion
Windows HPC Server 2008 extends and expands the gains of Windows Compute Cluster Server 2003, which brought supercomputing power to the Windows platform, providing a solution that is easy to use and to deploy for building high-performance HPC clusters on commodity x64 computers. By understanding the type of application and how it processes data across the cluster, you can make informed decisions about the specific hardware to specify for your cluster. Understanding where the bottlenecks are in your cluster lets you spend wisely to improve performance.
Page 29 of 41
References
Microsoft HPC Web Site
http://www.microsoft.com/hpc
Microsoft HPC Community Site
http://windowshpc.net/default.aspx
Windows HPC Server 2008 Partners
http://www.microsoft.com/hpc/en/us/partners.aspx
Tutorial from Lawrence Livermore National Lab
http://www.llnl.gov/computing/tutorials/mpi/
Tuning MPI Programs for Peak Performance
http://mpc.uci.edu/wget/www-unix.mcs.anl.gov/mpi/tutorial/perf/mpiperf/index.htm
Winsock Direct Stability Update
http://support.microsoft.com/kb/927620
DevCon Command-Line Utility
http://support.microsoft.com/?kbid=311272
New Networking Features in Windows Server 2003 Service Pack 1
http://www.microsoft.com/technet/community/columns/cableguy/cg1204.mspx
Mellanox MPI Stack and Tools for InfiniBand on Windows http://www.mellanox.com/content/pages.php? pg=products_dyn&product_family=32&menu_section=34 Myricom MPI stack for Myrinet
For Myrinet GM: http://www.myri.com/scs/download-mpichgm.html For Myrinet MX (newest): http://www.myri.com/scs/download-mpichmx.html

Hewlett-Packard HP-MPI http://h21007.www2.hp.com/dspp/tech/tech_TechDocumentDetailPage_IDX/1,1701,1238,00.ht ml#new2 Verari Systems MPI/Pro
http://www.verari.com/software_products.asp#mpi
Intel MPI Library
http://www.intel.com/cd/software/products/asmo-na/eng/cluster/mpi/308295.htm
Designing and Building Parallel Programs
http://www-unix.mcs.anl.gov/dbpp/
Parallel Programming Workshop
http://www.mhpcc.edu/training/workshop/parallel_intro/MAIN.html
What Every Dev Must Know About Multithreaded Apps
http://msdn.microsoft.com/msdnmag/issues/05/08/Concurrency/
RAM, Virtual Memory, Page File, and Monitoring
Page 30 of 41
http://members.shaw.ca/bsanders/WindowsGeneralWeb/RAMVirtualMemoryPageFileEtc.ht m
Performance Visualization for Parallel Programs
http://www-unix.mcs.anl.gov/perfvis/download/index.htm
Windows HPC Team Blog
http://blogs.technet.com/WindowsHPC
Building and Measuring the Performance of Windows HPC Server 2008Based Clusters for TOP500 Runs
http://download.microsoft.com/download/1/7/c/17c274a3-50a9-4a55-a5b129fe42d955d8/Top%20500%20White%20Paper_FINAL.DOCX
Performance Tuning Guidelines for Windows Server 2008
http://download.microsoft.com/download/9/c/5/9c5b2167-8017-4bae-9fded599bac8184a/Perf-tun-srv.docx
Windows Performance Tools Kit
http://www.microsoft.com/whdc/system/sysperf/perftools.mspx
Tuning Parallel MPI Programs with Vampir
http://tudresden.de/die_tu_dresden/zentrale_einrichtungen/zih/forschung/software_werkzeuge_zur_ unterstuetzung_von_programmierung_und_optimierung/vampirwindows/dateien/Tuning_Parallel_MPI_Programs_with_Vampir.pdf
Page 31 of 41
Appendix A: Performance Counters for Windows HPC Server 2008

Windows Server 2008 includes performance monitoring built in to the operating system (Perfmon.exe). You can use this Reliability and Performance Monitor to track the performance of individual nodes in your cluster to help identify system and hardware bottlenecks that affect the overall performance of the cluster.
Figure 1A. The Reliability and Performance Monitor user interface Microsoft TechNet provides extensive information on using the Performance Monitor. An article that provides details on the Reliability and Performance Monitor, Performance and Reliability Monitoring Step-by-Step Guide for Windows Server 2008, is found at http://technet.microsoft.com/en-us/library/cc771692.aspx.
Page 32 of 41
Recommended Counters to Monitor When Assessing Cluster Performance

Modify the values indicated by italics in the table below to match your systems and applications.
Table 1A. Recommended Performance Console Counters

Performance Console Counter Name \LogicalDisk(_Total)\%DiskTime Description %DiskTime is the percentage of elapsed time that the selected disk drive was busy servicing read or write requests. DiskBytes/sec is the rate that bytes are transferred to or from the disk during write or read operations. %CommittedBytesInUse is the ratio of committed memory bytes to the memory limit. Committed memory is the physical memory in use for which space has been reserved in the paging file if it needs to be written to disk. The commit limit is determined by the size of the paging file. If the paging file is enlarged, the commit limit increases, and the ratio is reduced. This counter displays the current percentage value only; it is not an average. AvailableMBytes is the amount of physical memory available to processes running on the computer, in megabytes, rather than bytes as reported in Memory\Available Bytes. AvailableMBytes is calculated by adding the amount of space on the Zeroed, Free, and Standby memory lists. Free memory is ready for use; Zeroed memory is pages of memory filled with zeros to prevent later processes from seeing data used by a previous process; Standby memory is memory removed from a process's working set (its physical memory) en route to disk but still available to be recalled. This counter displays the last observed value only; it is not an average. Pages/sec is the rate at which pages are read from or written to disk to resolve hard page faults. This counter is a primary indicator of the kinds of faults that cause system-wide delays. It is the sum of
\ LogicalDisk(_Total)\DiskBytes/se c \Memory\ %CommittedBytesInUse
\Memory\AvailableMBytes
\Memory\Pages/sec
Page 33 of 41
Memory\Pages Input/sec and Memory\Pages Output/sec. It is counted in numbers of pages, so it can be compared to other counts of pages such as Memory\Page Faults/sec without conversion. Pages/sec includes pages retrieved to satisfy faults in the file system cache (usually requested by applications noncached mapped memory files.) \ NetworkInterface(MyNetworkCo nnectionName)\BytesTotal/sec BytesTotal/sec is the rate at which bytes are sent and received over each network adapter, including framing characters, where MyNetworkConnectionName is the name of your network adapter. Network Interface\Bytes Total/sec is a sum of Network Interface\Bytes Received/sec and Network Interface\Bytes Sent/sec. Use these counters to watch all network adapters to ensure the traffic you expect is on each one, such as to ensure that MPI traffic really is on the MPI network. PacketsOutboundErrors is the number of outbound packets that could not be transmitted because of errors, where MyNetworkConnectionName is the name of your network adapter. PacketsReceivedErrors is the number of inbound packets that contained errors preventing them from being deliverable to a higher-layer protocol, where MyNetworkConnectionName is the name of your network adapter. Packets/sec is the rate at which packets are sent and received on the network adapter, where MyNetworkConnectionName is the name of your network adapter. The amount of the Page File instance in use in percent. See also Process\Page File Bytes. The peak usage of the Page File instance in percent. See also Process\Page File Bytes Peak. %ProcessorTime is the percentage of elapsed time that all of process threads of the process MyHpcAppName used the processor to execute instructions, where MyHpcAppName is the name of your HPC application. %InterruptTime (_Total) is the time all processors on the system spend receiving and servicing hardware interrupts during sample intervals. This
\ NetworkInterface(MyNetworkCo nnectionName)\PacketsOutboun dErrors \ NetworkInterface(MyNetworkCo nnectionName)\PacketsReceived Errors \ NetworkInterface(MyNetworkCo nnectionName)\Packets/sec \PagingFile(_Total)\%Usage \PagingFile(_Total)\%UsagePeak \Process(MyHpcAppName)\ %ProcessorTime
\Processor(_Total)\ %InterruptTime
Page 34 of 41
value is an indirect indicator of the activity of devices that generate interrupts, such as the system clock, the mouse, disk drivers, data communication lines, network adapters, and other peripheral devices. These devices normally interrupt the processor when they have completed a task or require attention. Normal thread execution is suspended during interrupts. Most system clocks interrupt the processor every 10 milliseconds, creating a background of interrupt activity. \Processor(_Total)\ %ProcessorTime %ProcessorTime (_Total) is the percentage of elapsed time that the processor spends to execute all non-idle threads on the system. This counter is the primary indicator of processor activity, and displays the average percentage of busy time observed during the sample interval. Interrupts/sec (_Total) is the average rate, in incidents per second, at which all processors received and serviced hardware interrupts. It does not include deferred procedure calls (DPCs), which are counted separately. Same as %InterruptTime above except only for processor 0, where 0 represents the processor being tested. Same as %ProcessorTime above except only for processor 0, where 0 represents the processor being tested. Same as Interrupts/sec above except only for processor 0, where 0 represents the processor being tested. Same as %InterruptTime above except only for processor 1, where 1 represents the processor being tested. Same as %ProcessorTime above except only for processor 1, where 1 represents the processor being tested. Same as Interrupts/sec above except only for processor 1, where 1 represents the processor being tested. ContextSwitches/sec is the combined rate at which all processors on the computer are switched from
\Processor(_Total)\Interrupts/sec
\Processor(0)\%InterruptTime
\Processor(0)\%ProcessorTime
\Processor(0)\Interrupts/sec
\Processor(1)\%InterruptTime
\Processor(1)\%ProcessorTime
\Processor(1)\Interrupts/sec
\System\ContextSwitches/sec
Page 35 of 41
one thread to another. Context switches occur when a running thread voluntarily relinquishes the processor, is preempted by a higher priority ready thread, or switches between user-mode and privileged (kernel) mode to use an Executive or subsystem service. \System\Processes Processes is the number of processes in the computer at the time of data collection. This is an instantaneous count, not an average over the time interval. Each process represents the running of a program. ProcessorQueueLength is the number of threads in the processor queue. Unlike the disk counters, this counter shows ready threads only, not threads that are running. There is a single queue for processor time even on computers with multiple processors. Therefore, if a computer has multiple processors, you need to divide this value by the number of processors servicing the workload. A sustained processor queue of fewer than 10 threads per processor is normally acceptable, depending on the workload. SystemUpTime is the elapsed time (in seconds) that the computer has been running since it was last started. This counter displays the difference between the start time and the current time. Monitor this counter for reductions over time, which could indicate the system is periodically restarting due to a problem in configuration, application, operating system, or elsewhere. Threads is the number of threads in the computer at the time of data collection. This is an instantaneous count, not an average over the time interval. A thread is the basic executable entity that can execute instructions in a processor.
\System\ProcessorQueueLength
\System\SystemUpTime
\System\Threads
Example of Collecting Data from Multiple Compute Nodes

A generic counter definition to collect the percentage of time the processor was working (on non-idle processes) would look like this in a Performance Monitor HTML file:
<PARAM NAME="Counter00001.Path" VALUE="\Processor(_Total)\% Processor Time"/>
To collect data from several nodes into one Performance Monitor counter log, you can specify the node names like this:
Page 36 of 41
<PARAM NAME="Counter00001.Path" VALUE="\\Node1\Processor(_Total)\% Processor Time"/> <PARAM NAME="Counter00002.Path" VALUE="\\Node2\Processor(_Total)\% Processor Time"/>
This example assumes the computer names of the two nodes are Node1 and Node2.
Page 37 of 41
Appendix B: Performance Console Launcher HTML Example

The following HTML code will launch a General Diagnostics performance monitoring session in Windows Internet Explorer. This code is provided solely as an example of how to use HTML to monitor performance.
<HTML> <HEAD> <META HTTP-EQUIV="Content-Type" CONTENT="text/html;" /> <META NAME="GENERATOR" Content="Microsoft System Monitor" /> </HEAD> <BODY> <OBJECT ID="DISystemMonitor1" WIDTH="100%" HEIGHT="100%" CLASSID="CLSID:C4D2D8E0-D1DD-11CE-940F-008029004347"> <PARAM NAME="_Version" VALUE="196611"/> <PARAM NAME="LogName" VALUE="General_Diagnostics"/> <PARAM NAME="Comment" VALUE=""/> <PARAM NAME="LogType" VALUE="0"/> <PARAM NAME="CurrentState" VALUE="0"/> <PARAM NAME="RealTimeDataSource" VALUE="1"/> <PARAM NAME="LogFileMaxSize" VALUE="-1"/> <PARAM NAME="DataStoreAttributes" VALUE="33"/> <PARAM NAME="LogFileBaseName" VALUE="General_Diagnostics"/> <PARAM NAME="LogFileSerialNumber" VALUE="1"/> <PARAM NAME="LogFileFolder" VALUE="C:\PerfLogs"/> <PARAM NAME="Sql Log Base Name" VALUE="SQL:!General_Diagnostics"/> <PARAM NAME="LogFileAutoFormat" VALUE="1"/> <PARAM NAME="LogFileType" VALUE="1"/> <PARAM NAME="StartMode" VALUE="0"/> <PARAM NAME="StopMode" VALUE="0"/> <PARAM NAME="RestartMode" VALUE="0"/> <PARAM NAME="LogFileName" VALUE="C:\PerfLogs\General_Diagnostics_000001.tsv"/> <PARAM NAME="EOFCommandFile" VALUE=""/> <PARAM NAME="Counter00001.Path" VALUE="\LogicalDisk(_Total)\% Disk Time"/> <PARAM NAME="Counter00002.Path" VALUE="\LogicalDisk(_Total)\Disk Bytes/sec"/> <PARAM NAME="Counter00003.Path" VALUE="\Memory\% Committed Bytes In Use"/> <PARAM NAME="Counter00004.Path" VALUE="\Memory\Available MBytes"/> <PARAM NAME="Counter00005.Path" VALUE="\Memory\Pages/sec"/> <PARAM NAME="Counter00006.Path" VALUE="\Network Interface(Intel[R] PRO_1000 PM Network Connection - Virtual Machine Network Services Driver)\Bytes Total/sec"/> <PARAM NAME="Counter00007.Path" VALUE="\Network Interface(Intel[R] PRO_1000 PM Network Connection - Virtual Machine Network Services Driver)\Packets Outbound Errors"/> <PARAM NAME="Counter00008.Path" VALUE="\Network Interface(Intel[R] PRO_1000 PM Network Connection - Virtual Machine Network Services Driver)\Packets Received Errors"/>
Page 38 of 41
<PARAM NAME="Counter00009.Path" VALUE="\Network Interface(Intel[R] PRO_1000 PM Network Connection - Virtual Machine Network Services Driver)\Packets/sec"/> <PARAM NAME="Counter00010.Path" VALUE="\Paging File(_Total)\% Usage"/> <PARAM NAME="Counter00011.Path" VALUE="\Paging File(_Total)\% Usage Peak"/> <PARAM NAME="Counter00012.Path" VALUE="\Process(vulcan_solver)\% Processor Time"/> <PARAM NAME="Counter00013.Path" VALUE="\Processor(_Total)\% Interrupt Time"/> <PARAM NAME="Counter00014.Path" VALUE="\Processor(_Total)\% Processor Time"/> <PARAM NAME="Counter00015.Path" VALUE="\Processor(_Total)\Interrupts/sec"/> <PARAM NAME="Counter00016.Path" VALUE="\Processor(0)\% Interrupt Time"/> <PARAM NAME="Counter00017.Path" VALUE="\Processor(0)\% Processor Time"/> <PARAM NAME="Counter00018.Path" VALUE="\Processor(0)\Interrupts/sec"/> <PARAM NAME="Counter00019.Path" VALUE="\Processor(1)\% Interrupt Time"/> <PARAM NAME="Counter00020.Path" VALUE="\Processor(1)\% Processor Time"/> <PARAM NAME="Counter00021.Path" VALUE="\Processor(1)\Interrupts/sec"/> <PARAM NAME="Counter00022.Path" VALUE="\System\Context Switches/sec"/> <PARAM NAME="Counter00023.Path" VALUE="\System\Processes"/> <PARAM NAME="Counter00024.Path" VALUE="\System\Processor Queue Length"/> <PARAM NAME="Counter00025.Path" VALUE="\System\System Up Time"/> <PARAM NAME="Counter00026.Path" VALUE="\System\Threads"/> <PARAM NAME="CounterCount" VALUE="26"/> <PARAM NAME="UpdateInterval" VALUE="2"/> <PARAM NAME="SampleIntervalUnitType" VALUE="1"/> <PARAM NAME="SampleIntervalValue" VALUE="2"/> </OBJECT> </BODY> </HTML>
Page 39 of 41

Performance Tuning An HPC Cluster - FINAL 012009

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Performance Tuning An HPC Cluster - FINAL 012009

Загружено:

Авторское право:

Доступные форматы

Performance Tuning a Windows HPC Cluster for Parallel Applications

Published: December 2008

Windows HPC Server 2008 White Paper

Windows HPC Server 2008 White Paper

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Contributors and Acknowledgements

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Criteria for Evaluating Performance

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Categories of Parallel Applications

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

General Performance Tuning Tools and Methods

Table 1: Performance Considerations Checklist

Ideally, the physical RAM should contain the entire

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Recommendations or Remarks hardware upgrade.

Generally unimportant except when processing real-time data.

Network and file access bandwidth

Saturation of the I/O subsystem

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Recommendations or Remarks changes.

NetworkDirect vs. TCP/IP

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Figure 2 NetworkDirect Architecture

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Vendor MPI stacks can also use NetworkDirect drivers.

GigE vs. Specialty Networking

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Figure 3. Setting the CPU interrupt modulation property

InfiniBand and the Open Fabrics Alliance

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

clusrun /all c:\drivers\openib\mft\mst status | find /c mt25208_pciconf0

List IB connection rate on all nodes.

List IB Firmware version on all nodes.

List driver version for IPoIB on all nodes.

Memory and CPU Constraints

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Performance Tuning a Windows HPC Cluster

Windows HPC Server 2008 White Paper

Tuning for Messaging-Intensive Applications

Performance Tips for Message-Intensive Applications