Академический Документы
Профессиональный Документы
Культура Документы
Table of Contents
Executive summary............................................................................................................................... 3 Background and motivation .................................................................................................................. 3 Structure of HP servers ...................................................................................................................... 3 Interleaved memory .......................................................................................................................... 3 Local memory .................................................................................................................................. 4 LORA scope ........................................................................................................................................ 4 Hardware platforms ......................................................................................................................... 4 Operating system ............................................................................................................................. 4 Virtual partitioning ........................................................................................................................... 4 Application workload ....................................................................................................................... 5 Variability in hardware resources....................................................................................................... 5 When to use LORA........................................................................................................................... 5 LORA configuration rules ...................................................................................................................... 5 nPartitions ....................................................................................................................................... 6 vPars .............................................................................................................................................. 6 Integrity Virtual Machines.................................................................................................................. 6 LORA system administration .................................................................................................................. 6 Server Tunables product in the Tune-N-Tools bundle ............................................................................. 6 loratune command ........................................................................................................................... 6 Benefits .............................................................................................................................................. 7 Performance .................................................................................................................................... 7 Cost ............................................................................................................................................... 7 Power management.......................................................................................................................... 7 Summary ............................................................................................................................................ 7 Glossary ............................................................................................................................................. 8 Technical details .................................................................................................................................. 8 Configuring nPartitions for LORA ........................................................................................................... 8 Converting an existing nPartition for LORA .......................................................................................... 8 Creating a new nPartition for LORA ................................................................................................... 9 Determine the size of the nPartition ................................................................................................. 9 Create the nPartition ..................................................................................................................... 9 Configuring vPars for LORA ................................................................................................................ 10 Considerations for memory granule size ........................................................................................... 10 Creating new vPars instances .......................................................................................................... 10
Dividing the nPartition into vPars instances of equal size ..................................................................... 11 Establishing vPars instances by processor and memory requirements .................................................... 14 Modifying existing vPars instances ................................................................................................... 16 Summary of vPars configuration rules ............................................................................................... 17 Advanced tuning ............................................................................................................................... 18 numa_mode kernel tunable parameter ............................................................................................. 18 numa_sched_launch kernel tunable parameter ............................................................................. 18 numa_policy kernel tunable parameter ......................................................................................... 18 mpsched command ........................................................................................................................ 19 Using mpsched to enforce good alignment for Java ........................................................................ 19 For more information .......................................................................................................................... 20
Executive summary
Locality-Optimized Resource Alignment (hereinafter LORA) is a framework for increasing system performance by exploiting the locality domains in HP servers with Non-Uniform Memory Architecture. LORA consists of configuration rules and tuning recommendations, plus HP-UX 11i v3 enhancements and tools to support them. LORA introduces a new mode to supplement the Symmetric Multiprocessing (SMP) mode originally implemented in HP-UX. LORA exploits locality in NUMA platforms to advantage, while the SMP approach treats the memory resources in a symmetric manner. For application workloads that exhibit locality of memory reference, systems configured in accordance with LORA will typically see a 20% performance improvement compared to the SMP mode used with interleaved memory. The advanced power controls in HP servers offer the opportunity for great power savings when platform hardware is not fully utilized. Because the power domains generally correspond to the locality domains, LORA configurations naturally mesh with a power conservation strategy. The body of this white paper contains sections describing background and motivation, scope, configuration rules, and system administration recommendations. The technical details behind these topics appear in the appendices. LORA was first introduced in September 2008 with Update 3 to HP-UX 11i v3. Here are the major improvements delivered in September 2009 with Update 5: The new parconfig command makes configuring nPartitions much simpler. The procedure for creating well-aligned vPars instances is simpler, and those instances are fully compatible with gWLM dynamic processor migration operations. HP now recommends deploying Integrity Virtual Machines in LORA mode. LORA mode is now recommended for more application classes. There is less need for system administrators to perform explicit tuning, because HP-UX implements heuristics to perform resource alignment automatically. There is a new command, loratune, to tune up resource alignment.
Interleaved memory
Interleaved memory (ILM) is a technique for masking the NUMA properties of a system. Successive cache lines in the memory address space are drawn from different localities, making the average memory access latency time more-or-less uniform.
Sometimes interleaved memory is the best technique. ILM yields good performance when memory references are spread across the entire address space with equal probability. This is the case for applications using large global data sets with no spatial locality.
Local memory
If memory is not interleaved, then the natural localities inherent in the structure of the server complex are evident. The processor cores in each locality enjoy fast access to their local memory. The counterpoint is that access to memory in a different locality is slower. When the memory reference pattern places the majority of accesses in local memory, LORA gives a significant performance advantage relative to interleaved memory.
LORA scope
Hardware platforms
LORA applies only to those HP servers that have a Non-Uniform Memory Architecture. Those are the servers built around the sx1000 and sx2000 chip sets. For these servers, the localities are oriented around cells. Local memory is referred to as Cell Local Memory (CLM). Memory performance is better when all the cells in an nPartition have the same amount of memory installed. This is good advice for the 100% interleaved case, because the deepest interleave is possible only when each cell contributes the same amount of memory. Having the same amount of memory on each cell is even more important for LORA: the memory symmetry promotes balanced utilization of the processing resources. Asymmetry of local memory can cause a slight degradation in overall system performance. The Integrity platform is the design center for LORA, and the architecture exploits features specific to that platform. LORA is not supported on HP 9000 (PA-RISC) platforms.
Operating system
Update 3 to HP-UX 11i v3, released in September 2008, was the first version to provide a rich set of mechanisms to support local memory. We recommend that LORA be used with this update or its successors, Update 4 and Update 5. The earliest versions of HP-UX assumed a uniform memory architecture and implemented only the SMP mode. HP-UX 11i v2 was the first version to provide support for local memory. The entire burden for managing local memory is placed on the system administrator and on applications, so we do not recommend LORA with HP-UX 11i v2.
Virtual partitioning
The two virtual partitioning solutions provided by HP are Virtual Partitions (vPars) and Integrity Virtual Machines. Because these solutions subdivide the physical resources of an nPartition, they present opportunities to exploit locality. The virtualization model of vPars is particularly well-suited to LORA. The version of the T1335DC product first delivered in Update 4 to HP-UX 11i v3, version A.05.05, contains many optimizations to gain additional benefit from local memory. The binding of virtual resources to physical resources in Integrity Virtual Machines is flexible and fluid, yet there are still opportunities to gain performance advantage through resource alignment. HP recommends deploying Integrity Virtual Machines with LORA starting with Update 4.
Application workload
Memory reference patterns are key to system performance on NUMA platforms. Some applications exhibit strong locality of memory reference, others scatter accesses to global data, and some fall in between. Most commercial applications exhibit sufficient reference locality to gain significant benefit by running with LORA. Technical applications that access extremely large data sets in a uniform manner are better suited to interleaved memory and SMP mode. In some cases, the type of the application is not as important as the size of the data set that it references. If an nPartition is devoted to running one single application, and the working set of that application consumes the bulk of the available physical memory, then there are few opportunities to exploit locality. Database management systems are often run in this manner, and so will generally exhibit more predictable behavior when run in SMP mode. By contrast, if an nPartition is running multiple applications or application instances each of which has a working set much smaller than the amount of physical memory, then LORA mode will usually be able to show a benefit by aligning processing resources.
HP-UX 11i supports a variety of partitioning models. The sections that follow explain which of these models are sensitive to the LORA configuration rules.
nPartitions
At the nPartition level, each base cell should be configured with ths local memory to comply with the LORA configuration rules. The floating cells, if any, always have 100% local memory. The appendix Configuring nPartitions for LORA contains more detail. For nPartitions containing exactly one locality, there is no difference between interleaved memory and local memory. For such a partition, there is no difference between SMP mode and LORA mode. If a second cell were added via Dynamic nPartitions, then there would be a difference between the two modes.
vPars
Each vPars instance should be composed with ths local memory and th ILM. Since the underlying nPartition has this memory ratio, it is straightforward to reflect the same ratio into the vPars instances. It is important that the processor and memory resources assigned to each vPars instance span the minimal set of localities. If a vPars instance must span multiple localities, then the processor and memory resources should be distributed symmetrically across those localities. Aligning I/O resources with the processors and memory is helpful, but it is a second-order effect. The appendix Configuring vPars for LORA explains these points in more detail. HP recommends that vPars instances configured with ths local memory be operated in LORA mode. This can be achieved most easily by leaving the numa_mode parameter at its default value. When vPars is operating in LORA mode, the system will manage any CPU migration operations so as to adhere as closely as possible to the configuration rules given above.
loratune command
The loratune command is valuable in LORA mode. The command can be used to restore good resource alignment if it has been disturbed by an event such as
terminating a major application or completing a backup dynamic platform operation such as online cell activation The simplest way to use the command is to invoke it with no arguments: loratune More details are available in the man page.
Benefits
Performance
LORA reduces average memory access latency times in comparison to the interleaved memory mode. The magnitude of the reduction depends on the memory reference pattern and the number of localities in the partition. When processors spend less time waiting for memory references to be satisfied, all aspects of application performance improve. Typically, response times decrease and throughput increases at the same time that processor utilization drops. A rough estimate of the performance benefit is 20%. As Table 1 indicates, the benefit is greater for larger partitions.
Cost
LORA makes processors operate more efficiently, which can be realized as an increase in performance. Alternatively, the increased efficiency can be used to reduce the number of processor cores allocated to an application workload. This reduces the hardware provisioning cost, and may also save on the cost of software licenses as well. The LORA configuration guidelines sometimes recommend an increase in the amount of memory, which may offset some of the cost savings due to increased processor efficiency.
Power management
Power management has strong synergy with LORA. By its nature, LORA groups hardware components by their physical locality. These localities often match power domains, which gives opportunities for power savings at times of low hardware utilization. The newest Itanium processors have multiple cores per socket, and have low-power modes. The greatest power savings are realized when all cores in a socket enter the low-power mode, as compared to having single cores in multiple sockets in the low-power mode. LORA tends to group cores by their proximity, increasing the chances that an entire socket can enter low-power mode when an application is experiencing a light load.
Summary
Locality-Optimized Resource Alignment is a framework for improving performance on HP servers with a Non-Uniform Memory Architecture, introduced with HP-UX 11i v3 Update 3 and enhanced in subsequent updates. This paper explains the circumstances in which LORA is beneficial and gives guidelines for deployment in those cases. When LORA is used with commercial applications, performance is about 20% better than the SMP interleaved memory configuration. LORA simplifies server configuration by presenting uniform configuration guidelines. LORA dovetails nicely with power management strategies.
Glossary
Term Socket Processor core Core CPU Cell Definition Receptacle on a motherboard for the physical package of processing resources. If the physical package of processing resources includes multiple independent functional entities, each of them is called a processor core. Same as processor core. The keyword cpu in the vPars commands refers to a core. Acronym for Central Processing Unit. The term processor core is preferred. The basic physical building block of a system complex. A cell contains processor sockets, memory, and I/O components. A component of the interconnect fabric that allows the cells in a system complex to communicate with each other. A set of processors, memory, and I/O system bus adapters identified by the operating system for resource alignment purposes. Same as locality domain. Acronym for Symmetric Multiprocessor. A model in which all of the processors in a system are equivalent to and interchangeable with each other. Acronym for Non--Uniform Memory Architecture. A hardware platform in which system memory is separated into localities based on memory access latency times for processors. Acronym for Interleaved Memory. A technique in which successive cache lines of memory are drawn from different localities.
NUMA
ILM
Technical details
The appendices that follow contain the technical details to supplement the general information presented in the opening sections of this paper. The appendix Configuring nPartitions for LORA explains the steps needed for every deployment of LORA. The appendix Configuring vPars for LORA gives the additional steps needed when vPars is used in a LORA nPartition. Recommendations for fine-tuning workloads are given in the appendix Advanced tuning.
The parmodify command is used to change the local memory ratio. For example, for an nPartition built from cells 1, 2, and 3: parmodify -p 2 -m 1:base:::87.5% -m 2:base:::87.5% -m 3:base:::87.5% It is necessary to reboot the partition for the change to take effect. The man page for the parmodify(1m) command has the full details.
Number of sockets 1 to 12 13 to 24 25 to 48 49 to 64
Note If memory utilization is below 75%, it is not necessary to increase the amount of memory at all.
Create the nPartition The parconfig command is used to build the nPartition. For example, the following command creates a partition that requires 24 processor cores: parconfig x -O HPUX_11iV3 X LORA -n mincpus:24 -Z high_performance The man page for the parconfig(1m) command has the full details.
10
Please note that the creation commands specify the cell identifiers for the local memory but only a count for the number of processors. The enhancements added to vPars version A.05.05 (the version delivered starting with Update 4) automatically allocate processor cores in close proximity to the memory. For this example, the processor and memory allocations would be as shown in the following diagram:
11
To achieve alignment between processors and memory with the vPars version delivered in Update 3, it was necessary to specify the cell identifiers for the processors to match the local memory. This had the drawback of disallowing gWLM to perform processor deletion operations upon the cell local processors. With the enhancements added in Update 4, gWLM is free to delete any core, and online processor addition operations will select the cores that give the best possible resource alignment. The creation commands did not specify any I/O allocations. The needed I/O could be added with subsequent commands. For example: vparmodify -p vp5 -a io:1/0/0/2/0.6.0:BOOT -a io:1/0/0/1/0 vparmodify -p vp6 -a io:1/0/4/0/0.6.0:BOOT -a io:1/0/6/1/0 As a general rule, overall system performance is determined by processor to memory alignment, and is less sensitive to I/O location. Most I/O operations are asynchronous and the operating system and applications employ a wide variety of latency-hiding techniques to make progress while I/O operations are pending. If a particular workload were sensitive to processor to I/O alignment, this can be achieved by specifying I/O location with the create command. For example: vparcreate -p vp5 -a cpu::4 -a io:3/0/0/2/0.6.0:BOOT -a io:3/0/0/1/0 vparcreate -p vp6 -a cpu::4 -a io:3/0/4/0/0.6.0:BOOT -a io:3/0/6/1/0 It is also possible to specify both local memory and I/O in the vparcreate command, in which case the processor cores are allocated across the set of localities is spanned by the memory and I/O. The following command would allocate processor cores preferentially from cells 2 and 3: vparcreate -p vp5 -a cpu::4 -a mem::4096 -a cell:2:mem::28672 -a io:3/0/0/2/0.6.0:BOOT -a io:3/0/0/1/0
12
This first example was extremely simple: the resources on each cell were divided in half. But it shows quite clearly the power behind the LORA concept: the alignment between processor cores and local memory guarantees that the preponderance of memory references will be satisfied through the fastest hardware path. A slightly more complicated example involves dividing the nPartition into four equal vPars instances. Each instance must have 6 cores and 48 GB of memory, of which 42 GB is local memory and 6 GB is interleaved memory. The first three instances can be built easily, with each taking three quarters of the resources on a cell. The fourth instance must be built from the remaining resources and is therefore split across all three cells. The splintering of a vPars instance reduces performance and so should be avoided when possible, but sometimes it is inevitable. For this example, the processor and memory allocations would be as shown in the following diagram:
It would not have been so smooth if it had been requested to create 7 equal vPars instances from the nPartition of 3 cells. The arithmetic implies that each instance would have 3.4 cores, but only an integral number of cores can be allocated. To solve this problem, some instances could have 3 cores and others 4, or the entire allocation could be handled by the technique for creating custom instances described in the next section.
13
Number of cores 12 4 2 2
Amount of memory 48 GB 88 GB 16 GB 16 GB
The first vPars instance in the table, cow, requires 12 cores, so it will not fit within a single locality. It will, however, fit within two localities, so it should be confined to just two localities: spreading its resources across a third locality would incur additional memory latency overhead. It is best to split the resources evenly across the two localities: this symmetry promotes the most balanced utilization of resources. The 48 GB of memory would be allocated in the ratio ths local memory and ths interleaved memory. The 42 GB of local memory should be distributed evenly across the two localities, with 21 GB from each of cells 1 and 2. The second vPars instance in the table, dog, requires 88 GB of memory, so it also will not fit within a single locality. Once again, the strategy is to confine it within the minimum number of localities, which is two, and divide the resources as evenly as possible between the two localities. In this case, cells 2 and 3 each contribute 2 cores, and 35 GB and 42 GB of local memory respectively, with 11 GB of memory coming from ILM. Here are commands that could be used to establish the configuration: vparcreate -p cow -a cpu::12 -a mem::6144 -a -a vparcreate -p dog -a cpu::4 -a mem::11264 -a -a cell:1:mem::21504 cell:2:mem::21504 cell:2:mem::35840 cell:3:mem::43008
Alternatively, a configuration of the same size including 1 GB of floating memory on each cell could be established with slightly different commands: vparcreate -p cow -a -a -a vparcreate -p dog -a -a -a cpu::12 -a mem::6144 cell:1:mem::20480 -a cell:2:mem::20480 -a cpu::4 -a mem::11264 cell:2:mem::34816 -a cell:3:mem::41984 -a
The commands for adding I/O to the vPars instances are not shown, because they would depend on the I/O devices needed in each instance and the location of the available I/O in the partition. Once again, the placement of the processors and memory usually has a greater impact on system performance than the location of the I/O.
14
The diagram in Figure 4 shows the example rx8640 with processor and memory resources allocated to the first two vPars instances from Table 5:
CPU 0
CPU 1
CPU 2
CPU 3
CPU 0
CPU 1
CPU 2
CPU 3
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
CPU 4
CPU 5
CPU 6
CPU 7
CPU 4
CPU 5
CPU 6
CPU 7
cow
dog
The next two vPars instances, elk and fox, are easy to lay out: the processor and memory resources needed by each one are available within one single locality. The resource allocations would be distributed as shown in the following table:
Cores cell 1 6
Cores cell 2 6 2
Cores cell 3
GB CLM cell 1 21
GB CLM cell 2 21 35
GB CLM cell 3
GB ILM
6 42 11 2 14 2
2 14 2
15
CPU 0
CPU 1
CPU 2
CPU 3
CPU 0
CPU 1
CPU 2
CPU 3
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
CPU 4
CPU 5
CPU 6
CPU 7
CPU 4
CPU 5
CPU 6
CPU 7
elk
cow
dog
fox
16
CPU 0
CPU 1
CPU 2
CPU 3
CPU 0
CPU 1
CPU 2
CPU 3
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
CPU 4
CPU 5
CPU 6
CPU 7
CPU 4
CPU 5
CPU 6
CPU 7
elk
cow
dog
fox
It is also possible to migrate memory along with the processor cores, but only for the memory that was specifically designated as floating memory. If 1 GB of memory from cell 2 in the vPars instance cow had been designated as floating memory, it could be migrated to instance dog with the following pair of commands: vparmodify -p cow -d cell:2:mem::1024:floating vparmodify -p dog -a cell:2:mem::1024:floating
17
Advanced tuning
An important part of the LORA value proposition is to deliver ease-of-use along with performance. Our goal is that LORA should work out-of-the-box, without the need for system administrators to perform explicit tuning. Several factors make the goal impossible to reach in every single case. The range of applications deployed across the HP-UX customer base is extremely diverse. So is the capacity of the servers: the applications could be deployed in a virtual partition with two processor cores and 3 GB of memory, or in a hard partition with 128 cores and 2 TB of memory. In addition, workloads can exhibit transient spikes in demand many times greater than the steady-state average. Here is the LORA philosophy for coping with this dilemma: provide out-of-the-box behavior that is solid in most circumstances, but implement mechanisms to allow system administrators to adjust the behavior to suit the idiosyncrasies of their particular workload if they desire to do so. This section discusses some possibilities for explicit tuning to override the automatic LORA heuristics. numa_mode
The numa_mode kernel tunable parameter controls the mode of the kernel with respect to NUMA platform characteristics. Because of the close coupling between memory configuration and kernel mode, it is recommended to accept the default value of numa_policy, which is 0, meaning to auto sense the mode at boot time. Systems configured in accordance with the LORA guidelines will be auto sensed into LORA mode; otherwise they will operate in SMP mode. As described in the numa_mode man page, the tunable can be adjusted to override the autosensing logic. In LORA mode, HP-UX implements a number of heuristics for automatic workload placement to establish good alignment between the processes executing an application in the memory that they reference. Every process and every thread is assigned a home locality. Processes and threads may temporarily be moved away from their home localities to balance the system load, but they are returned back home as soon as is practical. For process memory allocations, when the allocation policy stipulates the closest locality, the home locality of the process is used. For shared memory objects too large to fit within a single locality, the allocation is distributed evenly across the smallest number of localities that can accommodate the object. Any processes attaching to that shared memory object are then re-launched so as to be distributed evenly across the localities containing the memory. numa_sched_launch
The numa_sched_launch parameter controls the default process launch policy. The launch policy refers to the preferred locality for processes forked as children of a parent process. In LORA mode, the default launch policy is PACKED, which places child processes in the same locality as their parent. Setting the parameter to the value 0 forces the default launch policy to be the same as it is in SMP mode. Individual processes can be launched with a custom policy by using the mpsched command. numa_policy
The numa_policy kernel tunable parameter governs the way HP-UX 11i v3 performs memory allocation on NUMA platforms. When the parameter is at its default value of 0, the kernel chooses the allocation policy at boot time based on the platform memory configuration. The system administrator can override the default choice at any time by changing the value of the parameter. The numa_policy man page contains the full details; a brief summary appears below.
18
Value 0
Default Memory Allocation Policy automatically selected by the kernel at boot time
Use Cases recommended for all common workloads and configurations in LORA mode or with lots of local memory configured in SMP mode or with lots of interleaved memory configured highly threaded applications exhibiting spatial locality applications that access lots of global data
text and library data segments from interleaved memory; others from the locality closest to the allocating processor private objects from the closest locality; shared objects from interleaved memory
mpsched command
In LORA mode, the kernel attempts to provide good alignment between the processors executing an application and the memory that they reference. To offer direct control over process placement, HPUX 11i provides the mpsched command. The command reveals information about localities in the system and controls the processor or locality on which a specific process executes. The man page has the full details on how to use the command. The mpsched command can be used to bind processes to a particular locality. In the absence of this binding, the operating system might schedule processes in different localities as the workload ebbs and flows. The binding ensures that the processes are always executing in the same locality as the memory they allocate and hence will experience low memory latency. The command can also be used to specify a launch policy to control the scheduling of children forked by a process. Using mpsched to enforce good alignment for Java HP-UX is usually able to achieve good alignment when running Java. Here are some guidelines for explicit tunes to enforce good alignment when running multiple instances of the Java virtual machine. We'll assume that each Java instance is small enough in terms of processor and memory resources to fit into a single locality. It is best to subdivide the Java instance if it exceeds the capacity of a single locality. Use the command mpsched s to determine the number of localities in the partition and the locality identifier corresponding to each locality. It is usually not necessary to use any launch policy if there are fewer than three localities in the partition. If there are many localities, use commands similar to the following to launch Java, distributing the Java virtual machines instances evenly across two localities: mpsched mpsched mpsched mpsched l l l l 2 2 3 3 java java java java
Since Java is highly threaded, it may be beneficial to set the numa_policy parameter to 3.
19
20