C 02070810

Locality-Optimized Resource Alignment
Version 3.5 for HP-UX 11i v3 Update 5
Table of Contents
Executive summary............................................................................................................................... 3 Background and motivation .................................................................................................................. 3 Structure of HP servers ...................................................................................................................... 3 Interleaved memory .......................................................................................................................... 3 Local memory .................................................................................................................................. 4 LORA scope ........................................................................................................................................ 4 Hardware platforms ......................................................................................................................... 4 Operating system ............................................................................................................................. 4 Virtual partitioning ........................................................................................................................... 4 Application workload ....................................................................................................................... 5 Variability in hardware resources....................................................................................................... 5 When to use LORA........................................................................................................................... 5 LORA configuration rules ...................................................................................................................... 5 nPartitions ....................................................................................................................................... 6 vPars .............................................................................................................................................. 6 Integrity Virtual Machines.................................................................................................................. 6 LORA system administration .................................................................................................................. 6 Server Tunables product in the Tune-N-Tools bundle ............................................................................. 6 loratune command ........................................................................................................................... 6 Benefits .............................................................................................................................................. 7 Performance .................................................................................................................................... 7 Cost ............................................................................................................................................... 7 Power management.......................................................................................................................... 7 Summary ............................................................................................................................................ 7 Glossary ............................................................................................................................................. 8 Technical details .................................................................................................................................. 8 Configuring nPartitions for LORA ........................................................................................................... 8 Converting an existing nPartition for LORA .......................................................................................... 8 Creating a new nPartition for LORA ................................................................................................... 9 Determine the size of the nPartition ................................................................................................. 9 Create the nPartition ..................................................................................................................... 9 Configuring vPars for LORA ................................................................................................................ 10 Considerations for memory granule size ........................................................................................... 10 Creating new vPars instances .......................................................................................................... 10
Dividing the nPartition into vPars instances of equal size ..................................................................... 11 Establishing vPars instances by processor and memory requirements .................................................... 14 Modifying existing vPars instances ................................................................................................... 16 Summary of vPars configuration rules ............................................................................................... 17 Advanced tuning ............................................................................................................................... 18 numa_mode kernel tunable parameter ............................................................................................. 18 numa_sched_launch kernel tunable parameter ............................................................................. 18 numa_policy kernel tunable parameter ......................................................................................... 18 mpsched command ........................................................................................................................ 19 Using mpsched to enforce good alignment for Java ........................................................................ 19 For more information .......................................................................................................................... 20
Executive summary
Locality-Optimized Resource Alignment (hereinafter LORA) is a framework for increasing system performance by exploiting the locality domains in HP servers with Non-Uniform Memory Architecture. LORA consists of configuration rules and tuning recommendations, plus HP-UX 11i v3 enhancements and tools to support them. LORA introduces a new mode to supplement the Symmetric Multiprocessing (SMP) mode originally implemented in HP-UX. LORA exploits locality in NUMA platforms to advantage, while the SMP approach treats the memory resources in a symmetric manner. For application workloads that exhibit locality of memory reference, systems configured in accordance with LORA will typically see a 20% performance improvement compared to the SMP mode used with interleaved memory. The advanced power controls in HP servers offer the opportunity for great power savings when platform hardware is not fully utilized. Because the power domains generally correspond to the locality domains, LORA configurations naturally mesh with a power conservation strategy. The body of this white paper contains sections describing background and motivation, scope, configuration rules, and system administration recommendations. The technical details behind these topics appear in the appendices. LORA was first introduced in September 2008 with Update 3 to HP-UX 11i v3. Here are the major improvements delivered in September 2009 with Update 5: The new parconfig command makes configuring nPartitions much simpler. The procedure for creating well-aligned vPars instances is simpler, and those instances are fully compatible with gWLM dynamic processor migration operations. HP now recommends deploying Integrity Virtual Machines in LORA mode. LORA mode is now recommended for more application classes. There is less need for system administrators to perform explicit tuning, because HP-UX implements heuristics to perform resource alignment automatically. There is a new command, loratune, to tune up resource alignment.
Background and motivation

Structure of HP servers
HP midrange and high-end servers are constructed as a complex of multiple modular units containing the hardware processing resources. This structure yields great advantages: a single family of servers can span the range from an economical 4 processor cores up to world-class performance 128 processor cores, with similar scaling in the amount of memory and number of I/O slots. Moreover, the complex can be partitioned to support multiple independent and isolated application workloads, with each partition sized to have the right amount of hardware resources for its workload. A consequence of this structure is that the processing resources within the complex are grouped into a set of localities. For any given processor core, memory access latency time depends on where that memory is located. This is called Non-Uniform Memory Architecture (NUMA).
Interleaved memory
Interleaved memory (ILM) is a technique for masking the NUMA properties of a system. Successive cache lines in the memory address space are drawn from different localities, making the average memory access latency time more-or-less uniform.
Sometimes interleaved memory is the best technique. ILM yields good performance when memory references are spread across the entire address space with equal probability. This is the case for applications using large global data sets with no spatial locality.
Local memory
If memory is not interleaved, then the natural localities inherent in the structure of the server complex are evident. The processor cores in each locality enjoy fast access to their local memory. The counterpoint is that access to memory in a different locality is slower. When the memory reference pattern places the majority of accesses in local memory, LORA gives a significant performance advantage relative to interleaved memory.
LORA scope
Hardware platforms
LORA applies only to those HP servers that have a Non-Uniform Memory Architecture. Those are the servers built around the sx1000 and sx2000 chip sets. For these servers, the localities are oriented around cells. Local memory is referred to as Cell Local Memory (CLM). Memory performance is better when all the cells in an nPartition have the same amount of memory installed. This is good advice for the 100% interleaved case, because the deepest interleave is possible only when each cell contributes the same amount of memory. Having the same amount of memory on each cell is even more important for LORA: the memory symmetry promotes balanced utilization of the processing resources. Asymmetry of local memory can cause a slight degradation in overall system performance. The Integrity platform is the design center for LORA, and the architecture exploits features specific to that platform. LORA is not supported on HP 9000 (PA-RISC) platforms.
Operating system
Update 3 to HP-UX 11i v3, released in September 2008, was the first version to provide a rich set of mechanisms to support local memory. We recommend that LORA be used with this update or its successors, Update 4 and Update 5. The earliest versions of HP-UX assumed a uniform memory architecture and implemented only the SMP mode. HP-UX 11i v2 was the first version to provide support for local memory. The entire burden for managing local memory is placed on the system administrator and on applications, so we do not recommend LORA with HP-UX 11i v2.
Virtual partitioning
The two virtual partitioning solutions provided by HP are Virtual Partitions (vPars) and Integrity Virtual Machines. Because these solutions subdivide the physical resources of an nPartition, they present opportunities to exploit locality. The virtualization model of vPars is particularly well-suited to LORA. The version of the T1335DC product first delivered in Update 4 to HP-UX 11i v3, version A.05.05, contains many optimizations to gain additional benefit from local memory. The binding of virtual resources to physical resources in Integrity Virtual Machines is flexible and fluid, yet there are still opportunities to gain performance advantage through resource alignment. HP recommends deploying Integrity Virtual Machines with LORA starting with Update 4.
Application workload
Memory reference patterns are key to system performance on NUMA platforms. Some applications exhibit strong locality of memory reference, others scatter accesses to global data, and some fall in between. Most commercial applications exhibit sufficient reference locality to gain significant benefit by running with LORA. Technical applications that access extremely large data sets in a uniform manner are better suited to interleaved memory and SMP mode. In some cases, the type of the application is not as important as the size of the data set that it references. If an nPartition is devoted to running one single application, and the working set of that application consumes the bulk of the available physical memory, then there are few opportunities to exploit locality. Database management systems are often run in this manner, and so will generally exhibit more predictable behavior when run in SMP mode. By contrast, if an nPartition is running multiple applications or application instances each of which has a working set much smaller than the amount of physical memory, then LORA mode will usually be able to show a benefit by aligning processing resources.
Variability in hardware resources

HP-UX 11i is a dynamic platform: hardware resources can be added to or deleted from the operating system while it continues to service its application workload. When the variability in the amount of resources is large, maintaining the locality that is essential to good performance may not be possible. For that reason, we recommend LORA only when the variability in the amount of hardware resources is within 33% above and below the initial value. For variability beyond that range, the SMP mode may be a better choice than LORA. Some of the management tools and operations that can cause variability are Instant Capacity (iCAP, including TiCAP and GiCAP), Global Workload Manager (gWLM), and Dynamic nPartitions. All of those are fully compatible with LORA, so long as the degree of variability does not exceed 33%. For example, a partition configured with 10 cores and limits for a minimum of 7 cores and a maximum of 13 cores has 30% variability. A smaller minimum or a larger maximum would exceed the 33% variability limit recommended for LORA.
When to use LORA

LORA is recommended when all of the criteria mentioned above are satisfied, that is, for Integrity servers with HP-UX 11i v3 Update 3 or later running commercial applications or virtual partitioning with a variability of less than 33%. If some condition is not met, then the traditional SMP mode with 100% interleaved memory may be appropriate. In general, a system operating in SMP mode should be configured with predominantly interleaved memory. It would be unusual to see good performance if the operating system is treating memory in a symmetric fashion while the hardware is exposing the memory localities.
LORA configuration rules

The fundamental configuration rule for LORA is to establish ths of the memory as local memory (leaving th as interleaved memory). While some circumstances might see a slight improvement with a few megabytes more or less than ths, the recommended value is an excellent general choice. It simplifies configuration and management to converge on a single value. The design center for the HP-UX 11i v3 kernel is to tune for ths local memory. The processor scheduling algorithms have no flexibility if there is only one processor core available in any given locality. Therefore, the LORA configuration rules require that there be a minimum of two processor cores in each locality.
HP-UX 11i supports a variety of partitioning models. The sections that follow explain which of these models are sensitive to the LORA configuration rules.
nPartitions
At the nPartition level, each base cell should be configured with ths local memory to comply with the LORA configuration rules. The floating cells, if any, always have 100% local memory. The appendix Configuring nPartitions for LORA contains more detail. For nPartitions containing exactly one locality, there is no difference between interleaved memory and local memory. For such a partition, there is no difference between SMP mode and LORA mode. If a second cell were added via Dynamic nPartitions, then there would be a difference between the two modes.
vPars
Each vPars instance should be composed with ths local memory and th ILM. Since the underlying nPartition has this memory ratio, it is straightforward to reflect the same ratio into the vPars instances. It is important that the processor and memory resources assigned to each vPars instance span the minimal set of localities. If a vPars instance must span multiple localities, then the processor and memory resources should be distributed symmetrically across those localities. Aligning I/O resources with the processors and memory is helpful, but it is a second-order effect. The appendix Configuring vPars for LORA explains these points in more detail. HP recommends that vPars instances configured with ths local memory be operated in LORA mode. This can be achieved most easily by leaving the numa_mode parameter at its default value. When vPars is operating in LORA mode, the system will manage any CPU migration operations so as to adhere as closely as possible to the configuration rules given above.
Integrity Virtual Machines

The nPartition containing the Integrity Virtual Machines host should be composed with ths local memory and th ILM. The guest instances operate in a Uniform Memory Architecture environment, so it is neither necessary nor possible to configure local memory in the guest instances. The Integrity Virtual Machines host will allocate resources to the guest instances to gain the greatest possible benefit from memory local to the processors.
LORA system administration

For the most part, managing a system in LORA mode is identical to managing it in SMP mode. Suggestions for adapting to unusual workload profiles are in the Advanced Tuning appendix.
Server Tunables product in the Tune-N-Tools bundle

We recommend using the Server Tunables product in the Tune-N-Tools bundle with LORA to improve application performance. This product was introduced with Update 3 and is also available on the web in the HP software depot: http://h20392.www2.hp.com/portal/swdepot/displayProductInfo.do?productNumber=Tune-N-Tools
loratune command
The loratune command is valuable in LORA mode. The command can be used to restore good resource alignment if it has been disturbed by an event such as
terminating a major application or completing a backup dynamic platform operation such as online cell activation The simplest way to use the command is to invoke it with no arguments: loratune More details are available in the man page.
Benefits
Performance
LORA reduces average memory access latency times in comparison to the interleaved memory mode. The magnitude of the reduction depends on the memory reference pattern and the number of localities in the partition. When processors spend less time waiting for memory references to be satisfied, all aspects of application performance improve. Typically, response times decrease and throughput increases at the same time that processor utilization drops. A rough estimate of the performance benefit is 20%. As Table 1 indicates, the benefit is greater for larger partitions.
Cost
LORA makes processors operate more efficiently, which can be realized as an increase in performance. Alternatively, the increased efficiency can be used to reduce the number of processor cores allocated to an application workload. This reduces the hardware provisioning cost, and may also save on the cost of software licenses as well. The LORA configuration guidelines sometimes recommend an increase in the amount of memory, which may offset some of the cost savings due to increased processor efficiency.
Power management
Power management has strong synergy with LORA. By its nature, LORA groups hardware components by their physical locality. These localities often match power domains, which gives opportunities for power savings at times of low hardware utilization. The newest Itanium processors have multiple cores per socket, and have low-power modes. The greatest power savings are realized when all cores in a socket enter the low-power mode, as compared to having single cores in multiple sockets in the low-power mode. LORA tends to group cores by their proximity, increasing the chances that an entire socket can enter low-power mode when an application is experiencing a light load.
Summary
Locality-Optimized Resource Alignment is a framework for improving performance on HP servers with a Non-Uniform Memory Architecture, introduced with HP-UX 11i v3 Update 3 and enhanced in subsequent updates. This paper explains the circumstances in which LORA is beneficial and gives guidelines for deployment in those cases. When LORA is used with commercial applications, performance is about 20% better than the SMP interleaved memory configuration. LORA simplifies server configuration by presenting uniform configuration guidelines. LORA dovetails nicely with power management strategies.
Glossary
Term Socket Processor core Core CPU Cell Definition Receptacle on a motherboard for the physical package of processing resources. If the physical package of processing resources includes multiple independent functional entities, each of them is called a processor core. Same as processor core. The keyword cpu in the vPars commands refers to a core. Acronym for Central Processing Unit. The term processor core is preferred. The basic physical building block of a system complex. A cell contains processor sockets, memory, and I/O components. A component of the interconnect fabric that allows the cells in a system complex to communicate with each other. A set of processors, memory, and I/O system bus adapters identified by the operating system for resource alignment purposes. Same as locality domain. Acronym for Symmetric Multiprocessor. A model in which all of the processors in a system are equivalent to and interchangeable with each other. Acronym for Non--Uniform Memory Architecture. A hardware platform in which system memory is separated into localities based on memory access latency times for processors. Acronym for Interleaved Memory. A technique in which successive cache lines of memory are drawn from different localities.
Crossbar Locality domain Locality SMP
NUMA
ILM
Technical details
The appendices that follow contain the technical details to supplement the general information presented in the opening sections of this paper. The appendix Configuring nPartitions for LORA explains the steps needed for every deployment of LORA. The appendix Configuring vPars for LORA gives the additional steps needed when vPars is used in a LORA nPartition. Recommendations for fine-tuning workloads are given in the appendix Advanced tuning.
Configuring nPartitions for LORA

This appendix discusses converting an existing 100% interleaved partition for use with LORA, and creating a new LORA partition to meet the needs of a specified workload. The details for configuring a server complex and for dividing it into nPartitions are available in the references.
Converting an existing nPartition for LORA

If an existing interleaved nPartition is to be converted for LORA, it is only necessary to configure each base cell in the nPartition with 87.5% local memory. (Any floating cells always have 100% local memory.)
The parmodify command is used to change the local memory ratio. For example, for an nPartition built from cells 1, 2, and 3: parmodify -p 2 -m 1:base:::87.5% -m 2:base:::87.5% -m 3:base:::87.5% It is necessary to reboot the partition for the change to take effect. The man page for the parmodify(1m) command has the full details.
Creating a new nPartition for LORA

Determine the size of the nPartition Determine the size of the nPartition by estimating the amount of processing resources needed to provision the application workload, in terms of number of processors, amount of memory, and the number of I/O adapters. HP provides a variety of estimating tools, such as Capacity Advisor, to help provision workloads. Alternatively, the estimate could be based on comparison to a similar workload. Many workloads are CPU bound, so it is common to express the partition size as the number of processor cores or processor sockets. If iCAP or gWLM is to be used to manage the number of cores in the partition, then the maximum number of cores or sockets should be used in the size estimate. If the sizing guides or past experience assume interleaved memory, the estimates should be adjusted for the special characteristics of LORA. Under LORA, processors are more efficient, because they spend less time waiting for memory accesses to be satisfied, so fewer cores are needed. On the other hand, to ensure that local memory is available on the cell where it is needed, the amount of memory should be increased modestly. The following table gives the guidelines:
Table 1. Partition size adjustments for LORA
Number of sockets 1 to 12 13 to 24 25 to 48 49 to 64
Core adjustment None Reduce by 10% Reduce by 20% Reduce by 25%
Memory adjustment None Increase by 5% Increase by 10% Increase by 15%
Note If memory utilization is below 75%, it is not necessary to increase the amount of memory at all.
Create the nPartition The parconfig command is used to build the nPartition. For example, the following command creates a partition that requires 24 processor cores: parconfig x -O HPUX_11iV3 X LORA -n mincpus:24 -Z high_performance The man page for the parconfig(1m) command has the full details.
Configuring vPars for LORA

Considerations for memory granule size
Memory for vPars instances is allocated only in multiples of the memory granule size. The granule size can be different for interleaved memory and local memory. Default granule sizes are established when a server complex is shipped from the factory. The defaults can be changed by use of the vparenv command, or by adding options to the command that creates the very first vPars instance. A small granule size permits the greatest flexibility in dividing the memory resources in an nPartition for assignment to vPars instances. However, a small granule size limits the size of contiguous physical memory objects, which are useful to increase the performance of I/O drivers, and can result in longer boot times. Also, there is a limit on the total number of granules that can be created. The large granule size is more efficient, but it means that the quantum of memory allocation is large. Under the LORA configuration guidelines, an nPartition has seven times as much local memory as interleaved memory. Therefore, it is reasonable that the granule size for local memory is larger than the granule size for interleaved memory. For example, the local memory granule size could be four times larger or eight times larger than the interleaved memory granule size. An alternative strategy would be to skip interleaved memory altogether for extremely small vPars instances. For example, a vPars instance with only 2 GB of memory could be configured with 100% local memory, instead of giving it 256 megabytes of interleaved memory, which would require a granule size at least that small. In the examples that follow, it is assumed that the memory granule sizes have been established so that the requested memory allocations are an integral multiple of the relevant granule size.
Creating new vPars instances

In this section, we'll assume that an nPartition has already been configured in accordance with the LORA guidelines and will now be configured for vPars. There are two important sub-cases: dividing the nPartition into a given number of vPars instances of equal size, and establishing a set of vPars instances each with its own processor and memory requirements. In either case, the goal of the configuration is to create vPars instances where the cores and local memory are drawn from the minimal set of localities. If the I/O can also come from those same localities, that is a bonus, but might not be possible because of the locations of the needed I/O adapters. The enhancements introduced with vPars version A.05.05 can take account of the location of I/O adapters when assigning processor cores to vPars instances. The LORA configuration procedure is a refinement of the general vPars configuration procedure described in the HP-UX Virtual Partitions Administrator's Guide. That reference should be consulted for details not discussed in this paper. We illustrate the procedure with an nPartition consisting of cells 1, 2, and 3 in an HP Integrity rx8640 server with 64 GB of memory on each cell. If the nPartition is configured according to the LORA guidelines, each cell would have 56 GB of local memory and would contribute 8 GB to the 24 GB of ILM in the partition. The available processor and memory resources are shown in the following diagram:
10
Figure 1. Resources available in the example nPartition
Dividing the nPartition into vPars instances of equal size

Suppose that it is desired to create six vPars instances of equal size in this partition. Since the partition contains 24 cores, each vPars instance should have 4 cores. Since the partition contains 24 GB of interleaved memory and 168 GB of local memory, each vPars instance should have 4 GB of interleaved memory and 28 GB of local memory. In practice, the actual memory allocations will be somewhat smaller, due to the memory consumed by firmware and by the vPars monitor. The commands that could be used to establish the processor and memory allocations are as follows: vparcreate vparcreate vparcreate vparcreate vparcreate vparcreate -p -p -p -p -p -p vp1 vp2 vp3 vp4 vp5 vp6 -a -a -a -a -a -a cpu::4 cpu::4 cpu::4 cpu::4 cpu::4 cpu::4 -a -a -a -a -a -a mem::4096 mem::4096 mem::4096 mem::4096 mem::4096 mem::4096 -a -a -a -a -a -a cell:1:mem::28672 cell:1:mem::28672 cell:2:mem::28672 cell:2:mem::28672 cell:3:mem::28672 cell:3:mem::28672
Please note that the creation commands specify the cell identifiers for the local memory but only a count for the number of processors. The enhancements added to vPars version A.05.05 (the version delivered starting with Update 4) automatically allocate processor cores in close proximity to the memory. For this example, the processor and memory allocations would be as shown in the following diagram:
11
Figure 2. Six equal vPars instances configured according to LORA rules
To achieve alignment between processors and memory with the vPars version delivered in Update 3, it was necessary to specify the cell identifiers for the processors to match the local memory. This had the drawback of disallowing gWLM to perform processor deletion operations upon the cell local processors. With the enhancements added in Update 4, gWLM is free to delete any core, and online processor addition operations will select the cores that give the best possible resource alignment. The creation commands did not specify any I/O allocations. The needed I/O could be added with subsequent commands. For example: vparmodify -p vp5 -a io:1/0/0/2/0.6.0:BOOT -a io:1/0/0/1/0 vparmodify -p vp6 -a io:1/0/4/0/0.6.0:BOOT -a io:1/0/6/1/0 As a general rule, overall system performance is determined by processor to memory alignment, and is less sensitive to I/O location. Most I/O operations are asynchronous and the operating system and applications employ a wide variety of latency-hiding techniques to make progress while I/O operations are pending. If a particular workload were sensitive to processor to I/O alignment, this can be achieved by specifying I/O location with the create command. For example: vparcreate -p vp5 -a cpu::4 -a io:3/0/0/2/0.6.0:BOOT -a io:3/0/0/1/0 vparcreate -p vp6 -a cpu::4 -a io:3/0/4/0/0.6.0:BOOT -a io:3/0/6/1/0 It is also possible to specify both local memory and I/O in the vparcreate command, in which case the processor cores are allocated across the set of localities is spanned by the memory and I/O. The following command would allocate processor cores preferentially from cells 2 and 3: vparcreate -p vp5 -a cpu::4 -a mem::4096 -a cell:2:mem::28672 -a io:3/0/0/2/0.6.0:BOOT -a io:3/0/0/1/0
12
This first example was extremely simple: the resources on each cell were divided in half. But it shows quite clearly the power behind the LORA concept: the alignment between processor cores and local memory guarantees that the preponderance of memory references will be satisfied through the fastest hardware path. A slightly more complicated example involves dividing the nPartition into four equal vPars instances. Each instance must have 6 cores and 48 GB of memory, of which 42 GB is local memory and 6 GB is interleaved memory. The first three instances can be built easily, with each taking three quarters of the resources on a cell. The fourth instance must be built from the remaining resources and is therefore split across all three cells. The splintering of a vPars instance reduces performance and so should be avoided when possible, but sometimes it is inevitable. For this example, the processor and memory allocations would be as shown in the following diagram:
Figure 3. Four equal vPars instances configured according to LORA rules
It would not have been so smooth if it had been requested to create 7 equal vPars instances from the nPartition of 3 cells. The arithmetic implies that each instance would have 3.4 cores, but only an integral number of cores can be allocated. To solve this problem, some instances could have 3 cores and others 4, or the entire allocation could be handled by the technique for creating custom instances described in the next section.
13
Establishing vPars instances by processor and memory requirements

Suppose it is stipulated to divide the example partition into vPars of the following sizes:
Table 2. Specification for size of vPars instances
Name cow dog elk fox
Number of cores 12 4 2 2
Amount of memory 48 GB 88 GB 16 GB 16 GB
The first vPars instance in the table, cow, requires 12 cores, so it will not fit within a single locality. It will, however, fit within two localities, so it should be confined to just two localities: spreading its resources across a third locality would incur additional memory latency overhead. It is best to split the resources evenly across the two localities: this symmetry promotes the most balanced utilization of resources. The 48 GB of memory would be allocated in the ratio ths local memory and ths interleaved memory. The 42 GB of local memory should be distributed evenly across the two localities, with 21 GB from each of cells 1 and 2. The second vPars instance in the table, dog, requires 88 GB of memory, so it also will not fit within a single locality. Once again, the strategy is to confine it within the minimum number of localities, which is two, and divide the resources as evenly as possible between the two localities. In this case, cells 2 and 3 each contribute 2 cores, and 35 GB and 42 GB of local memory respectively, with 11 GB of memory coming from ILM. Here are commands that could be used to establish the configuration: vparcreate -p cow -a cpu::12 -a mem::6144 -a -a vparcreate -p dog -a cpu::4 -a mem::11264 -a -a cell:1:mem::21504 cell:2:mem::21504 cell:2:mem::35840 cell:3:mem::43008
Alternatively, a configuration of the same size including 1 GB of floating memory on each cell could be established with slightly different commands: vparcreate -p cow -a -a -a vparcreate -p dog -a -a -a cpu::12 -a mem::6144 cell:1:mem::20480 -a cell:2:mem::20480 -a cpu::4 -a mem::11264 cell:2:mem::34816 -a cell:3:mem::41984 -a
cell:1:mem::1024:floating cell:2:mem::1024:floating cell:2:mem::1024:floating cell:3:mem::1024:floating
The commands for adding I/O to the vPars instances are not shown, because they would depend on the I/O devices needed in each instance and the location of the available I/O in the partition. Once again, the placement of the processors and memory usually has a greater impact on system performance than the location of the I/O.
14
The diagram in Figure 4 shows the example rx8640 with processor and memory resources allocated to the first two vPars instances from Table 5:
Figure 4. Allocations for the first two custom vPars instances
CPU 0
CPU 1
CPU 2
CPU 3
CPU 0
CPU 1
CPU 2
CPU 3
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
CPU 4
CPU 5
CPU 6
CPU 7
CPU 4
CPU 5
CPU 6
CPU 7
cow
dog
The next two vPars instances, elk and fox, are easy to lay out: the processor and memory resources needed by each one are available within one single locality. The resource allocations would be distributed as shown in the following table:
Table 3. Resource distribution for the four custom vPars instances
Name cow dog elk fox
Cores cell 1 6
Cores cell 2 6 2
Cores cell 3
GB CLM cell 1 21
GB CLM cell 2 21 35
GB CLM cell 3
GB ILM
6 42 11 2 14 2
2 14 2
The diagram in Figure 5 shows this graphically:
15
Figure 5. Allocations for the custom vPars instances
CPU 0
CPU 1
CPU 2
CPU 3
CPU 0
CPU 1
CPU 2
CPU 3
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
CPU 4
CPU 5
CPU 6
CPU 7
CPU 4
CPU 5
CPU 6
CPU 7
elk
cow
dog
fox
Modifying existing vPars instances

The vparmodify command can be used to change the composition of existing vPars instances. When the number of processor cores allocated is changed, this is referred to as "CPU migration". In LORA mode, the system automatically performs all CPU migration operations so as to conform as closely as possible to the LORA configuration rules. For example, if the average processor utilization in the vPars instance cow is low while the utilization in instance dog is high, it might be desirable to migrate CPUs from instance cow to instance dog. Here is an example command sequence to migrate four cores: vparmodify -p cow -d cpu::4 vparmodify -p dog -a cpu::4 After the first command is executed, the vPars monitor chooses 2 processor cores from cell 1 and 2 cores from cell 2 to be deleted from vPars instance cow. After the second command is executed, two processor cores from cell 2 and two cores from cell 3 will be added to instance dog. The new resource assignments would be as shown in the following diagram:
16
Figure 6. Allocations after the CPU migration operations
CPU 0
CPU 1
CPU 2
CPU 3
CPU 0
CPU 1
CPU 2
CPU 3
CPU 0
CPU 1
CPU 2
CPU 3
CPU 4
CPU 5
CPU 6
CPU 7
CPU 4
CPU 5
CPU 6
CPU 7
CPU 4
CPU 5
CPU 6
CPU 7
elk
cow
dog
fox
It is also possible to migrate memory along with the processor cores, but only for the memory that was specifically designated as floating memory. If 1 GB of memory from cell 2 in the vPars instance cow had been designated as floating memory, it could be migrated to instance dog with the following pair of commands: vparmodify -p cow -d cell:2:mem::1024:floating vparmodify -p dog -a cell:2:mem::1024:floating
Summary of vPars configuration rules

The rules for configuring vPars for LORA are simple: Draw resources for each vPars instance from the minimal number of distinct localities If the number of localities is greater than one, balance the resources across those localities Do the best you can with I/O devices and with instances created from leftover resources A minor optimization, and one which also applies for 100% interleaved nPartitions, is to keep the cores on a socket in the same instance, because they share a common cache. Keeping such cores working together on the same application can give a small performance benefit, which is worth realizing if the choice of cores is otherwise arbitrary. Use mpsched -K to identify the cores that share a common socket. Similarly, mpsched S shows which cores belong to the same proximity set, and those should be kept together if the choice of cores is otherwise unconstrained.
17
Advanced tuning
An important part of the LORA value proposition is to deliver ease-of-use along with performance. Our goal is that LORA should work out-of-the-box, without the need for system administrators to perform explicit tuning. Several factors make the goal impossible to reach in every single case. The range of applications deployed across the HP-UX customer base is extremely diverse. So is the capacity of the servers: the applications could be deployed in a virtual partition with two processor cores and 3 GB of memory, or in a hard partition with 128 cores and 2 TB of memory. In addition, workloads can exhibit transient spikes in demand many times greater than the steady-state average. Here is the LORA philosophy for coping with this dilemma: provide out-of-the-box behavior that is solid in most circumstances, but implement mechanisms to allow system administrators to adjust the behavior to suit the idiosyncrasies of their particular workload if they desire to do so. This section discusses some possibilities for explicit tuning to override the automatic LORA heuristics. numa_mode
kernel tunable parameter
The numa_mode kernel tunable parameter controls the mode of the kernel with respect to NUMA platform characteristics. Because of the close coupling between memory configuration and kernel mode, it is recommended to accept the default value of numa_policy, which is 0, meaning to auto sense the mode at boot time. Systems configured in accordance with the LORA guidelines will be auto sensed into LORA mode; otherwise they will operate in SMP mode. As described in the numa_mode man page, the tunable can be adjusted to override the autosensing logic. In LORA mode, HP-UX implements a number of heuristics for automatic workload placement to establish good alignment between the processes executing an application in the memory that they reference. Every process and every thread is assigned a home locality. Processes and threads may temporarily be moved away from their home localities to balance the system load, but they are returned back home as soon as is practical. For process memory allocations, when the allocation policy stipulates the closest locality, the home locality of the process is used. For shared memory objects too large to fit within a single locality, the allocation is distributed evenly across the smallest number of localities that can accommodate the object. Any processes attaching to that shared memory object are then re-launched so as to be distributed evenly across the localities containing the memory. numa_sched_launch
The numa_sched_launch parameter controls the default process launch policy. The launch policy refers to the preferred locality for processes forked as children of a parent process. In LORA mode, the default launch policy is PACKED, which places child processes in the same locality as their parent. Setting the parameter to the value 0 forces the default launch policy to be the same as it is in SMP mode. Individual processes can be launched with a custom policy by using the mpsched command. numa_policy
The numa_policy kernel tunable parameter governs the way HP-UX 11i v3 performs memory allocation on NUMA platforms. When the parameter is at its default value of 0, the kernel chooses the allocation policy at boot time based on the platform memory configuration. The system administrator can override the default choice at any time by changing the value of the parameter. The numa_policy man page contains the full details; a brief summary appears below.
18
Table 4. Values for the numa_policy tunable parameter
Value 0
Default Memory Allocation Policy automatically selected by the kernel at boot time
Use Cases recommended for all common workloads and configurations in LORA mode or with lots of local memory configured in SMP mode or with lots of interleaved memory configured highly threaded applications exhibiting spatial locality applications that access lots of global data
from the locality closest to the allocating processor
from interleaved memory
text and library data segments from interleaved memory; others from the locality closest to the allocating processor private objects from the closest locality; shared objects from interleaved memory
mpsched command
In LORA mode, the kernel attempts to provide good alignment between the processors executing an application and the memory that they reference. To offer direct control over process placement, HPUX 11i provides the mpsched command. The command reveals information about localities in the system and controls the processor or locality on which a specific process executes. The man page has the full details on how to use the command. The mpsched command can be used to bind processes to a particular locality. In the absence of this binding, the operating system might schedule processes in different localities as the workload ebbs and flows. The binding ensures that the processes are always executing in the same locality as the memory they allocate and hence will experience low memory latency. The command can also be used to specify a launch policy to control the scheduling of children forked by a process. Using mpsched to enforce good alignment for Java HP-UX is usually able to achieve good alignment when running Java. Here are some guidelines for explicit tunes to enforce good alignment when running multiple instances of the Java virtual machine. We'll assume that each Java instance is small enough in terms of processor and memory resources to fit into a single locality. It is best to subdivide the Java instance if it exceeds the capacity of a single locality. Use the command mpsched s to determine the number of localities in the partition and the locality identifier corresponding to each locality. It is usually not necessary to use any launch policy if there are fewer than three localities in the partition. If there are many localities, use commands similar to the following to launch Java, distributing the Java virtual machines instances evenly across two localities: mpsched mpsched mpsched mpsched l l l l 2 2 3 3 java java java java
Since Java is highly threaded, it may be beneficial to set the numa_policy parameter to 3.
19
For more information

For a discussion of SMP versus NUMA, see http://en.wikipedia.org/wiki/Symmetric_multiprocessing Detailed information about the capabilities and configuration of the Superdome platform is available at http://h18000.www1.hp.com/products/quickspecs/archives_Division/11717_div_v1/11717_div. HTML The definitive manual for nPartition administration is the nPartition Administrator's Guide at http://docs.hp.com/en/5991-1247B_ed2/index.html The definitive manual for vPars administration is the HP-UX Virtual Partitions Administrator's Guide at http://docs.hp.com/en/T1335-90104/index.html The best reference for configuring and tuning the Oracle database for HP-UX is the white paper The Oracle database on HP Integrity servers The new commands for LORA are described in the manual pages: parconfig(1m) and loratune(5).
To help us improve our documents, please provide feedback at www.hp.com/solutions/feedback
Technology for better business outcomes

Copyright 2009 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. The only warranties for HP products and services are set forth in the express warranty statements accompanying such products and services. Nothing herein should be construed as constituting an additional warranty. HP shall not be liable for technical or editorial errors or omissions contained herein. UNIX is a registered trademark of The Open Group. 14655-ENW-LORA-TW, September 2009
20

C 02070810

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

C 02070810

Загружено:

Авторское право:

Доступные форматы

Locality-Optimized Resource Alignment

Version 3.5 for HP-UX 11i v3 Update 5

Background and motivation

Variability in hardware resources

When to use LORA

LORA configuration rules

Integrity Virtual Machines

LORA system administration

Server Tunables product in the Tune-N-Tools bundle

Crossbar Locality domain Locality SMP

Configuring nPartitions for LORA

Converting an existing nPartition for LORA

Creating a new nPartition for LORA

Core adjustment None Reduce by 10% Reduce by 20% Reduce by 25%

Memory adjustment None Increase by 5% Increase by 10% Increase by 15%

Configuring vPars for LORA

Creating new vPars instances

Figure 1. Resources available in the example nPartition

Dividing the nPartition into vPars instances of equal size

Figure 2. Six equal vPars instances configured according to LORA rules

Figure 3. Four equal vPars instances configured according to LORA rules

Establishing vPars instances by processor and memory requirements

Name cow dog elk fox

cell:1:mem::1024:floating cell:2:mem::1024:floating cell:2:mem::1024:floating cell:3:mem::1024:floating

Figure 4. Allocations for the first two custom vPars instances

Table 3. Resource distribution for the four custom vPars instances

Name cow dog elk fox

The diagram in Figure 5 shows this graphically:

Figure 5. Allocations for the custom vPars instances

Modifying existing vPars instances

Figure 6. Allocations after the CPU migration operations

Summary of vPars configuration rules

kernel tunable parameter

kernel tunable parameter

kernel tunable parameter

Table 4. Values for the numa_policy tunable parameter

from the locality closest to the allocating processor

from interleaved memory

For more information

To help us improve our documents, please provide feedback at www.hp.com/solutions/feedback

Technology for better business outcomes

Вам также может понравиться