Region-Based Techniques For Modeling and Enhancing Cluster Openmp Performance

Region-based Techniques for Modeling and Enhancing Cluster OpenMP Performance
Jie Cai
August 2011
A thesis submitted for the degree of Doctor of Philosophy of the Australian National University
c Jie Cai 2011
A This document was produced using TEX , LTEX and BIBTEX
For my wife, Ruru, who greatly supported my PhD research... ...and my loving parents.
Declaration
I declare that the work in this thesis is entirely my own and that to the best of my knowledge it does not contain any materials previously published or written by another person except where otherwise indicated.
Jie Cai
Acknowledgements
During my PhD, many people have offered kind help and generous support. I would like to thank them and appreciate their help.
Supervisors Dr. Peter Strazdins for the guidance and advise during my whole doctoral research; Dr. Alistair Rendell for the time we spent together before paper deadlines; Dr. Eric McCreath for the useful comments.
Readers For reading my thesis and providing valuable feedbacks, thank you. Warren Armstrong, Muhammad Atif, Michael Chapman, Pete Janes, Josh Milthorpe, Peter Strazdins, and Jin Wong.
Computer System Group Members For the cheerful four years of my PhD, thank you geeks. Joseph Anthony, Ting Cao, Elton Tian, Xi Yang, Fangzhou Xiao and more ...
Industry Partners For their generous nancial contribution to support my research. Australian Research Council, Intel, Sun Microsystems (Oracle)
Last but denitely not least For being so supportive, NCI NF colleagues. Ben Evans, Robin Humble, Judy Jenkinson and David Singleton.
Abstract
Cluster OpenMP enables the use of the OpenMP shared memory programming model on distributed memory cluster environments. Intel has released a cluster OpenMP implementation called Intel Cluster OpenMP (CLOMP). While this offers better programmability than message passing alternatives such as the Message Passing Interface (MPI), such convenience comes with overheads resulting from having to maintain the consistency of underlying shared memory abstractions. CLOMP is no exception. This thesis introduces models for understanding these overheads of cluster OpenMP implementations like CLOMP and proposes techniques for enhancing their performance. Cluster OpenMP systems are usually implemented using page-based software distributed shared memory (sDSM) systems, which create and maintain virtual global shared memory spaces in pages. A key issue for such system is maintaining the consistency of the shared memory space. This forms a major source of overhead, and it is driven by detecting and servicing page faults. To investigate and understand these systems, we evaluate their performance with different OpenMP applications, and we also develop a benchmark, called MCBENCH, to characterize the memory consistency costs. Using MCBENCH, we discover that this overhead is proportional to the number of writers to the same shared page and the number of shared pages. Furthermore, we divide an OpenMP program into separate parallel and serial regions. Based on the regions, we develop two region-based models to rationalize the numbers and types of the page faults and their associated costs to performance. The models highlight the fact that the major overhead is servicing the type of page faults, which requires data (a page or its modications, known as diffs) to be transferred across a network. With this understanding, we have developed three region-based prefetch (ReP) techniques based on the execution history of each parallel and sequential region. The rst ReP technique (TReP) considers temporal page faulting behaviour between consecutive executions of the same region. The second technique (HReP) considers both the temporal page faulting behaviour between consecutive executions of the same region and the spatial paging behaviour within an execution of a region. The last technique (DReP) utilizes our proposed novel stride-augmented run length encoding (sRLE) method to address the both the temporal and spatial page faulting behaviour between consecutive executions of the same region. These techniques effectively reduce the number of page faults and aggregate data (pages and diffs) into larger transfers, which leverages the network bandwidth provided
by interconnects. All three ReP techniques are implemented into runtime libraries of CLOMP to enhance its performance. Both the original and the enhanced CLOMP are evaluated using the NAS Parallel Benchmark OpenMP (NPB-OMP) suite, and two LINPACK OpenMP benchmarks on different hardware platforms, including two clusters connected with Ethernet and InniBand interconnects. The performance data is quantitatively analyzed and modeled. Also, MCBENCH is used to evaluate the impact of ReP techniques on memory consistency cost. The evaluation results demonstrate that, on average, CLOMP spends 75% and 55% overall elapsed time of the NPB-OMP benchmarks on Gigabit Ethernet and double data rate InniBand network respectively. These ratios of the NPB-OMP benchmarks are reduced effectively by 60% and 40% after implementing the ReP techniques on to the CLOMP runtime. For the LINPACK benchmarks, with the assistance of sRLE, DReP signicantly outperforms the other ReP techniques with effectively reducing 50% and 58% of page fault handling costs on Gigabit Ethernet and InniBand networks respectively.
Contents
Declaration Acknowledgements Abstract v vii ix
Introduction and Background
1
3 4 5 6 6 7 8 9 11 12 12 16 17 18 19 23 26 29 29 31 35 37
1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . 1.1.1 Research Objectives . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . 1.2.1 Performance Evaluation of CLOMP 1.2.2 Region-based Performance Models . 1.2.3 Region-based Prefetch Techniques . 1.3 Thesis Structure . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
2 Background 2.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 OpenMP Directives . . . . . . . . . . . . . . . . 2.1.2 Synchronization Operations . . . . . . . . . . . 2.2 Cluster OpenMP Systems . . . . . . . . . . . . . . . . 2.2.1 Relaxed Memory Consistency . . . . . . . . . . 2.2.2 Software Distributed Shared Memory Systems 2.2.3 Intel Cluster OpenMP . . . . . . . . . . . . . . 2.2.4 Alternative Approaches to sDSMs . . . . . . . 2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Performance Models . . . . . . . . . . . . . . . 2.3.2 Prefetch Techniques for sDSM Systems . . . . 2.3.3 Run-Length Encoding Methods . . . . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
II
Performance Issues of Intel Cluster OpenMP
39
41 42
3 Performance of Original Intel Cluster OpenMP System 3.1 Hardware and Software Setup . . . . . . . . . . . . . . . . . . . . . . .
3.2 Performance of CLOMP . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 NPB OpenMP Benchmarks Sequential Performance . . . . . . 3.2.2 Comparison of CLOMP and Intel Native OpenMP on a Single Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 CLOMP with Single Thread per Compute Node . . . . . . . . 3.2.4 CLOMP with Multiple Threads per Compute Node . . . . . . 3.2.5 Elapsed Time Breakdown for NPB-OMP Benchmarks . . . . . 3.3 Memory Consistency Cost of CLOMP . . . . . . . . . . . . . . . . . . . 3.3.1 Memory Consistency Cost Micro-Benchmark MCBENCH . . 3.3.2 MCBENCH Evaluation of CLOMP . . . . . . . . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Region-Based Performance Models 4.1 Regions of OpenMP Programs . . . . . . . . . 4.2 SIGSEGV Driven Performance (SDP) Models 4.2.1 Critical Path Model . . . . . . . . . . . 4.2.2 Aggregated Model . . . . . . . . . . . . 4.2.3 Coefcient Measurement . . . . . . . . 4.3 SDP Model Verication . . . . . . . . . . . . . 4.3.1 Critical Path Model Estimates . . . . 4.3.2 Aggregate Model Estimates . . . . . . 4.4 Summary . . . . . . . . . . . . . . . . . . . . .
43 44 44 48 48 53 55 56 57 60 65 66 67 68 70 71 72 73 74 75
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
III
Optimizations: Design, Implementation and Evaluation
79
5 Region-Based Prefetch Techniques 81 5.1 Limitations of Current Prefetch Techniques for sDSM Systems . . . 82 5.1.1 Parallel Application Examples . . . . . . . . . . . . . . . . . . 82 5.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.1.3 Prefetch Technique Design Assumptions . . . . . . . . . . . . . 88 5.2 Evaluation Metrics of Prefetch Techniques . . . . . . . . . . . . . . . 89 5.3 Temporal ReP (TReP) Technique . . . . . . . . . . . . . . . . . . . . . 90 5.4 Hybrid ReP (HReP) Technique . . . . . . . . . . . . . . . . . . . . . . 90 5.5 ReP Technique for Dynamic Memory Accessing Applications (DReP) 93 5.5.1 Stride-augmented Run-length Encoded Page Fault Records . 93 5.5.2 Page Miss Prediction . . . . . . . . . . . . . . . . . . . . . . . . 95 5.6 Ofine Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.6.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.6.2 Simulation Results and Discussions . . . . . . . . . . . . . . . 98 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6 Implementation and Evaluation
111
6.1 ReP Prefetch Techniques Implementation Issues . . . . . . . . . . . . 112 6.1.1 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.1.2 New Region Notication . . . . . . . . . . . . . . . . . . . . . . 114 6.1.3 Record Encoding and Flush Filter enabled Decoding . . . . . . 116 6.1.4 Prefetch Page Prediction . . . . . . . . . . . . . . . . . . . . . . 116 6.1.5 Prefetch Request and Event Handling . . . . . . . . . . . . . . 117 6.1.6 Page State Transition . . . . . . . . . . . . . . . . . . . . . . . . 118 6.1.7 Garbage Collection Mechanism . . . . . . . . . . . . . . . . . . 119 6.2 Theoretical Performance of the ReP Enhanced CLOMP . . . . . . . . 120 6.3 Performance Evaluation of the ReP Enhanced CLOMP . . . . . . . . 123 6.3.1 MCBENCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.3.2 NPB OpenMP Benchmarks . . . . . . . . . . . . . . . . . . . . 130 6.3.3 LINPACK Benchmarks . . . . . . . . . . . . . . . . . . . . . . . 138 6.3.4 ReP Techniques with Multiple Threads per Process . . . . . . 142 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
IV
Conclusions and Future Work
147
149
7 Conclusions and Future Work 7.1 Conclusions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
7.1.1 Performance Evaluation of CLOMP . . . . . . . . . . . . . . . 150 7.1.2 SIGSEGV Driven Performance Models . . . . . . . . . . . . . . 152 7.1.3 Performance Enhancement by RePs . . . . . . . . . . . . . . . 152 7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 7.2.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . 156 7.2.2 Performance Optimizations . . . . . . . . . . . . . . . . . . . . 156 7.2.3 Adapting ReP Techniques to the Latest Technologies . . . . . 156 7.2.4 Potential Use of sRLE . . . . . . . . . . . . . . . . . . . . . . . 157
Appendices
159
161
A Algorithms Used in DReP
A.1 Stride-augmented Run-length Encoding Algorithms . . . . . . . . . . 162 A.1.1 Algorithm 1: Page Fault Record Reconstruction Step (a) . . . 162 A.1.2 Algorithm 2: Page Fault Record Reconstruction Step (b) . . . 162 A.1.3 Algorithm 3: Page Fault Record Reconstruction Step (c) . . . . 163 A.2 Algorithm 4: DReP Predictor . . . . . . . . . . . . . . . . . . . . . . . 163
f B Tsegv,local and Ntotal for Theoretical ReP Speedup Calculation 165 B.1 NPB-OMP Benchmarks Datasheet . . . . . . . . . . . . . . . . . . . . 166 B.2 LINPACK Benchmarks Datasheet . . . . . . . . . . . . . . . . . . . . 166
C TReP and DReP Performance Results of the NPB-OMP benchmarks on a 4-node Intel Cluster 169 C.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 C.2 Sequential Elapsed Time . . . . . . . . . . . . . . . . . . . . . . . . . . 170 C.3 TReP and DReP Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 170 C.3.1 Elapsed Time over Gigabit Ethernet . . . . . . . . . . . . . . . 170 C.3.2 Elapsed Time over DDR InniBand . . . . . . . . . . . . . . . 173 D MultiRail Networks Optimization for the Communication Layer D.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2 Micro-Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2.1 Design Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2.2 Single-Rail Benchmark . . . . . . . . . . . . . . . . . . . . . . D.2.3 Multirail Benchmark . . . . . . . . . . . . . . . . . . . . . . . D.3 Bandwidth and Latency Experiments . . . . . . . . . . . . . . . . . D.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . D.3.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.3.3 Uni-directional Bandwidth . . . . . . . . . . . . . . . . . . . . D.3.4 Bi-directional Bandwidth . . . . . . . . . . . . . . . . . . . . D.3.5 Elapsed Time Breakdown . . . . . . . . . . . . . . . . . . . . D.4 Related Work on Multirail InniBand Network . . . . . . . . . . . . D.5 Challenge and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . 177 178 178 178 179 180 181 182 184 185 186 188 190 191
. . . . . . . . . . . . .
E Performance of CAL 193 E.1 Bandwidth and Latency of CAL . . . . . . . . . . . . . . . . . . . . . . 194 E.2 Comparison Between OpenMPI and CAL . . . . . . . . . . . . . . . . 194 Bibliography 197
List of Figures
2.1 OpenMP fork-join multi-threading parallelism mechanism [93] . . . . 2.2 OpenMP parallel directives and associated clauses in C and C++. . . . 2.3 OpenMP for directives and associated clauses in C and C++. . . . . . 2.4 An example OpenMP program in C using parallel for directives. . . . . 2.5 OpenMP synchronization directives in C and C++ languages: (a) barrier, and (b) ush. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 OpenMP threadprivate directive in C and C++ languages. . . . . . . . 2.7 Processes and threads in CLOMP . . . . . . . . . . . . . . . . . . . . . 2.8 State machine of CLOMP (derived from [47], [38], and experimental observation). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Illustration of two prefetch modes for Adaptive++ techniques. . . . . 3.1 Comparison of performance between native Intel OpenMP and CLOMP on a XE compute node. . . . . . . . . . . . . . . . . . . . . . . 3.2 Comparison of performance between native Intel OpenMP and CLOMP on a VAYU compute node. . . . . . . . . . . . . . . . . . . . . 13 13 14 15 15 16 23 25 34
45 46
3.3 Performance of CLOMP on XE with a single thread per compute node. 49 3.4 Performance of CLOMP on VAYU with a single thread per compute node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Performance of CLOMP on XE with multi-threads per compute node. 3.6 Performance of CLOMP on VAYU with multi-threads per compute node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 MCBENCH An array of size a-bytes is divided into chunks of cbytes. The benchmark consists of Change and Read phases that can be repeated for multiple iterations. Entering the Change phase of the rst iteration, the chunks are distributed to the available threads (four in this case) in a round-robin fashion. In the Read phase after the barrier, each thread reads from the chunk that its neighbour had written to. This is followed by a barrier which ends the rst iteration. For the subsequent iteration, the chunks to Change are the same as in the previous Read phase. That is, the shifting of the chunk distribution only takes place when moving from the Change to Read phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 52 53
57
3.8 MCBENCH evaluation results of CLOMP on XE with both Ethernet and InniBand interconnects: 64KB, 4MB and 8MB array sizes are used in these three gures respectively; comparison among difference chunk sizes 4B, 2KB and 4KB is illustrated in each gure for both Ethernet and InniBand. . . . . . . . . . . . . . . . . . . . . . 4.1 Illustration of regions in an OpenMP parallel program. . . . . . . . . 4.2 Schematic illustration of timing breakdown for parallel region using the SDP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 The algorithm used to determine the SDP coefcients. The code shown is in a parallel region. R is a private array while S is a shared one. Variables Dw and Dr represent reference times for accessing private array R. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Pseudo code to demonstrate the memory access patterns of the naive LINPACK OpenMP benchmark implementation for an n n columnmajor matrix A with blocking factor nb. . . . . . . . . . . . . . . . . . 5.2 Naive OpenMP LINPACK program with n n matrix: (a) memory access areas for different iterations. (b) page fault areas for different iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Pseudo code to demonstrate the memory access patterns of the optimized LINPACK OpenMP benchmark implementation for an n n column-major matrix A with blocking factor nb. . . . . . . . . . 5.4 Optimized OpenMP LINPACK program: (a) memory access areas for different iterations illustrated on a n n matrix panel. (b) page fault areas for different iterations illustrated on the n n matrix panel. . 5.5 The page fault record entry for TReP and HReP prefetch techniques. 5.6 A owchart of the HReP predictor. . . . . . . . . . . . . . . . . . . . . 5.7 Two levels of stride-augmented run-length encoding (sRLE) method: (a) Based on strides between consecutive pages, sorted missed pages are broken into small sub-arrays, and those consecutive pages with the same stride are stored in the same array. (b) The sub-arrays are compressed in to the rst level sRLE records in a (StartP ageID, CommonStride, RunLength) format. (c) Based on the stride between the start pages of consecutive rst level sRLE records, they are further compressed into the second level sRLE format, (F irstLevelRecord, CommonStride, RunLength) (more details in Section 5.5.1). . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Page fault record of region execution reconstructed via run-length encoding method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59 67 69
71
83
84
86
87 90 92
94 95
5.9 The effective page miss rate reduction for different prefetch techniques on 2 threads (a), 4 threads (b) and 8 threads (c). . . . . . . . . 103 6.1 Intel Cluster OpenMP runtime structure. . . . . . . . . . . . . . . . . 112 6.2 Data structure for stride-augmented run-length encoded page fault records. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.3 ReP prefetch record data structure. . . . . . . . . . . . . . . . . . . . . 115 6.4 User interactive interface of new region notication. . . . . . . . . . . 115 6.5 The round-robin prefetch request communication pattern. . . . . . . 118 6.6 New page state machine after introduced Prefetched diff and Prefetched page states. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.7 RePs VS. Original CLOMP: MCBENCH with 4B chunk size over both the GigE and IB networks. (a) 64KB array size, (b) 4MB array size, (c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.8 RePs VS. Original CLOMP: MCBENCH with 2048 bytes chunk size over both the GigE and IB networks. (a) 64KB array size, (b) 4MB array size, (c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . 127 6.9 RePs VS. Original CLOMP: MCBENCH with 4KB chunk size over both the GigE and IB networks. (a) 64KB array size, (b) 4MB array size, (c) 8MB chunk size. . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.10 RePs VS. Original CLOMP: BT speedup comparison on both GigE and IB networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.11 RePs VS. Original CLOMP: the naive LINPACK evaluation results comparison using N N matrix (N = 4096) with blocking factor N B = 64 via both GigE and IB. . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.12 RePs VS. Original CLOMP: the optimized LINPACK evaluation results comparison using N N matrix (N = 8192) with blocking factor N B = 64 via both GigE and IB. . . . . . . . . . . . . . . . . . . 141 6.13 DReP vs Original CLOMP: the optimized LINPACK benchmark (N = 8192 and N B = 64) results comparison with multiple threads per process via both GigE and IB. (a) 2 threads per process, (b) 4 threads per process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 C.1 Speedup of the BT and CG benchmarks over Gigabit Ethernet. . . . 171 C.2 Speedup of IS and LU benchmarks over Gigabit Ethernet. . . . . . . 172 C.3 Speedup of BT and CG benchmarks over DDR InniBand. . . . . . . 174 C.4 Speedup of IS and LU benchmarks over DDR InniBand. . . . . . . . 175 D.1 Single-rail bandwidth benchmark . . . . . . . . . . . . . . . . . . . . . 179 D.2 Multirail communication memory access pattern. . . . . . . . . . . . 180 D.3 Non-threaded multirail bandwidth benchmark . . . . . . . . . . . . . 182
D.4 Threaded multirail benchmark design. . . . . . . . . . . . . . . . D.5 RDMA write latency comparison. . . . . . . . . . . . . . . . . . . D.6 Uni-directional multi-port bandwidth. . . . . . . . . . . . . . . . D.7 Uni-directional multi-HCA bandwidth. . . . . . . . . . . . . . . . D.8 Bi-directional multi-port bandwidth. . . . . . . . . . . . . . . . . D.9 Bi-directional multi-HCA bandwidth. . . . . . . . . . . . . . . . D.10 Benchmarks elapsed time breakdown for 512bytes message. . . D.11 Benchmarks elapsed time breakdown for 4KB message. . . . . . D.12 Different ways to congure a InniBand multirail network [62].
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
183 184 185 186 187 188 188 189 190
List of Tables
2.1 OpenMP synchronization operations. . . . . . . . . . . . . . . . . . . . 3.1 Evaluation experimental hardware platforms. . . . . . . . . . . . . . 3.2 Sequential elapsed time (sec) of NPB with CLOMP. . . . . . . . . . . 3.3 Page faults handling cost (SEGV Cost) of CLOMP for NPB benchmarks as a ratio to corresponding elapsed time with single thread per proess on XE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Page faults handling cost breakdown for CLOMP for class A NPB benchmarks with multiple threads per process on XE. SEGV represents the ratio of page faults handling cost to the corresponding elapsed time; SEGV Lock in turn represents a ratio of pthread mutex cost within SEGV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Critical path page faults counts for the NPB-OMP benchmarks run using CLOMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Comparison between observed and estimated speedup for running NPB class A and C on the AMD cluster with CLOMP . . . . . . . . . 4.3 Average relative errors for the predicted NPB speedups evaluated using the critical path and aggregate (f = 0) SDP models and data from Tables 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 43 44
62
63
73 77
78
5.1 Threshold effects of ReP techniques for naive LINPACK benchmark. 98 5.2 Simulation prefetch efciency (E) and coverage (Nu /Nf ) for Adaptive++, TODFCM (1 page), TReP, HReP and DReP techniques. . . . . 108 5.3 Breakdown of prefetches issued by different prefetch modes and chosen list deployed in HReP. . . . . . . . . . . . . . . . . . . . . . . . 109 5.4 Comparison of F-HReP and HReP with the LU benchmark. . . . . . 109 6.1 Bandwidth and latency measured by the communication layer (CAL) of CLOMP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 ReP techniques prefetch efciency and coverage for MCBNECH with 4MB array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Message transfer counts (1000) comparison between RePs enhanced CLOMP and the original CLOMP for MCBENCH with 4B chunk . . 6.4 Message transfer counts (1000) comparison between RePs enhanced CLOMP and the original CLOMP for MCBENCH with 2KB chunk . 6.5 Message transfer counts (1000) comparison between RePs enhanced CLOMP and the original CLOMP for MCBENCH with 4KB chunk .
121 123 126 128 129
6.6 Page fault handling costs comparison for BT benchmark among the original CLOMP, the theoretical and the ReP techniques enchanced CLOMP. The computation part of elapsed time is common to all compared items. The page fault handling costs of the original CLOMP is presented in second, and that of others are presented as a reduction ratio (e.g. OrigT ReP ). . . . . . . . . . . . . . . . . . . . . . 132 Orig 6.7 Page fault handling costs reduction ratio (
orig Tsegv T segv ) orig Tsegv
comparison for
other NPB benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 6.8 Detailed Tsegv breakdown analysis of the IS Class A Benchmark for the ReP techniques. Overall Tsegv stands for overall CLOMP overhead. TMK Comm stands for the communication time spent by TMK for data transfer. TMK local stands for the local software overhead of TMK layer. ReP Comm stands for the communication time spent on prefetching data. ReP local stands for the local software overhead introduced by using the ReP prefetch techniques. Tsegv is presented in seconds and its components are presented as a ratio to the overall Tsegv . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 6.9 Detailed Tsegv breakdown analysis of the IS Class C Benchmark for the ReP techniques. Overall Tsegv stands for overall CLOMP overhead. TMK Comm stands for the communication time spent by TMK for data transfer. TMK local stands for the local software overhead of TMK layer. ReP Comm stands for the communication time spent on prefetching data. ReP local stands for the local software overhead introduced by using the ReP prefetch techniques. Tsegv is presented in seconds and its components are presented as a ratio to the overall Tsegv . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.10 Sequential elapsed time for LINPACK benchmarks . . . . . . . . . . 138 6.11 Page fault handling costs comparison for LINPACK benchmarks among the original CLOMP, the theoretical and the ReP techniques enchanced CLOMP. The computation part of elapsed time is common to all compared items. The page fault handling costs of the original CLOMP is presented in second, and that of others are presented as a reduction ratio (e.g. OrigT ReP ). . . . . . . . . . . . . . . . . . . . . . 139 Orig 6.12 Page faults handling cost comparison between DReP and the original CLOMP for the optimized LINPACK benchmark with multiple threads per process. SEGV represents the ratio of page faults handling cost to the corresponding elapsed time; SEGV Lock in turn represents a ratio of pthread mutex cost within SEGV. . . . . . 142
B.1 Tsegv,local (sec) for some NPB-OMP benchmarks with different number of processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . f B.2 Ntotal for some NPB-OMP benchmarks with different number of processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . segv B.3 Ttotal (sec) for LINPACK benchmarks with different number of processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . f B.4 Ntotal for LINPACK benchmarks with different number of processes.
166 167 167 167
C.1 Elapsed Time (sec) of some NPB-OMP Benchmarks on one thread. . 170 E.1 Complete bandwidth and latency measured by the communication layer (CAL) of CLOMP on XE. . . . . . . . . . . . . . . . . . . . . . . . 194 E.2 Comparison of CAL and OpenMPI: bandwidth and latency measured on XE via GigE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 E.3 Comparison of CAL and OpenMPI: bandwidth and latency measured on XE via DDR IB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

Region-Based Techniques For Modeling and Enhancing Cluster Openmp Performance

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Region-Based Techniques For Modeling and Enhancing Cluster Openmp Performance

Загружено:

Авторское право:

Доступные форматы

Region-based Techniques for Modeling and Enhancing Cluster OpenMP Performance

c Jie Cai 2011

A This document was produced using TEX , LTEX and BIBTEX

Introduction and Background

Performance Issues of Intel Cluster OpenMP

Optimizations: Design, Implementation and Evaluation

6 Implementation and Evaluation

Conclusions and Future Work

7 Conclusions and Future Work 7.1 Conclusions

A Algorithms Used in DReP

183 184 185 186 187 188 188 189 190

121 123 126 128 129

166 167 167 167

Вам также может понравиться