Вы находитесь на странице: 1из 54

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU

Implementation Results

GPU Acceleration of the Variational Monte Carlo Method for Many Body Physics
Rajagopalan, Kaushik Ragavan
Louisiana State University, Department of Electrical and Computer Engineering

04/09/2013

Kaushik Ragavan

1/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

1 2 3

Acknowledgements Motivation Variational Monte Carlo (VMC) VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC Prior Work CUDA Programming model CPU Implementation Algorithm Implementation GPU Implementation Intial Steps Optimization Results

4 5 6

Kaushik Ragavan

2/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Acknowledgements
Thesis Committee
Dr. David Koppelman - Chair Dr. Xin Li Dr. Juana Moreno

LA-Sigma Research group


Dr. Mark Jarrell Dr. Ramanujam Dr. Ka Ming Tam Dr. Zhifeng Yun Dr. Sandeep Pathak Niladri Sengupta

Kaushik Ragavan

3/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Motivation
High-Performance computing
High-Performance computing is one of the major areas making inroads into the future for large-scale simulation. Applications such as 3D nuclear test, Molecular Dynamics, and Quantum Monte Carlo simulations are now developed on supercomputers using the latest computing technologies. Most of todays supercomputers are now heterogenous: Multi-core CPU(s) equipped with massively parallel Graphics Processing Units (GPUs).

Variational Monte Carlo (VMC)


The VMC method is used in the Many Body Physics to study the ground state properties of a system. Depends on the variational parameters for a better prediction of the wavefunction. Drawbacks: Computationally expensive, requires many Monte carlo sweeps to obtain convergence, direct parallelization on CPU clusters does not scale with the system size. Solution: Porting the VMC method to a GPU using NVIDIA CUDA. Results: Nearly 3.85 X speedup compared to MPI fortran and 19 X speedup compared to the C++ version.

Kaushik Ragavan

4/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

Denition

VMC is a direct application of Monte Carlo integration to strongly correlated systems. The variational approach has been used widely in dierent areas of condensed matter physics, in particular the d-wave superconducting state of the high TC cuprates at T=0.

Kaushik Ragavan

5/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

VMC Algorithm
1

Equilibration:
Generate an initial conguration by using a random number generator for electrons. For each electron in the conguration: 1 Propose a move from mr to mr 2 2 Compute the ratio R = |(mr )/(mr )| 3 Perform metropolis acceptance comparison min(1, R ) 4 Update the conguration if the move is accepted 5 Else restore the conguration Repeat the above steps until the system equilibrates

Accumulation:
Repeat the same procedure from Equilibration Accumulate the local energy and other observable parameters at mr and mr , Perform metropolis acceptance comparison min(1, R ) Repeat the above steps until energies are accumulated

Kaushik Ragavan

6/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

Trial Wavefunction
In quantum mechanics, the variational principle can be derived by expanding a normalized trial wavefunction, T , in terms of the exact normalized eigenstates of the Hamiltonian.

T =
i =0 i =0

ci i ,

(1)

where ci is given by,

|ci |2 = 1

) with the wavefunction is given by, The Hamiltonian (H

T T H

=
i

ci i H
j

cj j

=
i

|ci |

i,

(2)

where

i i H

Kaushik Ragavan

7/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

Diculty of VMC
The wavefunction in real space is given by, (r1 , . . . , rN , r1 , . . . , rN ), where ri are the coordinates of the electrons on a lattice and C (r1 , . . . , rN , r1 , . . . , rN ), a conguration of electrons. To sum over all congurations, for example: a lattice with 100 sites and 50 and 50 electrons, we need to visit over 1060 congurations. To overcome this diculty, we use Monte Carlo method to perform the sum[2]. The energy function is given by, E (i ) =
C

(C )H (C ) C (C ) (C )

(3)

where P (C ), the probability of the conguration, is given by, | (C )|2 (C ) (C )

P (C ) =
C

Obviously P(C) = 1, We visit the high probability congurations and add up the contribution. This is called as importance sampling
Kaushik Ragavan 8/54 GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

Using the importance sampling, accurate results for various quantites can be obtained by a smaller number of Monte Carlo sweeps, NMC . The energy after NMC is given by, 1 NMC
NMC k =1

E (i ) =

H (mr ) (mr )

(4)

For other operators, O

1 NMC

NMC k =1

O (mr ) (mr )
|(m |2 r |(mr |2

(5)

For every Monte Carlo (MC) step, we need to evaluate the ratio of

which is of

complexity O (N 3 ). In order to optimize with respect to i , we need a complexity of O (N ). How can we reduce the algorithm complexity ?

Kaushik Ragavan

9/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

Faster VMC
Consider the spinless fermions with a conguration k given by a wavefunction and a conguration l given by a wavefunction . Both the congurations dier only by a position of a electron e , el and el a1 (e1 ) . . . aN (e1 ) ... ... ... a1 (el ) . . . aN (el ) ... ... ... a1 (eN ) . . . aN (eN )

(6)

Since and dier by only one column, the ratio can be determined as[2], det [] = det [] which of O (N ) The calculation of 1 is reduced to the order of O (N 2 ) using Sherman-Morrison-Woodbury (SMH) method.
Kaushik Ragavan 10/54 GPU Acceleration of VMC Method for Many Body Physics
1

kl kl
k

(7)

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

Our approach
We used the t t t model to study the Hubbard model.
ij

tij ci cj + U
i

ni ni

(8)

The VMC method is tested on a tilted square lattice. We study the Resonance Valence Bond (RVB) model to describe High-TC . The number of sites is given by, NS = L2 + 1, where L is odd. The number of electrons is given by, Ne = NS (1.0 x ), where x is the nominal hole doping. The number of electron pairs is given by, NP = Ne /2.

Kaushik Ragavan

11/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Prior Work
CPU Implementation:
A Fortran implementation of VMC was done by Dr. Sandeep Pathak, a former Postdoc at LA-Sigma Research group [Phys Rev.B] In his implementation, a Markov chains (MC) is simulated with an initial random conguration.

MPI Implementation
A MPI-Fortran implementation was done by Dr. Sandeep Pathak to study the d-correlated systems using the variational approach. [Phys Rev.B] Every MC is done by an MPI rank or process and it requires inter-processor and inter-node communication to compute the average energy. Drawbacks: 1 Eventhough the MPI implementation gives better results, algorithms such as VMC is a suitable candidate for GPU due to high oating point throughput.

Kaushik Ragavan

12/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Prior Work
GPU Implementation 1 CUDA port
Initial steps in porting VMC code to CUDA was done by Byron, an REU student at LA-Sigma Research group. In his implementation, a CUDA thread performs a MC and the number of thread blocks equals the number of MC. Drawbacks: Lacks enough thread-level parallelism, usage of shared and constant memory and caching techniques for L1 cache on a GPU.
2

CUDA and FPGA port


A GPU and FPGA implementation was done at University of Tennessee-Knoxville. Compares a Dual-Core-Dual-processor AMD Opteron @ 2.2 GHz CPU vs a NVIDIA GPU and Virtex-4 XC4VLX160 FPGA. Drawbacks: No eective utilization of memory resources on a GPU.

Kaushik Ragavan

13/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

GPU Architecture
More transistors are allocted for computational units on a GPU. CPU dedicates more transistors for caching and control units. Figure 1 shows the dierence between a CPU and a GPU architecture.

Figure: GPU Architecture

Kaushik Ragavan

14/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

CUDA Programming model


Thread Hierarchy
The CUDA threadIdx is of one, two or three dimensions. Group of threads form a block and the blocks can be of one, two or three dimensions. Group of thread blocks form a grid of one, two or three dimensions. Thread blocks reside within a Streaming Multiprocessor (SM). Fermi architecture supports 8 resident blocks per SM.

Figure: CUDA thread hierarchy


Kaushik Ragavan 15/54 GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Algorithm
Algorithm 1
Overview of the VMC Method Generate lattice object, wavefunction and pairfunction Perform the Equilibration procedure Accumulate the energy and study the groundstate

begin vmc Initialization() Equilibration() Accumulation() end vmc

Algorithm 2

Equilibration

procedure Equilibration( ) for i 0, nsweeps do Electron move() Perform MonteCarlo Sweep() end for end procedure

Call the electron move procedure Determine the acceptance of the move and update the Conguration

Kaushik Ragavan

16/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Algorithm

Algorithm 3

Accumulation Change the loop bounds to navesweeps and npsweep Repeat the procedures from Equilibration

procedure Accumulation( ) for i 0, navesweeps do for j 0, npsweep do Electron Move() Perform MonteCarlo Sweep() end for energy EnergyofCong() eneloc eneloc + energy end for end procedure

Call the energy of conguration Accumulate the energy

Kaushik Ragavan

17/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Algorithm
Algorithm 4
Electron Move
Pick a pair and spin at random

procedure Electron Move( ) ipair rand (Npairs ) ispin rand (2) oldsite plist (ipair , ispin) while (newsite == 0) do if neiprob > 0.0 then newsite rand (Nsites ) else newsite neiblist () end if if (latocc (newsite ) ! = (BT or HL)) then spinip true jpair whichpair (newsite 2 + 1 ispin) else newsite = 0 end if end while end procedure

Find a site for the electron move

neiblist maintains the neighbour information latocc will determine the spin at the newsite whichpair(ilat,spin), spin at lattice site ilat

Kaushik Ragavan

18/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Algorithm
Algorithm 5
Perform MonteCarlo Sweep
procedure Perform MonteCarlo Sweep( ) if spinip = true then dpbd 1 DetPByDet(ipair , 2 ispin 1, newsite ) UpdateCong(ipair , 2 ispin 1, newsite , dpbd 1) dpbd 2 DetPByDet(jpair , 1 2 ispin, oldsite ) dpbd dpbd 1 dpbd 2 else dpbd DetPByDet(ipair , 2 ispin 1, newsite ) end if norm2 norm(dpbd ) if norm2 uniformrand (0, 1) then if spinip = true then UpdateCong(jpair , 1 2 ispin, oldsite , dpbd 2) else UpdateCong(ipair , 2 ispin 1, newsite , dpbd ) end if else Copycong() end if end procedure

1 Determine 1 Move is accepted

Kaushik Ragavan

19/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Algorithm
Algorithm 6
Energy of a conguration
for inn 0, nearnsets do isite nearnp (inn 2); jsite nearnp (inn 2 + 1) ispin latocc (isite ); jspin latocc (jsite ) ipair whichpair (isite 2 + (1 + ispin)/2) jpair whichpair (jsite 2 + (1 + jspin)/2) if ispin jspin < 0 then tempconf CopyCong() dpbd DePByDet(ipair , ispin, jsite ); UpdateCong() dpbd 1 DetPByDet(jpair , jspin, isite ) otheloc = otheloc real (dpbd dpbd 1) + 1.0 end if end for otheloc (Jij otheloc )/(2) enektot enekloc ; othetot otheloc energy (enektot + othetot )/Nsites kinenergy (spkeloc (0) + spkeloc (1))/Nsites othenergy othetot return energy End Procedure ProcedureEnergyofCong saved = cong() Make a copy of the conguration tempconf = cong() Make a temporary conguration to for ispin 0, 2 do Calculate the Kinetic energy part of the Hamiltonian spin 2 ispin 1 for ipair 0, Npairs do isite plist (ipair 2 + ispin) for jn 0, nneibs do jsite neiblist (jn + nneibs isite ) if latocc (jsite ) = HL then dpbd DetPByDet(ipair , spin, jsite ) enekloc + thop real (dpbd ) spkeloc (ispin)+ thop real (dpbd ) end if end for end for end for Compute the Exchange term of the Hamiltonian otheloc 0 To hold the energy from the exchange term

Kaushik Ragavan

20/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Data Structures
Data Structures
Major data structures such as 1 , pairfunction and Plist are members of a class. The following table shows the data structures for dierent lattice models.

Table: Classes used in the Implementation of VMC


S.No 1 2 3 4 Class Sqlat Cong Wavefunction Montecarlo Functionality Holds Lattice parameters such as Nsites, Nelecs, Npairs and hopping, etc Contains the energy arrays, and functions for calculating the Ratio of Determinants and Update of 1 Contains the variational parameters required to generate the pairfunction or wavefunction Contains the lattice object, pairfunction and randomseeds for a corresponding Markov chain

Kaushik Ragavan

21/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Memory Requirements
Memory Requirements for L
The memory consumption increases in proportion to the lattice size, L The following table shows the memory requirements.

Table: Memory requirements


Data structure NS Ne Npairs PsiInv Pairfunction Plist Size L2 + 1 NS (1.E 0 x ) where x is hole doping Ne / 2 2) (NP 2) (NS (2 NP ) Data Type int int int double double int L=5 26 24 12 1.152 kB 5.408 kB 96 B L=9 82 74 37 10.952 kB 53.792 kB 296 B L = 15 226 204 102 83.232 kB 408.608 kB 816 B

Kaushik Ragavan

22/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Calculation of Ratio of Determinants using SMH formula

Function DetPByDet
If the spinip is up, a site is picked from the electron list and referred to as an oldsite. This site corresponds to the column of the pairfunction. A newsite given by an input parameter corresponds to the row of the pairfunction. The Ipair represents the electron pair and corresponds to the column of 1 . Thus, a dot product of a row of pairfuntion and a column of 1 is calculated. NP cache misses for the 1 and one cache miss for the pairfunction on L1 cache. The access pattern is reversed for the case of a down spinip. The Ipair corresponds to the row of the 1 and the newsite corresponds to the column of the pairfunction. Total of NP fused multiply-add oating point operations (FLOPS) to compute the dot product.

Kaushik Ragavan

23/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Memory Access Pattern


Figure:(a) 1 , (b) Plist, (c) Pairfunction

Kaushik Ragavan

24/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Update of a Conguration

Function UpdateCong
The Ipair th column of 1 is updated using dpbd (from Function DetPbyDet) A dot product is computed between a column of the 1 and a row of the pairfunction. The remaining columns of 1 are updated using the dot product and Ipair th column of 1 . 2 The update of 1 is of order O (NP ).

Kaushik Ragavan

25/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Memory Access Pattern


Figure:(a) 1 , (b) Dot product, (c) 1

Kaushik Ragavan

26/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Algorithm Analysis
Execution Time in FLOPS
The FLOPS are calculated based on the speculation that a spinip will occur 50% of the time in both the Equilibration and Accumulation stage.

Table: Estimate of FLOPS for the VMC method


Function DetPByDet UpdateCong FLOPS Npairs MUL Npairs DIV +Npairs 2 (MUL + ADD ) +Npairs 2 (MUL + SUB ) CopyCong Energy of a Conguration NIL 2 Npairs Nneibs (DetPByDet + MUL + ADD ) +Nearnsets (DetPByDet) +Nearnsets Updatecong +Nearnsets (DetPByDet) +Nearnsets (ADD + SUB ) Operation Either case of spinip For Ipair th row or col of 1 For dot product calculation For updating other rows or cols of 1 NIL Kinetic energy calculation

Accumulation

Kaushik Ragavan

27/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Results

Execution Time:
The results were obtained from a Intel(R)Core(TM)i7-2600 CPU @ 3.40 GHz. The cache hierarchy is given by: 32 kB data + 32 kB instruction L1 cache, 256 kB L2 cache per core. The following tables show the execution time for L = 5, 9, 15.

Kaushik Ragavan

28/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Results for L = 5, 9
Lattice Size Function Equilibration 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Copycong Move rejected 5.Total Accumulation 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Energy calculation DetPbyDet UpdateCong Total 5.Copycong Move rejected 6.Total Total VMC Execution Time(s) 0.02 0.01 0.03 0.00 0.00 0.04 0.02 0.24 L = 9 0.01 0.01 0.07 0.01 0.02 0.02 0.08 0.18 0.08 0.03 0.46 0.70 Lattice Size Function Equilibration 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Copycong Move rejected 5.Total Accumulation 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Energy calculation DetPbyDet UpdateCong Total 5.Copycong Move rejected 6.Total Total VMC Execution Time(s) 0.01 0.11 1.32 0.02 0.04 0.74 0.68 3.04 0.07 0.07 1.44 0.00 0.09 0.17 1.85 2.91 0.62 0.64 6.03 9.07

L = 5

Kaushik Ragavan

29/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Results for L = 5, 9

Kaushik Ragavan

30/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Results for L = 15
Lattice Size Function Equilibration 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Copycong Move rejected 5.Total Accumulation 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Energy calculation DetPbyDet UpdateCong Total 5.Copycong Move rejected 6.Total Total VMC Execution Time(s) 0.19 0.51 25.15 0.07 0.96 12.29 11.99 51.74 0.18 0.44 25.17 0.03 0.98 0.80 36.50 52.67 12.30 11.61 104.05 156.02

L = 15

Kaushik Ragavan

31/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Algorithm Implementation

Results for L = 15

Kaushik Ragavan

32/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

GPU Acceleration of the VMC Method


GPU Acceleration of the VMC Method
From the CPU implementation we determined the best candidates for parallelization using a GPU Functions such as the ratio of determinants, update and energy of a conguration are computationally intensive and exhibit data-level parallelism In our implemetation a block of CUDA threads will handle a Markov chain (MC) Every chain has its own copy of a conguration with members such as 1 , pairfunction and plist Every chain starts with an intial random conguation and performs Equilibration and Accumulation The energy values are accumulated from all MCs and averaged over on a CPU The ground state |G is then optimized with the variational parameters

Kaushik Ragavan

33/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

Workow

Kaushik Ragavan

34/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

Na ve Implementation

Intial steps on porting the VMC code to CUDA was done by Byron, an intern student at LA-Sigma Research group Each MC is handled by a CUDA thread and the number of thread blocks equals the number of MCs Drawbacks: 1 Code cannot exceed using just 32 of the GPUs computing potential per warp Higher memory access latency due to global memory read/write Lack of usage of high speed shared and constant memory and unoptimized L1 and L2 cache access

Kaushik Ragavan

35/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

Execution Conguration
The threads per MC depends upon the NP of the given lattice
2 The complexity of the functions are: DetPByDet - O (NP ), Update - O (NP ) and Copycong - O (2 NS ) 2 Numbers of threads should be in the range of (NP blocksize NP ) and rounded to the nearest multiple of a warp The access pattern diers based on a spinip for dierent functions. Set a Parameter called, threads-per-col, to determine the blocksize Blocksize = Round(threads-per-colNP ,nearest multiple of warp)

Table: Threads per MC and number of MCs per SM


Lattice Npairs Threads-per-col No.threads per MC per SM 64 160 416 No.resident MCs 24 9 3

L = 5 L = 9 L = 15

12 37 102

4 4 4

Kaushik Ragavan

36/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

Cache behavior
Table: Speculation for L = 5
GPU Memory L = 5 L1 cache/ shared L2 cache 1 (double) 1.152 KB Yes No Pairfunction (double) 5.408 KB Yes Yes.Due to sharing between MC No Plist (int) 96 B Yes No

Table: Speculation for L = 9


GPU Memory L = 9 L1 cache/ shared L2 cache 1 (double) 10.952 KB Yes No Pairfunction (double) 53.792 KB No Yes.Due to sharing between M.C No Plist (int) 296 B Yes No

Constant Memory

No

No

Constant Memory

No

No

Table: Speculation for L = 15


GPU Memory L = 15 L1 cache/ shared L2 cache 1 (double) 83.232 KB Yes, but partly cached Yes. Exceeds L1 cache size No Pairfunction (double) 408.608 KB No Yes. Due to sharing between M.C No Plist (int) 816 B Yes No

Constant Memory

No

Kaushik Ragavan

37/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

Summary of Memory Prediction


Blocksize = 64 and we have three copies of a conguration. Total MCs per SM will be 1536/64 = 24 Memory consumption will be 1.152 8 = 9.216 3 = 27.648 kB, since three copies of a conguration Pairfunction will be 5.408 kB and common to all congurations Total MC(s) per SM will be 48/27.648. One MC per SM

Table: L = 5
Lattice Threads per MC 64 Cong Cong1 Cong2 Cong3 Total L1 L1 L1 27.648 kB 1 Pairfunc Constant-Common to all MC 5.408 kB Plist L1 L1 L1 288 B 1 MC SM occupancy

L = 5

Kaushik Ragavan

38/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

Initial Parallelization
DetPByDet - Up spin DetPByDet - Down spin

Kaushik Ragavan

39/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

Initial Parallelization
UpdateCong - Up spin UpdateCong - Down spin

Kaushik Ragavan

40/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

Elimination of Redundant copies of a conguration

We removed the redundant copies of a conguration. A MC will now maintain a single copy of a conguration The following table predicts the MCs occupancy per SM, 48/9.216 = 5.20 5

Table: Occupancy calculation for L = 5


Lattice L = 5 Threads per MC 64 Cong Cong1 Total 1 L1 9.216 kB Pairfunc Constant Memory 5.408 kB Plist L1 96 B SM occupancy 5 MCs

Kaushik Ragavan

41/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

Optimized Memory Access Pattern


DetPByDet - Up spin DetPByDet - Down spin

Kaushik Ragavan

42/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

Optimized Memory Access Pattern


UpdateCong - Up spin UpdateCong - Down spin

Kaushik Ragavan

43/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

Drawbacks

A row or a column is accessed in DetPByDet First a row or column and then the remaining rows or columns are updated on a 1 Reduce the three function calls: DetPByDet, UpdatCong and DetPByDet (inorder) into a single function Solution: Merger of DetPByDet and UpdateCong to form a streamlined function

Kaushik Ragavan

44/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Intial Steps Optimization

Streamlined Function
Up spin Down spin

Kaushik Ragavan

45/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Comparison between a CPU and GPU


The CUDA code was tested on a Tesla M2070 NVIDIA GPU and benchmarked against a Intel(R)Core(TM)i7-2600 CPU @ 3.40 GHz. The MCs are performed sequentially by a CPU core. Increasing the blocking factor gave a better performance Bfactor = 8

Lattice L = 5 L = 9 L = 15

No.MCs 32 32 32

CPU Exec Time(s) 22.66 287.13 4920.22

GPU Exec Time(s) 3.19 17.37 309.72

Speedup 7.26 16.72 15.88

Bfactor = 16

Lattice L = 5 L = 9 L = 15

No.MCs 32 32 32

CPU Exec Time(s) 22.66 287.13 4920.22

GPU Exec Time(s) 2.48 14.44 256.83

Speedup 9.13 19.88 19.43

Kaushik Ragavan

46/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Comparison between MPI-CPU and GPU


We compared our CUDA code with the MPI code implemented by Sandeep using Fortran 90. The MPI code designates a MPI rank per MC The results were compared on Philip supercomputer at HPC, LSU

Table: Comparison of MPI vs GPU Performance

Lattice L = 5 L = 9 L = 15

No.MCs 32 32 32

MPI Exec Time(s) 4.89 63.66 1193.68

GPU Exec Time(s) 3.19 17.37 309.72

Speedup 1.53 3.66 3.85

Table: Philip Node conguration

Nodal Conguration No.Processors/node Total No.Nodes DRAM

Two 2.93 GHz Quad Core Nehalem Xeon 64-bit Processors 8 32 24GB @ 1333MHz

Kaushik Ragavan

47/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

GPU Results in depth

Table: L = 5
Function Electron move DPBD DPBD-UP DPBD-DN Total UPD UPD-UP UPD-DN Total Energy Streamlined (DPBD-UPD-DPBD) No.Cycles 420361934 95922170 97922630 193844800 322257414 259125546 581382960 969450046 1630512656

Table: L = 9
Function Electron move DPBD DPBD-UP DPBD-DN Total UPD UPD-UP UPD-DN Total Energy Streamlined (DPBD-UPD-DPBD) No.Cycles 1398769542 321308398 351500708 672809106 1275117358 1476716774 2751834132 6105049222 10717256244

Kaushik Ragavan

48/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

GPU results in depth

Figure: L = 5

Kaushik Ragavan

49/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

GPU Results in depth

Figure: L = 9

Kaushik Ragavan

50/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

GPU Results in depth

Table: L = 15
Function Electron move DPBD DPBD-UP DPBD-DN Total UPD UPD-UP UPD-DN Total Energy Streamlined (DPBD-UPD-DPBD) No.Cycles 4958484770 890322994 932234278 1822557272 5874557324 7310109920 13184667244 48836694164 8600610984504

Kaushik Ragavan

51/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

GPU results in depth

Figure: L = 15

Kaushik Ragavan

52/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Questions ?

Kaushik Ragavan

53/54

GPU Acceleration of VMC Method for Many Body Physics

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results

Thank You

Kaushik Ragavan

54/54

GPU Acceleration of VMC Method for Many Body Physics

Вам также может понравиться