Академический Документы
Профессиональный Документы
Культура Документы
Implementation Results
GPU Acceleration of the Variational Monte Carlo Method for Many Body Physics
Rajagopalan, Kaushik Ragavan
Louisiana State University, Department of Electrical and Computer Engineering
04/09/2013
Kaushik Ragavan
1/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
1 2 3
Acknowledgements Motivation Variational Monte Carlo (VMC) VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC Prior Work CUDA Programming model CPU Implementation Algorithm Implementation GPU Implementation Intial Steps Optimization Results
4 5 6
Kaushik Ragavan
2/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Acknowledgements
Thesis Committee
Dr. David Koppelman - Chair Dr. Xin Li Dr. Juana Moreno
Kaushik Ragavan
3/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Motivation
High-Performance computing
High-Performance computing is one of the major areas making inroads into the future for large-scale simulation. Applications such as 3D nuclear test, Molecular Dynamics, and Quantum Monte Carlo simulations are now developed on supercomputers using the latest computing technologies. Most of todays supercomputers are now heterogenous: Multi-core CPU(s) equipped with massively parallel Graphics Processing Units (GPUs).
Kaushik Ragavan
4/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Denition
VMC is a direct application of Monte Carlo integration to strongly correlated systems. The variational approach has been used widely in dierent areas of condensed matter physics, in particular the d-wave superconducting state of the high TC cuprates at T=0.
Kaushik Ragavan
5/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
VMC Algorithm
1
Equilibration:
Generate an initial conguration by using a random number generator for electrons. For each electron in the conguration: 1 Propose a move from mr to mr 2 2 Compute the ratio R = |(mr )/(mr )| 3 Perform metropolis acceptance comparison min(1, R ) 4 Update the conguration if the move is accepted 5 Else restore the conguration Repeat the above steps until the system equilibrates
Accumulation:
Repeat the same procedure from Equilibration Accumulate the local energy and other observable parameters at mr and mr , Perform metropolis acceptance comparison min(1, R ) Repeat the above steps until energies are accumulated
Kaushik Ragavan
6/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Trial Wavefunction
In quantum mechanics, the variational principle can be derived by expanding a normalized trial wavefunction, T , in terms of the exact normalized eigenstates of the Hamiltonian.
T =
i =0 i =0
ci i ,
(1)
|ci |2 = 1
T T H
=
i
ci i H
j
cj j
=
i
|ci |
i,
(2)
where
i i H
Kaushik Ragavan
7/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Diculty of VMC
The wavefunction in real space is given by, (r1 , . . . , rN , r1 , . . . , rN ), where ri are the coordinates of the electrons on a lattice and C (r1 , . . . , rN , r1 , . . . , rN ), a conguration of electrons. To sum over all congurations, for example: a lattice with 100 sites and 50 and 50 electrons, we need to visit over 1060 congurations. To overcome this diculty, we use Monte Carlo method to perform the sum[2]. The energy function is given by, E (i ) =
C
(C )H (C ) C (C ) (C )
(3)
P (C ) =
C
Obviously P(C) = 1, We visit the high probability congurations and add up the contribution. This is called as importance sampling
Kaushik Ragavan 8/54 GPU Acceleration of VMC Method for Many Body Physics
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Using the importance sampling, accurate results for various quantites can be obtained by a smaller number of Monte Carlo sweeps, NMC . The energy after NMC is given by, 1 NMC
NMC k =1
E (i ) =
H (mr ) (mr )
(4)
1 NMC
NMC k =1
O (mr ) (mr )
|(m |2 r |(mr |2
(5)
For every Monte Carlo (MC) step, we need to evaluate the ratio of
which is of
complexity O (N 3 ). In order to optimize with respect to i , we need a complexity of O (N ). How can we reduce the algorithm complexity ?
Kaushik Ragavan
9/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Faster VMC
Consider the spinless fermions with a conguration k given by a wavefunction and a conguration l given by a wavefunction . Both the congurations dier only by a position of a electron e , el and el a1 (e1 ) . . . aN (e1 ) ... ... ... a1 (el ) . . . aN (el ) ... ... ... a1 (eN ) . . . aN (eN )
(6)
Since and dier by only one column, the ratio can be determined as[2], det [] = det [] which of O (N ) The calculation of 1 is reduced to the order of O (N 2 ) using Sherman-Morrison-Woodbury (SMH) method.
Kaushik Ragavan 10/54 GPU Acceleration of VMC Method for Many Body Physics
1
kl kl
k
(7)
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Our approach
We used the t t t model to study the Hubbard model.
ij
tij ci cj + U
i
ni ni
(8)
The VMC method is tested on a tilted square lattice. We study the Resonance Valence Bond (RVB) model to describe High-TC . The number of sites is given by, NS = L2 + 1, where L is odd. The number of electrons is given by, Ne = NS (1.0 x ), where x is the nominal hole doping. The number of electron pairs is given by, NP = Ne /2.
Kaushik Ragavan
11/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Prior Work
CPU Implementation:
A Fortran implementation of VMC was done by Dr. Sandeep Pathak, a former Postdoc at LA-Sigma Research group [Phys Rev.B] In his implementation, a Markov chains (MC) is simulated with an initial random conguration.
MPI Implementation
A MPI-Fortran implementation was done by Dr. Sandeep Pathak to study the d-correlated systems using the variational approach. [Phys Rev.B] Every MC is done by an MPI rank or process and it requires inter-processor and inter-node communication to compute the average energy. Drawbacks: 1 Eventhough the MPI implementation gives better results, algorithms such as VMC is a suitable candidate for GPU due to high oating point throughput.
Kaushik Ragavan
12/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Prior Work
GPU Implementation 1 CUDA port
Initial steps in porting VMC code to CUDA was done by Byron, an REU student at LA-Sigma Research group. In his implementation, a CUDA thread performs a MC and the number of thread blocks equals the number of MC. Drawbacks: Lacks enough thread-level parallelism, usage of shared and constant memory and caching techniques for L1 cache on a GPU.
2
Kaushik Ragavan
13/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
GPU Architecture
More transistors are allocted for computational units on a GPU. CPU dedicates more transistors for caching and control units. Figure 1 shows the dierence between a CPU and a GPU architecture.
Kaushik Ragavan
14/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Algorithm
Algorithm 1
Overview of the VMC Method Generate lattice object, wavefunction and pairfunction Perform the Equilibration procedure Accumulate the energy and study the groundstate
Algorithm 2
Equilibration
procedure Equilibration( ) for i 0, nsweeps do Electron move() Perform MonteCarlo Sweep() end for end procedure
Call the electron move procedure Determine the acceptance of the move and update the Conguration
Kaushik Ragavan
16/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Algorithm
Algorithm 3
Accumulation Change the loop bounds to navesweeps and npsweep Repeat the procedures from Equilibration
procedure Accumulation( ) for i 0, navesweeps do for j 0, npsweep do Electron Move() Perform MonteCarlo Sweep() end for energy EnergyofCong() eneloc eneloc + energy end for end procedure
Kaushik Ragavan
17/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Algorithm
Algorithm 4
Electron Move
Pick a pair and spin at random
procedure Electron Move( ) ipair rand (Npairs ) ispin rand (2) oldsite plist (ipair , ispin) while (newsite == 0) do if neiprob > 0.0 then newsite rand (Nsites ) else newsite neiblist () end if if (latocc (newsite ) ! = (BT or HL)) then spinip true jpair whichpair (newsite 2 + 1 ispin) else newsite = 0 end if end while end procedure
neiblist maintains the neighbour information latocc will determine the spin at the newsite whichpair(ilat,spin), spin at lattice site ilat
Kaushik Ragavan
18/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Algorithm
Algorithm 5
Perform MonteCarlo Sweep
procedure Perform MonteCarlo Sweep( ) if spinip = true then dpbd 1 DetPByDet(ipair , 2 ispin 1, newsite ) UpdateCong(ipair , 2 ispin 1, newsite , dpbd 1) dpbd 2 DetPByDet(jpair , 1 2 ispin, oldsite ) dpbd dpbd 1 dpbd 2 else dpbd DetPByDet(ipair , 2 ispin 1, newsite ) end if norm2 norm(dpbd ) if norm2 uniformrand (0, 1) then if spinip = true then UpdateCong(jpair , 1 2 ispin, oldsite , dpbd 2) else UpdateCong(ipair , 2 ispin 1, newsite , dpbd ) end if else Copycong() end if end procedure
Kaushik Ragavan
19/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Algorithm
Algorithm 6
Energy of a conguration
for inn 0, nearnsets do isite nearnp (inn 2); jsite nearnp (inn 2 + 1) ispin latocc (isite ); jspin latocc (jsite ) ipair whichpair (isite 2 + (1 + ispin)/2) jpair whichpair (jsite 2 + (1 + jspin)/2) if ispin jspin < 0 then tempconf CopyCong() dpbd DePByDet(ipair , ispin, jsite ); UpdateCong() dpbd 1 DetPByDet(jpair , jspin, isite ) otheloc = otheloc real (dpbd dpbd 1) + 1.0 end if end for otheloc (Jij otheloc )/(2) enektot enekloc ; othetot otheloc energy (enektot + othetot )/Nsites kinenergy (spkeloc (0) + spkeloc (1))/Nsites othenergy othetot return energy End Procedure ProcedureEnergyofCong saved = cong() Make a copy of the conguration tempconf = cong() Make a temporary conguration to for ispin 0, 2 do Calculate the Kinetic energy part of the Hamiltonian spin 2 ispin 1 for ipair 0, Npairs do isite plist (ipair 2 + ispin) for jn 0, nneibs do jsite neiblist (jn + nneibs isite ) if latocc (jsite ) = HL then dpbd DetPByDet(ipair , spin, jsite ) enekloc + thop real (dpbd ) spkeloc (ispin)+ thop real (dpbd ) end if end for end for end for Compute the Exchange term of the Hamiltonian otheloc 0 To hold the energy from the exchange term
Kaushik Ragavan
20/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Data Structures
Data Structures
Major data structures such as 1 , pairfunction and Plist are members of a class. The following table shows the data structures for dierent lattice models.
Kaushik Ragavan
21/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Memory Requirements
Memory Requirements for L
The memory consumption increases in proportion to the lattice size, L The following table shows the memory requirements.
Kaushik Ragavan
22/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Function DetPByDet
If the spinip is up, a site is picked from the electron list and referred to as an oldsite. This site corresponds to the column of the pairfunction. A newsite given by an input parameter corresponds to the row of the pairfunction. The Ipair represents the electron pair and corresponds to the column of 1 . Thus, a dot product of a row of pairfuntion and a column of 1 is calculated. NP cache misses for the 1 and one cache miss for the pairfunction on L1 cache. The access pattern is reversed for the case of a down spinip. The Ipair corresponds to the row of the 1 and the newsite corresponds to the column of the pairfunction. Total of NP fused multiply-add oating point operations (FLOPS) to compute the dot product.
Kaushik Ragavan
23/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Kaushik Ragavan
24/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Update of a Conguration
Function UpdateCong
The Ipair th column of 1 is updated using dpbd (from Function DetPbyDet) A dot product is computed between a column of the 1 and a row of the pairfunction. The remaining columns of 1 are updated using the dot product and Ipair th column of 1 . 2 The update of 1 is of order O (NP ).
Kaushik Ragavan
25/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Kaushik Ragavan
26/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Algorithm Analysis
Execution Time in FLOPS
The FLOPS are calculated based on the speculation that a spinip will occur 50% of the time in both the Equilibration and Accumulation stage.
Accumulation
Kaushik Ragavan
27/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Results
Execution Time:
The results were obtained from a Intel(R)Core(TM)i7-2600 CPU @ 3.40 GHz. The cache hierarchy is given by: 32 kB data + 32 kB instruction L1 cache, 256 kB L2 cache per core. The following tables show the execution time for L = 5, 9, 15.
Kaushik Ragavan
28/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Results for L = 5, 9
Lattice Size Function Equilibration 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Copycong Move rejected 5.Total Accumulation 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Energy calculation DetPbyDet UpdateCong Total 5.Copycong Move rejected 6.Total Total VMC Execution Time(s) 0.02 0.01 0.03 0.00 0.00 0.04 0.02 0.24 L = 9 0.01 0.01 0.07 0.01 0.02 0.02 0.08 0.18 0.08 0.03 0.46 0.70 Lattice Size Function Equilibration 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Copycong Move rejected 5.Total Accumulation 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Energy calculation DetPbyDet UpdateCong Total 5.Copycong Move rejected 6.Total Total VMC Execution Time(s) 0.01 0.11 1.32 0.02 0.04 0.74 0.68 3.04 0.07 0.07 1.44 0.00 0.09 0.17 1.85 2.91 0.62 0.64 6.03 9.07
L = 5
Kaushik Ragavan
29/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Results for L = 5, 9
Kaushik Ragavan
30/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Results for L = 15
Lattice Size Function Equilibration 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Copycong Move rejected 5.Total Accumulation 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Energy calculation DetPbyDet UpdateCong Total 5.Copycong Move rejected 6.Total Total VMC Execution Time(s) 0.19 0.51 25.15 0.07 0.96 12.29 11.99 51.74 0.18 0.44 25.17 0.03 0.98 0.80 36.50 52.67 12.30 11.61 104.05 156.02
L = 15
Kaushik Ragavan
31/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Algorithm Implementation
Results for L = 15
Kaushik Ragavan
32/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Kaushik Ragavan
33/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Workow
Kaushik Ragavan
34/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Na ve Implementation
Intial steps on porting the VMC code to CUDA was done by Byron, an intern student at LA-Sigma Research group Each MC is handled by a CUDA thread and the number of thread blocks equals the number of MCs Drawbacks: 1 Code cannot exceed using just 32 of the GPUs computing potential per warp Higher memory access latency due to global memory read/write Lack of usage of high speed shared and constant memory and unoptimized L1 and L2 cache access
Kaushik Ragavan
35/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Execution Conguration
The threads per MC depends upon the NP of the given lattice
2 The complexity of the functions are: DetPByDet - O (NP ), Update - O (NP ) and Copycong - O (2 NS ) 2 Numbers of threads should be in the range of (NP blocksize NP ) and rounded to the nearest multiple of a warp The access pattern diers based on a spinip for dierent functions. Set a Parameter called, threads-per-col, to determine the blocksize Blocksize = Round(threads-per-colNP ,nearest multiple of warp)
L = 5 L = 9 L = 15
12 37 102
4 4 4
Kaushik Ragavan
36/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Cache behavior
Table: Speculation for L = 5
GPU Memory L = 5 L1 cache/ shared L2 cache 1 (double) 1.152 KB Yes No Pairfunction (double) 5.408 KB Yes Yes.Due to sharing between MC No Plist (int) 96 B Yes No
Constant Memory
No
No
Constant Memory
No
No
Constant Memory
No
Kaushik Ragavan
37/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Table: L = 5
Lattice Threads per MC 64 Cong Cong1 Cong2 Cong3 Total L1 L1 L1 27.648 kB 1 Pairfunc Constant-Common to all MC 5.408 kB Plist L1 L1 L1 288 B 1 MC SM occupancy
L = 5
Kaushik Ragavan
38/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Initial Parallelization
DetPByDet - Up spin DetPByDet - Down spin
Kaushik Ragavan
39/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Initial Parallelization
UpdateCong - Up spin UpdateCong - Down spin
Kaushik Ragavan
40/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
We removed the redundant copies of a conguration. A MC will now maintain a single copy of a conguration The following table predicts the MCs occupancy per SM, 48/9.216 = 5.20 5
Kaushik Ragavan
41/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Kaushik Ragavan
42/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Kaushik Ragavan
43/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Drawbacks
A row or a column is accessed in DetPByDet First a row or column and then the remaining rows or columns are updated on a 1 Reduce the three function calls: DetPByDet, UpdatCong and DetPByDet (inorder) into a single function Solution: Merger of DetPByDet and UpdateCong to form a streamlined function
Kaushik Ragavan
44/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Streamlined Function
Up spin Down spin
Kaushik Ragavan
45/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Lattice L = 5 L = 9 L = 15
No.MCs 32 32 32
Bfactor = 16
Lattice L = 5 L = 9 L = 15
No.MCs 32 32 32
Kaushik Ragavan
46/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Lattice L = 5 L = 9 L = 15
No.MCs 32 32 32
Two 2.93 GHz Quad Core Nehalem Xeon 64-bit Processors 8 32 24GB @ 1333MHz
Kaushik Ragavan
47/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Table: L = 5
Function Electron move DPBD DPBD-UP DPBD-DN Total UPD UPD-UP UPD-DN Total Energy Streamlined (DPBD-UPD-DPBD) No.Cycles 420361934 95922170 97922630 193844800 322257414 259125546 581382960 969450046 1630512656
Table: L = 9
Function Electron move DPBD DPBD-UP DPBD-DN Total UPD UPD-UP UPD-DN Total Energy Streamlined (DPBD-UPD-DPBD) No.Cycles 1398769542 321308398 351500708 672809106 1275117358 1476716774 2751834132 6105049222 10717256244
Kaushik Ragavan
48/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Figure: L = 5
Kaushik Ragavan
49/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Figure: L = 9
Kaushik Ragavan
50/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Table: L = 15
Function Electron move DPBD DPBD-UP DPBD-DN Total UPD UPD-UP UPD-DN Total Energy Streamlined (DPBD-UPD-DPBD) No.Cycles 4958484770 890322994 932234278 1822557272 5874557324 7310109920 13184667244 48836694164 8600610984504
Kaushik Ragavan
51/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Figure: L = 15
Kaushik Ragavan
52/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Questions ?
Kaushik Ragavan
53/54
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
Thank You
Kaushik Ragavan
54/54