Thesis Defense

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU
Implementation Results
GPU Acceleration of the Variational Monte Carlo Method for Many Body Physics
Rajagopalan, Kaushik Ragavan
Louisiana State University, Department of Electrical and Computer Engineering
04/09/2013
Kaushik Ragavan
1/54
GPU Acceleration of VMC Method for Many Body Physics
Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU Implementation Results
1 2 3
Acknowledgements Motivation Variational Monte Carlo (VMC) VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC Prior Work CUDA Programming model CPU Implementation Algorithm Implementation GPU Implementation Intial Steps Optimization Results
4 5 6
Kaushik Ragavan
2/54
Acknowledgements
Thesis Committee
Dr. David Koppelman - Chair Dr. Xin Li Dr. Juana Moreno
LA-Sigma Research group

Dr. Mark Jarrell Dr. Ramanujam Dr. Ka Ming Tam Dr. Zhifeng Yun Dr. Sandeep Pathak Niladri Sengupta
Kaushik Ragavan
3/54
Motivation
High-Performance computing
High-Performance computing is one of the major areas making inroads into the future for large-scale simulation. Applications such as 3D nuclear test, Molecular Dynamics, and Quantum Monte Carlo simulations are now developed on supercomputers using the latest computing technologies. Most of todays supercomputers are now heterogenous: Multi-core CPU(s) equipped with massively parallel Graphics Processing Units (GPUs).
Variational Monte Carlo (VMC)

The VMC method is used in the Many Body Physics to study the ground state properties of a system. Depends on the variational parameters for a better prediction of the wavefunction. Drawbacks: Computationally expensive, requires many Monte carlo sweeps to obtain convergence, direct parallelization on CPU clusters does not scale with the system size. Solution: Porting the VMC method to a GPU using NVIDIA CUDA. Results: Nearly 3.85 X speedup compared to MPI fortran and 19 X speedup compared to the C++ version.
Kaushik Ragavan
4/54
VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC
Denition
VMC is a direct application of Monte Carlo integration to strongly correlated systems. The variational approach has been used widely in dierent areas of condensed matter physics, in particular the d-wave superconducting state of the high TC cuprates at T=0.
Kaushik Ragavan
5/54
VMC Algorithm
1
Equilibration:
Generate an initial conguration by using a random number generator for electrons. For each electron in the conguration: 1 Propose a move from mr to mr 2 2 Compute the ratio R = |(mr )/(mr )| 3 Perform metropolis acceptance comparison min(1, R ) 4 Update the conguration if the move is accepted 5 Else restore the conguration Repeat the above steps until the system equilibrates
Accumulation:
Repeat the same procedure from Equilibration Accumulate the local energy and other observable parameters at mr and mr , Perform metropolis acceptance comparison min(1, R ) Repeat the above steps until energies are accumulated
Kaushik Ragavan
6/54
Trial Wavefunction
In quantum mechanics, the variational principle can be derived by expanding a normalized trial wavefunction, T , in terms of the exact normalized eigenstates of the Hamiltonian.
T =
i =0 i =0
ci i ,
(1)
where ci is given by,
|ci |2 = 1
) with the wavefunction is given by, The Hamiltonian (H
T T H
=
i
ci i H
j
cj j
=
i
|ci |
i,
(2)
where
i i H
Kaushik Ragavan
7/54
Diculty of VMC
The wavefunction in real space is given by, (r1 , . . . , rN , r1 , . . . , rN ), where ri are the coordinates of the electrons on a lattice and C (r1 , . . . , rN , r1 , . . . , rN ), a conguration of electrons. To sum over all congurations, for example: a lattice with 100 sites and 50 and 50 electrons, we need to visit over 1060 congurations. To overcome this diculty, we use Monte Carlo method to perform the sum[2]. The energy function is given by, E (i ) =
C
(C )H (C ) C (C ) (C )
(3)
where P (C ), the probability of the conguration, is given by, | (C )|2 (C ) (C )
P (C ) =
C
Obviously P(C) = 1, We visit the high probability congurations and add up the contribution. This is called as importance sampling
Kaushik Ragavan 8/54 GPU Acceleration of VMC Method for Many Body Physics
Using the importance sampling, accurate results for various quantites can be obtained by a smaller number of Monte Carlo sweeps, NMC . The energy after NMC is given by, 1 NMC
NMC k =1
E (i ) =
H (mr ) (mr )
(4)
For other operators, O
1 NMC
NMC k =1
O (mr ) (mr )
|(m |2 r |(mr |2
(5)
For every Monte Carlo (MC) step, we need to evaluate the ratio of
which is of
complexity O (N 3 ). In order to optimize with respect to i , we need a complexity of O (N ). How can we reduce the algorithm complexity ?
Kaushik Ragavan
9/54
Faster VMC
Consider the spinless fermions with a conguration k given by a wavefunction and a conguration l given by a wavefunction . Both the congurations dier only by a position of a electron e , el and el a1 (e1 ) . . . aN (e1 ) ... ... ... a1 (el ) . . . aN (el ) ... ... ... a1 (eN ) . . . aN (eN )
(6)
Since and dier by only one column, the ratio can be determined as[2], det [] = det [] which of O (N ) The calculation of 1 is reduced to the order of O (N 2 ) using Sherman-Morrison-Woodbury (SMH) method.
1
kl kl
k
(7)
Our approach
We used the t t t model to study the Hubbard model.
ij
tij ci cj + U
i
ni ni
(8)
The VMC method is tested on a tilted square lattice. We study the Resonance Valence Bond (RVB) model to describe High-TC . The number of sites is given by, NS = L2 + 1, where L is odd. The number of electrons is given by, Ne = NS (1.0 x ), where x is the nominal hole doping. The number of electron pairs is given by, NP = Ne /2.
Kaushik Ragavan
11/54
Prior Work
CPU Implementation:
A Fortran implementation of VMC was done by Dr. Sandeep Pathak, a former Postdoc at LA-Sigma Research group [Phys Rev.B] In his implementation, a Markov chains (MC) is simulated with an initial random conguration.
MPI Implementation
A MPI-Fortran implementation was done by Dr. Sandeep Pathak to study the d-correlated systems using the variational approach. [Phys Rev.B] Every MC is done by an MPI rank or process and it requires inter-processor and inter-node communication to compute the average energy. Drawbacks: 1 Eventhough the MPI implementation gives better results, algorithms such as VMC is a suitable candidate for GPU due to high oating point throughput.
Kaushik Ragavan
12/54
Prior Work
GPU Implementation 1 CUDA port
Initial steps in porting VMC code to CUDA was done by Byron, an REU student at LA-Sigma Research group. In his implementation, a CUDA thread performs a MC and the number of thread blocks equals the number of MC. Drawbacks: Lacks enough thread-level parallelism, usage of shared and constant memory and caching techniques for L1 cache on a GPU.
2
CUDA and FPGA port

A GPU and FPGA implementation was done at University of Tennessee-Knoxville. Compares a Dual-Core-Dual-processor AMD Opteron @ 2.2 GHz CPU vs a NVIDIA GPU and Virtex-4 XC4VLX160 FPGA. Drawbacks: No eective utilization of memory resources on a GPU.
Kaushik Ragavan
13/54
GPU Architecture
More transistors are allocted for computational units on a GPU. CPU dedicates more transistors for caching and control units. Figure 1 shows the dierence between a CPU and a GPU architecture.
Figure: GPU Architecture
Kaushik Ragavan
14/54
CUDA Programming model

Thread Hierarchy
The CUDA threadIdx is of one, two or three dimensions. Group of threads form a block and the blocks can be of one, two or three dimensions. Group of thread blocks form a grid of one, two or three dimensions. Thread blocks reside within a Streaming Multiprocessor (SM). Fermi architecture supports 8 resident blocks per SM.
Figure: CUDA thread hierarchy

Algorithm Implementation
Algorithm
Algorithm 1
Overview of the VMC Method Generate lattice object, wavefunction and pairfunction Perform the Equilibration procedure Accumulate the energy and study the groundstate
begin vmc Initialization() Equilibration() Accumulation() end vmc
Algorithm 2
Equilibration
procedure Equilibration( ) for i 0, nsweeps do Electron move() Perform MonteCarlo Sweep() end for end procedure
Call the electron move procedure Determine the acceptance of the move and update the Conguration
Kaushik Ragavan
16/54
Algorithm
Algorithm 3
Accumulation Change the loop bounds to navesweeps and npsweep Repeat the procedures from Equilibration
procedure Accumulation( ) for i 0, navesweeps do for j 0, npsweep do Electron Move() Perform MonteCarlo Sweep() end for energy EnergyofCong() eneloc eneloc + energy end for end procedure
Call the energy of conguration Accumulate the energy
Kaushik Ragavan
17/54
Algorithm
Algorithm 4
Electron Move
Pick a pair and spin at random
procedure Electron Move( ) ipair rand (Npairs ) ispin rand (2) oldsite plist (ipair , ispin) while (newsite == 0) do if neiprob > 0.0 then newsite rand (Nsites ) else newsite neiblist () end if if (latocc (newsite ) ! = (BT or HL)) then spinip true jpair whichpair (newsite 2 + 1 ispin) else newsite = 0 end if end while end procedure
Find a site for the electron move
neiblist maintains the neighbour information latocc will determine the spin at the newsite whichpair(ilat,spin), spin at lattice site ilat
Kaushik Ragavan
18/54
Algorithm
Algorithm 5
Perform MonteCarlo Sweep
procedure Perform MonteCarlo Sweep( ) if spinip = true then dpbd 1 DetPByDet(ipair , 2 ispin 1, newsite ) UpdateCong(ipair , 2 ispin 1, newsite , dpbd 1) dpbd 2 DetPByDet(jpair , 1 2 ispin, oldsite ) dpbd dpbd 1 dpbd 2 else dpbd DetPByDet(ipair , 2 ispin 1, newsite ) end if norm2 norm(dpbd ) if norm2 uniformrand (0, 1) then if spinip = true then UpdateCong(jpair , 1 2 ispin, oldsite , dpbd 2) else UpdateCong(ipair , 2 ispin 1, newsite , dpbd ) end if else Copycong() end if end procedure
1 Determine 1 Move is accepted
Kaushik Ragavan
19/54
Algorithm
Algorithm 6
Energy of a conguration
for inn 0, nearnsets do isite nearnp (inn 2); jsite nearnp (inn 2 + 1) ispin latocc (isite ); jspin latocc (jsite ) ipair whichpair (isite 2 + (1 + ispin)/2) jpair whichpair (jsite 2 + (1 + jspin)/2) if ispin jspin < 0 then tempconf CopyCong() dpbd DePByDet(ipair , ispin, jsite ); UpdateCong() dpbd 1 DetPByDet(jpair , jspin, isite ) otheloc = otheloc real (dpbd dpbd 1) + 1.0 end if end for otheloc (Jij otheloc )/(2) enektot enekloc ; othetot otheloc energy (enektot + othetot )/Nsites kinenergy (spkeloc (0) + spkeloc (1))/Nsites othenergy othetot return energy End Procedure ProcedureEnergyofCong saved = cong() Make a copy of the conguration tempconf = cong() Make a temporary conguration to for ispin 0, 2 do Calculate the Kinetic energy part of the Hamiltonian spin 2 ispin 1 for ipair 0, Npairs do isite plist (ipair 2 + ispin) for jn 0, nneibs do jsite neiblist (jn + nneibs isite ) if latocc (jsite ) = HL then dpbd DetPByDet(ipair , spin, jsite ) enekloc + thop real (dpbd ) spkeloc (ispin)+ thop real (dpbd ) end if end for end for end for Compute the Exchange term of the Hamiltonian otheloc 0 To hold the energy from the exchange term
Kaushik Ragavan
20/54
Data Structures
Data Structures
Major data structures such as 1 , pairfunction and Plist are members of a class. The following table shows the data structures for dierent lattice models.
Table: Classes used in the Implementation of VMC

S.No 1 2 3 4 Class Sqlat Cong Wavefunction Montecarlo Functionality Holds Lattice parameters such as Nsites, Nelecs, Npairs and hopping, etc Contains the energy arrays, and functions for calculating the Ratio of Determinants and Update of 1 Contains the variational parameters required to generate the pairfunction or wavefunction Contains the lattice object, pairfunction and randomseeds for a corresponding Markov chain
Kaushik Ragavan
21/54
Memory Requirements
Memory Requirements for L
The memory consumption increases in proportion to the lattice size, L The following table shows the memory requirements.
Table: Memory requirements

Data structure NS Ne Npairs PsiInv Pairfunction Plist Size L2 + 1 NS (1.E 0 x ) where x is hole doping Ne / 2 2) (NP 2) (NS (2 NP ) Data Type int int int double double int L=5 26 24 12 1.152 kB 5.408 kB 96 B L=9 82 74 37 10.952 kB 53.792 kB 296 B L = 15 226 204 102 83.232 kB 408.608 kB 816 B
Kaushik Ragavan
22/54
Calculation of Ratio of Determinants using SMH formula
Function DetPByDet
If the spinip is up, a site is picked from the electron list and referred to as an oldsite. This site corresponds to the column of the pairfunction. A newsite given by an input parameter corresponds to the row of the pairfunction. The Ipair represents the electron pair and corresponds to the column of 1 . Thus, a dot product of a row of pairfuntion and a column of 1 is calculated. NP cache misses for the 1 and one cache miss for the pairfunction on L1 cache. The access pattern is reversed for the case of a down spinip. The Ipair corresponds to the row of the 1 and the newsite corresponds to the column of the pairfunction. Total of NP fused multiply-add oating point operations (FLOPS) to compute the dot product.
Kaushik Ragavan
23/54
Memory Access Pattern

Figure:(a) 1 , (b) Plist, (c) Pairfunction
Kaushik Ragavan
24/54
Update of a Conguration
Function UpdateCong
The Ipair th column of 1 is updated using dpbd (from Function DetPbyDet) A dot product is computed between a column of the 1 and a row of the pairfunction. The remaining columns of 1 are updated using the dot product and Ipair th column of 1 . 2 The update of 1 is of order O (NP ).
Kaushik Ragavan
25/54
Memory Access Pattern

Figure:(a) 1 , (b) Dot product, (c) 1
Kaushik Ragavan
26/54
Algorithm Analysis
Execution Time in FLOPS
The FLOPS are calculated based on the speculation that a spinip will occur 50% of the time in both the Equilibration and Accumulation stage.
Table: Estimate of FLOPS for the VMC method

Function DetPByDet UpdateCong FLOPS Npairs MUL Npairs DIV +Npairs 2 (MUL + ADD ) +Npairs 2 (MUL + SUB ) CopyCong Energy of a Conguration NIL 2 Npairs Nneibs (DetPByDet + MUL + ADD ) +Nearnsets (DetPByDet) +Nearnsets Updatecong +Nearnsets (DetPByDet) +Nearnsets (ADD + SUB ) Operation Either case of spinip For Ipair th row or col of 1 For dot product calculation For updating other rows or cols of 1 NIL Kinetic energy calculation
Accumulation
Kaushik Ragavan
27/54
Results
Execution Time:
The results were obtained from a Intel(R)Core(TM)i7-2600 CPU @ 3.40 GHz. The cache hierarchy is given by: 32 kB data + 32 kB instruction L1 cache, 256 kB L2 cache per core. The following tables show the execution time for L = 5, 9, 15.
Kaushik Ragavan
28/54
Results for L = 5, 9
Lattice Size Function Equilibration 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Copycong Move rejected 5.Total Accumulation 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Energy calculation DetPbyDet UpdateCong Total 5.Copycong Move rejected 6.Total Total VMC Execution Time(s) 0.02 0.01 0.03 0.00 0.00 0.04 0.02 0.24 L = 9 0.01 0.01 0.07 0.01 0.02 0.02 0.08 0.18 0.08 0.03 0.46 0.70 Lattice Size Function Equilibration 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Copycong Move rejected 5.Total Accumulation 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Energy calculation DetPbyDet UpdateCong Total 5.Copycong Move rejected 6.Total Total VMC Execution Time(s) 0.01 0.11 1.32 0.02 0.04 0.74 0.68 3.04 0.07 0.07 1.44 0.00 0.09 0.17 1.85 2.91 0.62 0.64 6.03 9.07
L = 5
Kaushik Ragavan
29/54
Results for L = 5, 9
Kaushik Ragavan
30/54
Results for L = 15
Lattice Size Function Equilibration 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Copycong Move rejected 5.Total Accumulation 1.Electron Move 2.Spinip = true: DetPbyDet UpdateCong 3.Spinip = false: DetPbyDet UpdateCong 4.Energy calculation DetPbyDet UpdateCong Total 5.Copycong Move rejected 6.Total Total VMC Execution Time(s) 0.19 0.51 25.15 0.07 0.96 12.29 11.99 51.74 0.18 0.44 25.17 0.03 0.98 0.80 36.50 52.67 12.30 11.61 104.05 156.02
L = 15
Kaushik Ragavan
31/54
Results for L = 15
Kaushik Ragavan
32/54
Intial Steps Optimization
GPU Acceleration of the VMC Method

GPU Acceleration of the VMC Method
From the CPU implementation we determined the best candidates for parallelization using a GPU Functions such as the ratio of determinants, update and energy of a conguration are computationally intensive and exhibit data-level parallelism In our implemetation a block of CUDA threads will handle a Markov chain (MC) Every chain has its own copy of a conguration with members such as 1 , pairfunction and plist Every chain starts with an intial random conguation and performs Equilibration and Accumulation The energy values are accumulated from all MCs and averaged over on a CPU The ground state |G is then optimized with the variational parameters
Kaushik Ragavan
33/54
Workow
Kaushik Ragavan
34/54
Na ve Implementation
Intial steps on porting the VMC code to CUDA was done by Byron, an intern student at LA-Sigma Research group Each MC is handled by a CUDA thread and the number of thread blocks equals the number of MCs Drawbacks: 1 Code cannot exceed using just 32 of the GPUs computing potential per warp Higher memory access latency due to global memory read/write Lack of usage of high speed shared and constant memory and unoptimized L1 and L2 cache access
Kaushik Ragavan
35/54
Execution Conguration
The threads per MC depends upon the NP of the given lattice
2 The complexity of the functions are: DetPByDet - O (NP ), Update - O (NP ) and Copycong - O (2 NS ) 2 Numbers of threads should be in the range of (NP blocksize NP ) and rounded to the nearest multiple of a warp The access pattern diers based on a spinip for dierent functions. Set a Parameter called, threads-per-col, to determine the blocksize Blocksize = Round(threads-per-colNP ,nearest multiple of warp)
Table: Threads per MC and number of MCs per SM

Lattice Npairs Threads-per-col No.threads per MC per SM 64 160 416 No.resident MCs 24 9 3
L = 5 L = 9 L = 15
12 37 102
4 4 4
Kaushik Ragavan
36/54
Cache behavior
Table: Speculation for L = 5
GPU Memory L = 5 L1 cache/ shared L2 cache 1 (double) 1.152 KB Yes No Pairfunction (double) 5.408 KB Yes Yes.Due to sharing between MC No Plist (int) 96 B Yes No

GPU Memory L = 9 L1 cache/ shared L2 cache 1 (double) 10.952 KB Yes No Pairfunction (double) 53.792 KB No Yes.Due to sharing between M.C No Plist (int) 296 B Yes No
Constant Memory
No
No
Constant Memory
No
No

GPU Memory L = 15 L1 cache/ shared L2 cache 1 (double) 83.232 KB Yes, but partly cached Yes. Exceeds L1 cache size No Pairfunction (double) 408.608 KB No Yes. Due to sharing between M.C No Plist (int) 816 B Yes No
Constant Memory
No
Kaushik Ragavan
37/54
Summary of Memory Prediction

Blocksize = 64 and we have three copies of a conguration. Total MCs per SM will be 1536/64 = 24 Memory consumption will be 1.152 8 = 9.216 3 = 27.648 kB, since three copies of a conguration Pairfunction will be 5.408 kB and common to all congurations Total MC(s) per SM will be 48/27.648. One MC per SM
Table: L = 5
Lattice Threads per MC 64 Cong Cong1 Cong2 Cong3 Total L1 L1 L1 27.648 kB 1 Pairfunc Constant-Common to all MC 5.408 kB Plist L1 L1 L1 288 B 1 MC SM occupancy
L = 5
Kaushik Ragavan
38/54
Initial Parallelization
DetPByDet - Up spin DetPByDet - Down spin
Kaushik Ragavan
39/54
Initial Parallelization
UpdateCong - Up spin UpdateCong - Down spin
Kaushik Ragavan
40/54
Elimination of Redundant copies of a conguration
We removed the redundant copies of a conguration. A MC will now maintain a single copy of a conguration The following table predicts the MCs occupancy per SM, 48/9.216 = 5.20 5
Table: Occupancy calculation for L = 5

Lattice L = 5 Threads per MC 64 Cong Cong1 Total 1 L1 9.216 kB Pairfunc Constant Memory 5.408 kB Plist L1 96 B SM occupancy 5 MCs
Kaushik Ragavan
41/54
Optimized Memory Access Pattern

DetPByDet - Up spin DetPByDet - Down spin
Kaushik Ragavan
42/54
Optimized Memory Access Pattern

UpdateCong - Up spin UpdateCong - Down spin
Kaushik Ragavan
43/54
Drawbacks
A row or a column is accessed in DetPByDet First a row or column and then the remaining rows or columns are updated on a 1 Reduce the three function calls: DetPByDet, UpdatCong and DetPByDet (inorder) into a single function Solution: Merger of DetPByDet and UpdateCong to form a streamlined function
Kaushik Ragavan
44/54
Streamlined Function
Up spin Down spin
Kaushik Ragavan
45/54
Comparison between a CPU and GPU

The CUDA code was tested on a Tesla M2070 NVIDIA GPU and benchmarked against a Intel(R)Core(TM)i7-2600 CPU @ 3.40 GHz. The MCs are performed sequentially by a CPU core. Increasing the blocking factor gave a better performance Bfactor = 8
Lattice L = 5 L = 9 L = 15
No.MCs 32 32 32
CPU Exec Time(s) 22.66 287.13 4920.22
GPU Exec Time(s) 3.19 17.37 309.72
Speedup 7.26 16.72 15.88
Bfactor = 16
Lattice L = 5 L = 9 L = 15
No.MCs 32 32 32
CPU Exec Time(s) 22.66 287.13 4920.22
GPU Exec Time(s) 2.48 14.44 256.83
Speedup 9.13 19.88 19.43
Kaushik Ragavan
46/54
Comparison between MPI-CPU and GPU

We compared our CUDA code with the MPI code implemented by Sandeep using Fortran 90. The MPI code designates a MPI rank per MC The results were compared on Philip supercomputer at HPC, LSU
Table: Comparison of MPI vs GPU Performance
Lattice L = 5 L = 9 L = 15
No.MCs 32 32 32
MPI Exec Time(s) 4.89 63.66 1193.68
GPU Exec Time(s) 3.19 17.37 309.72
Speedup 1.53 3.66 3.85
Table: Philip Node conguration
Nodal Conguration No.Processors/node Total No.Nodes DRAM
Two 2.93 GHz Quad Core Nehalem Xeon 64-bit Processors 8 32 24GB @ 1333MHz
Kaushik Ragavan
47/54
GPU Results in depth
Table: L = 5
Function Electron move DPBD DPBD-UP DPBD-DN Total UPD UPD-UP UPD-DN Total Energy Streamlined (DPBD-UPD-DPBD) No.Cycles 420361934 95922170 97922630 193844800 322257414 259125546 581382960 969450046 1630512656
Table: L = 9
Kaushik Ragavan
48/54
GPU results in depth
Figure: L = 5
Kaushik Ragavan
49/54
Figure: L = 9
Kaushik Ragavan
50/54
Table: L = 15
Kaushik Ragavan
51/54
GPU results in depth
Figure: L = 15
Kaushik Ragavan
52/54
Questions ?
Kaushik Ragavan
53/54
Thank You
Kaushik Ragavan
54/54

Thesis Defense

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Thesis Defense

Загружено:

Авторское право:

Доступные форматы

Contents Acknowledgements Motivation Variational Monte Carlo (VMC) Prior Work CUDA Programming model CPU Implementation GPU

GPU Acceleration of VMC Method for Many Body Physics

GPU Acceleration of VMC Method for Many Body Physics

LA-Sigma Research group

GPU Acceleration of VMC Method for Many Body Physics

Variational Monte Carlo (VMC)

GPU Acceleration of VMC Method for Many Body Physics

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

GPU Acceleration of VMC Method for Many Body Physics

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

GPU Acceleration of VMC Method for Many Body Physics

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

where ci is given by,

) with the wavefunction is given by, The Hamiltonian (H

GPU Acceleration of VMC Method for Many Body Physics

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

where P (C ), the probability of the conguration, is given by, | (C )|2 (C ) (C )

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

For other operators, O

GPU Acceleration of VMC Method for Many Body Physics

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

VMC Algorithm Trial Wavefunction Diculty of VMC Faster VMC

GPU Acceleration of VMC Method for Many Body Physics

GPU Acceleration of VMC Method for Many Body Physics

CUDA and FPGA port

GPU Acceleration of VMC Method for Many Body Physics

Figure: GPU Architecture

GPU Acceleration of VMC Method for Many Body Physics

CUDA Programming model

Figure: CUDA thread hierarchy

begin vmc Initialization() Equilibration() Accumulation() end vmc

GPU Acceleration of VMC Method for Many Body Physics

Call the energy of conguration Accumulate the energy

GPU Acceleration of VMC Method for Many Body Physics

Find a site for the electron move

GPU Acceleration of VMC Method for Many Body Physics

1 Determine 1 Move is accepted

GPU Acceleration of VMC Method for Many Body Physics

GPU Acceleration of VMC Method for Many Body Physics

Table: Classes used in the Implementation of VMC

GPU Acceleration of VMC Method for Many Body Physics

Table: Memory requirements

GPU Acceleration of VMC Method for Many Body Physics

Calculation of Ratio of Determinants using SMH formula

GPU Acceleration of VMC Method for Many Body Physics

Memory Access Pattern

GPU Acceleration of VMC Method for Many Body Physics

GPU Acceleration of VMC Method for Many Body Physics

Memory Access Pattern

GPU Acceleration of VMC Method for Many Body Physics

Table: Estimate of FLOPS for the VMC method

GPU Acceleration of VMC Method for Many Body Physics

GPU Acceleration of VMC Method for Many Body Physics

GPU Acceleration of VMC Method for Many Body Physics

GPU Acceleration of VMC Method for Many Body Physics

GPU Acceleration of VMC Method for Many Body Physics

GPU Acceleration of VMC Method for Many Body Physics

Intial Steps Optimization

GPU Acceleration of the VMC Method

GPU Acceleration of VMC Method for Many Body Physics

Intial Steps Optimization

GPU Acceleration of VMC Method for Many Body Physics

Intial Steps Optimization

GPU Acceleration of VMC Method for Many Body Physics