You are on page 1of 34

FUSIONSIM:

A Cycle-Accurate CPU + GPU System Simulator


Vitaly Zakharenko, Andreas Moshovos
University of Toronto
Tor Aamodt
University of British Columbia

With support from AMD Canada, Ontario Centres of Excellence and
National Science and Engineering Council of Canada.
3 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSIONSIM:
A CYCLE-ACCURATE CPU + GPU
SYSTEM SIMULATOR
4 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
WHAT IS FUSIONSIM?

Detailed timing simulator of a complete system with an x86 CPU and a GPU
Fused or Discrete Systems

FusionSims features:
x86 out-of-order CPU + CUDA-capable GPU
Operate concurrently
Detailed timing models for all components
Models reflect modern hardware

Enables performance modeling:
Fused vs. Discrete
What if scenarios

5 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
AGENDA
TWO FLAVOURS OF FUSIONSIM
Structure & Functionality of Discrete FusionSim
Models a discrete system:
Distinct CPU and GPU chips
Separate CPU and GPU DRAM



Structure & Functionality of Fused FusionSim
Models a fused system:
Same CPU and GPU chip
Shared CPU and GPU DRAM
Partly shared memory hierarchy

6 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
AGENDA
FUSION: WHICH BENCHMARK BENEFITS?
Analytical speed-up model
Greater speed up for
Small benchmark input data size
Many kernel invocations (large cumulative latency overhead )
High benchmark kernel throughput
Long time spent in the GPU code relative to the x86 code

Simulation speed-up results of Rodinia
Range: 1.05x to 9.72x
A closer look at a fusion-friendly benchmark
Large speed-up (up to x9.72) for small problem sizes
Smaller (x1.8) speed-up for medium problem sizes
Dependence on latency overhead and kernel throughput
COPY
KERNEL
KERNEL
TOTAL
TOTAL
GPU
data
G
O
O
+ O
A
+ ~1
KERNEL
O
TOTAL
data
TOTAL
A
TOTAL
A
KERNEL
O
7 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
AGENDA
FUSION: WHICH SYSTEM FACTORS AFFECT SPEED-UP?

Kernel spawn latency
From GPU API kernel launch request until actual kernel execution
Simulation: order-of-magnitude reduction is important

CPU/GPU memory coherence
Simulation: performance loss is minor
less than 2 % for most Rodinia benchmarks
8 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM:
STRUCTURE
CPU from PTLSim: www.ptlsim.org
GPU of GPGPU-Sim: www.gpgpu-sim.org
CPU caches of MARSSx86: www.marss86.org
9 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM:
COMPONENT FEATURES
CPU: PTLSIM
Fast x86: 200KIPS/sec (isolated)
Out-of-Order
Micro-op architecture
Cycle-accurate
Modular & detailed memory
hierarchy model

GPU: GPGPU-SIM
OpenCL/CUDA capable
Currently only CUDA
High correlation vs. Nvidia GT200 and Fermi
NoC
Detailed & configurable
DRAM
Detailed

10 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM:
START-UP AND MEMORY LAYOUT
Input: standard Linux CUDA benchmark executable
Benchmarks process is created
Simulator is injected into virtual memory space
Private stack
Private heap & heap management
Invisible to the benchmark process

Simulator executes benchmarks code:
x86 code on PTLsim
PTX code on GPGPU-Sim
Benchmarks process communicates with FusionSim
via a single page accessible by both


(pink)
(green)
(yellow)
Replacement
of the standard
dynamic library
11 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM:
MAIN SIMULATION LOOP
Single simulation loop:
Each loop cycle == tick of a virtual common clock
x GPU_MULTIPLIER = GPU_FREQ
x CPU_MULTIPLIER = CPU_FREQ

WHILE (1) {
FOR GPU_MULTIPLIER ITERATIONS DO {
GPU_CYCLE()
}
FOR CPU_MULTIPLIER ITERATIONS DO {
CPU_CYCLE()
}
}


12 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM:
EXAMPLE GPU API CALL
Virtual PTLsim CPU executes x86 code
call to API cudaMemcpyAsync(a, b, c) is reached

On next GPU cycle, FusionSim
Identifies pending API call
Enqueues the task for the GPU
Decides whether to block the CPU (synchronous) or
to let the CPU proceed (asynchronous)
13 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM:
SIMULATOR FEATURES
Correctly models ordering and overlap in time of
asynchronous & synchronous operations
memory transfers
CUDA events
Kernel computations
CPU processing

Models duration of all CUDA stream operations

Simple and powerful mechanism for management of configuration and simulation output files

14 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM:
STRUCTURE
Processing Cluster is replaced by a CPU

CUDA global memory address space is shared
No more memory transfers from/to device
DRAM

Last Level Cache size is adjusted (increased)
GPUs L2 is also CPUs L3

CPU: L1 and L2 private caches



15 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM:
A CHALLENGE WITH EXISTING CPU +GPU MEMORY SPACES
CUDA global memory space
Shared between CPU & GPU
Accessible by both using the same virtual
address
Cached in LLC and mapped to DRAM

CUDA local memory space
Private to the GPU
Inaccessible by the CPU
Cached in LLC and mapped to DRAM

How do we model these?



16 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM:
SIMULATING THE CPU AND GPU MEMORY SPACES
Common Virtual memory
Used by both the CPU and the GPU
Slightly different virtual memory spaces

Generic virtual address
Used by GPU
For the same location X accessible by the CPU
Generic_virt_addr = virt_addr + 0x40000000

32-bit virtual address space (4GBytes)
FusionSim does not simulate OS kernel code => top-most 1GByte addresses unused
17 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM:
MEMORY SPACE: WHERE AND WHAT
CPU
Uses CPU virtual address
GPU
Uses generic virtual address
Caches
Physically-addressed
CPU adjusts virtual address to generic and
translates it to physical
GPU directly translates generic to physical
MMU
Same MMU for both the CPU and the GPU
Uses CPU
virtual address
Uses generic
virtual address
Physical
address
18 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM
MEMORY COHERENCE
Shared CUDA global address space

Same block from global space
Cached in private CPU L1 $
Cached in private GPU L1 $

Potential coherence problem

First-cut solution: Flushing caches to LLC
Interesting area for exploration



19 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM
MEMORY COHERENCE: IMPLEMENTATION
CPU side:
Selective flushing of private caches
cudaSelectivelyFlush(address, size)
prior to every kernel invocation
for every region of memory accessed by the kernel

GPU side:
GPGPU-Sim already flushes the caches






20 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM
CHANGES TO GPU API
No need for device memory allocation API
cudaMalloc()
cudaFree()

No memory transfers to/from device DRAM
cudaMemCpy()
cudaMemset()

Additional API function:
cudaSelectivelyFlush(address, size)

21 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED VS DISCRETE: EXPERIMENTAL METHODOLOGY
Rodinia
benchmark suite for heterogeneous computing

Discrete system modeled by Discrete FusionSim
Unmodified Rodinia

Fused system modeled by Fused FusionSim
Modified Rodinia:
No cudaMalloc()/cudaFree()
No memory transfers
Added cudaSelectivelyFlush()

Data input generation is excluded from time
measurement


22 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED VS DISCRETE: RELATIVE PERFORMANCE
Rodinia benchmarks

Two baseline discrete systems:
10 sec kernel spawn latency
100 usec kernel spawn latency

Speed-up varies:
From x1.05
nn, 10 usec
Up to x9.72
gaus_4, 10 usec


FUSED is better
23 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED SYSTEM: KERNEL SPAWN LATENCY
One baseline discrete system:
10 sec kernel spawn latency

Different fused systems:
0.1 sec kernel spawn latency
1 sec kernel spawn latency
10 sec kernel spawn latency

Simulations show:
Reduction of the latency to 1 sec is
important
Further reduction below 1 sec is NOT
important


FUSED is better
24 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED SYSTEM: COHERENCE OVERHEAD
Two fused systems:
Incoherent vs. coherent
kernel spawn latency is 0.1 usec in both
systems

Simulations show:
Minor performance loss
Less then 2% for most benchmarks
5% for bfs_small

SMALLER is better
25 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSION: WHICH BENCHMARK BENEFITS?
ANALYTICAL MODEL
Semantics
Total cumulative latency
Kernel throughput
Benchmark data input size

Greater speed up for
Small benchmark input data size
Small
Many kernel invocations and memory transfers
Large
High benchmark kernel throughput
Large
Long time spent in the GPU code relative to the
CPU code
KERNEL
O
TOTAL
data
COPY
KERNEL
KERNEL
TOTAL
TOTAL
GPU
data
G
O
O
+ O
A
+ ~1
TOTAL
A
TOTAL
A
TOTAL
data
KERNEL
O
26 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSION: WHICH BENCHMARK BENEFITS?
TWO SCENARIOS
Large


Significant


Insignificant
COPY
CHUNK KERNEL
CHUNK KERNEL
CHUNK
CHUNK
GPU
data
data
data
G
O
O
+ O
A
+ ~
) (
) ( 1
Small


Insignificant


Significant
CHUNK
data
CHUNK
data
COPY
KERNEL
O
O
KERNEL
CHUNK
CHUNK
data
O
A
KERNEL
CHUNK
CHUNK
data
O
A
COPY
KERNEL
O
O
27 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSION: WHICH BENCHMARK BENEFITS?
INPUT DATA SIZE
Greater problem size smaller benefit from fusion
F
U
S
E
D

i
s

b
e
t
t
e
r

INPUT SIZE IS GREATER
Rodinia BFS Rodinia Gaussian
28 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSION: WHICH BENCHMARK BENEFITS?
LATENCY OVERHEAD
Comparison between two benchmarks:
Rodinia Gaussian
Speed-up 9.72x
Rodinia NN
Speed-up 1.05x
Why?
100 times more kernel spawns for Gaussian
10 times more memory copies for Gaussian

Normalized latency overhead
TOTAL
TOTAL
data
A
) 10 (
_COPY MEM KERNEL
TOTAL
TOTAL
n n
data
+ ~
A
o
COPY
KERNEL
KERNEL
TOTAL
TOTAL
GPU
data
G
O
O
+ O
A
+ ~1
29 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSION: WHICH BENCHMARK BENEFITS?
KERNEL THROUGHPUT
Comparison between two benchmarks:
Rodinia BFS
Speed-up 4.28x
Rodinia NN
Speed-up 1.05x
Why?
100 times greater throughput
for BFS

Kernel throughput
COPY
KERNEL
KERNEL
TOTAL
TOTAL
GPU
data
G
O
O
+ O
A
+ ~1
KERNEL
O
KERNEL
O
30 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSIONSIM WEBSITE:
DOCUMENTATION & SOURCE CODE
www.fusionsim.ca
Discrete FusionSim & Fused FusionSim
Source code
Documentation
Google group for collaborators
31 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DERIVATION OF THE ANALYTICAL SPEED-UP MODEL
SYMBOL MEANINGS
32 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DERIVATION OF THE ANALYTICAL SPEED-UP MODEL
PART 1
The kernel can be modeled as a channel of throughput ) (
KER KER
data O as the actual throughput
will vary depending on
KER
data
. As
KER
data
increases,
KER
O
saturates.
Most of existing CUDA applications (including all the considered Rodinia benchmarks)
exhibit the following computation pattern:





For such applications
GPU
t
is described by the following:
|
|
.
|

\
|
O
+ =
KER
KER
TOT KER GPU
data
n t o

, where
KER
O
is the kernel data throughput and
TOT
o
is the total latency per iteration resulting
from both the memory transfers and the kernel spawn.

For CUDA applications that do not utilize multiple concurrent CUDA streams the total latency
per single computation iteration
TOT
o
is comprised of the time spent transferring the data to or
from the device and the kernel spawn latency:

KS
COPY
KER
COPY TOT
data
o o o +
|
|
.
|

\
|
O
+ = 2

The above expression holds true for all the considered Rodinia benchmarks.
For iterations do:
1. Copy the input data from the host to the device
2. Launch kernel on the data
3. Copy the results from the device to the host
33 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DERIVATION OF THE ANALYTICAL SPEED-UP MODEL
PART 2
COPY
KERNEL
KERNEL
TOTAL
TOTAL
KER
TOT
KER TOT
GPU
data data
n
G
O
O
+ O
A
+ = + O

~ 1 1
o
Since on fused systems this latency reduces to
KS KS TOT
o o o s ' = ' , the time
'
GPU
t of executing
the CUDA code on the fused system is given by
KER
KER
KER
KER
KER
KS KER GPU
data
n
data
n t
O
~
|
|
.
|

\
|
O
+ ' = ' o

The speed-up of the CUDA code is given by
1 +
O
~
'
=
KER
KER TOT
GPU
GPU
GPU
data t
t
G
o

Since
KER TOT KER
n data data / = , we obtain



Here
TOTAL
A is the total latency accumulated during the benchmark execution and comprising
the all kernel spawn and memory transfer latencies.
Please also note that the throughput
KER
O
of a benchmark kernel increases with
KER
data
for
small
KER
data
values and saturates to a constant for large
KER
data
values. The throughput
saturates when the input data size is sufficient for maximum possible warp scheduler
occupancy for the given benchmark kernel. For benchmarks utilizing CUDA streams and
overlapping kernel execution with data transfers the latency is bounded from above, i.e.:
KS
COPY
KER
COPY TOT
data
o o o +
|
|
.
|

\
|
O
+ s 2

This results in a smaller speed-up
GPU
G for such benchmarks. Applying the Amdahls law we
get an expression for the total benchmarks speed-up
TOT
G
:
GPU
GPU GPU CPU
TOT
G
G
G s
+
=
% %
1

34 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN
IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in
this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and
opinions presented in this presentation may not represent AMDs positions, strategies or opinions. Unless explicitly stated, AMD is
not responsible for the content herein and no endorsements are implied.