780655

LLNL-CONF-659522
Real-time FPGA-based Capture of

Memory Traces with Application to Active
Memory Emulation
G. S. Lloyd, K. Y. Cheng, M. B. Gokhale

September 2, 2014
Performance Modeling, Benchmarking and Simulation of High

Performance Computer Systems
New Orleans, LA, United States
November 16, 2014 through November 21, 2014
Disclaimer
This document was prepared as an account of work sponsored by an agency of the United States
government. Neither the United States government nor Lawrence Livermore National Security, LLC,
nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or
responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or
process disclosed, or represents that its use would not infringe privately owned rights. Reference herein
to any specific commercial product, process, or service by trade name, trademark, manufacturer, or
otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the
United States government or Lawrence Livermore National Security, LLC. The views and opinions of
authors expressed herein do not necessarily state or reflect those of the United States government or
Lawrence Livermore National Security, LLC, and shall not be used for advertising or product
endorsement purposes.
Real-time FPGA-based capture of memory traces with

application to active memory emulation
G. Scott Lloyd (lloyd23@llnl.gov) and Maya Gokhale, LLNL
Kevin Cheng, U. of Florida and LLNL
ABSTRACT
2.
Emulation with programmable logic can greatly accelerate evaluation of architectural features by near real time execution of
applications. In this work, we have developed an FPGA emulator to selectively collect memory traces in real time without
perturbing cache or read/write sequences to the memory being tested. The emulator is structured to include pluggable
Accelerators for special functions. We present the emulator
implementation and preliminary evaluation of data reordering
engines in an active memory. The emulator is to be released to
open source in an effort to reduce development time for others
and enable the routine use of FPGA emulation in performance
evaluation of architectures and applications.
FPGA emulation has long been suggested and undertaken

to speed up architecture evaluation. Emulation is a valuable
tool to analyze architectures, tools, design templates, techniques, and applications orders of magnitude faster than simulation. While emulation is commonly used to prototype ASIC
designs, system architecture evaluation through emulation is
employed mainly by large companies for proprietary designs
and open source. However, publicly available emulators for
open research are scarce. Proposed emulator platforms include
RAMP [13, 12] or [5] for processor emulation and Novo-G [1]
for mixed granularity emulation.
We have developed an FPGA emulator based on a low cost
FPGA development board, and will release the emulator to
open source. To help alleviate the development costs of FPGA
hardware designs, we have chosen a System-on-Chip FPGA
with on-chip hard IP processor cores and put as much infrastructure as possible into software that runs on the embedded
cores. The platform weve chosen has performance monitoring hardware on-chip, and we use that when we can. To simplify hardware development for parts of the infrastructure in
the programmable logic, we rely heavily on pre-existing freely
available IP from the vendor, so that much of the hardware
development is simply wiring together IP blocks, along with
registering of signals and busses and resolving clock domains
as needed. A list of pre-existing modules we have used in the
emulator is shown in Table 1.
The Xilinx Zynq 7000 System-on-Chip with both hard processors and FPGA logic is used for the emulation platform.
The Zynq block diagram and emulation framework are shown
in Figure 1.
The processing system (PS) has two ARM A9 processor
cores with separate L1 and shared L2 caches. There is an
on-chip 256KB SRAM memory, a dedicated DRAM memory
controller used by the ARM cores, and a hard IP DMA engine
as well as interfaces to various peripherals. On our development system, the ARM cores directly access a 1GB memory
(DRAM #1). The programmable logic (PL) connects to multiGb transceivers and general purpose I/Os. On the development board, the I/Os connect to a SODIMM with a second
1GB DRAM module (DRAM #2). Thus there are two distinct
paths to two different memories, one through the ARM on the
PS side and one through the FPGA fabric on the PL side. The
main processor bus AMBA Interconnect (also called the AXI
bus) is the common path for the processor cores to communicate with memory and with the FPGA fabric.
Our emulation framework is also shown in Figure 1. From
1.
INTRODUCTION
Understanding and minimizing DRAM energy consumption

is a major focus in node architecture research in the exascale
program. It is particularly valuable to understand memory
access patterns of long running benchmarks and applications
in order to evaluate benefits and drawbacks of proposed architectures, tools, design patterns, and techniques. However,
simulation-based approaches are too slow for full capture, and
performance counter based techniques must rely on sampling.
Further, measurement on real hardware perturbs cache and
memory system, and those effects make it harder to infer the
underlying memory interactions of the application.
To address the important need to capture traces at a rate
close to the speed of memory, we have built an FPGA emulation framework that records memory transactions generated
by software applications in real time without affecting cache,
memory, or speed of the running application. The framework
runs on the Xilinx Zynq System-on-Chip using a low cost,
readily available development board, and we are making it
available as open source. We are using the framework to evaluate active memory architectures for data-centric computing
applications. In this paper we describe the emulation framework including its capabilities and limitations, and present an
assessment of in-memory data re-ordering techniques to offload CPU using a custom hardware module we have designed.
Copyright 2014 Lawrence Livermore National Security, LLC.

All rights reserved. LLNL-CONF-659522.
EMULATOR
IP
AXI-Stream FIFO
AXI Interconnect
AXI DataMover
AXI Performance Monitor
MicroBlaze
LMB BRAM Controller
Block Memory Generator
FIFO Generator
Memory Interface Generator
Description
AXI slave to AXI-Stream adapter
Connects master and slave devices
Contiguous transfers and byte realignment
Monitors activity on AXI Interconnect
32-bit soft processor core
Local-Memory-Bus, Block-RAM Controller
Configures Block-RAM in programmable logic
Configures FIFOs in programmable logic
Configures AXI interface to DDR3 memory
Usage
Host interface to Accelerator on peripheral bus
Attaches Accelerator to memory, ARM with peripherals
Main component of DMA Unit in the Accelerator
Creates a trace and provides event counters
Control Unit in the Accelerator
Attaches BRAM to the MicroBlaze as local memory
Program store for the Accelerator
Local buffers for Trace Capture Device
Main storage for Trace Capture Device
Table 1: Pre-existing soft IP used in the emulator
Zynq SoC
Trace Subsystem
Processor or Accelerator into the second 1GB DRAM #2.

The emulator can be used to capture memory access traces
as they appear on the AXI bus. As a software program runs
on an ARM core, the APM captures each memory transaction
and forwards it to our custom Trace Capture Device, which
sends the trace entry to the memory interface for DRAM #2.
This process occurs concurrently with the transaction being
looped back onto the AXI bus and going to the on-chip memory controller and subsequently to DRAM #1. Trace capture
is completely independent of on-chip caches or ARM memory
operations.
A library has been built to retrieve event counter values from
the APM in programmable logic and from the Performance
Monitoring Unit (PMU) that is part of the Processing System.
The library also provides the capability to inject commands
telling the APM to start and stop tracing, enabling selective
tracing of regions of code. Further, by aliasing the ARM memory space in FPGA logic, it is possible to selectively capture
a subset of the memory traffic generated by the ARM cores,
so that specific data structures can be traced, while accesses to
the rest of the applications state in memory are not traced.
The trace can be recovered from DRAM #2 several ways,
depending on the method of connecting the development board
to a host machine. We presently use the Xilinx XMD console
with the JTAG port connection to read the entire trace for offline analysis. This method is slow, and work is in progress
to provide alternate methods such as filtering the trace on the
Zynq, so that only a subset needs to be read back. An alternative approach also under development is to write from DRAM
#2 to an SD card, and manually transfer the entire trace.
An example of the trace is shown in Figure 2. The trace distinguishes ARM vs. Accelerator requests, read or write, gives
the address accessed and transaction size. The ID of the AXI
is next (used to tag transactions), followed by the cycle count.
The AXI ID can be used to distinguish which ARM core issued the request. The example shows memory reads from an
ARM core and from the Accelerator. The trace also illustrates
that different payload sizes can be accessed from the memory. This option is not available with most fixed memory controllers, which can only perform cache line size access. The
flexibility of the Zynq memory interface enables exploration
into finer grained memory transactions. In this example we
would like to evaluate flexible memory organizations capable
of servicing both narrow and wide access requests.
Memory read and write traces generated by the emulator
can be run through DRAM simulators as shown in Figure 3.
This plot shows the power profile of an image differencing
benchmark (see Section 3.2) accessing two simulated DDR2
1 GB DRAM #2
Programmable Logic (PL)
AXI Performance
Monitor (APM)
Trace Capture
Device
1 GB DRAM #1
Memory Subsystem
BRAM
Peripheral
Accelerator
Accelerator
Accelerator
Host Subsystem
Monitor
L2 Cache
AXI Interconnect
SRAM
L1
L1
ARM
Core
ARM
Core
Processing System (PS)
Figure 1: Zynq SoC with emulation framework
the bottom, the Processor section includes the two ARM A9

cores with their cache hierarchy and an SRAM scratchpad
memory. We have developed a small library that uses the
on-chip performance monitor units to count micro-architecture
events such as L1 or L2 cache hits and memory requests. The
ARM core can issue requests directly to the 1GB DRAM #1
(in the Memory System level) to do memory read or write, or
into the programmable logic (PL), where logic on the FPGA
part can perform arbitrary operations on the request. One supported operation is to forward memory requests to DRAM #1,
allowing the Trace System to monitor these accesses. The
Trace System can also monitor memory transactions from the
Accelerator, a pluggable block of IP for emulating desired
functionality. Modules of SRAM are also available in the programmable logic (FPGA) for use by an Accelerator (labeled
BRAM in Figure). The Trace system consists of the AXI Performance Monitor (APM, a soft IP module provided by Xilinx), and our custom Trace Capture Device that writes a trace
of memory transactions appearing on the AXI bus from either
2
0,R,0x40101520,32,29,2131065
0,R,0x7f186b20,32,30,2131091
1,R,0x40185060,8,0,2132263
1,R,0x40185080,8,0,2132270
1,W,0x40000030,8,2,2132548
1,W,0x40000034,8,2,2132554
1,R,0x40185260,8,0,2132571
1,R,0x40185280,8,0,2132575
1,W,0x40000038,8,2,2132577
1,W,0x4000003c,8,2,2132581
Full Image
0
0
Reduced View
N
pixels
Trace format by column

1) CPU=0 / Accelerator=1
2) Read/Write
3) Address
4) Transaction size
5) AXI ID
6) Time stamp
DRE
assembles view
N
pixels
Figure 2: Fragment of a memory trace, with legend
Figure 4: Full and reduced resolution arrays. A DRE can assemble a reduced resolution buffer on demand.
data across chip interconnects between memory and processor. There have been numerous active memory processor architectures proposed (e.g. [2, 10, 6, 4, 7]). In this work, we
describe a data reordering Accelerator consisting of a flexible,
programmable DMA unit.
3.1
Figure 3: Trace power profile from DRAMSim2
Micron 32M 8B x4 sg25E devices with a 32B data bus in the

DRAMSim2 [9] simulator.
In addition to the emulator, we have implemented the corresponding functional simulator. We use the simulator (C++
class library) for initial application development and debug,
and to study memory bandwidth profiles. The simulator generates timing independent memory traces that can drive a detailed memory simulator in as fast as possible mode that assumes the memory requests occur every memory clock cycle.
If more detailed tracing with time stamps is desired, the application is compiled for the emulator and run on the Zynq board.
The simulator is also used to prototype Accelerator functions.
On the emulator, Accelerator functions can run on one of the
ARM cores, on a soft IP processor such as a MicroBlaze or
can be implemented in FPGA logic.
3.
Data reordering engines
Many data-centric applications need to perform relatively

simple operations like search, filter, and data reordering, across
a large cross section of data. Using traditional architectures,
the data must be moved through multiple levels of memory
and cache hierarchy to the CPU. Data-centric operations are
ideal for off-load to a memory system with processing capability. This off-load approach operates on data in-place and,
in theory, can drastically reduce the bandwidth required between memory and the CPU, while simultaneously increasing
the performance of data-centric applications. We are using
the emulation framework to evaluate potential benefit to applications that can off-load high value, low complexity data
reordering operations to an active memory.
As an example, Figure 4 illustrates an often used operation
of deriving a reduced resolution image from full resolution.
The full array is stored in memory. The 4 2D decimated image, can be stored in memory or on-CPU scratchpad. In our
example, the reduced array is assembled incrementally as requested by the main CPU, and in-memory state machine logic
performs strided DMA access to assemble the reduced view.
A data reordering engine can perform kernel operations such
as matrix transpose, data structure reorganization for cache
layout optimization, sampling and decimation of arrays, and
pointer traversal [3]. Application-level use cases include streaming assembly of multiple resolution image windows of high
resolution imagery, in-memory graph traversal, and efficient
sparse matrix access for computations on irregular meshes that
occur in scientific simulations. These are well understood operations and their suitability for execution in memory has been
widely endorsed. The emulation framework enables experiments to design and evaluate strided DMA and gather/scatter
units that interact with CPU and memory in novel ways.
We have designed and implemented a data reordering engine (DRE) Accelerator, as shown in Figure 5. The DRE con-
EMULATION OF DATA REORDERING IN AN ACTIVE MEMORY
The Accelerator block of the emulator represents arbitrary

functionality being emulated in the programmable logic and
evaluated within a whole system context. We are using the
emulator to design Accelerators that perform active memory
functions. While processing in memory has been investigated
for more than four decades [11, 8] without significant commercial impact, recent advances in 3D packaging have led to
renewed interest. With this technology, computation that is
typically handled by a CPU is performed within the memory system in the expectation that performance is improved
and energy reduced because processing is done in proximity to the data without incurring the overhead of moving the
3
cation then calls a wait function that returns when the DRE has
finished the operation.
The DRE assembles a small region of the decimated array into a view buffer by copying only the sub-sampled pixels from the high resolution image. The view buffer contains
only contiguous pixels of interest. When the host CPU gets a
cache line from the view buffer, reads to pixels close by in the
decimated view are satisfied from the CPU cache, with greatly
reduced latency. Without the DRE, accessing the next pixel in
the reduced resolution view image may require another memory request if the pixel resides in another cache line. In the
emulator, the DRE accesses memory in 8-byte chunks rather
than an entire CPU cache line of 32 or 64 bytes.
Cache coherence between the DRE and the CPU is managed explicitly by the application with function calls that flush
and invalidate the whole cache or a cache region. Similar to a
message passing paradigm, data blocks initialized by the host
and used by the DRE are stored in memory with a cache flush
on the host side and refreshed with a cache invalidate on the
DRE side.
In the emulator, the DRE is implemented as a server in the
CU that executes commands from the application client running standalone on an ARM core. The ARM code runs
without an OS, and virtual-to-physical mapping is one-to-one.
The server responds to commands to initialize the image dimensions and decimation factor, fill a view buffer, and flush
or invalidate its cache. The client on an ARM core iteratively
sends commands to the DRE to fill the view buffer and performs the image differencing operation on the view buffers,
writing the differenced image to memory one view buffer at a
time. The view buffer is stored in on-chip scratchpad SRAM.
The application enables tracing through calls to the APM trace
library (see Section 2), so that the addresses and lengths of all
memory reads and writes by both ARM and DRE are captured
in FPGA logic and logged to DRAM #2. The read and write
operations are concurrently and transparently routed to the onboard memory and do not perturb the ARM or DRE caches or
the applications memory usage.
In the plots shown in Figures 6 and 7, tracing was only enabled for the main iteration loop so that memory activity specific to emulation setup and shutdown was not included in the
trace. Additionally, by aliasing the host memory address space
to the programmable logic, memory traffic to the image arrays
and view buffers can be selectively traced, allowing code, libraries, and other program state to be accessed without affecting the trace. Control over temporal and spatial tracing allows
the application or active memory designer to focus on a specific experiment and not have to filter out other artifacts such as
OS and unrelated runtime software through post-processing.
The plots use a simple energy model to graph power used to
access the image arrays during program execution. The model
assumes 10pJ/bit for in-memory access, 30 pJ/bit for CPU access, and 1pJ/bit to read or write the on-chip scratchpad. The
simple power model was used because detailed DRAM simulators require a fixed data payload size, and in our experiments
the host and DRE access different data sizes (32-bytes by host,
and 8-bytes by DRE). The plots compare power profiles of image differencing using the ARM core only (Host Only) and using the Accelerator in conjunction with the host (Host+DRE).
In this experiment, with a decimation factor of 8, the accelerated version runs in 4/5th the time and uses 1/3 the power.
BRAM
Local Memory Bus
Data Reordering Engine

(DRE)
DMA Unit
Command
(address, length)
Control Unit
Slave
Peripheral
Bus
AXI Interconnect
Memory Mapped
Read and Write
AXI Memory Interface
Figure 5: Structure of the data reordering Accelerator.
sists of two major modules, a DMA Unit and a Control Unit

(CU). The core of the DMA Unit is the Xilinx Data Mover IP
block (see Table 1) which transfers contiguous blocks of memory and performs byte realignment. With commands coming
from the CU, the DMA Unit can be programmed to support
strided transfers and gather/scatter operations. Both the CU
and the DMA Unit can access memory through the AXI interface. ARM cores issue commands to the DRE through a
slave peripheral port. The CU has been implemented with the
Xilinx MicroBlaze, a soft IP microprocessor, and with local
Block-RAM to hold Accelerator functions.
3.2
Image differencing
This mini-application performs pixel-wise image differencing of reduced resolution views into two high resolution images. The benchmark loads two high resolution 2D images
into memory and, given a decimation factor, subtracts corresponding pixels in reduced resolution views in both x and y
dimensions. The resulting reduced resolution difference image can be saved to a file or displayed.
In this implementation, the application allocates two view
buffers in on-chip scratchpad memory to hold a portion of
each decimated image. The image difference is performed on
an ARM core by subtracting pixel values from the two view
buffers. Iteratively, the view buffers are filled by the DRE with
pixels sub-sampled from the original image and then the pixel
difference is taken by the ARM. The resulting difference value
is written to the reduced size output image in memory.
Two DREs are used by the image difference application
with one assigned to each view buffer. A setup function communicates the location, pixel size, and dimensions of the original image to the DRE along with the decimation factor. These
values are stored within the DRE and are used when assembling a decimated view. When the application wants a new
view, a fill function is called with the offset into the decimated
array and also the view buffer location and length. The appli4
(a) decimation factor 4.
(b) decimation factor 8.
(c) decimation factor 16.
Figure 6: Power profiles comparing host-only with dre-assisted image differencing
// Loop o v e r a d j a c e n c y l i s t f o r v e r t e x
f o r ( s i z e _ t j = 0 ; j < e d g e s ; ++j ) {
new_pr += cur_pr [ e d g e _ i [ j ] ] ;
}
(a) No DRE
d r e . s e t u p ( cur_pr , d o u b l e _ s z , edge_i , e d g e s ) ;
d r e . f i l l ( view_pr_i , view_sz , v i e w _ o f f ) ;
// Loop o v e r a d j a c e n c y l i s t f o r v e r t e x i
f o r ( s i z e _ t j = 0 ; j < e d g e s ; ++j ) {
new_pr += view_pr_i [ j ] ;
}
(b) With DRE
Figure 8: Indirect access of page rank vector based on the index array edge_i
Figure 7: Power profiles comparing host-only with dreassisted image differencing with decimation factor 8, at 10
microsec granularity.
access pattern of PageRank is dependent on the underlying

topology of the graph. As our study focuses on scale-free
graphs such as social networks, we use the synthetic RMAT
generator from the Graph500 benchmark.
We are interested in the realistic scenario in which the page
rank vector length is significantly larger than the processors
cache. The page rank vector contains a double float for every
vertex in the graph. We use an adjacency list representation
of the graph in which a list element holds the vertex id and
its edge list, i.e. the ids of its edge targets. As the graph is
traversed, standard caching is ineffective due to the low probability of reusing a cache lines data. Making use of our data
reordering engine, we have modified PageRank to assemble a
view of the page rank vector of those elements indexed by the
vertexs adjacency list edge_i. Figure 8a shows the conventional access method using each target vertex id in an edge list
to index into the page rank vector. In contrast, Figure 8b shows
that the indirect index can be replaced by a direct index into a
page rank view. the DRE command setup associates the page
rank vector with the edge list. The command ll then fills the
page rank view buffer starting at a specified offset in the edge
list. An indirect access in the original page rank vector is replaced by a direct access into the page rank view associated
with the edge list being processed. This approach significantly
reduces the volume of data transfered between main memory
and the processor, as well as the number of cache line fetches.
For a decimation factor of 4, the host only version is faster

but uses more energy than the DRE version. Emulation of a
range of parameters enables very fast quantitative evaluation
of trade-offs in power and performance. Figure 7 magnifies a
portion of the trace to highlight the interplay between host and
DRE and the periodic cache management activity by the host.
3.3
Page rank
Another important application is the analysis of irregular

graphs such as social networks. Processing these graphs is
challenging due to the unstructured and irregular connections
(edges) between the vertices, which leads to low cache line
reuse and frequent processor stalls.
We have studied the memory access patterns and the data
reorganization opportunities for the PageRank algorithm applied to scale-free graphs. PageRank is a social network analysis tool that ranks vertices by their relative importance. Designed to rank pages on the Web, PageRank models a random
web surfer that randomly follows links with random restart.
It is often iteratively computed as a stochastic random walk
with restart, where the initial page rank vector is a uniformly
distributed random number across all vertices. The memory
5
Page Rank View

Vertex i
0
Edge List
Vertex i
float
Page Rank
0
int
float
edges
edges
DRE
assembles view based on index array
Index
array
N
vertices
Figure 9: View assembly: The DRE assembles a page rank

view based on the adjacency list for vertex i which is an index
array into the original page rank vector.
Figure 10: Power profile fragment comparing host-only with
dre-assisted page rank (scale 17, 10 microsec granularity. Red
peaks clearly show scale-free nature of graph.
Theoretically, a separate DRE could be used for each vertex

in the graph; however, in this example only a single DRE is
used, and it must be set up for each vertex. The DRE setup
function specifies the page rank vector and an index array,
which is the current vertexs adjacency list. A small 4 KiB
view buffer is allocated in the SRAM scratchpad and filled repeatedly by the DRE at successive offsets instead of allocating
a potentially large view buffer equal in size to the page rank
vector. Figure 9 illustrates the assembly of a view by the DRE
given the original page rank vector and an adjacency list.
Figure 10 shows a representative fragment of the power profile of the page rank benchmark (scale 17 graph), both with and
without the gather/scatter DRE. The host only version (blue)
shows the irregular access pattern caused by indirect access
into the page rank vector. In contrast, the DRE-assisted version shows a more regular pattern as the DRE fills the page
rank vector (green), and then hands off the remaining computation to the host (red). The red spikes correspond to the host
updating long page rank vectors associated with hubs (large
degree vertices) in the scale free graph. As the DRE performs
irregular access at much reduced power in 8 byte chunks, its
power profile is more than an order of magnitude lower than
the host only.
4.
the emulator is that traces must be less than 1GB. Mitigation

strategies are selective capture of high value code regions or
pausing the application to read out the trace and then resume.
Another challenge is calibration between the specific processor/cache of the Zynq to other processor micro-architectures
and cache configurations. This is not a major concern for our
purpose, which is to investigate Accelerator modules for inclusion in active memory. We hope by making the emulator
available to others, the simulation and performance analysis
community can find this platform useful for other experiments.
Acknowledgments.
This work was performed under the auspices of the U.S.
Department of Energy by Lawrence Livermore National Laboratory under contract No. DE-AC52-07NA27344 and was
funded by LDRD. We gratefully acknowledge Roger Pearces
contribution of the Host only pagerank benchmark.
5.
CONCLUSIONS AND FUTURE WORK
REFERENCES
[1] //http://www.chrec.org/facilities/.
[2] Jay B. Brockman, Shyamkumar Thoziyoor, Shannon K.
Kuntz, and Peter M. Kogge. A low cost, multithreaded
processing-in-memory system. In Proceedings of the
3rd Workshop on Memory Performance Issues: in
conjunction with the 31st International Symposium on
Computer Architecture, WMPI 04, pages 1622, New
York, NY, USA, 2004. ACM.
[3] Pedro C. Diniz and Joonseok Park. Data reorganization
engines for the next generation of system-on-a-chip
fpgas. In Proceedings of the 2002 ACM/SIGDA Tenth
International Symposium on Field-Programmable Gate
Arrays, FPGA 02, pages 237244, New York, NY,
USA, 2002. ACM.
[4] Jaffrey Draper, J. Tim Barrett, Jeff Sondeen, Sumit
Mediratta, Chang Woo Kang, Ihn Kim, and Gokhan
Daglikoca. A prototype processing-in-memory (PIM)
chip for the data-intensive architecture (diva) system.
The Journal of VLSI Signal Processing, 40:7384, 2005.
We have built an FPGA emulator to capture memory traces

in real time without perturbing application cache or memory
traffic. The emulator runs on a low cost System-on-Chip development board and primarily uses vendors IP blocks, thus
minimizing development costs. While the emulator is under
active development, we plan to make the source openly available, so that others can use, modify, and extend the framework.
The emulator is structured to allow for pluggable Accelerator
blocks, and we have built Accelerators to perform applicationspecific data reordering functions. The emulator enables rapid
evaluation of a range of parameters and problem sizes of application benchmarks.
We have been using the emulator standalone to collect traces
and then analyzing the traces offline with open memory simulators such as DRAMSim2 and proprietary simulators. In the
future, we would like to integrate the emulator more closely
with other simulators/emulators such as SST, enabling scaling to parallel execution environments. Another limitation of
6
[5] Elias El Ferezli. Fax86: An open-source

fpga-accelerated x86 full-system emulator. Masters
thesis, University of Toronto, 2011.
[6] Basilio B. Fraguela, Jose Renau, Paul Feautrier, David
Padua, and Josep Torrellas. Programming the FlexRAM
parallel intelligent memory system. In Proceedings of
the Ninth ACM SIGPLAN Symposium on Principles and
Practice of Parallel Programming, PPoPP 03, pages
4960, New York, NY, USA, 2003. ACM.
[7] Joseph Gebis, Sam Williams, Christos Kozyrakis, and
David Patterson. VIRAM1: A mediaoriented vector
processor with embedded DRAM. In Student Design
Contest, the 41st Design Automation Conference
(DAC04), San Diego, California, June 7-11 2004.
[8] Maya Gokhale, William Holmes, and Ken Iobst.
Processing in memory: The Terasys massively parallel
PIM array. IEEE Computer, 28(4):2331, Apr 1995.
[9] Paul Rosenfeld, Elliott Cooper-Balis, and Bruce Jacob.
Dramsim2: A cycle accurate memory system simulator.
In Computer Architecture Letters, number 1, pages
1619. IEEE, January 2011.
[10] Yan Solihin, Jaejin Lee, and Josep Torrellas. Using a
user-level memory thread for correlation prefetching. In
Proceedings of the 29th annual international symposium
on Computer architecture, ISCA 02, pages 171182,
Washington, DC, USA, 2002. IEEE Computer Society.
[11] Harold S. Stone. A logic-in-memory computer.
Computers, IEEE Transactions on, C-19(1):7378, Jan
1970.
[12] Z. Tan, A. Waterman, et al. A case for FAME: FPGA
architecture model execution. In Proceedings of the 37th
annual international symposium on Computer
architecture, ISCA 10, pages 290301. ACM, 2010.
[13] J. Wawrzynek, D. Patterson, et al. Ramp: Research
accelerator for multiple processors. In IEEE Micro,
April 2007.

780655

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

780655

Загружено:

Авторское право:

Доступные форматы

LLNL-CONF-659522

Real-time FPGA-based Capture of

G. S. Lloyd, K. Y. Cheng, M. B. Gokhale

Performance Modeling, Benchmarking and Simulation of High

Real-time FPGA-based capture of memory traces with

FPGA emulation has long been suggested and undertaken

Understanding and minimizing DRAM energy consumption

Copyright 2014 Lawrence Livermore National Security, LLC.

Table 1: Pre-existing soft IP used in the emulator

Processor or Accelerator into the second 1GB DRAM #2.

Programmable Logic (PL)

Processing System (PS)

Figure 1: Zynq SoC with emulation framework

the bottom, the Processor section includes the two ARM A9

Trace format by column

Figure 2: Fragment of a memory trace, with legend

Micron 32M 8B x4 sg25E devices with a 32B data bus in the

Data reordering engines

Many data-centric applications need to perform relatively

EMULATION OF DATA REORDERING IN AN ACTIVE MEMORY

The Accelerator block of the emulator represents arbitrary

Data Reordering Engine

AXI Memory Interface

Figure 5: Structure of the data reordering Accelerator.

sists of two major modules, a DMA Unit and a Control Unit

(a) decimation factor 4.

(b) decimation factor 8.

(c) decimation factor 16.

Figure 6: Power profiles comparing host-only with dre-assisted image differencing

(b) With DRE

access pattern of PageRank is dependent on the underlying

For a decimation factor of 4, the host only version is faster

Another important application is the analysis of irregular

Page Rank View

Figure 9: View assembly: The DRE assembles a page rank

Theoretically, a separate DRE could be used for each vertex

the emulator is that traces must be less than 1GB. Mitigation

CONCLUSIONS AND FUTURE WORK

We have built an FPGA emulator to capture memory traces

[5] Elias El Ferezli. Fax86: An open-source

Вам также может понравиться