Академический Документы
Профессиональный Документы
Культура Документы
Disclaimer
This document was prepared as an account of work sponsored by an agency of the United States
government. Neither the United States government nor Lawrence Livermore National Security, LLC,
nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or
responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or
process disclosed, or represents that its use would not infringe privately owned rights. Reference herein
to any specific commercial product, process, or service by trade name, trademark, manufacturer, or
otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the
United States government or Lawrence Livermore National Security, LLC. The views and opinions of
authors expressed herein do not necessarily state or reflect those of the United States government or
Lawrence Livermore National Security, LLC, and shall not be used for advertising or product
endorsement purposes.
ABSTRACT
2.
Emulation with programmable logic can greatly accelerate evaluation of architectural features by near real time execution of
applications. In this work, we have developed an FPGA emulator to selectively collect memory traces in real time without
perturbing cache or read/write sequences to the memory being tested. The emulator is structured to include pluggable
Accelerators for special functions. We present the emulator
implementation and preliminary evaluation of data reordering
engines in an active memory. The emulator is to be released to
open source in an effort to reduce development time for others
and enable the routine use of FPGA emulation in performance
evaluation of architectures and applications.
1.
INTRODUCTION
EMULATOR
IP
AXI-Stream FIFO
AXI Interconnect
AXI DataMover
AXI Performance Monitor
MicroBlaze
LMB BRAM Controller
Block Memory Generator
FIFO Generator
Memory Interface Generator
Description
AXI slave to AXI-Stream adapter
Connects master and slave devices
Contiguous transfers and byte realignment
Monitors activity on AXI Interconnect
32-bit soft processor core
Local-Memory-Bus, Block-RAM Controller
Configures Block-RAM in programmable logic
Configures FIFOs in programmable logic
Configures AXI interface to DDR3 memory
Usage
Host interface to Accelerator on peripheral bus
Attaches Accelerator to memory, ARM with peripherals
Main component of DMA Unit in the Accelerator
Creates a trace and provides event counters
Control Unit in the Accelerator
Attaches BRAM to the MicroBlaze as local memory
Program store for the Accelerator
Local buffers for Trace Capture Device
Main storage for Trace Capture Device
Zynq SoC
Trace Subsystem
1 GB DRAM #2
AXI Performance
Monitor (APM)
Trace Capture
Device
1 GB DRAM #1
Memory Subsystem
BRAM
Peripheral
Accelerator
Accelerator
Accelerator
Host Subsystem
Monitor
L2 Cache
AXI Interconnect
SRAM
L1
L1
ARM
Core
ARM
Core
0,R,0x40101520,32,29,2131065
0,R,0x7f186b20,32,30,2131091
1,R,0x40185060,8,0,2132263
1,R,0x40185080,8,0,2132270
1,W,0x40000030,8,2,2132548
1,W,0x40000034,8,2,2132554
1,R,0x40185260,8,0,2132571
1,R,0x40185280,8,0,2132575
1,W,0x40000038,8,2,2132577
1,W,0x4000003c,8,2,2132581
Full Image
0
0
Reduced View
N
pixels
DRE
assembles view
N
pixels
Figure 4: Full and reduced resolution arrays. A DRE can assemble a reduced resolution buffer on demand.
data across chip interconnects between memory and processor. There have been numerous active memory processor architectures proposed (e.g. [2, 10, 6, 4, 7]). In this work, we
describe a data reordering Accelerator consisting of a flexible,
programmable DMA unit.
3.1
Figure 3: Trace power profile from DRAMSim2
3.
cation then calls a wait function that returns when the DRE has
finished the operation.
The DRE assembles a small region of the decimated array into a view buffer by copying only the sub-sampled pixels from the high resolution image. The view buffer contains
only contiguous pixels of interest. When the host CPU gets a
cache line from the view buffer, reads to pixels close by in the
decimated view are satisfied from the CPU cache, with greatly
reduced latency. Without the DRE, accessing the next pixel in
the reduced resolution view image may require another memory request if the pixel resides in another cache line. In the
emulator, the DRE accesses memory in 8-byte chunks rather
than an entire CPU cache line of 32 or 64 bytes.
Cache coherence between the DRE and the CPU is managed explicitly by the application with function calls that flush
and invalidate the whole cache or a cache region. Similar to a
message passing paradigm, data blocks initialized by the host
and used by the DRE are stored in memory with a cache flush
on the host side and refreshed with a cache invalidate on the
DRE side.
In the emulator, the DRE is implemented as a server in the
CU that executes commands from the application client running standalone on an ARM core. The ARM code runs
without an OS, and virtual-to-physical mapping is one-to-one.
The server responds to commands to initialize the image dimensions and decimation factor, fill a view buffer, and flush
or invalidate its cache. The client on an ARM core iteratively
sends commands to the DRE to fill the view buffer and performs the image differencing operation on the view buffers,
writing the differenced image to memory one view buffer at a
time. The view buffer is stored in on-chip scratchpad SRAM.
The application enables tracing through calls to the APM trace
library (see Section 2), so that the addresses and lengths of all
memory reads and writes by both ARM and DRE are captured
in FPGA logic and logged to DRAM #2. The read and write
operations are concurrently and transparently routed to the onboard memory and do not perturb the ARM or DRE caches or
the applications memory usage.
In the plots shown in Figures 6 and 7, tracing was only enabled for the main iteration loop so that memory activity specific to emulation setup and shutdown was not included in the
trace. Additionally, by aliasing the host memory address space
to the programmable logic, memory traffic to the image arrays
and view buffers can be selectively traced, allowing code, libraries, and other program state to be accessed without affecting the trace. Control over temporal and spatial tracing allows
the application or active memory designer to focus on a specific experiment and not have to filter out other artifacts such as
OS and unrelated runtime software through post-processing.
The plots use a simple energy model to graph power used to
access the image arrays during program execution. The model
assumes 10pJ/bit for in-memory access, 30 pJ/bit for CPU access, and 1pJ/bit to read or write the on-chip scratchpad. The
simple power model was used because detailed DRAM simulators require a fixed data payload size, and in our experiments
the host and DRE access different data sizes (32-bytes by host,
and 8-bytes by DRE). The plots compare power profiles of image differencing using the ARM core only (Host Only) and using the Accelerator in conjunction with the host (Host+DRE).
In this experiment, with a decimation factor of 8, the accelerated version runs in 4/5th the time and uses 1/3 the power.
BRAM
Local Memory Bus
DMA Unit
Command
(address, length)
Control Unit
Slave
Peripheral
Bus
AXI Interconnect
Memory Mapped
Read and Write
3.2
Image differencing
This mini-application performs pixel-wise image differencing of reduced resolution views into two high resolution images. The benchmark loads two high resolution 2D images
into memory and, given a decimation factor, subtracts corresponding pixels in reduced resolution views in both x and y
dimensions. The resulting reduced resolution difference image can be saved to a file or displayed.
In this implementation, the application allocates two view
buffers in on-chip scratchpad memory to hold a portion of
each decimated image. The image difference is performed on
an ARM core by subtracting pixel values from the two view
buffers. Iteratively, the view buffers are filled by the DRE with
pixels sub-sampled from the original image and then the pixel
difference is taken by the ARM. The resulting difference value
is written to the reduced size output image in memory.
Two DREs are used by the image difference application
with one assigned to each view buffer. A setup function communicates the location, pixel size, and dimensions of the original image to the DRE along with the decimation factor. These
values are stored within the DRE and are used when assembling a decimated view. When the application wants a new
view, a fill function is called with the offset into the decimated
array and also the view buffer location and length. The appli4
// Loop o v e r a d j a c e n c y l i s t f o r v e r t e x
f o r ( s i z e _ t j = 0 ; j < e d g e s ; ++j ) {
new_pr += cur_pr [ e d g e _ i [ j ] ] ;
}
(a) No DRE
d r e . s e t u p ( cur_pr , d o u b l e _ s z , edge_i , e d g e s ) ;
d r e . f i l l ( view_pr_i , view_sz , v i e w _ o f f ) ;
// Loop o v e r a d j a c e n c y l i s t f o r v e r t e x i
f o r ( s i z e _ t j = 0 ; j < e d g e s ; ++j ) {
new_pr += view_pr_i [ j ] ;
}
Figure 8: Indirect access of page rank vector based on the index array edge_i
Figure 7: Power profiles comparing host-only with dreassisted image differencing with decimation factor 8, at 10
microsec granularity.
3.3
Page rank
Edge List
Vertex i
float
Page Rank
0
int
float
edges
edges
DRE
assembles view based on index array
Index
array
N
vertices
4.
Acknowledgments.
This work was performed under the auspices of the U.S.
Department of Energy by Lawrence Livermore National Laboratory under contract No. DE-AC52-07NA27344 and was
funded by LDRD. We gratefully acknowledge Roger Pearces
contribution of the Host only pagerank benchmark.
5.
REFERENCES
[1] //http://www.chrec.org/facilities/.
[2] Jay B. Brockman, Shyamkumar Thoziyoor, Shannon K.
Kuntz, and Peter M. Kogge. A low cost, multithreaded
processing-in-memory system. In Proceedings of the
3rd Workshop on Memory Performance Issues: in
conjunction with the 31st International Symposium on
Computer Architecture, WMPI 04, pages 1622, New
York, NY, USA, 2004. ACM.
[3] Pedro C. Diniz and Joonseok Park. Data reorganization
engines for the next generation of system-on-a-chip
fpgas. In Proceedings of the 2002 ACM/SIGDA Tenth
International Symposium on Field-Programmable Gate
Arrays, FPGA 02, pages 237244, New York, NY,
USA, 2002. ACM.
[4] Jaffrey Draper, J. Tim Barrett, Jeff Sondeen, Sumit
Mediratta, Chang Woo Kang, Ihn Kim, and Gokhan
Daglikoca. A prototype processing-in-memory (PIM)
chip for the data-intensive architecture (diva) system.
The Journal of VLSI Signal Processing, 40:7384, 2005.