Вы находитесь на странице: 1из 2

Design and Evaluation of a Four-Port Data Cache

for High Instruction Level Parallelism


Reconfigurable Processors
Kiyeon Lee, Moo-Kyoung Chung, Soojung Ryu , Yeon-Gon Cho and Sangyeun Cho
Computer Science Department, University of Pittsburgh, Pittsburgh, USA
Samsung Advanced Institute of Technology (SAIT), Yongin, Korea

Main memory
0x00 0x10 0x20 0x30
AbstractThis paper explores high-bandwidth data cache 2 entry MBB
0x00
Bank 0
designs for a coarse-grained reconfigurable architecture proces- 0x04 0x00
sor family capable of achieving a high degree of instruction 0x04
level parallelism. To meet stringent power, area and time-to- 0x10 0x10
market constraints, we take an architectural approach rather 0x14 0x14
0x20
than circuit-level multi-porting approaches. We closely examine 0x04 0x14 0x24 0x34
0x20 0x24
two design choices: single-level banked cache (SLC) and two- 0x24 0x30
level cache (TLC). A detailed simulation study using a set of 0x34 Bank 1
0x30
microbenchmarks and industry-strength benchmarks finds that 0x34
both SLC and TLC offer a reasonably competitive performance
at a small implementation cost compared with a hypothetical w/ 16B cache block size
64-bit bus
cache with perfect ports and a multi-bank scratchpad memory.
Fig. 1. Data mapping from main memory to miss block buffer (MBB) and
I. I NTRODUCTION to cache banks when word interleaving is used in SLC.
The context of this paper is the design of a high-bandwidth
data cache for a coarse-grained reconfigurable architecture bus width is 64 bits, the sustainable read bandwidth is halved
(CGRA) processor family called RP (reconfigurable proces- (we use only 32 bits out of each 64 bits fetched). Similarly, the
sor). Despite a conventional scratchpad memory (SPM) is fast memory bandwidth is underutilized on writes. To address this
and gives predictable performance, it increases programming problem, we introduce a miss block buffer (MBB) to capture
effort and reduces code portability. non-target cache blocks on the bus using otherwise wasted
The main challenge of our design problem is to provide memory bandwidth.
as many as four ports to the cache given four memory access MBB is a small fully associative cache that holds all the
units in an array of 16 FUs [3]. We explore in detail two cache data received from the memory bus as shown in Figure 1.
organizations, single-level banked cache (SLC) and two-level This approach is practical and efficient, since the data in the
cache (TLC), which offer competitive performance compared buffer can be reused multiple times before it is evicted.
with the baseline SPM.
II. P ROPOSED C ACHE D ESIGNS B. Two-level cache
A. Single-level banked cache The second design, TLC, has two levels of cache memory:
Our first design, SLC, is conceptually similar to the current private level-0 (L0) cache per port and shared level-1 (L1)
SPM design; each single-port cache bank simply replaces the cache. Figure 2 depicts the TLC organization. L0 caches are
SRAM bank. As such, SLC requires small overall design a small fully associative cache each serving requests from a
change. However, it also inherits the deficiencies of the SPM particular memory port. When a memory instruction is issued,
design, most notably bank conflicts. the L0 cache is probed to find the requested data in the first
To reduce the number of bank conflicts, we use a word memory pipeline stage. If a hit occurs, the requested data is
interleaving scheme in the SLC design rather than following fetched from the L0 cache without accessing the L1 cache.
the traditional block interleaving approach. Word interleaving The L1 cache keeps the working set of the program and backs
distributes consecutive data words to cache blocks that fall into up the L0 cache. It is accessed when the memory instruction
different cache banks. Fig. 1 gives an example of how word misses in the L0 cache. Hence, the success of TLC relies on
interleaving works. However, there is a potential drawback to the L0 cache hit rate and the number of L1 cache banks (i.e.,
the proposed scheme. On a cache miss, the memory controller bandwidth to back up L0 caches).
has to issue a sequence of reads at different addresses. While In our current TLC design, the L0 cache is accessed only
this is relatively simple to implement, we end up losing by load instructions, and when a store is issued, all matching
valuable memory bandwidth if the memory bus width is larger cache blocks in L0 caches are invalidated and the store
than a word. For example, if the word size is 32 bits and the instruction goes to the L1 cache.

978-1-4673-3052-7/12/$31.00 2012 IEEE 500


1.5
conflict miss

Normalized cycles
1.4
1.3
1.2
1.1
1
SPM

TLC
SPM

TLC
SPM

TLC
SPM

TLC
SPM

TLC
SPM

SPM

TLC
SPM

TLC
SPM

TLC
SPM

TLC
SPM

TLC
SLC - block

SLC - block

SLC - block

SLC - block

SLC - block

SLC - block

SLC - block

SLC - block

SLC - block

SLC - block

SLC - block
TLC
SLC - word

SLC - word

SLC - word

SLC - word

SLC - word

SLC - word

SLC - word

SLC - word

SLC - word

SLC - word

SLC - word
matmul crc32 fft ifft avc cjpeg djpeg lame madplay tiffdither GMEAN

Fig. 3. Comparing SPM, SLC and TLC with the ideal memory. Note that the impact of the cache misses and bank conflicts on the execution time are shown
in each bar. SLC - block and SLC - word represent SLC with block and word interleaving, respectively. When word interleaving is used with SLC, we
assume MBB has four entries. TLC has 128B L0 cache (with 16B L0 cache blocks) per memory port. SLC is configured with four 16KB L1 cache banks
and TLC is configured with two 32KB L1 cache banks. Both SLC and TLC use 16B L1 cache blocks.

Fu1 Fu2 Fu9 Fu10 Design (assumed 45nm technol- Area Dynamic read en-
read write ogy) (mm2 ) ergy/access (nJ)
?= access access Data Memory Controller (DMC) SPM (4 32KB SRAM banks) 0.566 0.031
SPM (4 16KB SRAM banks) 0.240 0.023
L1 Data Cache SLC (4 16KB L1 cache banks) 0.349 0.037
tag data
SLC (4 16KB L1 cache banks + 0.355 0.038
Memory bus MBB)
L0 cache TLC (2 32KB L1 cache banks + 4 0.341 0.019
Main Memory 128B L0 caches)
TABLE I
Fig. 2. TLC organization and an L0 cache. We assume Fu1, Fu2, Fu9, and C OMPARING THE AREA AND ENERGY OF SPM, SLC AND TLC DESIGNS .
Fu10 are the functional units that support memory access.

III. Q UANTITATIVE E VALUATION and associative structures like L0 caches occupy fairly limited
area because of their small capacity. TLC uses less chip area
Experimental setup. To evaluate SLC and TLC designs,
than SLC because it uses fewer cache banks. At equal cache
we develop a cycle-accurate trace-driven simulator. Memory
capacity, fewer banks imply smaller area because of much less
traces are prepared for the following benchmarks: matmul (ma-
wires and peripheral logic.
trix multiplication), avc (video decoding, an industry-strength
In terms of energy, TLC achieved the lowest per access
real RP application), and benchmarks from the MiBench
energy among the three memory designs. The L1 cache is not
benchmark suite [1] (crc32, fft, ifft, cjpeg, djpeg, lame, mad-
accessed in TLC when the requested data is found in the L0
play, and tiffdiether).
cache. Note that access to an L0 cache consumes significantly
Results. In Figure 3, we evaluate SLC and TLC designs
lower energy than L1 cache.
by normalizing the cycle counts to the cycle count of the
ideal memorya perfect cache having no miss or bank IV. C ONCLUSION
conflicts. Compared with the ideal memory, SPM, SLC with We explored two high-bandwidth data cache designs (SLC
block interleaving, and TLC show 10%, 18%, and 17% and TLC) for RP, a coarse-grained reconfigurable architecture
performance degradation, on average, respectively. Relating (CGRA) processor family. Both designs achieve a relatively
to the SPM, both SLC with block interleaving and TLC low cycle overhead. Based on our evaluation, we conclude that
show 7% performance degradation on average. SLC with word cache memory is an attractive alternative to the existing SPM
interleaving has 11% performance degradation compared with for the RP architecture because it reduces programming effort
the ideal memory. When compared with SPM, SLC with word and increases code portability while achieving competitive
interleaving only increased the cycle count by 1% on average. performance compared with SPM.
In summary, SLC shows better performance than TLC,
unless the benchmarks have high L0 cache hit rates. Word R EFERENCES
interleaving is very effective in minimizing the bank conflicts [1] M. R. Guthaus et al. Mibench: A free, commercially representative
and reducing the cycle counts in SLC. embedded benchmark suite. In IEEE 4th Annual Workshop on Workload
Characterization, 2001.
Area and energy. The area and energy of SPM, SLC, and TLC [2] N. Muralimanohar et al. Cacti 6.0: A tool to model large caches. Technical
designs are estimated with the Cacti model [2]. Table I presents Report HPL-2009-85, HP Laboratories, 2009.
the result of this estimation. At equal capacity, SPM requires [3] B. D. Sutter et al. An efficient memory organization for high-ilp inner
modem baseband sdr processors. Journal of Signal Processing Systems,
the smallest area. Caches come with hidden overheads, like 61(2):157179, 2010.
tag array and associated logic. However, buffers like MBB

501