Вы находитесь на странице: 1из 12

CReAMS: An Embedded Multiprocessor

Platform

Abstract. As the number of embedded applications is increasing, the current


strategy of the companies is to launch a new platform within short periods of
time to execute them efficiently with low energy consumption. However, for
each new platform deployment, new tool chains come along, with additional
libraries, debuggers and compilers. This strategy implies in high hardware
redesign costs, breaks binary compatibility and results in a high overhead in the
software development process. Therefore, focusing on area savings, low energy
consumption, binary compatibility maintenance and mainly software
productivity improvement, we propose the exploitation of Custom
Reconfigurable Arrays for Multiprocessor System (CReAMS). CReAMS is
composed of multiple adaptive reconfigurable systems to efficiently exploit
Instruction and Thread Level Parallelism (ILP and TLP) at hardware level, in a
totally transparent fashion. Assuming the same chip area of a multiprocessor
platform, the proposed architecture shows a reduction of 37% in energy-delay
product (EDP) on average, when running applications with different amounts of
ILP and TLP.

1 Introduction
Embedded systems are getting increasingly heterogeneous in terms of software. A
current high end cell phone has a considerable number of applications, being most of
them installed during the product lifetime. As the current embedded systems have hard
design constraints, the applications must be efficiently executed to provide the lowest
possible energy consumption. To support such demand, current embedded platforms
(e.g. OMAP) comprise one or two simple general purpose processors that are
surrounded by several dedicated hardware accelerators (communication, graphics and
audio processors), each one of them with its particular instruction set architecture
(ISA). While nowadays one can find up to 21 accelerators in an embedded platform, it
is expected that 600 different dedicated accelerators will be needed in the year 2024
[1], considering the growth in the complexity of embedded applications.
The foreseen scenario for the current strategy of embedded platform development is
not favorable for future embedded products. The four quadrants plotted in the Figure 1
show the strengths and the weaknesses of the existing hardware strategies used to
design a platform, considering organization and architecture. The main strategy used
in the current embedded platforms, illustrated in the lower left quadrant of the Figure
1, is area inefficient, since it relies on the employment of dedicated accelerators (as
ASIPs or graphic accelerators). Moreover, although it is energy efficient, it covers
only software with a restricted behavior in terms of ILP and TLP. Every release of
such a platform will not be transparent to the software developers, since together with
a new platform, a new version of its tool chain with particular libraries and compilers
must be provided. Besides the obvious deleterious effects on software productivity and
compatibility for any new hardware upgrade, there will also be intrinsic costs of new
hardware and software developments for every new product.
On the other hand, the upper right quadrant of the Figure 1 illustrates the
multiprocessing systems that are conceived as the replication of the same processors,
in terms of architecture and organization. Commonly, such strategy is employed in
general purpose platforms where performance is mandatory. However, energy
consumption is also getting relevant in this domain (e.g. it is necessary to reduce
energy costs in datacenters). In order to cope with this drawback, the homogeneous
architecture and heterogeneous organization, shown in the upper left quadrant of the
Figure 1, has been emerging to provide better energy and area efficiency than the other
two aforementioned platforms, with the cost of higher design validation time, since
many different processors organizations are used. However, it has the advantage of
implementing a unique ISA, so the software development process is not penalized: it is
possible to generate assembly code using the very same tool chain for any platform
version, and still maintain full binary compatibility for the already developed
applications. Unfortunately, since there is a limit of TLP for most applications [26], all
the above strategies do not provide the energy and performance optimization required
with the software productivity necessary in the constrained embedded systems
domain.
Homogeneous
Organization

SW Behavior Coverage SW Behavior Coverage


Strengths
Strengths

SW Productivity SW Productivity
Energy Consumption Design Time
Chip Area
GPP GPP GPP GPP

GPP
Weaknesses

Weaknesses

GPP GPP GPP


GPP

Design Time Energy Consumption


Chip Area
Heterogeneous Homogeneous
Architecture Architecture
Energy Consumption
Strengths

G ASIP2
P
P ASIP1
D
S Graphic
P Accelerator
Weaknesses

Chip Area
SW Productivity
Design Time
SW Behavior Coverage

Heterogeneous
Organization
General Purpose MPSoC Embedded MPSoC

Figure 1. Different Architectures and Organizations

Alternatively, dynamic reconfigurable architectures have already shown to be very


attractive for embedded platforms based on single threaded applications, since they
can adapt the fine grain parallelism exploitation (i.e. at instruction level) to the
application requirements at run-time [2][3]. However, the applications also exhibit
limits of ILP. Thus, gains in performance when such exploitation is employed tend to
stagnate even if a huge amount of resources is available in the reconfigurable
architecture, so a single accelerator does not provide an advantageous tradeoff
between area and performance if one needs to reach high performance levels in single
thread execution.
Therefore, an ideal embedded system platform should be composed of generic
processing elements that could adapt to the particularities of a given application,
which emulates the behavior of the dedicated accelerators that are successfully
employed in current platforms. At the same time, these processing elements should
execute the same ISA, which increases software productivity and sustains the binary
compatibility. Such approach would be able to efficiently attack the whole spectrum of
application behaviors: those that contain massive thread level parallelism and those
single threaded applications. This way, one could reach a satisfactory tradeoff in terms
of energy, performance and area, without extra software and hardware costs.
In this work we show how to achieve the aforementioned objectives by proposing a
platform based on Custom Reconfigurable Arrays for Multiprocessor System
(CReAMS). With CReAMS it is possible to emulate the desired heterogeneous
behavior with a homogenous platform. In a software environment composed of
parallel, embedded and general purpose applications, CReAMS achieves better
performance than a homogeneous architecture, occupies the same chip area, and
consumes less energy. Different from heterogeneous architectures, CReAMS provides
the software productivity of a homogeneous multiprocessor device, in which a single
tool chain can be used for the whole platform. Therefore, one can maintain full binary
compatibility throughout the platform’s lifetime.
This paper is organized as follows. Section 2 shows a review of researches
regarding reconfigurable architectures and multiprocessing environments. Section 3
presents the CReAMS architecture. Results of the proposed architecture considering
different instruction- and thread- level parallelism are shown in Section 4. Section 5
concludes this work.
2 Related Work
2.1 Reconfigurable Systems
Careful classification study in respect to coupling, granularity and instructions type
is presented in [4]. For instance, processors like Chimaera [5] have a tightly coupled
reconfigurable array in the processor core, which works as an additional functional
unit, limited to combinational logic only. This simplifies the control logic and
diminishes the communication overhead between the reconfigurable array and the rest
of the system. The GARP machine [6] is a MIPS compatible processor with a loosely
coupled reconfigurable array. The communication is done using dedicated move
instructions. Moreover, other reconfigurable architectures, very similar to the dataflow
approaches, were proposed: TRIPS is based on a hybrid von-Neumann/dataflow
architecture that combines an instance of coarse-grained, polymorphous grid processor
core with an adaptive on-chip memory system [7]. Wavescalar [8] also uses a very
similar strategy. In agreement with the previous examples, one can also refer to
Piperench [9] and Molen [10].
Stitt et al. [3] presented the Warp Processing, which is based on a system that does
dynamic partitioning using reconfigurable logic. Good results are shown when
applying such technique in a set of popular embedded system benchmarks. Warp is
composed of a microprocessor to execute the application software, another
microprocessor where a simplified CAD algorithm runs, local memory and a dedicated
simplified FPGA. In [2], a coarse-grain array tightly coupled to an ARM processor,
named Configurable Compute Array (CCA), is proposed. Feeding the CCA involves
two steps: the discovery of which sub graphs are suitable for running on the CCA, and
their replacement by micro-ops in the instruction stream. Two alternative approaches
are presented: static, where the sub-graphs for the CCA are found at compile time, and
dynamic. Dynamic discovery assumes the use of a trace cache to perform sub-graph
discovery on the retiring instruction stream at run-time.
Most of the reconfigurable systems rely on compiler driven resource allocation, and
so backward binary compatibility is lost. In order to cope with that, Warp Processing
and CCA apply dynamic techniques. They present some drawbacks, though. First,
significant memory resources are required for the kernels transformation. In the case
of the Warp Processing, the use of an FPGA presents long latency and high area
consumption, being power inefficient. In the case of the CCA, some operations, such
as memory accesses and shifts, are not supported at all. Therefore, usually just the
very critical parts of the software are optimized, which limits their field of application.
Most important, however, is that although these works show the potential of
dynamically transforming parts of the software to reconfigurable logic execution, they
are still limited to the optimization of one thread only.
2.2 Multiprocessor Systems
Performance scalability in parallel applications and easy hardware implementation
process make multiprocessor systems very attractive. As already mentioned, one can
find different multiprocessing environments: homogeneous in both architecture and
organization; with a homogeneous architecture and a heterogeneous organization; and
heterogeneous in both architecture and organization. Regarding the first type, Watkins
[11] presents a procedure for mapping functions in the ReMAPP system, composed of
a pair of coarse grained reconfigurable fabric that is shared among several cores.
Most platforms for embedded devices, as OMAP [12], are heterogeneous both in
architecture and organization. This strategy is used to satisfy the increasing demand in
performance caused by the heterogeneous behavior found in software for embedded
devices (e.g. games, high definition video players, VOIP, text editors). In the academic
domain, the authors in [13] proposes KAHRISMA, an architecture that supports
multiple instruction sets (RISC, 2- and 6-issue VLIW, and EPIC) and fine- and coarse-
grained reconfigurable arrays. Software compilation, ISA partitioning, custom
instructions selection and thread scheduling are made by a design time tool that
decides, for each part of the application code, which assembly code will be generated.
Such decision depends on the dominant type of parallelism available in the selected
part of the application code and the resources availability in the architecture. A run
time system is responsible for code binding and for avoiding execution collisions in
the resources.
Considering a system with homogeneous architecture and heterogeneous
organization, one can find the Thread Warping [14], which extends the
aforementioned Warp Processing system to support multiple thread execution. In this
case, one processor is totally dedicated to run the operating system tasks needed to
synchronize threads and to schedule their kernels in the accelerators.
ReMAPP and KAHRISMA are able to optimize software that contain many
threads, however they also rely on compiler support, static profiling and a tool to
associate the code or custom instructions to the different hardware components at
design time, not maintaining binary compatibility and with a mandatory time overhead
whenever the platform changes. Thread Warping presents the same deficiency of the
original work: only critical code regions are optimized, due to the high overhead in
time and memory imposed by the dynamic detection hardware. Thus, it only optimizes
applications with few and very defined kernels, narrowing its field of application. The
optimization of few kernels will very likely not satisfy the performance requirements
of future embedded systems, where it is foreseen a high concentration of different
behaviors of software [1].
Therefore, while multiprocessor systems based on homogeneous architectures
provide full binary compatibility and the opportunity of using the same ISA across the
whole platform, the use of heterogeneous architectures brings best tradeoff in terms of
area, performance and power, but with the cost of having different ISAs to deal with
during design time.
2.3 The Proposed Approach
CReAMS was built as a homogeneous architecture and organization. However, it
virtually behaves as a homogenous architecture with a heterogeneous organization,
thanks to its dynamic adaptive hardware. This is possible due to the coupling of a
lightweight dynamic reconfigurable architecture to each processor that it is not
restricted to optimize only critical code regions since it provides small reconfiguration
time. Thus, CReAMS is capable of achieving the performance and occupying the same
chip area of a homogeneous architecture, providing less energy consumption, with
software productivity and full binary compatibility by using a single tool chain for any
new version launched.
3 CReAMS Architecture
A general overview of the CReAMS platform is given in Figure 2(a). The thread
level parallelism is explored by replicating the number of Dynamic Adaptive
Processors (DAPs) (in the example of the Figure 2(a), by four DAPs). CReAMS also
includes an on-chip unified L2 shared cache to support communication among the
DAPs. As single thread optimization is mandatory when handling a heterogeneous
software environment, DAP uses a reconfiguration mechanism that adapts to the
behavior of the software, which allows single thread acceleration with low-energy
consumption.

CReAMS Detecting Mode


Probing Mode
for ( i=0 ; i<100 ; i++)
DAP DAP DAP DAP a[i] = b[i] + 1; Accelerating Mode
Reconfiguring Mode

i=0 i=1 i=2 i=99


SHARED L2

Time

(a) (b)
Figure 2. (a) Example of a CReAMS Platform (b) Acceleration Process

3.1 Dynamic Adaptive Processor (DAP)


We divided DAP in four blocks to better explain it, as illustrated in Figure 3. These
blocks are discussed in the following sections.
a) Processor Pipeline (Block 2)
A SparcV8-based architecture is used as the baseline processor to work together
with the reconfigurable system. Its five stage pipeline reflects a traditional RISC
execution flow (instruction fetch, decode, execution, data fetch and write back). The
SparcV8 is very similar to the processors used in well-known embedded platforms,
since they are also based on RISC architectures (e.g. MIPS, ARM).
b) Reconfigurable Data Path Structure (Block 1)
Following the classifications shown in [4], the reconfigurable data path is coarse
grained and tightly coupled to the SparcV8 pipeline, which avoids external accesses to
the memory, saves power and reduces the reconfiguration time. Because it is coarse-
grained, the size of the memory necessary to keep each configuration decreases when
compared to fine-grained data paths (e.g. FPGAs), since the basic processing elements
are functional units that work at the word level (arithmetic and logic, memory access
and multiplier).
As illustrated in the Figure 3, the data path is organized as a matrix of rows and
columns. The number of rows dictates the maximum instruction level parallelism that
can be exploited, since instructions located at the same column are executed in
parallel. For example, the illustrated data path (Block 1 of Figure 3) is able to execute
up to four arithmetic and logic operations, two memory accesses (two memory ports
are available in the L1 data cache) and one multiplication without true (read after
write) dependences. The number of columns determines the maximum number of data
dependent instructions that can be stored in one configuration. Three columns of
arithmetic and logic units (ALU) compose a level. A level does not affect the SparcV8
critical path (which, in this case, is given by the multiplier circuit). Therefore, up to
three ALU instructions can be executed in the reconfigurable data path within one
SparcV8 cycle, without affecting its original frequency (600 MHz). Memory accesses
and multiplications take one equivalent SparcV8 cycle to perform their operations.

Figure 3. DAP blocks

The entire structure of the reconfigurable data path is totally combinational: there is
no temporal barrier between the functional units. The only exception is for the entry
and exit points. The entry point is used to keep the input context, which is connected
to the processor register file. The fetch of the input operands from the SparcV8
register file is the first step to configure the data path before actual execution. After
that, results are stored in the output context registers through the exit point of the data
path. The values stored in the output context are sent to the SparcV8 register file on
demand. It means that if any value is produced at any data path level (a cycle of
SparcV8 processor) and if it will not be changed in the subsequent levels, this value is
written back in the cycle after that it was produced. In the current implementation, the
SparcV8 register file has two write/read ports. The interconnections among the
functional units, input and output contexts are made through multiplexers.
We have coupled sleep transistors [15] to switch power on/off of each functional
unit in the reconfigurable data path. The dynamic reconfiguration process is
responsible for the sleep transistors management. Their states are stored in the
reconfiguration memory, together with the reconfiguration data. Thus, for a given
configuration, idle functional units are set to the off state, which avoids leakage or
dynamic power dissipation, since the incoming bits do not produce switching activity
in the disconnected circuit. Although the sleep transistors are bigger and in series to
the regular transistors used in the implemented circuit, they have been designed so that
their delays do not significantly impact the critical path or the reconfiguration time.
c) Dynamic Detection Hardware (Block 4)
The hardware responsible for code detection, named Dynamic Detection Hardware
(DDH), was implemented as a 4-stage pipelined circuit to avoid the increase of the
critical path of the original SparcV8 processor. These four stages are the following:
Instruction Decode (ID), Dependence Verification (DV), Resource Allocation (RA),
Update Tables (UT). They are responsible for decoding the incoming instruction and
allocating it to right functional unit in the data path, according to the dependences with
the already allocated instructions. At the end of the process, the configuration bits are
updated in the reconfiguration memory.
Figure 2 (b) shows a simple example of how a DAP could dynamically accelerate a
code segment of a single thread. The DAP works in four modes: probing, detecting,
reconfiguring and accelerating. The flow works as follows:
At the beginning of the time bar shown in Figure 2 (b), the DAP searches for an
already translated code segment to accelerate by comparing the address cache entries
to the content of the program counter register. However, when the first loop iteration
appears (i=0), the DDH detects that there is a new code segment to translate, and it
changes to detecting mode.
In the detecting mode, concomitantly with the instruction execution in the SparcV8
pipeline, these instructions are also translated into a configuration by the DDH
pipeline (the process stops when a branch instruction is found). When the second loop
iteration is found (i=1), the DDH is still finishing the detection process that started
when i=0. It takes few cycles to store the configuration bits into the reconfiguration
memory, and to update the address cache with the memory address of the first detected
instruction.
Then, when the first instruction of the third loop iteration comes to the fetch stage
of the SparcV8 pipeline (i=2), the probing mode detects a valid configuration in the
reconfiguration memory: the program counter content was found in the address cache
entry.
After that, the DAP enters in the reconfiguring mode, where it feeds the
reconfigurable data path with the operands. For example, if 8 operands are needed, 4
cycles are necessary, since 2 read ports are available in the register file. In parallel
with the operands fetch, the reconfiguration bits are also loaded from the
reconfiguration memory. The reconfiguration memory is accessed on demand: at each
clock cycle, only the necessary bits to configure a data path level are fetched, instead
of fetching all the reconfiguration bits at once. This approach decreases the
reconfiguration memory port width, which is one of the main sources of power
consumption in memories [16].
Finally, the accelerating mode is activated and the next loop iterations (until the
99th) are efficiently executed by taking advantage of the reconfigurable logic.
d) Storage Components (Block 3)
Two storage components are part of the DAP acceleration process: address cache
and reconfiguration memory. The address cache holds the memory address of the first
instruction of every configuration built by the dynamic detection hardware. It is used
to verify the existence of a configuration in the reconfiguration memory: an address
cache hit indicates that a configuration was found. The address cache is implemented
as a 4-way set associative table containing 64 entries. The reconfiguration memory
stores the routing bits and the necessary information to fire a configuration, such as the
input and output contexts and immediate values.
Besides the two storage components explained before, the current DAP
implementation has a private 32 KB 4-way set associative L1 data cache and a private
8 KB 4-way set associative L1 instruction cache. According to our experiments, the
same hit rate is achieved by the SparcV8 in the DAP compared to the standalone
SparcV8 using a quarter size of its 32KB L1 instruction cache. It happens because, as
translated instructions are stored in the reconfiguration memory, the SparcV8 within
the DAP has fewer memory accesses in the L1 instruction cache than the standalone
SparcV8 processor. Thus, the impact of the additional area of the address cache and
the reconfiguration memory in the DAP design is amortized.
4 CReAMS Evaluation
In all experiments we have compared the CReAMS to a multiprocessing platform
built through the replication of standalone SparcV8 processors, named MPSparcV8.
The configurations of the DAP and the standalone SparcV8 processor used in this
evaluation are shown in Table 1 (b). Both CReAMS and MPSparcV8 have an on-chip
unified 512 KB 8-way set associative L2 shared cache.
Table 1. (a) Area (um2) of DAP and SparcV8; (b) The configuration of both basic processors

Processors Area (um2) DAP SparcV8


SparcV8 processor 247,615 247,615
Reconfigurable Data Path 3,298,602 - L1 I-Cache L1 D-Cache Pipeline Frequency
32KB 4-Way L1 I-Cache - 483,303 DAP 8KB 4-Way 32KB 4-Way 5-Stages 600MHz
8KB 4-Way L1 I-Cache 130,980 - Standalone SparcV8 32KB 4-Way 32KB 4-Way 5-Stages 600Mhz
32KB 4-Way L1 D-Cache 483,303 483,303
48 KB Reconfiguration Memory 688,255 - (b)
128 Entries 4-Way Address Cache 19,473 -
Total 4,868,228 1,214,221

(a)
4.1 Benchmarks
In order to measure the performance and energy efficiency of CReAMS considering
a heterogeneous application set, benchmarks from different suites were selected to
cover a wide range of behaviors in terms of type (i.e. TLP and ILP) and levels of
existing parallelism. The scope is to mimic future complex embedded applications that
will run in portable devices. From the parallel suites [17] [18], we have selected md,
jacobi and lu that are applications where TLP is dominant. Also, three SPEC
OMPM2001 [19] applications (apsi, equake and ammp) were chosen. Finally, we have
selected four applications (susan edges, susan smoothing, susan corners and patricia)
from the MiBench suite [20], which reflects a traditional embedded scenario.
The benchmarks were parallelized using OpenMP and POSIX. These libraries
provide methods that discover, at run time, the number of processors of the underlying
multiprocessor architecture, so they can take full advantage of the available resources
even when the platform changes (e.g. processors are added), with no need for
recompilation.
4.2 Methodology
We have used the scheme presented in [21] for the performance evaluation. The
base platform is Simics [22], an instruction level simulator. While the application runs
on the Simics, instructions of each software thread are sent to the cycle-accurate
timing simulators. There is one timing simulator for each DAP (in the case of the
CReAMS simulation) or for each standalone SparcV8 processor (when simulating the
MPSparcV8). Special attention is given to synchronization mechanisms, such as locks
and barriers, so that the time spent with blocking synchronization and memory
transfers are precisely calculated.
We have described the CReAMS architecture in VHDL, the power management
technique using Sleep Transistors was also described. The MPSparcV8 VHDL
description was obtained from [23]. The Synopsys Design and Power Compiler tools,
using a CMOS 90nm technology, were employed to synthesize the VHDL descriptions
to standard cell and gather data about power, critical path and area. Data on memory
was obtained with CACTI 6.5 [24].
In addition, we have considered CReAMS setups with different numbers of DAPs
(4, 8, 16 and 64 DAPs). The same was done with the MPSparcV8 system (4, 8, 16 and
64 SparcV8s). Each DAP has a reconfigurable data path (Block 1 of the Figure 3)
composed of 6 arithmetic and logic, 4 load/store and 2 multipliers units per column.
The entire reconfigurable data path has 24 columns. A 46 KB reconfiguration
memory, implemented as a DRAM memory, is able to store up to 64 configurations.
They are indexed by a 64-entries 4-way set associative Address Cache. As already
discussed in Section 3.1, the DAP has one quarter less L1 I-Cache accesses than the
SparcV8 processor. Thus, to avoid that the memory hierarchy bias the results in our
favor, we reduced to 8KB the size of the DAP L1 I-Cache to achieve the same cache
hit ratio as the 32KB L1 I-Cache of the standalone SparcV8.
Instead of showing absolute data about energy consumption, we use the energy-
delay product (EDP) metric [25] to present the real contribution of the proposed
approach to embedded systems domain.
4.3 Performance and Energy-Delay Product Evaluation
Figure 4 shows the speedup (y-axis), varying the number of processors (x-axis),
provided by the MPSparcV8 and CReAMS over a standalone SparcV8 processor.
Analyzing only the MPSparcV8 speedups (striped bars), the applications that contain
massive TLP (md, jacobi and susan_s) scale almost linearly as the number of
processors increases. For these applications, the mean performance of the 64-
MPSparcV8 grows 46 times. On the other hand, small performance gains are obtained
in those applications that contain small TLP (equake, apsi, ammp, susan_e, susan_c,
lu and patricia). In this case, with the 64-MPSparcV8 platform, performance
improvements of only 2.3 are obtained, which reinforces the need for an adaptable ILP
exploitation mechanism to achieve balanced performance in future embedded systems,
since they will mix many applications from different domains and amounts of ILP and
TLP.
Speedup MPSparcV8 CReAMS
40 49 93 74 51 70
38
36
34
32
30
28
26
24
22
20
18
16
14
12
10
8
6
4
2
0
64

64

64

64

64

64

64

64

64

64
16

16

16

16

16

16

16

16

16

16
4

4
8

8
equake apsi ammp susan_e patricia susan_c susan_s md jacobi lu
Number of Processors

Figure 4. Speedup of CReAMS and MPSparv8 over a standalone SparcV8

CReAMS (solid bars) presents better performance than the MPSparcV8 with the
same number of processors, even for those applications that present massive TLP.
However, a DAP is almost four times bigger than the SparcV8 processor, as it can be
seen in the Table 1(a). Therefore, let us compare both architectures with equivalent
area occupation: the CReAMS composed of 4 DAPs against the MPSparcV8
composed of 16 SparcV8; and the CReAMS composed of 16 DAPs against the
MPSparcV8 composed of 64 SparcV8.
As it can be observed, the 4-DAPs CReAMS outperforms the 16-SparcV8
MPSparcV8 execution time in six benchmarks (equake, apsi, ammp, susan edges,
patricia and lu), even when applications that present high TLP is considered, such as
lu and susan edges. ammp is the one that benefits most from CReAMS: it presents an
execution time 41% smaller than the MPSparcV8.
Table 2 shows the energy-delay product of both platforms. In this Table, the
designs with equivalent area are highlighted with the same color. Let us consider the
4-DAPs CReAMS setup running ammp: it saves 32% of energy consumption and
improves in 41% the performance compared to the 16-SparcV8 MPSparcV8.
Therefore, CReAMS provides a reduction in the energy-delay product in a factor of
almost four times in this case. The average reduction in energy-delay by using
CReAMS, and considering the whole set of benchmarks, is of 33%; while the energy
is reduced in 82% and performance is improved in 4% on average when the same chip
area is considered.
The main sources of CReAMS energy savings are:
• Although more power is spent because of the DDH hardware and reconfigurable
data path, fewer memory accesses for instructions are performed: once they were
translated to a data path configuration, they will reside in the reconfiguration memory.
• The reconfiguration strategy: let us consider the loop example of the Figure 2 (b),
the data path is only reconfigured once to execute 98 loop iterations, thus the
acceleration process avoids several accesses to the L1 instruction cache and
reconfiguration memory;
• Shorter execution time: as CReAMS accelerates the code, it provides less energy
consumption;
• The use of sleep transistors avoids that idle functional units spend unnecessary
power.
Table 2. Energy-Delay product, in J*1e-3s, of the MPSparcV8 and CReAMS (lower is better)

MPSparcV8 CReAMS
4 SparcV8 8SparcV8 16SparcV8 64SparcV8 4 DAPs 8 DAPs 16 DAPs 64 DAPs
equake 293 305 287 343 169 175 162 149
apsi 10768 10386 10169 10282 6361 6143 6023 5355
ammp 19698 22155 20764 12814 5452 6282 5761 2984
susan_e 7.97 7.15 6.55 6.12 3.46 2.85 2.43 2.12
patricia 2.91 1.96 2.54 2.65 1.47 0.77 1.02 1.06
susan_c 1.0296 0.8134 0.6274 0.4624 0.5160 0.3761 0.2600 0.1600
susan_s 177.2 102.5 59.8 18.7 41.7 24.9 14.7 4.7
md 0.000282 0.000171 0.000098 0.000039 0.000066 0.000042 0.000026 0.000013
jacobi 35.1 20.7 11.3 3.6 17.4 10.3 5.7 1.9
lu 0.000124 0.000091 0.000127 0.000215 0.000041 0.000028 0.000043 0.000076

The gains in energy and performance provided by the second setup considering the
same area (16-DAPs CReAMS against 64-SparcV8 MPSparcV8) is greater than the
first comparison shown above. As the number of cores increases, the level of TLP
tends to stagnate. Now, CReAMS outperforms MPSparcV8 in seven benchmarks
(equake, apsi, ammp, susan_e, susan_c, patricia and lu). In this case, the execution of
lu in the 16-DAPs CReAMS is 57% faster than in the 64-SparcV8 MPSparcV8 and
consumes 50% less energy, which reflects in a energy delay product reduction in a
factor of five times.
Only three benchmarks (susan smoothing, md and jacobi) present better
performance in the MPSparcV8, since they have perfect load balancing among their
threads. However, even showing worse performance than MPSparcV8, CReaMS
provides reductions in the energy-delay product of 26% and 33% for susan smoothing
and md respectively. Due to its massive TLP and perfect load balancing, jacobi is the
only application where CReAMS does not provide gains neither in performance nor in
energy when the same chip area is considered. Nevertheless, using the energy-delay
product metric, we demonstrated the CReAMS efficiency to dynamically adapt to the
applications with different levels of parallelism.
5 Conclusions
In this work, we have proposed CReAMS, aiming to offer the advantages of a
heterogeneous multiprocessor architecture and to improve software productivity, since
neither tool chains nor code recompilations are necessary for its use. Considering
designs with equivalent chip areas, CReAMS offers considerable improvements in
performance and in the energy-delay product for a wide range of different
applications, compared to a traditional homogeneous multiprocessor environment.

References
1. International Technology Roadmap for
Semiconductors:http://www.itrs.net/links/2009ITRS/2009Chapters_2009Tables/2 009_SysDrivers.pdf
2. Clark, N., Kudlur, M., Park, H. Mahlke, S., Flautner, K.,"Application-Specific Processing on a
General-Purpose Core via Transparent Instruction Set Customization". In MICRO-37, pp. 30-40, Dec.
2004
3. Lysecky, R., Stitt, G., Vahid, F., "Warp Processors". In ACM Transactions on Design Automation of
Electronic Systems, pp. 659-681, July 2006
4. Compton K., Hauck, S., “Reconfigurable computing: A survey of systems and software”. In ACM
Computing Surveys, vol. 34, no. 2, pp. 171-210, June 2002.
5. Hauck, S., Fry, T., Hosler, M., Kao, J., “The Chimaera reconfigurable functional unit”. In Proc. IEEE
Symp. FPGAs for Custom Computing Machines, Napa Valley, CA, pp. 87–96, 1997.
6. Hauser, J. R., Wawrzynek, J., “Garp: a MIPS processor with a reconfigurable coprocessor”. In Proc.
1997 IEEE Symp. Field Programmable Custom Computing Machines, pp. 12–21, 1997.
7. Sankaralingam, k., Nagarajan, R., Liu, H., Kim, C. , Huh, J. , Burger, D. , Keckler, S. W., Moore C.
R., “Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture”. In Proc. of the 30th
Int. Symp. on Computer Architecture, pp. 422-433, June 2003.
8. Swanson, S., Michelson, K., Schwerin, A., Oskin. M. “WaveScalar”. In MICRO-36, Dec. 2003
9. Goldstein, S. C., Schmit, H., Budiu, M., Cadambi, S., Moe, M., Taylor, R.R., "PipeRench: A
Reconfigurable Architecture and Compiler". In IEEE Computer, pp. 70-77, April, 2000
10. Vassiliadis, S., Gaydadjiev, G. N., Bertels, K.L.M., Panainte, E. M., “The Molen Programming
Paradigm”. In Proceedings of the Third International Workshop on Systems, Architectures, Modeling,
and Simulation, pp. 1-10, Greece, July 2003
11. Watkins, M.A.; Albonesi, D.H., "Enabling Parallelization via a Reconfigurable Chip Multiprocessor",
Workshop on Parallel Execution of Sequential Programs on Multi-core Architectures, 2010. 37th
International Symposium on Computer Architecture, June 2010.
12. Cumming, P. The TI OMAP Platform Approach to SoC. Kluwer Academic Publ, Boston, MA
13. Koenig, R.; Bauer, L.; Stripf, T.; Shafique, M.; Ahmed, W.; Becker, J.; Henkel, J.; , "KAHRISMA: A
Novel Hypermorphic Reconfigurable-Instruction-SetMulti-grained-Array Architecture," Design,
Automation & Test in Europe Conference, pp.819-824, 2010
14. Lee, J., Wu, H., Ravichandran, M., and Clark, N. 2010. Thread tailor: dynamically weaving threads
together for efficient, adaptive parallel applications. ISCA '10. pp. 270-279.
15. Shi, K.; Howard, D. “Challenges in Sleep Transistor Design and Implementation in Low-Power
Designs”. In Proceedings of Design Automation Conference, 43. 2006, pp. 113 – 116.
16. Berticelli Lo, T.; Beck, A.C.S.; Rutzig, M.B.; Carro, L.; , "A low-energy approach for context memory
in reconfigurable systems," Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW),
2010, pp.1-8, 19-23 April 2010
17. Dorta, A. J., Rodriguez, C., Sande, F. d., and Gonzalez-Escribano, A. 2005. The OpenMP Source Code
Repository. In Proceedings of the 13th Euromicro Conference on Parallel, Distributed and Network-
Based Processing (February 09 - 12, 2005). PDP. IEEE ComputerSociety, Washington, DC, 244-250.
18. Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995.The SPLASH-2 programs:
characterization and methodological considerations. In Proceedings of the 22nd Annual international
Symposium on Computer Architecture (S. Margherita Ligure, Italy, June 22 - 24, 1995). ISCA '95.
ACM, New York, NY, 24-36.
19. Kaivalya M. Dixit. 1993. The SPEC benchmarks. In Computer benchmarks, Jack J. Dongarra and
Wolfgang Gentzsch (Eds.). Elsevier Advances In Parallel Computing Series, Vol. 8. pp. 149-163.
20. Guthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M., Mudge T.,Brown, R.B., “MiBench: A Free,
Commercially Representative Embedded Benchmark Suite. 4th Workshop on Workload
Characterization”, Austin, TX, Dec. 2001
21. Monchiero, M., Ahn, J., Falcón, A., Ortega, D., and Faraboschi, P. 2009. How to simulate 1000 cores.
SIGARCH Comput. Archit. News37, 2 (Jul. 2009), 10-19.
22. Magnusson, P. S., Christensson, M., Eskilson, J., Forsgren, D., Hållberg, G., Högberg, J., Larsson, F.,
Moestedt, A., and Werner, B. 2002. Simics: A Full System Simulation Platform. Computer 35, 2 (Feb.
2002), 50-58.
23. Aeroflex Gaisler: http://www.gaisler.com
24. Wilton, S.J.E.; Jouppi, N.P.; , "CACTI: an enhanced cache access and cycle time model," Solid-State
Circuits, IEEE Journal of , vol.31, no.5, pp.677-688, May 1996
25. Gonzalez, R.; Horowitz, M.; , "Energy dissipation in general purpose microprocessors," Solid-State
Circuits, IEEE Journal of , vol.31, no.9, pp.1277-1284, Sep 1996
26. Geoffrey Blake, Ronald G. Dreslinski, Trevor Mudge, and Kriszti\&\#225;n Flautner. 2010. Evolution
of thread-level parallelism in desktop applications. SIGARCH Comput. Archit. News 38, 3 (June
2010), 302-313.