Академический Документы
Профессиональный Документы
Культура Документы
Traffic Demand
Profiler Guided Manual Allocation & Mapping
978-1-4799-7581-5/15/$31.00 2015 IEEE 141 16th Int'l Symposium on Quality Electronic Design
from Simulink model and map them onto virtual platform (VP) im- Guided by the profiling results, the user manually allocates pro-
plemented in FPGA. In [4], authors generate in addition to SW the cessing elements (PE) and maps the model onto the PEs accord-
VP via a combined algorithm and architecture model (CMMA). ingly yielding a mapped specification model. Allocation is recorded
Unlike [6, 7, 8, 4] which only target multiprocessor architecture and by annotating the Simulink model. In particular, the block of a cer-
SW generation, we target a general heterogeneous architecture, in- tain type of computation is likely mapped onto the corresponding
cluding CPUs and hardware accelerators (e.g. FPGA). SimSH also PE that is designed to optimize this type of computation. Details of
explicitly addresses the communication across HW and SW. profiling Simulink applications and the synthesis of system speci-
The work in [15, 14] convert Simulink models to a System Level fications can be found in [3]. Conversely, they are out of scope of
Design Language (SLDL) for System Level Design. The work in- this article, which instead focuses on the synthesis framework.
troduces an interesting profiling approach and focuses on design The input Simulink application model is shown in Fig. 2. For
space exploration (DSE). However, it stays at abstract simulation the discussion of this example, assume that blocks B and C are
level, unlike our which aims for heterogeneous target execution. computationally heavy as revealed by the profiler. User then maps
Simulink R2014a [12] also supports concurrent execution code them on a hardware while other blocks stay in software.
generation. However, it does not specifically address communi- In the Front-end Synthesis, the mapped specification model is
cation optimization. Furthermore, Simulink only targets specific split into hardware models and software models and then synthe-
heterogeneous architectures (such as Zynq with single CPU and up sized into software implementation in C/C++ and hardware imple-
to 2 FPGAs), while our work targets a general heterogeneous ar- mentation in Hardware Description Language (HDL). In this step,
chitecture. Different from the industry approach, SimSH reveals the functionality of all blocks in the model is synthesized for dif-
both design methods and usage. It allows users in the academic ferent PEs while the communication across the PEs is missing. To
community to easily expand the tool to support other platforms. address that, we insert the Proxy in the model that encapsulates the
cross-PE communication which will be further refined.
In the Communication Refinement, the Proxy is refined and re-
3. HW/SW CODESIGN FRAMEWORK alized following the OSI standard [5]. In our case, the Proxy is
The input of the SimSH is a Simulink specification model. Simulink comprised of 4 layers: the application layer for the consistent in-
[12] is a Model-Based Design (MBD) tool for system modeling and terface, the transport layer for synchronization, the network layer
verification. A Simulink model is described as a set of functional for addressing and marshaling and the physical layer for interfac-
blocks and subsystems, i.e. a grouping of blocks linked by signals. ing with the physical bus. Then the refined communication is in-
Fig. 2 illustrates our SimSH in more detail. It takes a Simulink tegrated into the software and hardware implementation yielding a
model as input and guides the user in allocating and mapping blocks complete implementation in C/C++ and HDL on all PEs.
based on profiling. Synthesis occurs in 3 phases: Front-end Synthe- In the Back-end Synthesis, SimSH integrates the cross-compilation
sis, Communication Refinement, and Back-end Synthesis, yielding environment for software compilation and Xilinx ISE [13] for high-
the SW/HW implementation. level synthesis. It finally generates software binary for processors
SimSH includes a profiler to investigate the applications compu- and bitstream for FPGAs. The work in this paper makes assump-
tation and communication workload. It employs Algo2Spec [14] to tions and restrictions: (a) the user selects allocation and mapping
generate a SLDL specification model (in SpecC), and then profiles manually. (b) it is bounded by Simulink Embedded Coder and HDL
the specification using scprof [3]. The profiler reports computation Coder restrictions and only supports discrete event models using
and traffic demands in terms of number of operations, individually BF527 DSP Xilinx FPGA BF527 DSP
fixed step solver.
Image Load Pack Unpack Sobel Pack Unpack Deserialize
for each operation and data type. The profiling exposes computa- Serialize 8 bits SW 1:2 HW 2:1 8 bits Core 1 bit HW 1:16 SW 16:1 1 bit Image Print
tional and communication hot spots of the application. 3.1 Front-end
16 bits
Synthesis 16 bits
Architectural Component Allocation / Mapping Allocation & the functional blocks and inter-PE communication is mapped on
Mapping Mapped Specification Model
Decision a PE, the cross-PE communication is implicitly mapped on the
Mapped Model
SW HW SW
shared bus. The front-end synthesis explores the inter-PE commu-
Front-end Synthesis
A B C D nication optimization and inserts Proxy for further refinement.
Functional Implementation in C/C++ and HDL
DSP P bits P bits
FPGA Q bits Q bits
DSP
A BC Refinement
Communication D
Model Splitting & Proxy Generation (Communication Optimization)
NPQ-bit Bus
Complete Implementation in C/C++ and HDL
SW Model HW Model
Front-end
A (SW) Proxy-BC D (SW) Proxy-A BC (HW) Proxy-D
Synthesis DSP FPGA DSP
Pack Unpack Pack Unpack
A BC D
P bits 1:NQ NQ:1 1:NP NP:1 Q bit
Embedded Coder HDL Coder P bits Q bits
NPQ bits NPQ bits
NPQ-bit Bus
C/C++ Code (.c) HDL Apps (.vhd)
Platform Lib Simulink Model
Platform Achitecture
Figure 3: Communication Optimization over Underutilized Bus
C / HDL Code
HWn.v
Communication Refinement HWn.v Architectural Allocation &
Port Info (.xml) HwSwCoDesign Framework
Database Mapping
Bus Interface
Software Top-Level HDL Top-Level
3.1.1 Communication Optimization Decision
C / HDL Code
Given a group TLM
TLM of blocks HW:
n TLM
mapped
TLM on each PE in the mapped
n
SW: C/C++, binary n VHDL, bit stream n
1 bit 1 bit
Time
Not used
SimSH detects the under-utilized bus by comparing cross PE A (SW) BC (HW) D (SW) XYZ (HW)
signals data width and bus width. Fig. 3 shows an opportunity A
c1
B C
c2
D ... X Y Z
for communication optimization. In the original mapped specifi- c3
cation model (upper graph), a single transaction (P-bit width from
A to BC) is less than the bus width (NPQ-bit width), which under- DSP (SW) FPGA-1 (HW) FPGA-n (HW)
utilizes the bus. The bus utilization can be optimized by concate- c2
A D c1 BC XZY
nating multiple user transactions accordingly. To do this, SimSH
inserts in the mapped model a pack and unpack block at both sides c3 c1
Proxy-BC c2 Proxy-A
c3
Proxy-D
... Proxy-n
of a cross-PE communication. This bundles multiple user transfers
utilizing the bus width (lower graph). Bus
NQ
MSB Figure 5: Model Split for Each PE
De-mux
P-bit (pack) NPQ-bit the blocks it replaces, it can be scheduled identically to the orig-
LSB 1 2 3 Time
1 inal specification model. All inserted blocks: pack, unpack and
DSP FPGA
Pixel 1
Bit Slice NQ
Proxy block are composed
Pixel 2DSP FPGAof synthesizable blocks (for both SW
Concat
DSP FPGA
Vector
...
NPQ-bit Bit Slice 2 (unpack) P-bit in more detail.
Bit Slice 1
Manually scheduling blocks mapped on the same PE is challeng-
Figure 4: pack and unpack for Concatenating N Transfers ing due to Simulink semantics. To circumvent scheduling ambi-
Fig. 4 visualizes pack and unpack as parametrizable blocks. On guities, SimSH Proxy blocks are Simulink blocks and synthesized
the top, pack buffers NQ P-bit input and concatenates them to a by Simulink together with the computation modules. We observe
single NPQ-bit output. This reduces the data rate by factor NQ Simulink Embedded Coder generates sequential implementations
between input and output. On the bottom, unpack slices a NPQ-bit and Simulink HDL Coder generates pipelined implementations. In
input and into NQ P-bit outputs, increasing the data rate by NQ. In addition, by Simulink synthesizing the Proxy, it benefits from all
result, while the processing blocks (A, BC, D) remain untouched, optimizations of the Simulink synthesis.
transfers are bundled and bus utilization is increased. Overall, front-end synthesis generates computation blocks and
Implementaion FPGA
If the target heterogeneous architecture supports bus burst trans- theDSPcommunication within one PE. The inter-PE communication
Top.vhd
Simulink Model
via Proxy needs further refinement as discussed in the next section.
HWApp0.vhd
fer or DMA, SimSH can more aggressively concatenate transac-SWApp0 SWApp1 HWApp0
tions at cost of latency. The added blocks (pack, unpack) minimally 3.2 Communication Refinement
increase computation. The benefits through better utilizing the bus
Proxy SimSH automatically refines communication into a layered im-
outweigh the minimal computation overhead as the concatenation Send 0 Rate Proxy FIFO FIFO FIFO
Proxy plementation
Adapter following
Recv 1 the OSI standard [5] as shown in Fig. 6:
ratio increases (see Section 4). Recv 0 Communication Mo
application layer, transportProxy_In
layer, network Proxy_Out
layer and physical layer.
Overall, communication optimization updates the mapped model
with fewer transfers across the blocks mapped on different PEs. A transaction initiated at the application layer is decomposedDSP into
packets at the transport layer, converted into bus transactions
Bus_Comm.h Application
in theLayer Appl
send()
networkRecv() layer and finally transferred via the Decoder Synchronization
physical layer. The Syn
3.1.2 Model Splitting and Proxy Generation Layer
layered design hides the underlying hardware (from the physical
Decoder
Network Layer Net
SimSH then splits the mapped model into a set of target models,
layer up), as well as application specifications (from the applica-
one for each PE. Types include a SW model for a processor (e.g. Driver IFC Physical Layer Phy
tion layer down). The OSI layered communication implementation
CPU, DSP, ) or a HW model (e.g. for FPGA). Each target model Bus Bus
allows a wide application of the Proxy principle to a host of hetero-
only contains the blocks mapped to the particular PE.
geneous architectures. It simplifies expanding the database for new
Fig. 5 shows in the top half the mapping annotated model. Blocks
architectures.
A and D are mapped to a DSP (SW) and B,C to FPGA-1 (HW). To
Implementation OSI Model
illustrate a more general complex example, blocks (X,Y,Z) and the
backward communication c3 are included. In result of mapping, DSP FPGA
the communication across PE boundaries (c1, c2 and c3) needs to Simulink SW Model Simulink HW Model Application
A D BC Layer
be established. For this, Proxy Generation replaces blocks mapped
on another PE with a local Proxy. A proxy acts as a placeholder and Proxy FIFO FIFO FIFO
Send 0 Rate Proxy Transport Layer
bundles data input and output of the current PE. Proxy Generation Proxy Adapter Recv 1 Proxy_Recv Proxy_Send
Recv 0
traces the interface types in the Simulink model, and inserts proxies
maintaining the interfaces. Fig. 5 shows the results. Block that are Bus_Comm.h
Decoder Network Layer
mapped on FPGA-1 (HW), i.e. BC (Unpack NQ:1 and Pack 1:NP) send() Recv()
are replaced with a Proxy-BC in the SW model. Proxy-BC on the Driver Interface (IFC)
Physical Layer
DSP sends c1 and reads c2 and c3 from HW model. In result, A Bus
and D execute as if block B,C were still in SW.
Figure 6: Proxy OSI Model
3.1.3 Homogeneous Synthesis At Application Layer, Proxy is a placeholder for blocks mapped
SimSH invokes the Simulink Embedded Coder [10] and Simulink to other PEs. It retains identical boundary interfaces of those re-
HDL Coder [11] to generate target SW and HW implementations. mote blocks, replicating each port (e.g. in direction, width, data
Each of these can optimize internally to generate efficient code. To type, and update rate). E.g. SW Proxy-BC in Fig. 5 implements
maximize the potential, we do not synthesize each block individ- input port c1, output ports c2 and c3, identical to HW block BC.
ually. Instead, blocks mapped to the same PE are grouped into a At Transport Layer, as shown in Fig. 6, the SW Proxy-BC in-
super-block and then synthesis is invoked on that super-block. Fur- stantiates a proxy_recv block for each input and a proxy_send block
thermore, as each inserted Proxy retains the boundary interfaces of for each output port. In case of a rate change in the replaced blocks,
Driver Interface (IFC)
Bus
DSP FPGA
ock1 a RateSuper
AdapterBlock3 Super
is inserted. In SW, proxy_recv and proxy_send are Block2 The Network Layer of the Proxy provides addressing and data
connected through the Rate Adapter to allow different read and marshalling as shown in Fig. 6. We follow a two-layer addressing,
write transactions rates. proxy_recv , proxy_send and Rate Adapter similar to Simulinks identification (block ID and port ID). The net-
DSP Proxy FPGA
are constructed from a Simulink synthesizable Proxy
subset to simplify FPGA
work layerProxy
maps Simulinks addressing onto the physical address.
code generation. HW Address Prefix Block ID Port ID
Implementing the hardware proxy requires strict timing, as it in- MSB LSB
terfaces with the network layer from the database. To guarantee the
Figure 8: Proxy Addressing at Network Layer
Busthe proxy out of communication
timing, one approach is to generate
Depicted in Fig. 8, block ID and port ID follow the HW address
primitives. In fact, communication refinement extracts the system
composition and connection from the HW target model into a XML prefix from the most significant bit (MSB) to the least significant
file generally following the IP-XACT standard [2]. It captures all bit (LSB). The address range for block ID and port ID is dependent
relevant port characteristics, which guides the communication re- on the number of blocks and ports in the model.
finement to generate a proxy_recv or proxy_send for each port in Listing 2: Proxy API at Network Layer
the HW Proxy. The FIFOs in proxy_recv or proxy_send decouple 1 /* Proxy Network Layer */
execution of synthesized application from communication code. 2 send(BlockX, PortY){
A
8
9
10
proxy_recv1(outport_c2){
B C c2);
recv(Proxy-BC,
c1 outport_c2 = c2;}
c2
D ... port ID to the physical address and then marshal the transaction
X payloadYand eventually
Z call the bus API. In HW model, a 2-level
decoder (out of the database) is instantiated to select the HW block
c3 SW, our transport layer uses and the Proxy FIFO.
For synchronization across HW and The Physical Layer in SW model contains the bus driver from
a buffered asynchronous communication. We use this as we ob- the database. It provides a set of native API for bus transactions
served that Simulink HDL Coder can synthesize HW blocks into called from the network layer. In addition, it also wraps the SW
DSP (SW) design to relax theFPGA-1
a pipelined (HW)
pressure for the high-level synthe- FPGA-nwith
application (HW)some top level architecture specific initialization.
sis. We utilize this concept to realize synchronization
c2 across each In HW model, the physical layer instantiates the bus Interface (IFC)
A D
HW/SW boundary by adding
c1 an BC
additional cycle delay. XZYand the top level FPGA pin mapping [1] as well as the
component
Listing 1 shows the pseudo API of read c3 and write transaction
c3 c1 Proxy-BC
in the transport layer. All the transactions at the transport layer
and above are Proxy-A
c2 hitherto addressed by blockProxy-D
ID and port ID from
... User Constraint File (UCF). The IFC component can directly read
data from the bus and interprets a bus writing as signals on the bus
lines. Proxy-n
Simulink model and therefore transparent to all underlying hetero- Table 1: Timing and Dependency of Proxy
geneous architectures.
Simulink Embedded Coder synthesizesBus the SW model into a step OSI Timing Application Platform
function triggered by a periodic timer. After execution of A, Proxy Layer Accuracy Specific Specific
BC issues a write transaction, immediately followed by a read trans- App Application loosely high none
action (as governed by the Rate Adapter). It reads the BC result of Proxy Transport approximate medium medium
the previous iteration, and SW continues with D. Hence, HW and Network
SW execution are overlapped. HW starts executing upon availabil- Database Physical cycle accurate none high
ity of the data and produces the output.
1 2 3 4 Step/Time Overall, Table 1 summarizes the layering scheme from timing
and dependency aspects. The timing accuracy (precision of syn-
A&D BC thesis) increases along the top-down layering refinement of com-
Pixel 1
(DSP) (FPGA)
A&D BC munication based on OSI model. Besides, the higher OSI layer is
Pixel 2
(DSP) (FPGA) more application specific and less platform specific. Hence, the
A &D BC
... most timing accurate element (bus interface), instantiated from the
...
(DSP) (FPGA)
database, is completely platform specific and independent on ap-
... plications. Conversely, the application requires full synthesis (with
Result 1 Result 2 least timing requirements), while proxies are partially parameter-
ized.
Figure 7: HW/SW Synchronization In the result of the front-end synthesis and communication re-
As shown in Fig. 7, BC produces results which are read by D finement, SimSH has generated a complete SW and HW imple-
in the next iteration (assuming the same rate for simplicity). This mentation in bare-C and VHDL.
additional iteration delay makes the implementation of HW com-
pletely independent of the speed of SW. Therefore, the maximum 3.3 Back-end Synthesis
delay of HW is relaxed to be as large as the complete loop of soft- Back-end synthesis is responsible for synthesizing the C/C++
ware execution. In a result, the HW can run multiple orders of mag- and HDL code into the appropriate target binaries/bitstream. Based
nitude slower than the bus speed, which has a potential for more on the selected target architecture, the back-end synthesis integrates
energy saving. But we dont explore it in this case because we are cross-compilation environments (e.g. BF527) and Xilinx ISE [13]
targeting on FPGA. to automate the SW compilation and HW high level synthesis.
The process of back-end synthesis is automated. SimSH gen- BF527 DSP Xilinx FPGA BF527 DSP
Image Load Sobel Deserialize
erates Makefiles to automate the SW cross-compilation. For HW Serialize 8 bits Core 1 bit Image Print
8 bits 1 bit
back-end synthesis, SimSH generates Xilinx ISE project files and 16 bits 16 bits
16-bit Bus
invokes ISE for HW high level synthesis via command line [13].
BF527 DSP Xilinx FPGA BF527 DSP
4. EXPERIMENTAL RESULTS Image Load
Serialize 8 bits
Pack
SW 1:2
Unpack
HW 2:1
Sobel
Core
Pack
HW 1:16
Unpack
SW 16:1 1 bit
Deserialize
Image Print
8 bits 1 bit
To demonstrate the benefits of the framework, we use the Sobel 16 bits 16 bits
16-bit Bus
Edge Detect [9]. Sobel Edge Detect detects the edges in an image
by comparing each pixel with its neighbors. It computes the gra- Figure 11: Sobel-Edge-Detect Communication Optimization over
dients of the current pixel via a matrix multiplication of a Sobel Underutilized Bus DSP FPGA
DSP P bits P bits
FPGA Q bits Q bits
DSP 8 bits 8 bits 1 bit
operator and the matrix of current neighboring A pixels. If the gradi- BC
SimSH
D
splits the resulting model into a
A
HW model and a SW
BC
NPQ-bit Bus
ent of a pixel is larger than a certain threshold, this pixel is detected 16-bit Bus
model. In the SW model, it replaces the Sobel Core with emph-
as a part of an edge. DSP FPGA proxy_sobel DSP as a placeholder which consumes DSP pixels and outputs FPGA
Pack Unpack Pack Unpack Pack Unpack Pack
DSP FPGA A DSP
1:NQ
P bits NQ:1 P bits
BC
Q bits
decisions
1:NP
mimicking
NP:1 Q bit
D as if Sobel Core Awould
8 bits
still
1:2
be in SW. Sim-
2:1 8 bits
BC
1 bit 1:16
NPQ bits
ilarly, in the HW model, SimSH replaces Image 2Load,
NPQ bits bits Serialize 16
NPQ-bit Bus 16-bit Bus
Serialize
Sobel
Deserialize
and Deserialize, Image Print with two proxies to receive pixels and
8 bits Core 1 bit send results.
A D
BC Then, SimSH invokes imulink Embedded Coder [10] and Simulink
Load Image Image Print HDL Coder [11] to generate target SW (C/C++) and HW (HDL)
Figure 9: Sobel Edge
NotDetection
used Algorithm implementations. Here, the generated HDL for the Sobel Core has
Fig. 9 depictsDSP
the Simulink model, 13 pipeline stages. Invalid outputsNot due
usedto pipeline fill are discarded
FPGAmainly as a pipeline
DSP of Image by the HW Proxy.
Load, Serialize, Sobel Edge Detect, Deserialize and Image Print. 16-bit Bus
Image Load simply loads a 320x240 12-bit FPGA Address Prefix 4-bit Block ID 4-bit Port ID 12-bit Free 16
Address Space
Sobel gray image (8 bits/pixel) and 31
DSP 16 bits 16 bits
19
FPGA
15 11
16 bits bits DSP
0
Term
ImagesendsSerialize
Serialize each pixel to Sobel
8 bits Edge
Core. Then,Deserialize
Sobel Core outputs Super
Block1 Figure
Input
12: Proxy Network
Pack
Input
Layer
1 bit Addressing
Unpack Unpack
Super
Block2
Output
Pack
Output Super
Block3
the binary decision whether the pixel is1 bit Print
part of an edge. Finally, 8 bits 8 bits 1 bit
Detect
Deserialize assembles the image. During the communication16 bits refinement, the logical 16 bits addressing
140 based onBusblock
Transfer ID and port ID is refined to physical addressing of
123.25 8 bits
EBIU bus. Fig. 12 depicts the8 address
bits
allocation following Simulink
Bits
120
100 PAL: Modular, Portable, Low Power two-layer addressing: in a 32-bit address, 1 bitthe address range 1 bit from
Million Cycles
30
BF527 Digital Signal Processor (DSP) 600MHz [1] and a Xilinx component in VHDL [1]. We implement the IFC as the EBIU bus
25 21.7
20.3100MHz linked
FPGA Spantan3E XC3S500E by 16-bit
17.9 External driver in FPGA. It reacts to EBIU control line: reading from data
20 17
Bus Interface Unit (EBIU) 100MHz on chip. line during read transaction and writing to EBIU control line and
15
10 5.8
data line during write transaction. Furthermore, the physical layer
4.1 5
Application-specific Synthesis
3.6 Results
0.91 0.58also encapsulates the top level FPGA pin mapping as well as the
0 NQ
In result
of the mapping, Sobel Cores input and output cross PE
0 User Constraint File (UCF). MSB
De-mux
pack/unpack
bus, only 1/2 and 1/16 of the bus width is utilized. SimSH detects To illustrate the Bit Slice NQ of the HW/SW co-design and the com-
benefits
Concat
Vector