Rapid Heterogeneous Prototyping From Simulink PDF

Computation /
Traffic Demand
Profiler Guided Manual Allocation & Mapping
Architectural Component Allocation / Mapping Allocation &

Mapping Mapped Specification Model
Decision
Mapped Model
SW HW SW Front-end Synthesis
A B C D
Rapid Heterogeneous Prototyping from SimulinkFunctional Implementation in C/C++ and HDL
Communication Refinement
del Splitting & Proxy Generation (Communication Optimization)
Shen Feng, Chris Driscoll, Jerediah Fevold, Hao Jiang, Gunar Schirner
sfeng@ece.neu.edu, {driscoll.ch, fevold.j, jiang.hao}@husky.neu.edu, schirner@ece.neu.edu
Complete Implementation in C/C++ and HDL
SW Model HW Model
Front-end
Proxy-BC D (SW) Proxy-A BC (HW) Department
Proxy-D of Electrical and Computer Engineering
Synthesis
Northeastern University
Boston, MA, 02115
Embedded Coder HDL Coder
C/C++ Code (.c) HDL Apps (.vhd)

ABSTRACT Simulink Model Platform Achitecture
Designing embedded high-performance systemsHW is nchallenging
.v due
Communication Refinement
to complex algorithms, HW
real-time operations andPort n.v
conflicting goals Architectural
Info (.xml) HwS wCoD esign Frame work
(e.g. power v.s. performance). Heterogeneous platforms that com- Database
bine processors and custom
Software Top-Level hardware accelerators are a promising
HDL Top-Level
approach. However, manually designing HW/SW systems is pro- TLM
TLMn TLM
TLMn
n binary
SW: C/C++, HW: VHDL,nbit stream
hibitively expensive due to the immense manual effort.
This paper introduces
mpilation (Cross-compiler)
SimSH: Simulink Sw/Hw CoDesign
HDL Synthesis (Xilinx) Back-end Frame-
work, which provides an automatic path from an Synthesisalgorithm cap- Figure 1: SimSH Flow Overview
tured in Simulink to a heterogeneous implementation. Given an
.bin .bit
allocation and a mapping decision, the SimSH automatically syn- lengthens the time-to-market. To bridge this gap, this paper intro-
thesizes the Simulink model onto the heterogeneous target with reduces SimSH which provides an automatic path from a Simulink
construction of the synchronization and communication between model to a heterogeneous implementation.
processing elements. In the process, the SimSH detects an under- Fig. 1 overviews SimSH. It takes a platform architecture and
utilized bus and optimizes communication by packing / unpack- a Simulink specification model as input. Using the architectural
ing. Synthesizing a heterogeneous implementation from Simulink database, our SimSH generates the necessary interfaces and pro-
allows the developer to focus on the algorithm design with rapid
validation and test on a heterogeneous platform. We demonstrate duces a SW and HW implementation (as binary for processor(s)
synthesis benefits using a Sobel Edge Detection algorithm and tar- and bitstream for FPGA(s)). The user analyzes the computation and
get a heterogeneous architecture of Blackfin processor and Spar- communication workload of the model and determines processing
tan3E FPGA. The synthesized solution is 2.68x faster (and energy element (PE) allocation and model-to-PE mapping.
efficient) over pure SW execution. The contributions are the following:
Introducing a SimSH that provides an automatic path from
1. INTRODUCTION Simulink to a heterogeneous platform, given PE allocation
Algorithm designers prototype and fine-tune algorithms using and mapping. The SimSH empowers algorithm developers
high-level environments, such as Simulink [12]. In Simulink, users rapidly synthesize the application avoiding tedious and error-
benefit from large library resources and an interactive user interface prone manual implementation efforts.
for algorithm design and system modeling. Furthermore, Simulink The SimSH automatically inserts necessary communication
can synthesize algorithms onto homogeneous platforms (either CPU and synchronization across PEs via Communication Refine-
or FPGA), through Simulink Embedded Coder or HDL Coder [12]. ment. The synthesized layered communication is influenced
However, a homogeneous architecture may not meet performances by the OSI standard [5] to enable reusability and scalability
or power constraints of demanding applications (e.g. streaming ap- over varying architectures.
plications). Designer shift attention to heterogeneous Multiproces- A communication optimization is introduced which detects
sor System-On-Chip (MPSoC) which comprises multiple CPUs, an underutilized bus, and increases efficiency through pack-
memories and hardware accelerators. MPSoCs can improve per- /unpack to fully utilize the bus.
formance and power efficiency through specialized hardware com-
We demonstrate the benefits using Sobel Edge Detection [9], and
ponents (i.e. accelerators). However, current tools to not offer an
map it to a heterogeneous platform of Blackfin DSP and Xilinx
easy path from Simulink onto a general heterogeneous architecture.
FPGA. The results demonstrate significant benefits in terms of (a)
While additional heterogeneous components increase performance,
rapid realization (within minutes), and (b) increased performance
they widen the gap between Simulink prototyping and heteroge-
and energy efficiency (both 2.68x over SW implementation).
neous implementation. The tremendous effort of manually creat-
This paper is structured as following: Section 2 examines the
ing a heterogeneous solution by stitching together homogeneous
relevant research. Section 3 introduces SimSH flow in detail. Sec-
synthesis results (created in isolation) becomes a bottleneck and
tion 4 shows the experiment result. Section 5 concludes the paper
and touches on future work.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are 2. RELATED WORK
not made or distributed for profit or commercial advantage and that copies Synthesizing Simulink algorithm models to specifications has
bear this notice and the full citation on the first page. To copy otherwise, to emerged in recent research. In [6, 7], authors proposed a frame-
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. work for software code generation from Simulink and validation on
MPSoC architecture. In [8], authors generate software for MPSoC
978-1-4799-7581-5/15/$31.00 2015 IEEE 141 16th Int'l Symposium on Quality Electronic Design
from Simulink model and map them onto virtual platform (VP) im- Guided by the profiling results, the user manually allocates pro-
plemented in FPGA. In [4], authors generate in addition to SW the cessing elements (PE) and maps the model onto the PEs accord-
VP via a combined algorithm and architecture model (CMMA). ingly yielding a mapped specification model. Allocation is recorded
Unlike [6, 7, 8, 4] which only target multiprocessor architecture and by annotating the Simulink model. In particular, the block of a cer-
SW generation, we target a general heterogeneous architecture, in- tain type of computation is likely mapped onto the corresponding
cluding CPUs and hardware accelerators (e.g. FPGA). SimSH also PE that is designed to optimize this type of computation. Details of
explicitly addresses the communication across HW and SW. profiling Simulink applications and the synthesis of system speci-
The work in [15, 14] convert Simulink models to a System Level fications can be found in [3]. Conversely, they are out of scope of
Design Language (SLDL) for System Level Design. The work in- this article, which instead focuses on the synthesis framework.
troduces an interesting profiling approach and focuses on design The input Simulink application model is shown in Fig. 2. For
space exploration (DSE). However, it stays at abstract simulation the discussion of this example, assume that blocks B and C are
level, unlike our which aims for heterogeneous target execution. computationally heavy as revealed by the profiler. User then maps
Simulink R2014a [12] also supports concurrent execution code them on a hardware while other blocks stay in software.
generation. However, it does not specifically address communi- In the Front-end Synthesis, the mapped specification model is
cation optimization. Furthermore, Simulink only targets specific split into hardware models and software models and then synthe-
heterogeneous architectures (such as Zynq with single CPU and up sized into software implementation in C/C++ and hardware imple-
to 2 FPGAs), while our work targets a general heterogeneous ar- mentation in Hardware Description Language (HDL). In this step,
chitecture. Different from the industry approach, SimSH reveals the functionality of all blocks in the model is synthesized for dif-
both design methods and usage. It allows users in the academic ferent PEs while the communication across the PEs is missing. To
community to easily expand the tool to support other platforms. address that, we insert the Proxy in the model that encapsulates the
cross-PE communication which will be further refined.
In the Communication Refinement, the Proxy is refined and re-
3. HW/SW CODESIGN FRAMEWORK alized following the OSI standard [5]. In our case, the Proxy is
The input of the SimSH is a Simulink specification model. Simulink comprised of 4 layers: the application layer for the consistent in-
[12] is a Model-Based Design (MBD) tool for system modeling and terface, the transport layer for synchronization, the network layer
verification. A Simulink model is described as a set of functional for addressing and marshaling and the physical layer for interfac-
blocks and subsystems, i.e. a grouping of blocks linked by signals. ing with the physical bus. Then the refined communication is in-
Fig. 2 illustrates our SimSH in more detail. It takes a Simulink tegrated into the software and hardware implementation yielding a
model as input and guides the user in allocating and mapping blocks complete implementation in C/C++ and HDL on all PEs.
based on profiling. Synthesis occurs in 3 phases: Front-end Synthe- In the Back-end Synthesis, SimSH integrates the cross-compilation
sis, Communication Refinement, and Back-end Synthesis, yielding environment for software compilation and Xilinx ISE [13] for high-
the SW/HW implementation. level synthesis. It finally generates software binary for processors
SimSH includes a profiler to investigate the applications compu- and bitstream for FPGAs. The work in this paper makes assump-
tation and communication workload. It employs Algo2Spec [14] to tions and restrictions: (a) the user selects allocation and mapping
generate a SLDL specification model (in SpecC), and then profiles manually. (b) it is bounded by Simulink Embedded Coder and HDL
the specification using scprof [3]. The profiler reports computation Coder restrictions and only supports discrete event models using
and traffic demands in terms of number of operations, individually BF527 DSP Xilinx FPGA BF527 DSP
fixed step solver.
Image Load Pack Unpack Sobel Pack Unpack Deserialize
for each operation and data type. The profiling exposes computa- Serialize 8 bits SW 1:2 HW 2:1 8 bits Core 1 bit HW 1:16 SW 16:1 1 bit Image Print
tional and communication hot spots of the application. 3.1 Front-end
16 bits
Synthesis 16 bits
Simulink Model In the result of the profiler-guided

16-bit Bus allocation and mapping, a
HWn.v Simulink Specification Model
A B C D Algo2Spec Profiling Profiled mapping annotated Simulink specification model is the input of the
HWn.v
Computation /
Traffic Demand front-end synthesis,Profiler
as shown in the
Guided Manual upper
Allocation graph in Fig. 3. While
& Mapping
Architectural Component Allocation / Mapping Allocation & the functional blocks and inter-PE communication is mapped on
Mapping Mapped Specification Model
Decision a PE, the cross-PE communication is implicitly mapped on the
Mapped Model
SW HW SW
shared bus. The front-end synthesis explores the inter-PE commu-
Front-end Synthesis
A B C D nication optimization and inserts Proxy for further refinement.
Functional Implementation in C/C++ and HDL
DSP P bits P bits
FPGA Q bits Q bits
DSP
A BC Refinement
Communication D
Model Splitting & Proxy Generation (Communication Optimization)
NPQ-bit Bus
Complete Implementation in C/C++ and HDL
SW Model HW Model
Front-end
A (SW) Proxy-BC D (SW) Proxy-A BC (HW) Proxy-D
Synthesis DSP FPGA DSP
Pack Unpack Pack Unpack
A BC D
P bits 1:NQ NQ:1 1:NP NP:1 Q bit
Embedded Coder HDL Coder P bits Q bits
NPQ bits NPQ bits
NPQ-bit Bus
C/C++ Code (.c) HDL Apps (.vhd)
Platform Lib Simulink Model
Platform Achitecture
Figure 3: Communication Optimization over Underutilized Bus
C / HDL Code
HWn.v
Communication Refinement HWn.v Architectural Allocation &
Port Info (.xml) HwSwCoDesign Framework
Database Mapping
Bus Interface
Software Top-Level HDL Top-Level
3.1.1 Communication Optimization Decision
C / HDL Code
Given a group TLM
TLM of blocks HW:
n TLM
mapped
TLM on each PE in the mapped
n
SW: C/C++, binary n VHDL, bit stream n
Back-end model, the traffic between blocks

S/W Compilation (Cross-compiler) HDL Synthesis (Xilinx)
Synthesis 16-bitmapped
Bus on different PEs is influ-
ential to the16overall performance. FPGA
It usually incurs longer latency
DSP
.bin .bit DSP bits 16 bits 16 bits 16 bits
than
Super
inter-PE
Input
communication.
Input
To achieve
Super
efficiency,
Output
itOutput
is importantSuper
that 8user
Block1 bits transactions
Pack (as generated
Unpack 8 bits Block2by1the bit specification)
Pack match
Unpack 1 bit the
Block3
Figure 2: SimSH Flow underlying interconnect.
16 bits 16 bits
Bus Transfer
8 bits 8 bits
Bits
1 bit 1 bit
Time
Not used
SimSH detects the under-utilized bus by comparing cross PE A (SW) BC (HW) D (SW) XYZ (HW)
signals data width and bus width. Fig. 3 shows an opportunity A
c1
B C
c2
D ... X Y Z
for communication optimization. In the original mapped specifi- c3
cation model (upper graph), a single transaction (P-bit width from
A to BC) is less than the bus width (NPQ-bit width), which under- DSP (SW) FPGA-1 (HW) FPGA-n (HW)
utilizes the bus. The bus utilization can be optimized by concate- c2
A D c1 BC XZY
nating multiple user transactions accordingly. To do this, SimSH
inserts in the mapped model a pack and unpack block at both sides c3 c1
Proxy-BC c2 Proxy-A
c3
Proxy-D
... Proxy-n
of a cross-PE communication. This bundles multiple user transfers
utilizing the bus width (lower graph). Bus
NQ
MSB Figure 5: Model Split for Each PE
De-mux
Input Bit Concat Output

Buffer
...
P-bit (pack) NPQ-bit the blocks it replaces, it can be scheduled identically to the orig-
LSB 1 2 3 Time
1 inal specification model. All inserted blocks: pack, unpack and
DSP FPGA
Pixel 1
Bit Slice NQ
Proxy block are composed
Pixel 2DSP FPGAof synthesizable blocks (for both SW
Concat
DSP FPGA
Vector
Input ... Unbuffer Output and HW). Section 3.2 discusses

Pixel 3
Proxy composition and refinement
...
NPQ-bit Bit Slice 2 (unpack) P-bit in more detail.
Bit Slice 1
Manually scheduling blocks mapped on the same PE is challeng-
Figure 4: pack and unpack for Concatenating N Transfers ing due to Simulink semantics. To circumvent scheduling ambi-
Fig. 4 visualizes pack and unpack as parametrizable blocks. On guities, SimSH Proxy blocks are Simulink blocks and synthesized
the top, pack buffers NQ P-bit input and concatenates them to a by Simulink together with the computation modules. We observe
single NPQ-bit output. This reduces the data rate by factor NQ Simulink Embedded Coder generates sequential implementations
between input and output. On the bottom, unpack slices a NPQ-bit and Simulink HDL Coder generates pipelined implementations. In
input and into NQ P-bit outputs, increasing the data rate by NQ. In addition, by Simulink synthesizing the Proxy, it benefits from all
result, while the processing blocks (A, BC, D) remain untouched, optimizations of the Simulink synthesis.
transfers are bundled and bus utilization is increased. Overall, front-end synthesis generates computation blocks and
Implementaion FPGA
If the target heterogeneous architecture supports bus burst trans- theDSPcommunication within one PE. The inter-PE communication
Top.vhd
Simulink Model
via Proxy needs further refinement as discussed in the next section.
HWApp0.vhd
fer or DMA, SimSH can more aggressively concatenate transac-SWApp0 SWApp1 HWApp0
tions at cost of latency. The added blocks (pack, unpack) minimally 3.2 Communication Refinement
increase computation. The benefits through better utilizing the bus
Proxy SimSH automatically refines communication into a layered im-
outweigh the minimal computation overhead as the concatenation Send 0 Rate Proxy FIFO FIFO FIFO
Proxy plementation
Adapter following
Recv 1 the OSI standard [5] as shown in Fig. 6:
ratio increases (see Section 4). Recv 0 Communication Mo
application layer, transportProxy_In
layer, network Proxy_Out
layer and physical layer.
Overall, communication optimization updates the mapped model
with fewer transfers across the blocks mapped on different PEs. A transaction initiated at the application layer is decomposedDSP into
packets at the transport layer, converted into bus transactions
Bus_Comm.h Application
in theLayer Appl
send()
networkRecv() layer and finally transferred via the Decoder Synchronization
physical layer. The Syn
3.1.2 Model Splitting and Proxy Generation Layer
layered design hides the underlying hardware (from the physical
Decoder
Network Layer Net
SimSH then splits the mapped model into a set of target models,
layer up), as well as application specifications (from the applica-
one for each PE. Types include a SW model for a processor (e.g. Driver IFC Physical Layer Phy
tion layer down). The OSI layered communication implementation
CPU, DSP, ) or a HW model (e.g. for FPGA). Each target model Bus Bus
allows a wide application of the Proxy principle to a host of hetero-
only contains the blocks mapped to the particular PE.
geneous architectures. It simplifies expanding the database for new
Fig. 5 shows in the top half the mapping annotated model. Blocks
architectures.
A and D are mapped to a DSP (SW) and B,C to FPGA-1 (HW). To
Implementation OSI Model
illustrate a more general complex example, blocks (X,Y,Z) and the
backward communication c3 are included. In result of mapping, DSP FPGA
the communication across PE boundaries (c1, c2 and c3) needs to Simulink SW Model Simulink HW Model Application
A D BC Layer
be established. For this, Proxy Generation replaces blocks mapped
on another PE with a local Proxy. A proxy acts as a placeholder and Proxy FIFO FIFO FIFO
Send 0 Rate Proxy Transport Layer
bundles data input and output of the current PE. Proxy Generation Proxy Adapter Recv 1 Proxy_Recv Proxy_Send
Recv 0
traces the interface types in the Simulink model, and inserts proxies
maintaining the interfaces. Fig. 5 shows the results. Block that are Bus_Comm.h
Decoder Network Layer
mapped on FPGA-1 (HW), i.e. BC (Unpack NQ:1 and Pack 1:NP) send() Recv()
are replaced with a Proxy-BC in the SW model. Proxy-BC on the Driver Interface (IFC)
Physical Layer
DSP sends c1 and reads c2 and c3 from HW model. In result, A Bus
and D execute as if block B,C were still in SW.
Figure 6: Proxy OSI Model
3.1.3 Homogeneous Synthesis At Application Layer, Proxy is a placeholder for blocks mapped
SimSH invokes the Simulink Embedded Coder [10] and Simulink to other PEs. It retains identical boundary interfaces of those re-
HDL Coder [11] to generate target SW and HW implementations. mote blocks, replicating each port (e.g. in direction, width, data
Each of these can optimize internally to generate efficient code. To type, and update rate). E.g. SW Proxy-BC in Fig. 5 implements
maximize the potential, we do not synthesize each block individ- input port c1, output ports c2 and c3, identical to HW block BC.
ually. Instead, blocks mapped to the same PE are grouped into a At Transport Layer, as shown in Fig. 6, the SW Proxy-BC in-
super-block and then synthesis is invoked on that super-block. Fur- stantiates a proxy_recv block for each input and a proxy_send block
thermore, as each inserted Proxy retains the boundary interfaces of for each output port. In case of a rate change in the replaced blocks,
Driver Interface (IFC)
Bus
DSP FPGA
ock1 a RateSuper
AdapterBlock3 Super
is inserted. In SW, proxy_recv and proxy_send are Block2 The Network Layer of the Proxy provides addressing and data
connected through the Rate Adapter to allow different read and marshalling as shown in Fig. 6. We follow a two-layer addressing,
write transactions rates. proxy_recv , proxy_send and Rate Adapter similar to Simulinks identification (block ID and port ID). The net-
DSP Proxy FPGA
are constructed from a Simulink synthesizable Proxy
subset to simplify FPGA
work layerProxy
maps Simulinks addressing onto the physical address.
code generation. HW Address Prefix Block ID Port ID
Implementing the hardware proxy requires strict timing, as it in- MSB LSB
terfaces with the network layer from the database. To guarantee the
Figure 8: Proxy Addressing at Network Layer
Busthe proxy out of communication
timing, one approach is to generate
Depicted in Fig. 8, block ID and port ID follow the HW address
primitives. In fact, communication refinement extracts the system
composition and connection from the HW target model into a XML prefix from the most significant bit (MSB) to the least significant
file generally following the IP-XACT standard [2]. It captures all bit (LSB). The address range for block ID and port ID is dependent
relevant port characteristics, which guides the communication re- on the number of blocks and ports in the model.
finement to generate a proxy_recv or proxy_send for each port in Listing 2: Proxy API at Network Layer
the HW Proxy. The FIFOs in proxy_recv or proxy_send decouple 1 /* Proxy Network Layer */
execution of synthesized application from communication code. 2 send(BlockX, PortY){
Listing 1: Proxy API at Transport Layer 3 addr = convert2addr((BlockX, PortY);

4 API_BUS_SEND(addr, Port.data);}
1 /* Transport Layer of SW Proxy-BC */ 5 recv(BlockX, PortY){
2 proxy_send0(inport_c1){ 6 addr = convert2addr(BlockX, PortY);
3 c1 = inport_c1; 7 API_BUS_RECV(addr, Port.data);}
4 send(Proxy-BC, c1);}
5 proxy_recv0(outport_c3){ Listing 2 shows two steps: addressing and marshalling. The SW
A (SW)
7
6 recv(Proxy-BC, c3);
BC (HW)
outport_c3 = c3;} D (SW) XYZ
send and(HW)
recv function in SW Proxy first convert the block ID and
A
8
9
10
proxy_recv1(outport_c2){
B C c2);
recv(Proxy-BC,
c1 outport_c2 = c2;}
c2
D ... port ID to the physical address and then marshal the transaction
X payloadYand eventually
Z call the bus API. In HW model, a 2-level
decoder (out of the database) is instantiated to select the HW block
c3 SW, our transport layer uses and the Proxy FIFO.
For synchronization across HW and The Physical Layer in SW model contains the bus driver from
a buffered asynchronous communication. We use this as we ob- the database. It provides a set of native API for bus transactions
served that Simulink HDL Coder can synthesize HW blocks into called from the network layer. In addition, it also wraps the SW
DSP (SW) design to relax theFPGA-1
a pipelined (HW)
pressure for the high-level synthe- FPGA-nwith
application (HW)some top level architecture specific initialization.
sis. We utilize this concept to realize synchronization
c2 across each In HW model, the physical layer instantiates the bus Interface (IFC)
A D
HW/SW boundary by adding
c1 an BC
additional cycle delay. XZYand the top level FPGA pin mapping [1] as well as the
component
Listing 1 shows the pseudo API of read c3 and write transaction
c3 c1 Proxy-BC
in the transport layer. All the transactions at the transport layer
and above are Proxy-A
c2 hitherto addressed by blockProxy-D
ID and port ID from
... User Constraint File (UCF). The IFC component can directly read
data from the bus and interprets a bus writing as signals on the bus
lines. Proxy-n
Simulink model and therefore transparent to all underlying hetero- Table 1: Timing and Dependency of Proxy
geneous architectures.
Simulink Embedded Coder synthesizesBus the SW model into a step OSI Timing Application Platform
function triggered by a periodic timer. After execution of A, Proxy Layer Accuracy Specific Specific
BC issues a write transaction, immediately followed by a read trans- App Application loosely high none
action (as governed by the Rate Adapter). It reads the BC result of Proxy Transport approximate medium medium
the previous iteration, and SW continues with D. Hence, HW and Network
SW execution are overlapped. HW starts executing upon availabil- Database Physical cycle accurate none high
ity of the data and produces the output.
1 2 3 4 Step/Time Overall, Table 1 summarizes the layering scheme from timing
and dependency aspects. The timing accuracy (precision of syn-
A&D BC thesis) increases along the top-down layering refinement of com-
Pixel 1
(DSP) (FPGA)
A&D BC munication based on OSI model. Besides, the higher OSI layer is
Pixel 2
(DSP) (FPGA) more application specific and less platform specific. Hence, the
A &D BC
... most timing accurate element (bus interface), instantiated from the
...
(DSP) (FPGA)
database, is completely platform specific and independent on ap-
... plications. Conversely, the application requires full synthesis (with
Result 1 Result 2 least timing requirements), while proxies are partially parameter-
ized.
Figure 7: HW/SW Synchronization In the result of the front-end synthesis and communication re-
As shown in Fig. 7, BC produces results which are read by D finement, SimSH has generated a complete SW and HW imple-
in the next iteration (assuming the same rate for simplicity). This mentation in bare-C and VHDL.
additional iteration delay makes the implementation of HW com-
pletely independent of the speed of SW. Therefore, the maximum 3.3 Back-end Synthesis
delay of HW is relaxed to be as large as the complete loop of soft- Back-end synthesis is responsible for synthesizing the C/C++
ware execution. In a result, the HW can run multiple orders of mag- and HDL code into the appropriate target binaries/bitstream. Based
nitude slower than the bus speed, which has a potential for more on the selected target architecture, the back-end synthesis integrates
energy saving. But we dont explore it in this case because we are cross-compilation environments (e.g. BF527) and Xilinx ISE [13]
targeting on FPGA. to automate the SW compilation and HW high level synthesis.
The process of back-end synthesis is automated. SimSH gen- BF527 DSP Xilinx FPGA BF527 DSP
Image Load Sobel Deserialize
erates Makefiles to automate the SW cross-compilation. For HW Serialize 8 bits Core 1 bit Image Print
8 bits 1 bit
back-end synthesis, SimSH generates Xilinx ISE project files and 16 bits 16 bits
16-bit Bus
invokes ISE for HW high level synthesis via command line [13].
BF527 DSP Xilinx FPGA BF527 DSP
4. EXPERIMENTAL RESULTS Image Load
Serialize 8 bits
Pack
SW 1:2
Unpack
HW 2:1
Sobel
Core
Pack
HW 1:16
Unpack
SW 16:1 1 bit
Deserialize
Image Print
8 bits 1 bit
To demonstrate the benefits of the framework, we use the Sobel 16 bits 16 bits
16-bit Bus
Edge Detect [9]. Sobel Edge Detect detects the edges in an image
by comparing each pixel with its neighbors. It computes the gra- Figure 11: Sobel-Edge-Detect Communication Optimization over
dients of the current pixel via a matrix multiplication of a Sobel Underutilized Bus DSP FPGA
DSP P bits P bits
FPGA Q bits Q bits
DSP 8 bits 8 bits 1 bit
operator and the matrix of current neighboring A pixels. If the gradi- BC
SimSH
D
splits the resulting model into a
A
HW model and a SW
BC
NPQ-bit Bus
ent of a pixel is larger than a certain threshold, this pixel is detected 16-bit Bus
model. In the SW model, it replaces the Sobel Core with emph-
as a part of an edge. DSP FPGA proxy_sobel DSP as a placeholder which consumes DSP pixels and outputs FPGA
Pack Unpack Pack Unpack Pack Unpack Pack
DSP FPGA A DSP
1:NQ
P bits NQ:1 P bits
BC
Q bits
decisions
1:NP
mimicking
NP:1 Q bit
D as if Sobel Core Awould
8 bits
still
1:2
be in SW. Sim-
2:1 8 bits
BC
1 bit 1:16
NPQ bits
ilarly, in the HW model, SimSH replaces Image 2Load,
NPQ bits bits Serialize 16
NPQ-bit Bus 16-bit Bus
Serialize
Sobel
Deserialize
and Deserialize, Image Print with two proxies to receive pixels and
8 bits Core 1 bit send results.
A D
BC Then, SimSH invokes imulink Embedded Coder [10] and Simulink
Load Image Image Print HDL Coder [11] to generate target SW (C/C++) and HW (HDL)
Figure 9: Sobel Edge
NotDetection
used Algorithm implementations. Here, the generated HDL for the Sobel Core has
Fig. 9 depictsDSP
the Simulink model, 13 pipeline stages. Invalid outputsNot due
usedto pipeline fill are discarded
FPGAmainly as a pipeline
DSP of Image by the HW Proxy.
Load, Serialize, Sobel Edge Detect, Deserialize and Image Print. 16-bit Bus
Image Load simply loads a 320x240 12-bit FPGA Address Prefix 4-bit Block ID 4-bit Port ID 12-bit Free 16
Address Space
Sobel gray image (8 bits/pixel) and 31
DSP 16 bits 16 bits
19
FPGA
15 11
16 bits bits DSP
0
Term
ImagesendsSerialize
Serialize each pixel to Sobel
8 bits Edge
Core. Then,Deserialize
Sobel Core outputs Super
Block1 Figure
Input
12: Proxy Network
Pack
Input
Layer
1 bit Addressing
Unpack Unpack
Super
Block2
Output
Pack
Output Super
Block3
the binary decision whether the pixel is1 bit Print
part of an edge. Finally, 8 bits 8 bits 1 bit
Detect
Deserialize assembles the image. During the communication16 bits refinement, the logical 16 bits addressing
140 based onBusblock
Transfer ID and port ID is refined to physical addressing of
123.25 8 bits
EBIU bus. Fig. 12 depicts the8 address
bits
allocation following Simulink
Bits
120
100 PAL: Modular, Portable, Low Power two-layer addressing: in a 32-bit address, 1 bitthe address range 1 bit from
Million Cycles
19 down to 16 is allocated to block ID and address range from 15

80 Time
Processing Element down to 12 is for port ID. The most significant 12 bits are reserved
Peripherals
60 LCD
by the memory system for FPGA address range while the least sig-
Communications
40 RISC-DSP FPGA nificant 12 bits are still free.
USB
& Networking 14.05 18.73
20 Therefore, the total available 20 bit address range (excluding the
Sensor Interfaces GPIO reserved 12 bits) supports up to 220 combination of block ID and
0 16-bit EBIU BUS
Serializer Sobel_Core Deserializer port ID. In this experiment, the allocated 4-bit block ID and 4-bit
Not used
port ID supports up to 16 blocks and 16 ports, while we only need
Figure 10: Sobel Edge Detect Performance Estimation three blocks and two ports.
To analyze the computation demands and Mem
toFPGAguide HWthe mapping To route Bus transaction to the correct Simulink block entity on
Accelerator
process, the application is profiled. In Fig. 10, the profiling results FPGA, the corresponding two level decoders are generated: two 4-
of Sobel 50 Bus
Edge Detect show that Sobel Core occupies 79.1% of the to-16 bit decoder select the hardware block entity and the input/out-
45.6
total computation
45 demand. User maps SobelCPU 1
Core
CPU 2
onto
CPU n
the FPGA put Proxy FIFO respectively.
and the rest
40 on DSP, as shown in blue in Fig. 9 General Heterogeneous
The physical layer refinement of the HW Proxy is dependent on
Platform
This experiment
35 targets a heterogeneous platform of a Blackfin the database: the EBIU bus driver in bare-C and Interface (IFC)
Million Cycles
30
BF527 Digital Signal Processor (DSP) 600MHz [1] and a Xilinx component in VHDL [1]. We implement the IFC as the EBIU bus
25 21.7
20.3100MHz linked
FPGA Spantan3E XC3S500E by 16-bit
17.9 External driver in FPGA. It reacts to EBIU control line: reading from data
20 17
Bus Interface Unit (EBIU) 100MHz on chip. line during read transaction and writing to EBIU control line and
15
10 5.8
data line during write transaction. Furthermore, the physical layer
4.1 5
Application-specific Synthesis
3.6 Results
0.91 0.58also encapsulates the top level FPGA pin mapping as well as the
0 NQ
In result
of the mapping, Sobel Cores input and output cross PE
0 User Constraint File (UCF). MSB
De-mux
Input Bit Concat Output

Buffer
...
boundaries (Fig. 11).

Pure SW For each
Solution HW-SW8-bit
Co-input pixel,
HW-SW the Sobel
input HW-SWCoreoutputout- HW-SW P-bit (pack) NPQ-bit
design (no opt) pack/unpack pack/unpack
puts a 1-bit result indicating the edge. Considering a 16-bit EBIU 4.2
input/output Evaluation 1
LSB
pack/unpack
bus, only 1/2 and 1/16 of the bus width is utilized. SimSH detects To illustrate the Bit Slice NQ of the HW/SW co-design and the com-
benefits
Concat
Vector
Input ... Unbuffer Output

Total Run Time Communication Delay
the underutilized the bus in the data streaming and concatenates in- munication
NPQ-bit optimization,
Bit Slice 2 we compare five different
(unpack) P-bit implementa-
Bit Slice 1
put and output of HW block by inserting pack and unpack blocks. tions (varying mapping and optimization level) as illustrated in
To realize the input concatenation, SimSH inserts pack_SW after Fig. 13. Pure SW Solution maps the whole specification model on
the Serialize in the SW model (Fig. 11 bottom). Before sending, BF527 DSP. HW-SW Co-Design (no opt) maps Sobel Core on the
the pack_SW marshals two 8-bit pixels into one 16-bit bus trans- Xilinx Spartan3E FPGA and the remaining blocks on BF527 DSP.
action. Upon receiving, unpack_HW fires the Sobel edge Detect HW-SW input pack/unpack optimizes the input communication
twice, once with each pixel. This cuts the number of input trans- of HW module by inserting pack/unpack on the path from SW to
actions into half. Similarly, concatenation of 16 Sobel Core output HW. HW-SW output pack/unpack optimizes the output commu-
reduces the amount of output transactions 16x. nication of the Sobel Core. HW-SW input/output pack/unpack
14.05 18.73
20
0
Serializer Sobel_Edge_Detect Deserializer
optimizes both input and output of the Sobel Core. 5. CONCLUSION

50 45.6 This paper introduces a Simulink-based SimSH to bridge the
45
40
gap between the algorithm design in Simulink and its implementa-
35 tion on a heterogeneous platform. Given an allocation and a map-
Million Cycles
30 ping decision, our SimSH automatically synthesizes the Simulink

25 20.3 21.7 model onto the heterogeneous target and refines the synchroniza-
20 17.9 17
tion and communication across processing elements. Furthermore,
15
10 5.8
the SimSH optimizes communication by detecting an underutilized
3.6
5 0 0.91 0.58 bus and concatenating transactions accordingly. In the result, it al-
0 lows the developer to focus on the algorithm exploration and tuning
Pure SW Solution HW-SW Co- HW-SW input HW-SW output HW-SW
design (no opt) pack/unpack pack/unpack input/output
and rapidly prototype it on a heterogeneous target platform.
pack/unpack We have demonstrated the benefits using Sobel Edge Detection
Total Run Time Communication Delay
[9], and targeted a heterogeneous architecture with a Blackfin pro-
Figure 13: Sobel Edge Detect Performance cessor and Spartan3E FPGA. Our proposed SimSH achieves up to
a 2.68x speedup and energy efficiency with communication opti-
The baseline solution is the pure SW solution, mapping the whole
mization against a pure software solution. In future work, we will
Simulink model on DSP. It results in the longest execution time
investigate into automatic mapping decisions for a given platform.
about 45.6 Mcycles. HW-SW [no-opt] maps the most computa-
tionally intensive Sobel Core to FPGA, with the rest running on
DSP, all in a pipelined fashion. The total execution time drops to 6. REFERENCES
20.3Mcycles, yielding a 2.25x speed up. However, HW-SW [no- [1] Analog Devices, Inc. (ADI). ADSP-BF52x Blackfin o
opt] includes a communication overhead across DSP and FPGA of Processor Hardware Reference, February 2013. Rev. 1.2.
22.2% of the total execution time. [2] V. Berman. Standards: The P1685 IP-XACT IP Metadata
Standard. Design Test of Computers, IEEE, 23(4):316317,
Optimizing the path from SW to HW, input pack/unpack solu- April 2006.
tion reduces traffic overhead slightly but yields an longer total exe- [3] L. Cai, et al. Retargetable profiling for rapid, early
cution time than HW-SW [no-opt]. The overhead of executing pack system-level design space exploration. In Proceedings of the
(in SW) outweighs the communication performance gain which is Design Automation Conference (DAC), San Diego, CA, June
small due to the low input concatenation ratio of 2 : 1 (pack 2 pix- 2004.
els). Conversely, when only optimizing the path from HW to SW in [4] S.-I. Han, et al. Simulink-based Heterogeneous
[output pack/unpack] solution, the overall performance increases, Multiprocessor SoC Design Flow For Mixed
Hardware/Software Refinement And Simulation. Integration,
as the output concatenation ratio 16 : 1 is much higher than input the VLSI Journal, 42(2):227245, Feb. 2009.
concentration. [5] Internation Organization for Standardization (ISO).
Finally, optimizing both paths, SW to HW and HW to SW, achieves Reference Model of Open System Interconnection (OSI),
a 2.68x speed up against the pure software solution. The total second edition, 1994. ISO/IEC 7498 Standard.
communication time (0.58Mcycles) of HW-SW input/output pack- [6] K. Popovici et al. Simulink Based Hardware-Software
/unpack decreases 10 fold compared to unoptimized HW-SW [no- Codesign Flow For Heterogeneous MPSoC. In Proceedings
opt] solution (5.8Mcycles). Meanwhile, the communication time of the 2007 summer computer simulation conference, pages
497504. Society for Computer Simulation International,
(0.58Mcycles) with 3.4% of the total execution time is no longer a 2007.
significant delay contributor. [7] K. M. Popovici. Multilevel Programming Envrionment for
To assess power efficiency, we measure board-level power of our Heterogeneous MPSoC Architectures. PhD thesis, Grenoble
platform. It remains fairly constant at around 680mW regardless of Institute of Technology, 2008.
load and FPGA usage. As the DSP runs at the fixed frequency, this [8] F. Robino et al. From Simulink to NoC-based MPSoC on
indicates that the FPGA (whose load is changed) is a minor con- FPGA. In DATE, pages 14, 2014.
tributor towards the total dynamic power. Nonetheless, HW/SW [9] I. Sobel. Neighborhood Coding Of Binary Images For Fast
Co-design and further communication optimization shorten total Contour Following And General Binary Array Processing.
Computer Graphics and Image Processing, 8(1):127 135,
execution time in the heterogeneous execution. Hence, the energy 1978.
efficiency increases linearly with performance speedup. Our opti- [10] I. The MathWorks. Embedded coder ref. http://www.
mized HW/SW solution is 2.68x more energy efficient. mathworks.com/products/embedded-coder/,
Table 2: FPGA Utilization of HW/SW Optimized Solution 2014a.
[11] I. The MathWorks. Hdl coder ref. http:
Slice Total Application Proxy+Glue Database //www.mathworks.com/products/hdl-coder/,
Pack+Unpack 2014a.
Usage (out of 9312) 547 170 177 200 [12] The MathWorks Inc. MATLAB and Simulink, 1993-2014.
Utilization 5.8% 1.8% 1.9% 2.1% [13] Xilinx. Xilinx Command Line Tools User Guide, October
2013. Version 14.7.
In the HW/SW optimized solution, the FPGA utilization of the [14] J. Zhang et al. Joint Algorithm Developing and
Spartan 3E is only 5.874% as shown in Table 2. The generated System-Level Design: Case Study on Video Encoding. In
Embedded Systems: Design, Analysis and Verification, pages
Proxy, pack, unpack and other glue logic (bus IF) occupy 32%. 2638. Springer, 2013.
The application specific Soble Core is small in this example. This [15] L. Zhang, et al. Bridging Algorithm And ESL Design:
indicates significant room for other implementations, e.g. dupli- Matlab/Simulink Model Transformation And Validation. In
cating Sobel Core on FPGA. However, this algorithm optimization Specification Design Languages (FDL), 2013 Forum on,
is out of the scope of the SimSH. On the other hand, the DSP is pages 18, Sept 2013.
fully utilized at nearly 100% for all five implementations due to the
overlapped HW/SW execution.

Rapid Heterogeneous Prototyping From Simulink PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Rapid Heterogeneous Prototyping From Simulink PDF

Загружено:

Авторское право:

Доступные форматы

Computation /

Architectural Component Allocation / Mapping Allocation &

C/C++ Code (.c) HDL Apps (.vhd)

Simulink Model In the result of the profiler-guided

Back-end model, the traffic between blocks

Input Bit Concat Output

Input ... Unbuffer Output and HW). Section 3.2 discusses

Listing 1: Proxy API at Transport Layer 3 addr = convert2addr((BlockX, PortY);

19 down to 16 is allocated to block ID and address range from 15

Input Bit Concat Output

boundaries (Fig. 11).

Input ... Unbuffer Output

optimizes both input and output of the Sobel Core. 5. CONCLUSION

30 ping decision, our SimSH automatically synthesizes the Simulink

Вам также может понравиться