Вы находитесь на странице: 1из 8

A runtime adaptive H.

264 video-decoding MPSoC


platform
Giuseppe Tuveri, Simone Secchi, Paolo Meloni, Luigi Raffo

Emanuele Cannella

DIEE - Department of Electrical and Electronic Engineering


Leiden Institute of Advanced Computer Science
University of Cagliari
Leiden University
email: {giuseppe.tuveri, simone.secchi, paolo.meloni, luigi}@diee.unica.it
email: cannella@liacs.nl

AbstractDue to current and future technology issues, multicore processing systems are required to provide support for
adaptivity to an ever increasing extent. This requirement may
descend from demands of fault-tolerance as well as from dynamic
Quality-of-Service (QoS) management strategies, depending on
the targeted application and power budget. This paper presents a
Network-on-Chip (NoC)-based Multi-Processor System-on-Chip
(MPSoC) platform for video decoding applications that provides
system adaptivity and reduced power consumption. The platform
specifically targets execution of Polyhedral Process Network
(PPN) streaming applications. System adaptivity is achieved
through support for runtime migration of PPN processes between
different tiles, while the power consumption is reduced at runtime
through clock gating of inactive processing tiles. The details of
how the migration process and clock gating mechanisms are
implemented in the platform, both in hardware and middleware,
will be presented, along with a characterization of the introduced
overhead. In its standard operating mode, the adaptive platform
executes a PPN implementation of an H.264 decoder on a stream
of video packets coming from a network connection. The network
packets are analyzed through a deep packet inspection kernel,
OpenDPI, to distinguish between video and special reconfiguration packets. Upon reception of a reconfiguration packet from
the network, the adaptive platform performs an on-line reconfiguration that employs runtime PPN process migration to modify
the amount of computational resources allocated to execution
of the H.264 decoder application. The results demonstrate the
feasibility of the approach and its possible applicability to the
broader class of PPN streaming applications.

I. I NTRODUCTION
Multi-Processor Systems-on-Chip (MPSoCs) are currently
employed as computational infrastructure of numerous complex digital systems, ranging from workstations to batteryoperated hand-held devices, in a large number of application
scenarios. Such variety in workload requirements poses a
number of new challenges, in addition to the classic need for
ever increasing throughput performances.
System adaptivity is a valuable feature whose need is clearly
called for by at least three different technology trends. First,
modern application workloads are intrinsically dynamic and
The research leading to these results has received funding from the European Community Seventh Framework Programme (FP7/2007-2013) under
grant agreement no. 248424, MADNESS Project, from ARTEMIS JU ASAM Project, and by the Region of Sardinia, Young Researchers Grant, PO
Sardegna FSE 2007-2013, L.R.7/2007 Promotion of the scientific research
and technological innovation in Sardinia.

composed of multiple interleaved different-priority applications, often sharing the same computing resources. Therefore,
their predictability is usually low and runtime adaptivity is a
useful feature in the quest for optimal resource management.
Second, limitations posed by the power budget are starting to
apply to more than just smartphones and hand-held devices,
therefore different operation modes characterized by different
power consumption and performance figures are generally
available and selectable at runtime, either under user or operating system control. This aspect often translates into the need
of powering down parts of the available computing resources,
trading performances for reduced power consumption. Finally,
modern technology scaling and transistor integration capacity
have revitalized the interest in component reliability. For
instance, future multi-core computing platforms will likely
feature very high core counts, posing the need of runtime
system adaptivity and task migration in order to maintain the
overall functionality in presence of faulty cores.
This paper describes the design of an adaptive MPSoC platform for H.264 decoding of network streaming video packets.
Adaptive video streaming has gained, in the recent past, significant interest in the academia, industry and standardization
communities because of the many factors that benefit from
such adaptation capability [1]. Two of the many points of
interests of adaptive video streaming are unpredictability of
common channel conditions, which reduces effective network
bandwidth, and variability of the computational power at the
receiver side across the plethora of embedded platforms that
integrate video reproduction capabilities. Moreover, a large
number of embedded devices start to integrate OSs and run
multiple applications, requiring the receiving platform itself
to provide a degree of adaptivity when running video-based
applications.
The proposed platform is composed of different computing
tiles and a NoC communication infrastructure. The H.264
video decoder application and other possible target applications are written according to the Polyhedral Process Network
(PPN) model of computation. PPNs are a class of applications
commonly used to map streaming applications onto multiprocessor systems, leveraging the parallelism offered by process pipelining. System adaptivity is provided in the form
of runtime PPN process migration among different computational tiles. Support for system adaptivity is implemented

through specific mechanisms that operate both at hardwareand middleware-level. At the hardware level, specific features
have been added to efficiently map the PPN communication
channels onto the NoC as well as to facilitate the process
migration mechanism. The middleware layer includes components that have been designed to realize in an efficient way
communication between the different PPN processes on top
of a NoC, as well as components that implement save and
restore of the PPN process state for the migration procedure.
Moreover, in order to reduce the overall power consumption,
the platform includes an infrastructure for clock gating that
operates at the tile level. By using this infrastructure, tiles
whose computational workloads are migrated can be turned
off to reduce dynamic power consumption in idle state.
The platform implements the process of on-line reconfiguration upon receiving a reconfiguration packet from a network
connection. In the presented experiments, the adaptive platform is synthesized as an FPGA-based prototype and executes
the H.264 decoder on a stream of video packets from an
external Ethernet network connection. Upon reception of the
reconfiguration packet, the adaptive platform will re-configure
the mapping of the PPN processes onto the computational
tiles, employing runtime PPN process migration to modify the
amount of computational resources allocated to execution of
the H.264 decoder application.
The remaining of the paper is organized as follows: Section
II presents an overview of the most relevant related work for
what regards mapping of PPN applications onto NoC-based
MPSoCs and PPN process migration as a way to improve system adaptivity for such applications and platforms. Section III
presents the hardware organization of the presented platform,
with particular emphasis on those features that specifically
contribute to execution of adaptive PPN applications on top
of the NoC. Section IV presents the middleware components
that realize PPN process communication and migration among
the different computing tiles, along with the implementation of
tile-level clock gating for power reduction purposes. Section
V describes the experiments that demonstrate the implemented
migration functionality and power reduction performances.
Finally, Section VI concludes the paper and gives an overview
of prospective platform improvements.
II. R ELATED W ORK
Runtime adaptivity will be a critical feature of future computing systems [2]. The design of adaptive systems requires
the creation of a number of mechanisms and paradigms at all
levels, including hardware architecture and software/middleware organization. In the area of embedded computing systems, where constraints such as cost and power consumption
are the most critical, system adaptivity has been a major
topic of research in the recent past, and a large number of
run-time management methodologies have been studied [3].
Among such methodologies, runtime process migration is one
of the most commonly addressed solutions, investigated for
purposes such as dynamic load balancing and system-level
fault resilience [4],[5].

Kahn Process Networks (KPNs) constitute a wellestablished model of computation used to describe streaming
applications [6], where computational processes operate in
parallel and communicate using unbounded FIFO channels.
KPNs do not require the underlying computing resources to
share any memory between different processes, therefore they
are often implemented on top of physically distributed memory
architectures. Polyhedral Process Networks (PPNs) [7] belong
to the class of KPNs and have some peculiarities that facilitate
their mapping on top of MPSoC platforms. For instance, it is
possible to derive the buffer sizes that guarantee the absence
of deadlocks in a PPN. A PPN process is forced to stall
when trying to read from an empty FIFO communication
channel or write into a full FIFO communication channel.
Moreover, it is possible to automatically translate a sequential
program that satisfies a set of structural constraints in a PPN
representation [8], at the same time determining the size of the
communication buffers that guarantee deadlock-free execution
of the PPN.
Moreover, these characteristics make PPNs a very appealing model of computation for implementation of runtime system adaptivity through process migration. Process
migration in distributed-memory MPSoC platforms has been
investigated for networks of processes. In [9] a NoC-based
distributed-memory MPSoC platform is described, where
system-adaptivity is achieved through process migration. The
remapping decision is delegated to an OS-level runtime monitor, and different remapping policies are presented and evaluated. However, the task remapping can happen only at specific
points in time, namely when process communication events
are processed, therefore reducing the effective responsiveness
of the adaptivity mechanism. In our platform, we allow task
remapping events to happen at anytime during execution.
In [10] and [11], two other approaches to system-level
adaptivity that rely on task migration are presented. In both
these works, the user is in charge of defining checkpoints at
which the task re-mapping can take place. Again, our approach
to task migration does not burden the programmer with such
task. Moreover, the mentioned works assume to have sharedmemory support from the underlying hardware or middleware,
while our approach relies completely on a distributed memory
hierarchy. We believe this is a more general case, as abstract
shared-memory primitives can always be implemented on top
of a physically distributed memory.
The mapping of PPNs, and KPN applications more in
general, on top of MPSoC embedded computing platforms
has been extensively studied in literature. Different alternatives
exist when mapping the logical point-to-point KPN communication channels onto the available underlying communication
infrastructure. The main trade-off in mapping KPN applications is between generality of the approach applicability
and overall performances. In [12], a survey of mapping
strategies for KPN applications is presented. The majority of
the approaches at the state of the art focus on optimizing
the application on the underlying MPSoC architecture and
therefore lack in generality. In our platform, we try to remain

as general as possible, by targeting a NoC-based MPSoC


platform with distributed memory. We built a middleware
layer which provides an intuitive PPN API for this platform
that the PPN applications can use. By doing so, the PPN
application code will be portable to all the possible mappings
of the processes onto the available cores. The middleware will
also implement, in a programmer-transparent fashion, the task
migration mechanisms.
Task migration and re-mapping have often been used as an
enabler to implement strategies for runtime reduction of energy consumption. Such solutions often use task migration in
conjunction with the application of clock gating capabilities to
idle tiles or processing elements. Examples of such techniques
can be found in both academic and production systems, such as
[13], [14] or [15]. Section IV-A will describe how the proposed
system uses similar clock gating techniques for energy saving
purposes.
III. P LATFORM DESCRIPTION
The presented platform consists of an MPSoC organized as
a cluster of tiles, interconnected through a packed-switched
NoC substrate. The MPSoC platform is controlled by a host
processor, which interfaces to the external network to receive
the stream of video and control packets, and also controls a
set of on-chip peripherals, such as a memory controller and a
video controller for data input/output.
As mentioned in Section I, the presented platform is targeted
at adaptive decoding of network video streams. In this application scenario, video packets are potentially interleaved with
other packets incoming from the network, therefore the host
processor is in charge of filtering out the unneeded packets and
passing on the video packets to the PPN H.264 video decoder.
In order to do so, the host processor runs OpenDPI [16], an
open source implementation of a Deep Packet Inspection (DPI)
engine. OpenDPI inspects the network packets on the basis
of the application-level and returns a code for the detected
protocol, after performing a protocol-specific function.
Figure 1 shows the hardware organization of the platform,
where the computational tiles are connected to the host processor (left side) through the NoC interconnect.
The computational tile internal structure is shown on the
right side of Figure 1. The processing element is the central
element of the tile, and no constraints have been put on its
architecture. Any kind of processor with standard bus-based
signal interface can be easily integrated. Processors with the
ability to support multiple pending memory operations inside
the memory units can be plugged as well, since the NoC
substrate supports multiple pending transactions per master
interface. No instruction set extensions are needed in the
processor ISA to perform communication operations, since
communication and synchronization mechanisms are entirely
managed accessing memory-mapped registers at the network
interface. The processing element interfaces with two dualport memory modules for instructions and data storage, maximizing the bandwidth for data and instruction access from the
processing core. Each memory module is directly connected to

the processor through one of its two ports, while the other one
is used to handle message-passing send and receive communication in a DMA fashion. This feature enables the processor
to decouple computation from communication, covering (if
needed) a large part of the communication latency with useful
computation.
The NoC substrate consists of an enhanced instance of
the pipes-lite library of synthesizable network components
[17]. pipes is a scalable and configurable lightweight NoC,
whose topology and bandwidth can be customized. The tile
interfaces to the network through a Network Interface (NI).
The NI is in charge of constructing the packets according to
the transactions requested by the cores. For processing tiles,
the NI includes master and slave interfacing capabilities. While
the necessity to include a master interface is obvious, a slave
interface is required to support the message passing between
tiles. In addition to the NI, the Network Adapter (NA) has
been extended with support for message-passing within the
NoC.
A programmable message handler (MPH) with DMA capabilities is integrated with the NI. The MPH includes a DMA
engine (i.e. address generator and memory interface) that allows to offload the memory copies needed to execute message
passing from the processor. In order to do so, the MPH exposes
a set of memory-mapped registers that are programmed by the
processor to control send and receive operations, setting the
receive/destination address within the NoC, the message tag
and message size. We will not go into details of the necessary
parameters for message passing, as they are commonly found
in all message-passing communication libraries. The other
exposed registers contain the memory addresses where the
data is to be found (for a send operation) or stored into (for
a receive operation), the data size (necessary to control the
DMA engine)
Similarly to all message-passing communication libraries,
while the send() message-passing operation does not require
any specific synchronization, the receive() message-passing
operation, which must always be explicitly executed by the
receiving process, may trigger some synchronization issues. In
detail, the receive operation might be called by the receiving
process before or after the actual data is arrived at destination.
In case the receive is invoked before the data is arrived
at destination, the MPH module will be already correctly
programmed to process the incoming transaction and, upon
reception of the data, it will directly store it at the final address
in memory. Instead, if the data arrives at destination before the
receive is invoked by the receiving process, the message data
is stored in the memory, into a buffer reserved for such a
purpose. The message identification fields (sender, tag, buffer
address) are stored inside an event file, in order to enable the
receive primitive, when invoked by the receiving process, to
retrieve the message from the memory. The message-passing
receive primitive scans the event file locations, to check if
the message under reception is already stored in the buffer.
If so, the processor copies the message data from the buffer
to the final memory address indicated by the programmer. If

Fig. 1.

Hardware architecture overview and internal organization of the single tile

instead the message is not found in the event file, the processor
keeps polling the DMA handler, where a dedicated circuitry
is in charge of comparing the incoming messages with the
expected message describers. In order to allow partial buffer
de-fragmentation, the buffer is treated as a list.
As it will be described in the next section, the processor
might also need, during application execution, to generate an
asynchronous event on another processor, such as, for example, the initiation of the migration process. To this aim, the
NA has been also enriched with a message tag decoder. Upon
reception of a message, the message tag is also compared by
the tag decoder against a set of pre-determined tag values.
If the tag decoder finds a match, it raises an interrupt signal
for the processor, communicating with the processor interrupt
controller.
Finally, the tile includes a set of performance counters that
can be used to monitor application execution, memory and
network access directly at the processor interface. The performance counters are directly interfaced to the data memory for
storage of the desired access traces.
All the logic included in the NA was explicitly designed to
support message-passing inside the NoC decoupling processor
computation from network communication. While a certain
amount of performance benefit can be extracted from overlapping computation and communication, there is an associated
area overhead that needs to be taken into account. This
overhead will be crucial, since all the processor tiles will
experience it. Moreover, the designed logic might also affect
the operating frequency of the NA, since it heavily consists
of combinational logic. In order to estimate such overhead,
we synthesized for FPGA the NA hardware RTL description,
and obtained the figures that we report in Table I, comparing
against a base NA architecture without MPH.
Table I shows how the overhead introduced by the insertion
of the MPH module is quite significant, especially in terms of
operating frequency. However, in designing the MPH module
we focused on minimizing the message latency, therefore the
largest part of its logic is fully combinational. The design
could be easily modified to reduce the impact on the operating
frequency, however this will result in an increased latency

Slices
Slice Reg.s
Slice LUTs
Freq.

base NA
361
577
879
135 MHz

NA
1010
2589
3434
103 MHz

TABLE I
NA AREA AND FREQUENCY FIGURES FOR FPGA

P1
Application(s)

P2
Middleware

PPN
communication

P3
Process
migration

PPN
Processes
tile2

tile3

tile4

Run-time
manager

Local Operating System

Fig. 2.

tile1

Platform software and middleware stack

experienced by each message. The overhead in terms of area


is negligible when we recall that the largest portion of the tile
area occupation is imputable to the processor and memory
modules.
IV. P LATFORM A DAPTIVITY
This section will describe in detail how the process migration is implemented in our platform. To do so, we will describe
the entire software and middleware stack of the platform, when
executing PPN applications. Figure 2 gives an overview of the
entire stack.
The application level occupies the top level of the stack.
As already mentioned in Section I, applications are specified
according to the PPN model of computation. The lowest level
of the stack is a basic OS that provides simple functionalities
such as process creation and destruction.
The middleware level is composed of three different components. The first one implements the communication between
PPN processes, and deals with the mapping of logical FIFO
channels on top of the NoC physical substrate. The mapping
must be able to preserve the PPN semantics such as process
blocking on read from empty FIFO or write to full FIFO. The

communication layer is based on a request-driven approach


[18].
B1

ch1

B2

ch2

tile1

tile2
B1P

B1C

P
PE

C
B2P

B2C

requests

NI

NI

PE

tokens ch1, ch2

Fig. 3.

Producer-consumer inter-tile communication implementation

Figure 3 shows a PPN producer-consumer processes communicating over a NoC physical communication infrastructure.
In our request-driven approach, each FIFO buffer of the
original PPN graph is split into two software buffers, one on
the producer tile and one on the consumer tile. For instance,
B1 in the top part of Fig. 3 is split in B1P on tile1 and
B1C on tile2 . The size of these buffers is defined such that
producer and consumer will have identical buffer sizes for
the channel that links them. Moreover, the transfer of tokens
from the producer tile to the consumer tile is initiated by the
consumer. Every time the consumer is blocked on a read at a
given FIFO channel, it sends a request to the producer to send
new tokens for that channel. The producer, after receiving this
request, sends as many tokens as it has in its software FIFO
implementing that channel.
When the request message is received at the producer tile,
it generates an interrupt for the processor. The mechanism
that implements the request interrupt generation has already
been presented, from the hardware point of view, in Section
III. Request messages are sent from the consumer using
dedicated tags, such that the tag decoder hardware module
at the producer will be able to detect the request reception
event and raise the interrupt signal to the processor. The
interrupt-based mechanism allows for a better computationcommunication overlap, as the processor will now be relieved
of any polling for request messages.
Run-time
Manager

Predecessor tile

Source tile

P1

B1

Successor tile
C

B1

P2

B2

B2

tile0

tile1

P3

tile2

migration

B1

P2

B2

platform system adaptivity. Figure 4 shows an example of a


migration process in our approach. The involved tiles are the
source and destination tile, directly affected by the actual task
migration. All the tiles that communicate with the source tile
in the PPN graph (both preceding and succeeding tiles) have
to be made aware of the process migration as well.
The migration mechanism requires actions from all the
involved tiles depicted in Figure 4. The migration decision,
namely which process has to be migrated and where, is
taken by the resource manager, when properly stimulated
through reception of a migration request packet. Then, the
resource manager sends a specific control message to the
source tile. The source tile broadcasts this control message
to the destination, predecessor and successor tiles to complete
the migration procedure.
For each of the tiles involved in the migration procedure,
the following list of actions is taken.
1) Actions on the source tile: On the source tile, the PPN
process is stopped and its state saved and forwarded to the
destination tile. A PPN process state is composed only of
its input and output FIFO buffers and its iterator set, which
identifies the current iteration of the PPN main loop [7]. The
source tile is also in charge of taking care of propagating the
migration decision to the other tiles involved in the migration
procedure (dashed arrows in Figure 4).
2) Actions on the destination tile: The destination tile
receives a specific message for process activation. Upon
reception of the migration message, the middleware creates
the software FIFOs and activates the replica of the migrated
process using an OS call. Resuming the process includes
copying the input and output FIFOs of the migrated process.
The execution on the destination tile will start from the
beginning of the iteration that has been interrupted on the
source tile.
3) Actions on predecessor and successor tile(s): These tiles
are required to update the middleware tables where the current
mapping of the processes in the system is stored.
The last component of the middleware is the runtime
manager, which is in charge of deciding where to migrate the
process. The runtime manager is executed in a centralized way,
on a single tile of the platform. However, all the tiles have the
potential to execute it. The runtime manager is activated upon
reception of a special packet, that requests a process migration.
It is important to note that the policies implemented by the
runtime manager do not depend on the form of input that
signals the runtime manager to trigger a process migration. In
our platform, the runtime manager decides where to migrate
the process by checking a set of migration look-up tables
that are generated offline and include all the possible reconfiguration combinations.

tile3

Fig. 4.

Destination tile

A. Power Reduction Through Clock Gating

Migration scenario

As mentioned in Section I, in order to reduce power


consumption at runtime the proposed platform includes an
infrastructure that turns off the tiles that are not running any
computational task. This operation employs hardware clock

The second middleware component implements the process


migration mechanism, which is at the core of the proposed

gating at the level of the single tile. The proposed platform


includes a centralized unit, named Clock Gating Manager
(CGM), that is connected to every computing tile of the mesh
through the NoC interconnect. The CGM generates the clock
enable signals for all the different clock gates of the platform.
Figure 1 shows the CGM as a node of the NoC.
In our platform, all the tiles can be turned off through the
clock enable signals, except the tile that runs the runtime
manager component of the middleware, since it is required
to de-activate and re-activate all the other tiles in a centralized
fashion. The CGM controls the clock gating of every tile in
the mesh. In detail, when the tile has finished migrating its
last process, the runtime manager signals the CGM to turn
that tile off. Conversely, before migrating a process to a tile
that is currently in idle state, the runtime manager will signal
the CGM to re-activate its clock.
When a tile is turned off through clock gating, the components of the tile that actually stop receiving an active clock
are the processor, the local memory and the Network Adapter.
The NoC router cannot be de-activated, since it is necessary
for the continued operation of the entire interconnect.
V. E XPERIMENTS
This section will describe the experiments that we ran
to evaluate the proposed adaptive network video streaming
platform. In the targeted application, the H.264 video packets
are detected by OpenDPI and assembled into a video frame,
which is then placed inside a frame buffer, where the H.264
decoder will fetch it for decoding. After the H.264 decoding,
the decoded video frame is directly output by the tile that
executes the last process in the PPN pipeline, using the video
controller. The left part of Figure 5 shows the frame buffer.
The PPN representation of the H.264 decoder is shown in
the right part of Figure 5. The process network consists of 6
processes.
Moreover, an additional protocol is used in our experiments
to send the reconfiguration request packet. This packet is used
to request the platform to reduce or increase the computational
resources allocated to H.264 decoding. The payload of this
packet contains the number of tasks to be migrated, with the
aim of freeing computational cores, or a reset command, to
return to the initial PPN process mapping.

Fig. 5.

process graph of the H.264 decoder PPN application

In order to evaluate the performance of the proposed platform, we built an FPGA prototype of the whole system. In the
recent past, FPGA prototyping has gained significant interest
as a convenient alternative to software simulation (in terms of

speed and accuracy) for assessing the performances of MPSoC


platforms [19]. In our case, the main advantage of FPGA
prototyping is that it allows to run the target application with
long execution times. The downside of FPGA prototyping is
that the performance numbers that can be extracted from the
prototype are significantly lower than those expected by an
ASIC implementation of the platform. In our case, this will
reflect in reduced frame-per-second figures of the proposed
H.264 decoder. However, the experimental results are still
significant, when interpreted in terms of clock cycles, and hold
generally valid in terms of relative adaptivity overhead.
Our experimental FPGA setup consists of a Xilinx ML605
board, mounting an lx240t Virtex-6 FPGA device. Moreover,
the development boards includes a 512 MB DDR2 RAM
external memory, along with a number of peripherals, that we
use to interface to the network (Ethernet PHY), to output the
video frames (DVI video controller) and for serial input/output
(RS232 controller). On the FPGA device, we placed a synthesized version of the platform shown in Figure 1, with the same
organization and number of cores as shown in the figure. In
this prototype, the host processor (a Xilinx MicroBlaze in our
prototyping system) interfaces to the external network through
an Ethernet controller to receive the video stream packets. The
KPN H.264 decoding is performed on the tile processors (still
MicroBlazes in our prototype). After the H.264 decoding, the
decoded video frame is directly output by the tile that executes
the last process in the PPN pipeline, using the TFT video
controller. The host processor also interfaces to the external
memory through a DDR memory controller, for instructions
and data, and to other peripherals (UART, Debug, TFT video)
for data input/output. The used software toolchain was Xilinx
ISE v14.
The experiments that we performed were aimed at assessing
the performance of the proposed platform in terms of system
adaptivity. We ran a first set of experiments where, while performing the H.264 decoding on the platform, we incrementally
requested through the Ethernet network connection a number
of reconfigurations that was aimed to subsequently reduce the
active computational tiles, starting from 5 active tiles, down
to 2 active tiles.
Figure 6 shows the performances that were extracted from
the prototype for decreasing number of active tiles. The numbers were obtained counting the number of cycles necessary
to decode each frame of the video and averaging over all the
decoded frames. Numbers are expressed in frames per second,
converting the number of cycles, for an operating processor
frequency of 100 MHz.
It can be noted how process migration does not result
in very steep performance degradation. In our experiments,
we programmed the remapping tables used by the runtime
resource manager in such a way that remappings do not add
processes to the tile which runs the most computationally
intensive nodes (i.e., parser and cavlc).
We also assessed the relevant time overheads for the process
migration mechanism as implemented in the proposed platform. Figure 7 shows a graphical representation of the active

Fig. 7.

Process migration example and relevant time overheads

Fig. 6. Prototyped platform performances in FPS for decreasing number of


active tiles

processes over time. At time t0 , the H1 process (i.e. the idct


task in the PPN in Figure 5) is stopped to be migrated from
tile4 to tile5 . The time interval between t0 and t1 is the time
necessary to save the process state and prepare for migration.
We recall that the state of a process is composed only by its
iterators and input/output FIFO buffers. Thus, by analyzing
the PPN topology we can derive an upper bound of the state
size of a process, which in turn is proportional to the time
(t1 t0 ). The time interval between t1 and t2 is the time
necessary to transfer the process state to the newly-assigned
computational tile. This time depends on the overall network
topology, current traffic and also on the re-mapping choices
made by the runtime manager. The time interval between t2
and t3 is the time necessary to restore the process state in
the new tile. Again, this time depends only on the application
characteristics.
It is important to note that the PPN iteration involved with
the migration request is interrupted and needs to be re-started
on the new tile. The worst case is when the request is received
toward the finish of the iteration.
The second set of experiments that we performed on the proposed platform involves the ability to reduce, at runtime, power
consumption through the use of the clock gating mechanism
described in IV-A. To this aim, we performed the same pattern
of migration requests that we just discussed, and enabled clock

gating on the idle tiles. This means that, after every task
migration, we switched off the clock signal of a specific tile.
We measured the runtime reduction of power consumption of
the FPGA device using Xilinx System Monitor utility.
Figure 8 shows the power figures for decreasing number
of active tiles. It can be noticed how the 30% performance
degradation obtained reducing the number of active tiles from
5 to 2 is matched by a 15% power saving resulting from
clock gating of the idle tiles. It is important to mention
that the reduction in power consumption that we can achieve
through clock gating is in reality a reduction of dynamic power
consumption, while static power consumption is not impacted
by switching of the clock signal. Therefore, only in terms of
dynamic power, the percentage of power reduction will be
significantly higher than 15%.

Fig. 8. Prototyped power reduction through clock gating for decreasing


number of active tiles

VI. C ONCLUSION
This paper presented an adaptive Network-on-Chip (NoC)based Multi-Processor System-on-Chip (MPSoC) platform for
video decoding applications that achieves system adaptivity
and runtime power consumption reduction capabilities. System
adaptivity was achieved developing a set of process runtime
migration mechanisms, while the reduction in power consumption is realized through clock gating at the level of the
single computational tile. The paper presented the platform

architecture organization, in terms of hardware components as


well as middleware strategies. The platform executes a PPN
implementation of an H.264 decoder on a stream of video
packets coming from a network connection, where a front-end
packet inspection is performed by a host processor running
a deep packet inspection kernel, OpenDPI, to discriminate
video and special reconfiguration packets. Upon reception of a
reconfiguration packet from the network, the adaptive platform
performs an on-line reconfiguration that employs runtime PPN
process migration to modify the amount of computational
resources allocated to execution of the H.264 decoder application. The experimental results obtained through a customdeveloped FPGA prototype demonstrate the effectiveness of
the runtime adaptivity support, as well as how the graceful
performance degradation generated by the process remapping
is matched by a significant power saving, obtained through
tile-level clock gating.
R EFERENCES
[1] B. Girod, M. Kalman, Y. J. Liang, and R. Zhang, Advances
in channel-adaptive video streaming, Wireless Communications and
Mobile Computing, vol. 2, no. 6, pp. 573584, 2002. [Online].
Available: http://dx.doi.org/10.1002/wcm.87
[2] J. Henkel and L. Bauer, What is adaptive computing? SIGDA
Newsl., vol. 40, no. 5, pp. 11, May 2010. [Online]. Available:
http://doi.acm.org/10.1145/1866966.1866967
[3] V. Nollet, D. Verkest, and H. Corporaal, A Safari Through the MPSoC
Run-Time Management Jungle, Signal Processing Systems, vol. 60,
no. 2, pp. 251268, 2010.
[4] J. M. Smith, A survey of process migration mechanisms, 2001.
[5] D. S. Milojicic, F. Douglis, Y. Paindaveine, R. Wheeler,
and S. Zhou, Process migration, ACM Comput. Surv.,
vol. 32, no. 3, pp. 241299, Sep. 2000. [Online]. Available:
http://doi.acm.org/10.1145/367701.367728
[6] G. Kahn, The semantics of simple language for parallel programming.
in IFIP Congress, 1974, pp. 471475.
[7] S. Verdoolaege, Polyhedral process networks, in Handbook of Signal
Processing Systems, S. S. Bhattacharyya, E. F. Deprettere, R. Leupers,
and J. Takala, Eds. Springer US, 2010, pp. 931965. [Online].
Available: http://dx.doi.org/10.1007/978-1-4419-6345-1-33
[8] S. Verdoolaege, H. Nikolov, and T. Stefanov, pn: a tool for improved
derivation of process networks, EURASIP J. Emb. Sys., vol. 2007.
[9] G. M. Almeida, G. Sassatelli, P. Benoit, N. Saint-Jean, S. Varyani,
L. Torres, and M. Robert, An Adaptive Message Passing MPSoC
Framework, Int. J. of Reconfigurable Computing, vol. 2009, p. 20, 2009.
[10] S. Bertozzi, A. Acquaviva, D. Bertozzi, and A. Poggiali, Supporting
task migration in multi-processor systems-on-chip: a feasibility study,
in Proc. of the conf. on Design, automation and test in Europe, ser.
DATE06, 2006, pp. 1520.
[11] A. Acquaviva, A. Alimonda, S. Carta, and M. Pittau, Assessing Task
Migration Impact on Embedded Soft Real-Time Streaming Multimedia
Applications, EURASIP J. Emb. Sys., vol. 2008, 2008.
[12] W. Haid, K. Huang, I. Bacivarov, and L. Thiele, Multiprocessor soc
software design flows, Signal Processing Magazine, IEEE, vol. 26,
no. 6, pp. 6471, 2009.
[13] J. Donald and M. Martonosi, Techniques for multicore thermal management: Classification and new exploration, in Computer Architecture,
2006. ISCA 06. 33rd International Symposium on, 2006, pp. 7888.
[14] T. Limberg, M. Winter, M. Bimberg, M. B. S. Tavares, H. Ahlendorf,
M. G. Fettweis, T. U. Dresden, H. Eisenreich, G. Ellguth, and T. U.
Dresden, A heterogeneous mpsoc with hardware supported dynamic
task scheduling for software defined radio.
[15] T.
Corporation,
Tile
processor
architecture
overview
for
the
tilepro
series,
2013.
[Online].
Available: http://www.tilera.com/scm/docs/UG120-Architecture-OverviewTILEPro.pdf
[16] Ipoque, OpenDPI.org, http://www.opendpi.org.

[17] D. Bertozzi and L. Benini, Xpipes: a network-on-chip architecture


for gigascale systems-on-chip, Circuits and Systems Magazine, IEEE,
vol. 4, no. 2, pp. 1831, 2004.
[18] E. Cannella, O. Derin, P. Meloni, G. Tuveri, and T. Stefanov, Adaptivity
support for mpsocs based on process migration in polyhedral process
networks, VLSI Design, vol. 2012, no. Article ID 987209, p. 17 pages,
February 2012.
[19] P. Meloni, S. Secchi, and L. Raffo, An fpga-based framework for
technology-aware prototyping of multicore embedded architectures,
Embedded Systems Letters, IEEE, vol. 2, no. 1, pp. 59, 2010.

Вам также может понравиться