Академический Документы
Профессиональный Документы
Культура Документы
Emanuele Cannella
AbstractDue to current and future technology issues, multicore processing systems are required to provide support for
adaptivity to an ever increasing extent. This requirement may
descend from demands of fault-tolerance as well as from dynamic
Quality-of-Service (QoS) management strategies, depending on
the targeted application and power budget. This paper presents a
Network-on-Chip (NoC)-based Multi-Processor System-on-Chip
(MPSoC) platform for video decoding applications that provides
system adaptivity and reduced power consumption. The platform
specifically targets execution of Polyhedral Process Network
(PPN) streaming applications. System adaptivity is achieved
through support for runtime migration of PPN processes between
different tiles, while the power consumption is reduced at runtime
through clock gating of inactive processing tiles. The details of
how the migration process and clock gating mechanisms are
implemented in the platform, both in hardware and middleware,
will be presented, along with a characterization of the introduced
overhead. In its standard operating mode, the adaptive platform
executes a PPN implementation of an H.264 decoder on a stream
of video packets coming from a network connection. The network
packets are analyzed through a deep packet inspection kernel,
OpenDPI, to distinguish between video and special reconfiguration packets. Upon reception of a reconfiguration packet from
the network, the adaptive platform performs an on-line reconfiguration that employs runtime PPN process migration to modify
the amount of computational resources allocated to execution
of the H.264 decoder application. The results demonstrate the
feasibility of the approach and its possible applicability to the
broader class of PPN streaming applications.
I. I NTRODUCTION
Multi-Processor Systems-on-Chip (MPSoCs) are currently
employed as computational infrastructure of numerous complex digital systems, ranging from workstations to batteryoperated hand-held devices, in a large number of application
scenarios. Such variety in workload requirements poses a
number of new challenges, in addition to the classic need for
ever increasing throughput performances.
System adaptivity is a valuable feature whose need is clearly
called for by at least three different technology trends. First,
modern application workloads are intrinsically dynamic and
The research leading to these results has received funding from the European Community Seventh Framework Programme (FP7/2007-2013) under
grant agreement no. 248424, MADNESS Project, from ARTEMIS JU ASAM Project, and by the Region of Sardinia, Young Researchers Grant, PO
Sardegna FSE 2007-2013, L.R.7/2007 Promotion of the scientific research
and technological innovation in Sardinia.
composed of multiple interleaved different-priority applications, often sharing the same computing resources. Therefore,
their predictability is usually low and runtime adaptivity is a
useful feature in the quest for optimal resource management.
Second, limitations posed by the power budget are starting to
apply to more than just smartphones and hand-held devices,
therefore different operation modes characterized by different
power consumption and performance figures are generally
available and selectable at runtime, either under user or operating system control. This aspect often translates into the need
of powering down parts of the available computing resources,
trading performances for reduced power consumption. Finally,
modern technology scaling and transistor integration capacity
have revitalized the interest in component reliability. For
instance, future multi-core computing platforms will likely
feature very high core counts, posing the need of runtime
system adaptivity and task migration in order to maintain the
overall functionality in presence of faulty cores.
This paper describes the design of an adaptive MPSoC platform for H.264 decoding of network streaming video packets.
Adaptive video streaming has gained, in the recent past, significant interest in the academia, industry and standardization
communities because of the many factors that benefit from
such adaptation capability [1]. Two of the many points of
interests of adaptive video streaming are unpredictability of
common channel conditions, which reduces effective network
bandwidth, and variability of the computational power at the
receiver side across the plethora of embedded platforms that
integrate video reproduction capabilities. Moreover, a large
number of embedded devices start to integrate OSs and run
multiple applications, requiring the receiving platform itself
to provide a degree of adaptivity when running video-based
applications.
The proposed platform is composed of different computing
tiles and a NoC communication infrastructure. The H.264
video decoder application and other possible target applications are written according to the Polyhedral Process Network
(PPN) model of computation. PPNs are a class of applications
commonly used to map streaming applications onto multiprocessor systems, leveraging the parallelism offered by process pipelining. System adaptivity is provided in the form
of runtime PPN process migration among different computational tiles. Support for system adaptivity is implemented
through specific mechanisms that operate both at hardwareand middleware-level. At the hardware level, specific features
have been added to efficiently map the PPN communication
channels onto the NoC as well as to facilitate the process
migration mechanism. The middleware layer includes components that have been designed to realize in an efficient way
communication between the different PPN processes on top
of a NoC, as well as components that implement save and
restore of the PPN process state for the migration procedure.
Moreover, in order to reduce the overall power consumption,
the platform includes an infrastructure for clock gating that
operates at the tile level. By using this infrastructure, tiles
whose computational workloads are migrated can be turned
off to reduce dynamic power consumption in idle state.
The platform implements the process of on-line reconfiguration upon receiving a reconfiguration packet from a network
connection. In the presented experiments, the adaptive platform is synthesized as an FPGA-based prototype and executes
the H.264 decoder on a stream of video packets from an
external Ethernet network connection. Upon reception of the
reconfiguration packet, the adaptive platform will re-configure
the mapping of the PPN processes onto the computational
tiles, employing runtime PPN process migration to modify the
amount of computational resources allocated to execution of
the H.264 decoder application.
The remaining of the paper is organized as follows: Section
II presents an overview of the most relevant related work for
what regards mapping of PPN applications onto NoC-based
MPSoCs and PPN process migration as a way to improve system adaptivity for such applications and platforms. Section III
presents the hardware organization of the presented platform,
with particular emphasis on those features that specifically
contribute to execution of adaptive PPN applications on top
of the NoC. Section IV presents the middleware components
that realize PPN process communication and migration among
the different computing tiles, along with the implementation of
tile-level clock gating for power reduction purposes. Section
V describes the experiments that demonstrate the implemented
migration functionality and power reduction performances.
Finally, Section VI concludes the paper and gives an overview
of prospective platform improvements.
II. R ELATED W ORK
Runtime adaptivity will be a critical feature of future computing systems [2]. The design of adaptive systems requires
the creation of a number of mechanisms and paradigms at all
levels, including hardware architecture and software/middleware organization. In the area of embedded computing systems, where constraints such as cost and power consumption
are the most critical, system adaptivity has been a major
topic of research in the recent past, and a large number of
run-time management methodologies have been studied [3].
Among such methodologies, runtime process migration is one
of the most commonly addressed solutions, investigated for
purposes such as dynamic load balancing and system-level
fault resilience [4],[5].
Kahn Process Networks (KPNs) constitute a wellestablished model of computation used to describe streaming
applications [6], where computational processes operate in
parallel and communicate using unbounded FIFO channels.
KPNs do not require the underlying computing resources to
share any memory between different processes, therefore they
are often implemented on top of physically distributed memory
architectures. Polyhedral Process Networks (PPNs) [7] belong
to the class of KPNs and have some peculiarities that facilitate
their mapping on top of MPSoC platforms. For instance, it is
possible to derive the buffer sizes that guarantee the absence
of deadlocks in a PPN. A PPN process is forced to stall
when trying to read from an empty FIFO communication
channel or write into a full FIFO communication channel.
Moreover, it is possible to automatically translate a sequential
program that satisfies a set of structural constraints in a PPN
representation [8], at the same time determining the size of the
communication buffers that guarantee deadlock-free execution
of the PPN.
Moreover, these characteristics make PPNs a very appealing model of computation for implementation of runtime system adaptivity through process migration. Process
migration in distributed-memory MPSoC platforms has been
investigated for networks of processes. In [9] a NoC-based
distributed-memory MPSoC platform is described, where
system-adaptivity is achieved through process migration. The
remapping decision is delegated to an OS-level runtime monitor, and different remapping policies are presented and evaluated. However, the task remapping can happen only at specific
points in time, namely when process communication events
are processed, therefore reducing the effective responsiveness
of the adaptivity mechanism. In our platform, we allow task
remapping events to happen at anytime during execution.
In [10] and [11], two other approaches to system-level
adaptivity that rely on task migration are presented. In both
these works, the user is in charge of defining checkpoints at
which the task re-mapping can take place. Again, our approach
to task migration does not burden the programmer with such
task. Moreover, the mentioned works assume to have sharedmemory support from the underlying hardware or middleware,
while our approach relies completely on a distributed memory
hierarchy. We believe this is a more general case, as abstract
shared-memory primitives can always be implemented on top
of a physically distributed memory.
The mapping of PPNs, and KPN applications more in
general, on top of MPSoC embedded computing platforms
has been extensively studied in literature. Different alternatives
exist when mapping the logical point-to-point KPN communication channels onto the available underlying communication
infrastructure. The main trade-off in mapping KPN applications is between generality of the approach applicability
and overall performances. In [12], a survey of mapping
strategies for KPN applications is presented. The majority of
the approaches at the state of the art focus on optimizing
the application on the underlying MPSoC architecture and
therefore lack in generality. In our platform, we try to remain
the processor through one of its two ports, while the other one
is used to handle message-passing send and receive communication in a DMA fashion. This feature enables the processor
to decouple computation from communication, covering (if
needed) a large part of the communication latency with useful
computation.
The NoC substrate consists of an enhanced instance of
the pipes-lite library of synthesizable network components
[17]. pipes is a scalable and configurable lightweight NoC,
whose topology and bandwidth can be customized. The tile
interfaces to the network through a Network Interface (NI).
The NI is in charge of constructing the packets according to
the transactions requested by the cores. For processing tiles,
the NI includes master and slave interfacing capabilities. While
the necessity to include a master interface is obvious, a slave
interface is required to support the message passing between
tiles. In addition to the NI, the Network Adapter (NA) has
been extended with support for message-passing within the
NoC.
A programmable message handler (MPH) with DMA capabilities is integrated with the NI. The MPH includes a DMA
engine (i.e. address generator and memory interface) that allows to offload the memory copies needed to execute message
passing from the processor. In order to do so, the MPH exposes
a set of memory-mapped registers that are programmed by the
processor to control send and receive operations, setting the
receive/destination address within the NoC, the message tag
and message size. We will not go into details of the necessary
parameters for message passing, as they are commonly found
in all message-passing communication libraries. The other
exposed registers contain the memory addresses where the
data is to be found (for a send operation) or stored into (for
a receive operation), the data size (necessary to control the
DMA engine)
Similarly to all message-passing communication libraries,
while the send() message-passing operation does not require
any specific synchronization, the receive() message-passing
operation, which must always be explicitly executed by the
receiving process, may trigger some synchronization issues. In
detail, the receive operation might be called by the receiving
process before or after the actual data is arrived at destination.
In case the receive is invoked before the data is arrived
at destination, the MPH module will be already correctly
programmed to process the incoming transaction and, upon
reception of the data, it will directly store it at the final address
in memory. Instead, if the data arrives at destination before the
receive is invoked by the receiving process, the message data
is stored in the memory, into a buffer reserved for such a
purpose. The message identification fields (sender, tag, buffer
address) are stored inside an event file, in order to enable the
receive primitive, when invoked by the receiving process, to
retrieve the message from the memory. The message-passing
receive primitive scans the event file locations, to check if
the message under reception is already stored in the buffer.
If so, the processor copies the message data from the buffer
to the final memory address indicated by the programmer. If
Fig. 1.
instead the message is not found in the event file, the processor
keeps polling the DMA handler, where a dedicated circuitry
is in charge of comparing the incoming messages with the
expected message describers. In order to allow partial buffer
de-fragmentation, the buffer is treated as a list.
As it will be described in the next section, the processor
might also need, during application execution, to generate an
asynchronous event on another processor, such as, for example, the initiation of the migration process. To this aim, the
NA has been also enriched with a message tag decoder. Upon
reception of a message, the message tag is also compared by
the tag decoder against a set of pre-determined tag values.
If the tag decoder finds a match, it raises an interrupt signal
for the processor, communicating with the processor interrupt
controller.
Finally, the tile includes a set of performance counters that
can be used to monitor application execution, memory and
network access directly at the processor interface. The performance counters are directly interfaced to the data memory for
storage of the desired access traces.
All the logic included in the NA was explicitly designed to
support message-passing inside the NoC decoupling processor
computation from network communication. While a certain
amount of performance benefit can be extracted from overlapping computation and communication, there is an associated
area overhead that needs to be taken into account. This
overhead will be crucial, since all the processor tiles will
experience it. Moreover, the designed logic might also affect
the operating frequency of the NA, since it heavily consists
of combinational logic. In order to estimate such overhead,
we synthesized for FPGA the NA hardware RTL description,
and obtained the figures that we report in Table I, comparing
against a base NA architecture without MPH.
Table I shows how the overhead introduced by the insertion
of the MPH module is quite significant, especially in terms of
operating frequency. However, in designing the MPH module
we focused on minimizing the message latency, therefore the
largest part of its logic is fully combinational. The design
could be easily modified to reduce the impact on the operating
frequency, however this will result in an increased latency
Slices
Slice Reg.s
Slice LUTs
Freq.
base NA
361
577
879
135 MHz
NA
1010
2589
3434
103 MHz
TABLE I
NA AREA AND FREQUENCY FIGURES FOR FPGA
P1
Application(s)
P2
Middleware
PPN
communication
P3
Process
migration
PPN
Processes
tile2
tile3
tile4
Run-time
manager
Fig. 2.
tile1
ch1
B2
ch2
tile1
tile2
B1P
B1C
P
PE
C
B2P
B2C
requests
NI
NI
PE
Fig. 3.
Figure 3 shows a PPN producer-consumer processes communicating over a NoC physical communication infrastructure.
In our request-driven approach, each FIFO buffer of the
original PPN graph is split into two software buffers, one on
the producer tile and one on the consumer tile. For instance,
B1 in the top part of Fig. 3 is split in B1P on tile1 and
B1C on tile2 . The size of these buffers is defined such that
producer and consumer will have identical buffer sizes for
the channel that links them. Moreover, the transfer of tokens
from the producer tile to the consumer tile is initiated by the
consumer. Every time the consumer is blocked on a read at a
given FIFO channel, it sends a request to the producer to send
new tokens for that channel. The producer, after receiving this
request, sends as many tokens as it has in its software FIFO
implementing that channel.
When the request message is received at the producer tile,
it generates an interrupt for the processor. The mechanism
that implements the request interrupt generation has already
been presented, from the hardware point of view, in Section
III. Request messages are sent from the consumer using
dedicated tags, such that the tag decoder hardware module
at the producer will be able to detect the request reception
event and raise the interrupt signal to the processor. The
interrupt-based mechanism allows for a better computationcommunication overlap, as the processor will now be relieved
of any polling for request messages.
Run-time
Manager
Predecessor tile
Source tile
P1
B1
Successor tile
C
B1
P2
B2
B2
tile0
tile1
P3
tile2
migration
B1
P2
B2
tile3
Fig. 4.
Destination tile
Migration scenario
Fig. 5.
In order to evaluate the performance of the proposed platform, we built an FPGA prototype of the whole system. In the
recent past, FPGA prototyping has gained significant interest
as a convenient alternative to software simulation (in terms of
Fig. 7.
gating on the idle tiles. This means that, after every task
migration, we switched off the clock signal of a specific tile.
We measured the runtime reduction of power consumption of
the FPGA device using Xilinx System Monitor utility.
Figure 8 shows the power figures for decreasing number
of active tiles. It can be noticed how the 30% performance
degradation obtained reducing the number of active tiles from
5 to 2 is matched by a 15% power saving resulting from
clock gating of the idle tiles. It is important to mention
that the reduction in power consumption that we can achieve
through clock gating is in reality a reduction of dynamic power
consumption, while static power consumption is not impacted
by switching of the clock signal. Therefore, only in terms of
dynamic power, the percentage of power reduction will be
significantly higher than 15%.
VI. C ONCLUSION
This paper presented an adaptive Network-on-Chip (NoC)based Multi-Processor System-on-Chip (MPSoC) platform for
video decoding applications that achieves system adaptivity
and runtime power consumption reduction capabilities. System
adaptivity was achieved developing a set of process runtime
migration mechanisms, while the reduction in power consumption is realized through clock gating at the level of the
single computational tile. The paper presented the platform