Hard and Soft Embedded FPGA Processor Systems Design: Design Considerations and Performance Comparisons

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/267627161
Hard and Soft Embedded FPGA Processor Systems Design: Design

Considerations and Performance Comparisons
Article · November 2013
CITATIONS READS
2 1,567
1 author:
Vincent Andrew Akpan

The Federal University of Technology, Akure, Ondo State, Nigeria
41 PUBLICATIONS 116 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
An advanced wireless intelligent autonomous humanoid robot with sophisticated obstacle avoidance and guidance systems based on IoT with machine vision for
diverse applications View project
All content following this page was uploaded by Vincent Andrew Akpan on 01 November 2014.
The user has requested enhancement of the downloaded file.

International Journal of Engineering and Technology Volume 3 No. 11, November, 2013
Hard and Soft Embedded FPGA Processor Systems Design: Design

Considerations and Performance Comparisons
Vincent A. Akpan
Department of Physics Electronics,
The Federal University of Technology, P.M.B. 704 Akure, Ondo State, Nigeria.
ABSTRACT
This paper presents a novel and efficient hardware/software co-design techniques for the development of high-performance embedded
processor system targeting field programmable gate arrays (FPGAs). Some very important and critical design considerations for
developing FPGA embedded processor systems are first presented. Next, the architectures of the IBM hard-core PowerPC™440 and the
Xilinx soft-core MicroBlaze™ processors are introduced together with comprehensive techniques for FPGA embedded processor
systems design. Then, two embedded processor systems are designed, implemented on Virtex-5 FX70T ML507 FPGA development
board, tested and their performances are evaluated on an industry-standard FPGA benchmark DMIPs (Dhrystone million instructions
per second). The two embedded processors are based on: 1) the IBM PowerPC™440 hard processor core and 2) the Xilinx MicroBlaze™
soft processor core. Experimental results have shown that the IBM hard-core PowerPC™440 embedded processor system out-performs
the Xilinx the soft-core MicroBlaze™ embedded processor system in terms of FPGA device consumptions and their maximum operating
frequency for the DMIPs benchmark implementation. The DMIPs benchmark performance results indicate that the embedded processor
system are highly optimized and can be deployed for the development of real-time embedded processor systems. Finally, a brief
conclusion and some discussions on future directions are given.
Keywords: Embedded processor system design, Dhrystone benchmark, field programmable gate array (FPGA), embedded PowerPC™440
processor core, embedded MicroBlaze™ processor core, Virtex-5 FX70T ML507 FPGA.
I. INTRODUCTION are provided for justification [3], [4].
Embedded systems are now widespread in domestic and The field programmable gate array (FPGA) is a general-
industrial systems (appliances and applications). As systems purpose populated with digital logic building blocks [3], [4],
complexity increases with real-time constraints, the embedded [6]. The most primitive FPGA building block is called a logic
system design becomes more complex and the real-time element (LE) by Altera [7] or a logic cell (LC) by Xilinx [8].
constraints then depends on the computational Altera and Xilinx are two world market leaders in the FPGA
power/efficiency of the embedded platform. An embedded industries. Besides Alttera and Xilinx, many other FPGA
system is basically software implemented on hardware in order companies exist, but their products are not discussed here. In
to perform and realize specific real-time functionalities. either case, the FPGA building block consists of a look-up
Traditionally, embedded systems were designed and realized table (LUT) for logical functions and flip-flops for storage. In
using off-the-shelf microprocessors, microcontrollers, digital addition to the LC/LE block, FPGA also contains memory,
signal processors (DSPs), and application specific integrated clock management, input/output and multiplication blocks.
circuits (ASICs) [1]. However, in recent times, embedded
system designs have been directed on the use of field According to Jeff Bier [9]: “the next time you’re choosing an
programmable gate array (FPGA) [1]–[4]. embedded processor, you should consider an FPGA”.
Embedding a processor inside an FPGA has many advantages
Recently, investigations and surveys on the use of FPGAs in [Fletcher] with several challenges towards the design and
industrial control applications have been reported [3]–[5]; implementation of the embedded processor [9]–[11].
where it has been proposed that FPGAs can be configured to Thorough literature survey shows that FPGA are not widely
solve computationally intensive tasks for real-time used as embedded processor for even relatively simple up to
applications. For example, an FPGA-based framework for complicated real-time embedded applications [2], [12]. The
prototyping of multi-core embedded architectures have been performance of an embedded processor in a product is often a
proposed in [5] although no embedded processor was design key product differentiator [11].
nor implemented. The comparison of embedded system design
for industrial applications using microprocessors, An embedded system design is a complex task since it consists
microcontrollers, digital signal processors (DSPs), application of the software and hardware portions; and thus the difficulty
specific integrated circuits (ASICs), and FPGAs; indicated that of generating a design from a set of requirements and
specifications becomes more complex [2]. Another problem is
FPGAs are more suitable for such tasks and several references on how to experiment with mixed hardware/software solutions
can be a time-consuming process due to the historical
ISSN: 2049-3444 © 2013 – IJET Publications UK. All rights reserved. 1000
International Journal of Engineering and Technology (IJET) – Volume 3 No. 11, November, 2013
disconnect between software development methods and the embedded processor system using included memory
low-level methodologies required for hardware design and controllers. A variety of memory controllers enhance the
synthesis for implementation on an FPGA. For many FPGA embedded processor system’s interface capabilities.
applications, the complete hardware/software application is FPGA embedded processors use general-purpose FPGA logic
represented by collection of software and hardware source files to construct internal memory, processor buses, internal
that are not easily compiled, simulated or debugged with a peripherals, and external peripheral controllers including
single tool set. In addition, because the hardware design external memory controllers. As more pieces of buses, memory
process is relatively inefficient, hardware and software design controllers, peripherals and peripheral controllers are added to
cycles may be out of sync, requiring system interfaces, the embedded processor system, the system becomes
fundamental software/hardware partitioning decisions and increasingly more powerful and useful.
algorithm designs to be prematurely locked down.
However, it is worth noting that the additions of large banks of
Furthermore, one of the most exciting developments in FPGA external memory may increase the latency to access this
that has emerged in recent years is the emergence of hard and external memory and may have negative impact on
soft FPGA-embedded processors. These processors include performance. In addition, adding many pieces of peripherals
Xilinx MicroBlaze™, IBM PowerPC™440, Altera® Nios™ and memory as well as their respective controllers may reduce
II, and others. There are challenges to using FPGAs as software performance and increase the embedded system cost that
platforms; however, software programmers may not have the consumes the FPGA resources.
skills or indeed, the desire to make use of hardware design
tools or hardware-oriented languages such as VHDL and FPGA manufacturers often publish embedded processor
Verilog. Software programmers using FPGAs may also be performance benchmarks. The manufacturers obviously know
faced with design methods that are new and unfamiliar, what must be done in order to get the best out of their FPGAs
including the need to efficiently partition applications between that performs the best for each specific benchmark, and they
hardware and software. These same arguments may be true for take full advantage of every possible enhancement strategies
most hardware designers who are not familiar with software when benchmarking. A clue to these strategies is that the
programming or engineering. FPGA embedded processor system constructed to run the
benchmark has very few peripherals and runs exclusively using
To address the above issues, several design methodologies internal memory. However, no easy formula or chart exists that
have emerged, notably embedded systems design tools from shows how to compare the performance and cost for different
Altera [7] and Xilinx [8]; and the choice of FPGA design tool memory strategies and peripheral sets. The usual performance
poses additional challenges. In previous works, critical benchmark is the Dhrystone benchmark implementation to
overview of embedded system design technologies based on evaluate the Dhrystone million instructions per second
Xilinx system design tools have been reported [1]. In another (DMIPs) performance measured in terms of the maximum
study [2], embedded processor system design methodologies FPGA operating frequency (fmax) in (MHz) [17], [18]. It is then
from model-based design view point have also been reported left for the users of such FPGAs to achieve the maximum
where all of the Xilinx design tools were discussed and frequency and DMIPs set out by the manufacturers.
techniques on how these tools can be integrated to achieve
efficient FPGA embedded system design were proposed.
B. Some Advantages and Disadvantages of FPGA
In this paper, critical design considerations and techniques for Embedded Processor Systems
embedded processor system design are presented. Then, two
embedded processor systems are design, tested and evaluated The embedded systems are normally defined as the software
on an industry-standard benchmark. The first embedded implemented in hardware in order to realize specified real-time
processor is a hard-core IBM PowerPC™440 processor [13]– functionalities. The normally used soft-core processing
[15] and the second is a soft-core Xilinx MicroBlaze™ hardware includes microcontrollers, microprocessors, FPGAs,
processor [16]. Performance evaluation and device utilizations digital signal processors (DSPs), and application-specific
of the two embedded processor systems implemented on integrated circuits (ASICs), each of which has its own
Xilinx Virte-5 FX70T ML507 FPGA development board are properties. Although, FPGA hardware technologies have
compared. Finally, a brief conclusion is given and some attracted an always increasing interest and have significantly
remarks are made with directions for future works. disrupted the embedded system design technologies, it is worth
considering some advantages and disadvantages that may be
II. OVERVIEW OF EMBEDDED derived or incurred by the use of FPGA embedded processor
technologies.
PROCESSOR SYSTEMS AND DESIGN
CONSIDERATIONS
Here, some advantages of an FPGA embedded processor
A. Why Embedding a Processor Inside an FPGA system when compared to an off-the-shelf processor are
summarized in the following:
Embedding a processor inside an FPGA has many advantages.
Specific peripherals can be chosen to improve performance 1) Hardware Acceleration: The most compelling reason for
based on the application with unique user-defined peripherals FPGA embedded processor is the ability to make trade-offs
been easily attached. Likewise, large banks of external between hardware and software to maximize efficiency and
memory can be connected to the FPGA and accessed by the performance. Suppose an algorithm is identified as
bottleneck, a custom co-processor can be designed in the [8], [12], [20].

FPGA specifically for that algorithm [19]. Then, this co-
processor can be attached as a peripheral to the FPGA
embedded processor as a co-processing engine through C. Xilinx’s Virtex-5 FX70T Embedded Hard-Core
special, low-latency channels, and custom instructions can PowerPC™440 and Soft-Core MicroBlaze
be defined to implemented the co-processor. Processors
2) Peripheral Customization: FPGA embedded processor-
based system offers and allow complete flexibility for the A processor built from dedicated silicon is referred to as a
selection of any combination of peripherals or controllers. “hard” processor such as the IBM PowerPC™440 embedded
In fact, new unique peripherals can be design and connected processor core inside the Xilinx Virtex-5 FXT family of
directly to the processor’s bus with the assumption that FPGAs. On the other hand, a “soft” processor is built using the
there are no standard requirements for the peripherals. FPGA’s general-purpose logic such as the Xilinx
3) Components and Cost Reduction: With the versatility of the MicroBlaze™ embedded processor core available as an
FPGA embedded processor, a previous system that intellectual property (IP) for implementation in several Xilinx
required multiple components can be replaced with a single series of FPGAs. The soft processor is typically described in a
FPGA such as in the case when an auxiliary input/output hardware description language (HDL) or netlist. Unlike the
(I/O) chip or a co-processor is required next to an off-the- hard processor, the soft processor must be synthesized and fit
shelf processor. By reducing the components count in the into the FPGA fabric. In both hard and soft processor systems,
design, a reduced board size and inventory management, the local memory, internal peripherals, peripheral and memory
both of which can save design time-to-market and cost. controllers, and processor buses must be built from the FPGA’s
4) Component Obsolescence Mitigation: Obsolescence general-purpose logic.
mitigation is a difficult issue when a design requirement
must ensure that a product lifespan be much longer than the 1) Embedded Hard-Core PowerPC™440 Processor
typical lifespan of a standard electronics product. In this
case, FPGA embedded soft-processors could be an excellent The Xilinx Virtex®-5 FXT FPGAs introduce an embedded
solution since the HDL source code for the soft-processor processor block for PowerPC™ 440 (PPC440) processor
can be purchased and owned thereby guaranteeing the designs [15]. This block contains the PowerPC™ 440x5 32-bit
lifespan of the product. embedded processor developed by IBM [13], 14]. The
PowerPC 440x5 processor implements the IBM Book E:
Enhanced PowerPC™ Architecture. The PowerPC™ 440’s
Additionally, some disadvantages and challenges of an FPGA
high-speed, superscalar design and Book E Enhanced
embedded processor system when compared to an off-the-shelf
PowerPC™ architecture put it at the leading edge for high
processor are discussed in the following.
performance system-on-a-chip (SOC) designs. The
PowerPC™ 440 core combines the performance and features
First, it is worth noting FPGA embedded processor is not
of standalone microprocessors with the flexibility, low power,
without disadvantages. When compared to an off-the-shelf
and modularity of embedded CPU cores.
processor, the hardware platform for the FPGA embedded
processor must be designed as above, which is a challenging
A typical system on a chip design with the PPC440 Core uses
hardware-software co-design task. Because of the integration
the IBM CoreConnect™ bus structure for system level
of the hardware and software platforms, the design tools are
communication [13]. High bandwidth peripherals and the
more complex especially when a co-processing custom
PPC440 core communicate with one another over the
peripheral is involved. The increased tool complexities and
processor local bus (PLB). Less demanding peripherals share
design methodologies require that critical decisions be made
the on-chip peripheral bus (OPB) and communicate to the PLB
and adequate attention be invested.
through the OPB Bridge. The PLB and OPB provide common
interfaces for peripherals and enable quick turnaround, custom
Next, since FPGA embedded processor software design is still
solutions for high volume applications. The typical
relatively new compared to software design for standard
architectural example of the PowerPC™440 Core-based
processors; the software design tools are likewise relatively
system on a chip, illustrating the two-level bus structure and
immature, although workable with several challenges.
modular core-based design is shown in Fig. 1.
The PowerPC™440 embedded processor contains a dual-
Finally, in terms of design cost, if the desired task can be
issue, superscalar 32-bit reduced instruction set computer
achieved with a standard off-the-shelf microprocessor,
(RISC) central processing unit (CPU), pipelined processing
microcontroller or a digital signal processor (DSP) that is less
unit, along with other functional elements required to
expensive compared to the FPGA, then using a large FPGA implement embedded system-on-a-chip solutions. These other
with unused gates or processor makes the FPGA embedded functions include memory management, cache control, timers,
processor system cost inconsequential. and debug facilities. In addition to three separate 128-bit
Based on the acceptability of the Xilinx’s FPGA and design Processor Local Bus (PLB) interfaces, the embedded processor
tools as the leading-edge FPGA design industry [1], [20]; the provides interfaces for custom coprocessors and floating-point
Xilinx’s systems design tools have been chosen for the functions, along with separate 32 KB instruction and 32 KB
embedded processor system development while the Xilinx data caches [15].
Virtex-5 XC5VFX70T ML507 FPGA development board has
been chosen due to its embedded features and capabilities [1],
Fig. 1: The PowerPC™440 Core system on a chip with two-level bus structure and additional peripherals
Fig. 2: The block diagram of a hard-core PowerPC™440 processor
The PPC440 Core, as a member of the PowerPC™ 400 Family, program, in which over 80 third party vendors have combined
is supported by the IBM PowerPC™ Embedded Tools™ with IBM to provide a complete tools solution including Xilinx
[14]. Development tools for the PPC440 include C/C++ Local Bus (PLB), and Xilinx® CacheLink (XCL). The LMB
compilers, debuggers, bus functional models, provides single-cycle access to on-chip dual-port block RAM.
hardware/software co-simulation environments, and real-time The PLB interfaces provide a connection to both on-chip and
operating systems. As part of the tools program, IBM off-chip peripherals and memory. The CacheLink interface is
maintains a complete set of development tools by offering the intended for use with specialized external memory controllers.
High C/C++ Compiler, RISCWatch™ debugger with MicroBlaze also supports up to 16 Fast Simplex Link (FSL)
RISCTrace™ trace interface, VHDL and Verilog simulation ports, each with one master and one slave FSL interface. The
models and a PPC440 Core Superstructure development kit architecture of the Xilinx MicroBlaze™ processor core, the
[13]. The PPC440 CPU operates on instructions in a dual issue, core interfaces, buses, memory and peripherals are shown in
seven-stage pipeline, capable of dispatching two instructions
Fig. 3 [16].
per clock to multiple execution units and to optional Auxiliary
Processor Units (APUs). The PPC440 core block diagram is
The acronyms of the core interfaces shown in Fig. 3 are defined
shown in Fig. 2.
as follows [XMBPRG, 2010]:
The PowerPC™ 440 embedded processor implements the full,
32-bit fixed-point subset of the IBM Book E: Enhanced DPLB : Data interface, Processor Local Bus,
PowerPC™ architecture. The PowerPC™440 embedded DLMB: Data interface, Local Memory Bus (BRAM only),
processor fully complies with this architectural IPLB : Instruction interface, Processor Local Bus,
specification. The 64-bit operations of the architecture ILMB : Instruction interface, Local Memory Bus (BRAM
are not supported, and the embedded processor does not only),
MFSL 0..15 : FSL master interfaces,
implement the floating-point operations, although a
DWFSL 0..15 : FSL master direct connection interfaces,
floating-point unit (FPU) can be attached (using the
SFSL 0..15 : FSL slave interfaces,
APUs interface). Within the embedded processor, the 64-
DRFSL 0..15 : FSL slave direct connection interfaces,
bit operations and the floating-point operations are DXCL : Data side Xilinx CacheLink interface (FSL
trapped, and the floating-point operations can be master/slave pair),
emulated using software. IXCL : Instruction side Xilinx CacheLink interface
(FSL master/slave pair),
The PowerPC™ 440 embedded processor implemented in Core : Miscellaneous signals for: clock, reset,
Xilinx Virtex-5 devices and discussed in Xilinx’s debug, and trace.
documentations differs from the Book E architecture
specification in the use of bit numbering for architected The Xilinx MicroBlaze™ soft core processor is highly
registers [13], [15]. Specifically, Book E defines the full, 64- configurable and allows the section of a specific or fixed set of
bit instruction set architecture, where all registers have bit features required by the design for embedded processor system
numbers from 0 to 63, with bit 63 being the least significant. development. The fixed features of the processor includes: 1)
This document describes the PowerPC 440 embedded thirty-two 32-bit general purpose registers, 2) 32-bit
processor, which is a 32-bit subset implementation of the instruction word with three operands and two addressing
architecture. Accordingly, all architected registers are 32 bits mode, and 3) a 32-bit address bus, and a single issue pipeline.
in length, with the bits numbered from 0 to 31, where bit 31 is In addition to these fixed features, the MicroBlaze™ processor
the least significant. Therefore, references to register bit is parameterized to allow selective enabling of additional
numbers from 0 to 31 in this document correspond to bits 32 functionality.
to 63 of the same register in the Book E architecture
specification ([IBM PEPC440, 2010]; [XEPB Virtex-5, The MicroBlaze™ processor can be configured with the
2010]). following bus interfaces: 1) A 32-bit version of the PLB V4.6
interface, 2) LMB provides simple synchronous protocol for
2) Embedded Soft-Core MicroBlaze Processor efficient block RAM transfers, 3) FSL provides a fast non-
This sub-section gives a brief overview of the basic features arbitrated streaming communication mechanism, 4) XCL
and architecture of the Xilinx MicroBlaze™ embedded provides a fast slave-side arbitrated streaming interface
processor version 7.20 currently support for Xilinx between caches and external memory controllers, 5) Debug
MicroBlaze™ embedded processor development within the interface for use with the Microprocessor Debug Module
Embedded Development Kit (EDK) 11.4 for Xilinx Virtex-5 (MDM) core, and 6) Trace interface for performance analysis.
FX70T GPGA being used in this work.
The processor local bus (PLB) interfaces are implemented as
Like the IBM PowerPC™, The MicroBlaze™ soft core byte-enable capable 32-bit masters. The MicroBlaze™ on-chip
processor is a 32-bit reduced instruction set computer (RISC). peripheral bus (OPB) interfaces are implemented as byte-
The processor includes the Big-Endian bit reversed format, 32- enable capable masters. The local memory bus (LMB) is a
bit general purpose registers, virtual-memory management, synchronous bus used primarily to access on-chip block RAM.
cache software support, and Fast Simplex Link (FSL) It uses a minimum number of control signals and a simple
interfaces. The MicroBlaze core is organized as a Harvard protocol to ensure that local block RAM are accessed in a
architecture with separate bus interface units for data and single clock cycle. All the LMB signals are usually active high.
instruction accesses. The following three memory interfaces
are supported: Local Memory Bus (LMB), the IBM Processor As a note on the embedded MicroBlaz™ processor system
clocks and resets signals, the following should be taken into
considerations for improved performances. Although, the features; and so on.

overall embedded system reset designated here as “Reset” and
the MicroBlaze™ reset designated here as “MB_Signal” In fact, the FPGA manufacturer, which in this case is Xilinx,
signals are functionality equivalent, the Reset is primarily knows what must be done to get the most out of their FPGAs
intended for use with the on-chip peripheral bus (OPB) and they take full advantage of every possible enhancement
interface, whereas the MB_Reset is intended for the processor techniques when benchmarking. Thus, it is also necessary to
local bus (PLB) interfaces. Furthermore, the MicroBlaz™ employ the best enhancement techniques in the embedded
processor is a synchronous design clocked with the overall processor design proposed in this work as much as possible,
system clock designated here as “Clk” signal, except for the although the task is complicated.
hardware debug logic, which is clocked with the debug clock
signal designated here as “Debug_Clk”. If the hardware debug
logic is not used, there is no minimum frequency limit for the E. Design Considerations for FPGA Embedded
Clk. However, if hardware debug logic is used, there are Processor System Design
signals transferred between the two clock regions. In this case,
Clk must have a higher frequency than the debug clock
Since the Xilinx base system builder (BSB) wizard provides an
Debg_Clk [16].
efficient way to create an FPGA embedded processor system,
the choice of the memory types, memory controllers,
peripherals, peripheral controllers, size and type of instruction
and data cache memories, the optimization levels, and
processor clock frequency and size of local memory. The
discussions here are specific to the peripheral that may be
considered for the design of the proposed FPGA embedded
processor systems to achieve the following design objectives:
high-performance and optimized speed in terms of operating
frequency at reduced cost in terms of FPGA fabric resources
consumption.
To be more specific, the proposed FPGA embedded processor

system will incorporate a co-processing system that will be
attached to the processor local bus (PLB), a memory and
memory controller are required. Because instruction and data
will be read in and written out, the size of the instruction and
data cache memories and peripherals together with their
respective controllers must be configured. The initialization of
the processor programs also needs memory and memory
controllers. The universal asynchronous receiver and
transmitter (UART) and joint test action group (JTAG) ports
Fig. 3: The architecture of the Xilinx MicroBlaze™ processor are required, and the UART must also be configured properly
core, the core interfaces, buses for communication. During synthesis, simulation and
compilation of the embedded processor system, an appropriate
optimization scheme must be selected to achieve the above
D. Industry-Standard Benchmark for FPGA Embedded design objectives. While the processor timer is internal, the
Processors and Xilinx Embedded FPGA Processors clock and reset are external. Among other memories and
Benchmark Performances peripherals and their respective controllers, the most important
The industry standard benchmark for FPGA embedded is whether an interrupt and a debug logic controllers will be
processors is Dhrystone MIPs (DMIPs). Xilinx quote DMIPs required. These issues and other critical considerations for the
for almost all their available embedded processors including embedded processor system design are considered in the
MicroBlaze™ and the PowerPC™440 embedded processors. following.
The maximum operating frequency and DMIPs achievable
from the Virtex-5 FXT family of FPGAs as quoted by Xilinx 1) Compiler Optimization and Parameters
for MicroBlaze™ are 210 MHz and 240 DMIPs respectively
[18]. Similar results for the PowerPC™440 are 550MHz and
Compiler optimizations are available in Xilinx platform studio
1,100 DMIPs for a single processor system. According to
(XPS) based on GNU compiler collections (GCC). These
Xilinx, this performance is twice with dual embedded
compilers have several levels of optimization including Levels
PowerPC™440 processors as 1,100MHz and 2,200 DMIPs
0, 1, 2, and 3 as well as size reduction optimization. The
[17]. The achieved DMIPs reported by Xilinx are based on
strategies for these different levels of optimizations as given
several factors to maximize the benchmark results. Such
below:
factors include: 1) optimal compiler optimization level, fastest
available device family; 2) fastest speed grade in that device
Level 0: This level does not apply any optimization to the
family; 3) executing from the fastest and lowest latency
design compilation.
memory which is typically an on-chip memory; 4)
Level 1: This is the first and the lowest (Low -01) level of
optimization of the embedded processor’s parameterizable
optimization that performs jump and pop expensive inside the FPGA, but requires fewer FPGA input-
optimization. output (I/O) ports and is least expensive per megabyte.
Level 2: this the second level of optimization and is
designated as Medium (-02). This level activates In addition to the memory access time, the peripheral also
nearly all optimizations that do not involve a incurs some latency. In MicroBlaze, for example, the memory
speed-space trade-off and so the executables do controllers are attached to the on-chip peripheral bus (OPB).
not increase in size. The compiler doe not perform The OPB SDRAM controller requires about eight to ten cycle
loop unrolling, function in-lining or strict aliasing latency for a read operation and four to six cycle latency for a
optimizations. This is the standard level used that write operation depending on the clock frequency. Thus, it is
can be used for all program deployment. obvious that the worst possible program performance would be
Level 3: This level offers the highest level and is designated achieved by having the entire program reside in external
High (-03). This level adds more expensive memory. Since optimizing execution speed is a typical in the
embedded processor system design, an entire program, should
options, including those that increase code size. In
rarely be targeted solely at external memory.
some cases, this optimization level actually
produces code that is less efficient the Level 2, and
as such may be used with cautions. c) Instruction and Data Cache Memory
Size Optimized (-0s): This option produces the smallest
code size as much as possible. The PowerPC™ in Xilinx FPGAs has instruction and data
cache built directly into the silicon of the hard processor.
Note in general, however, that both any of the optimization Enabling this cache is almost always a performance advantage
level and debug option are used, the information obtained from for the PowerPC™ [10]. On the other hand, the MicoBlaze™
the optimization process may not correlate with the generated cache architecture is not on the dedicated silicon chip rather
source code. the instruction and data cache controllers are selectable
parameters in the MicroBlaze configuration. When these
2) Memory Types controllers are included, the cache memory is built from
BRAM. Therefore, enabling the cache is likely to consume
The FPGA embedded processor provide access to fast, local more BRAM than local memory for the same storage size
memory as well as an interface to slower, external memory. because the cache architecture requires address line tag
The way the memory is used has a significant effect on storage. Additionally, enabling the cache may also consume
performance. However, the memory usage can be manipulated general-purpose FPGA logic to build the cache controllers.
using the Linker Script. The consequences are that the achievable system frequency
may be reduced when the cache is enabled as more logic may
a) Local Memory Only be added and the complexity of the design may increase during
The local memory provides the fasted option in accessing the FPGA place and route operation. Despite these
memory. Xilinx FPGA local memory is made up of large consequences in enabling the MicroBlaze™ cache, especially
FPGA memory blocks called BlockRAM (BRAM). Embedded the instruction cache, may improve performance, even when
processor accesses BRAM in a single bus cycle. Since the the system is likely to run at lower frequency. Finally, enabling
processor and the bus run at the same frequency in MicroBlaze, the cached memory is always worth an experiment to justify
instructions stored in BRAM are executed at the full different trade-offs.
MicroBlaze processor frequency. In the MicroBlaze processor
system, BRAM is essentially equivalent in performance to a d) Combination of Local, External and Cache Memory
Level 1 (L1) cache. On the other hand, the PowerPC™ can run
at frequencies greater than the bus and has true built-in L1 As discussed earlier, the memory that provides the best
cache. Therefore, BRAM in a PowerPC™ processor system is performance is one that only has local memory. However, this
equivalent in performance to a Level 2 (L2) cache. Thus, if the architecture may not always be practical since many useful and
program for a particular embedded processor system design efficient embedded programs exceed the available local
fits entirely within the local memory, then the design is likely memory capacity. On the other hand, running from externally
to achieve optimal memory performance, although it is mostly memory exclusively may have more than eight times
likely that the embedded programs will exceed the local performance disadvantage due to the peripheral bus latency.
memory capacity.
Caching the external memory is an excellent choice for
embedded PowerPC™440 processor systems. For embedded
b) External Memory Only
MicroBlaze™ processor systems, perhaps the optimal memory
Xilinx FPGAs provides several memory controllers that configuration may be to wisely partition the program code,
interface with a variety of external memory devices. These maximizing the system frequency and local memory size.
memory controllers are connected to the processor’s peripheral Critical data, instructions and stack can also be placed in local
bus. The three types of volatile memory are supported by memory. Data cache may not be used so as to allow for a larger
Xilinx FPGAs are static RAM (SRAM), single-data-rate RAM local memory bank. Suppose that the local memory is not large
(SDRAM), and the double-data-rate RAM (DDR) SRAM. The enough; then the instruction cache can be enabled for the
SRAM controller is the smallest and simplest inside the FPGA address rang in the external memory used for instructions. By
while the SDRAM is the most expensive of the three memory not consuming BRAM in data cache, the local memory can be
types. The DDR SDRAM controller is the largest and most increased to contain more space. An instruction cache for the
instructions assigned to external memory could be very shifter rather than performing these functions in software.
effective. Alternatively, experimentation or profiling could Although, enabling these processor capabilities may consume
show which code fragments are most heavily accessed; and FPGA resources, but the performance improvements can be
assigning these fragments to local memory could provide a extraordinary.
greater performance improvement than caching.
d) Co-Processing Hardware
1) Optimization Specific to an FPGA Embedded
Processor Custom hardware logic can be designed to offload an FPGA
embedded processor. For example, a software bottleneck
Since the one of the objective of the proposed embedded identified in an algorithm can be converted into a custom
processor system design using the Xilinx Virtex-5 FX70T hardware. Then, custom software instructions can be defined
FPGA is to improve the performance of the hardware, to operate the hardware co-processor.
additional techniques must be exploited to achieve this
objective. Given the fact that the FPGA embedded processor
Any operation that is algorithmic, mathematical, or parallel is
resides next to additional FPGA hardware resources, one here a good candidate for a hardware co-processor which is the
technique is to consider a custom co-processor designed subject of the proposed embedded processor system design in
specifically to target the implementation of a core algorithm in this work. FPGA logic can be traded for performance but the
the design. advantages can be enormous and performance can be improved
significantly.
a) Logic Optimization and Reduction
The key point here is that only peripheral and buses that are III. THE EMBEDDED POWERPC™440
necessary and required should be connected. Suppose that the PROCESSOR SYSTEM
intended design does will not store and run any instructions DEVELOPMENT USING XILINX
using external memory; then connecting the instruction side of INTEGRATED SOFTWARE
the peripheral bus is not necessary. Connecting both the ENVIRONMENT (ISE) AND XILINX
instruction and data side of the processor to a single bus may PLATFORM STUDION
create a multi-master system which requires an arbiter.
Optimal bus performance is achieved when a single master The embedded PowerPC™440 processor design considered
resides on the bus. here follows closely from the design considerations outlined
and discussed in Section II. The embedded processor system
Furthermore, debug logic requires resources in the FPGA and design using the IBM PowerPC™ 440 hard processor cores are
may be the hardware bottleneck. When the design is each instantiated from the Xilinx ISE which then initializes the
completely debugged, the debug logic can be removed from XPS where the actual processor systems’ designs are done. The
the final system, which will potentially increase the system’s Xilinx ISE is started and the project name is assigned on the
performance. For example, in an embedded MicroBlaze™ “New Project Wizard”. The name assigned here for the
processor system with the cache enabled, the debug logic will PowerPC™440 processor system is
typically be the critical path that will slow down the entire “emb_ppc440_processor”. The FPGA device family Virtex-5
design [10]. XC5VFX70T is selected and the speed grade for this device
family based on our available Virtex-5 FX70T ML507 FPGA
b) Area and Timing Constraints board is –2 and is thus specified as well as the device package
of FF1136. The Xilinx synthesis tool (XST) as the synthesis
Xilinx FPGA place and route tools as well as the Xilinx’s tool to be used in synthesizing the design. The Xilinx
PlanAhead™ tool perform much better when the design ModelSim-SE is selected as the simulation tool. The language
objectives are well specified. In these Xilinx tools, the desired for the embedded processor system development is the VHDL
clock frequency, pin location, and logic element location can (very-high-speed hardware description language). In addition
be specified. By providing these details, the design tools can to these selections, the Embedded Processor is also added as a
be able to make efficient, optimized and smarter trade-offs “New Source” in this project wizard. The
during hardware design implementation. Therefore, a careful “emb_ppc440_proceesor” project summary is shown in Fig.
study of the datasheets for each peripheral together with the 4(a).
design guidelines goes a long way in this regard and it is a
necessity. When the “New Project Wizard” is completed, the ISE
initializes and automatically starts up the Xilinx platform
studio (XPS) since “Embedded Processor” was added as a
c) Hardware Acceleration
“New Source”. The XPS in turn initializes and brings up the
Dedicated hardware outperforms software at the expense of
Base System Builder (BSB) which is an automated tool that
FPGA resources for dramatic performance improvements.
can be used to create an embedded processor system. The
Therefore, the FPGA’s ability to accelerate the processor
embedded processor system design using the BSB is an eight-
performance with dedicated hardware should be considered.
stage procedure, namely: Welcome, Board, System, Processor,
Provided the hardware divider and the hardware barrel-shifter
Peripheral, Cache, Application, and the Summary.
are enabled, embedded MicroBlaze™ processor can be
customized to use a hardware divider and a hardware barrel-
The “Welcome” allows new processor(s) to be design or an A. The “System” stage shown in Fig. 4(c) allows a single- or
existing pre-designed processor system to be loaded as shown dual-processor system to be specified and designed. The
in Fig. 4(b). The “Board” stage allows the FPGA device family Virtex-5 XC5VFX70T devices family currently supports
and package to be specified, if different from that specified in single processor systems design. Thus, a single processor
the “New Project Wizard”. This is sometimes useful if a system is the target in this work. Then in the “Processor” stage,
custom FPGA board different from the pre-configured Xilinx the choices of selecting a PowerPC™440 or a MicroBlaze™
FPGA development boards. It is also useful if the processor processor are available. Thus, in this sub-section, a
design was not initialized and started using the Xilinx ISE. The PowerPC™440 is selected as the intended processor as shown
advantages of initializing and starting an embedded processor in Fig. 4(d) whereas in the next sub-section the MicroBlaze™
system design from the ISE are many as discussed in Appendix processor will be selected.
(a) New project summary (b) Based System Builder: “Welcome”
(c) Based System Builder: “System” (d) Based System Builder: “Processor”
Fig. 4: The Xilinx ISE “New Project Summary” and the BSB Welcome, System, and Processor design stages for the embedded
The “Peripheral” stage allows different memory types and

peripherals to be added or removed from the proposed
embedded processor system. Once a memory or peripheral is
selected, the associated controller is automatically added.
Furthermore, if the “Interrupt” check box is selected, the
interrupt controller is also included which must be configured
in the XPS after the BSB have created embedded processor
system. As discussed in Section II-(E) under memory types as
well as hardware and optimization specific to an FPGA
embedded processor; the choice of memory and hardware
peripheral including their respective controllers have
significant effects on the embedded systems performance.
Here, peripherals that are not needed are removed. The actual
size of the embedded program is yet to be known and this
makes the choice of the memory difficult to select. In this
regard, the embedded processor local memory is selected first.
Next, the external DDR SRAM and the on-board SRAM are
added. In this, the serial port is needed to print out all results to
the host development computer. Thus, the only peripheral
added here is the UART (RS323_Uart_1) and it is configured
as follows: Buat Rate = 115200, Data Bits = 8, Parity = None,
and the Interrupt is not used (that is, it is left unchecked). The
BSB dialog for the “Peripheral” stage and the selected memory
types and peripherals is shown in Fig. 5(a).
The “Cache” stage allows the instruction and data caches

memory types and controllers to be enabled. As mentioned (a) Based System Builder: “Peripheral”
earlier, the PowerPC™440 embedded in the Virtex-5 series of
FPGAs provides 32-Kbit of caches which are built directly into
the silicon of the hard PowerPC™440 core. Normally, these
caches are enabled in software and can be configured to cache
multiple memory regions. Here, both the instruction and data
cache memory types are enabled, although this can also be
configure in the software design part of the embedded
processor system implementation using the Xilinx SDK. The
“Application” stage lists the readily available applications to
be implemented by the embedded processor system. The
applications are usually written in C programming language
and users applications. The default Xilinx applications
available under “Application” are the “Memory” and
“Peripheral” test programs as shown in Fig. 5(b) under the File
Location category: “TestApp_Memory_ppc440_0” and
“TestApp_Peripheral_ppc440_0”. Note that new software
programs can be crated and added into this “Application” both
from the XPS after the BSB must have finished creating the
processor, and from the Xilinx SDK during the software design
portion of the embedded processor system.
The “Summary” is the last stage of the BSB-guided steps for

creating an embedded processor system. This stage lists all the
available peripheral associated with the created embedded
processor together with their instance name, base and high
addresses as shown under System Summary in Fig. 5(b). The
“Summary” stage also list the major software associated with
the processor system as shown under Overall in the File
Location category in Fig. 5(b). The components of the previous
“Application” stage are also listed in the “Summary” stage
dialog window. (b) Based System Builder: “Summary”
Fig. 5: The BSB: the Peripheral and Summary design stages for
the embedded PowerPC™440 processor system.
processor system design is shown in Fig. 7. During the

Next, the just created PowerPC™440 embedded processor Netlist generation, the “User Constraint File (UCF) was
system must be compiled so that all the memory types, generated. The UCF file has the project name with a ucf
peripherals, memory and peripheral driver software and the extension, that is, “emb_ppc440_processor.ucf” and is
entire embedded processor system can be updated. The Xilinx always located in the directory “data” in the processor
ISE and the XPX are used interchangeably to perform these hierarchy. This file defines the constraints on the created
compilations. The compilation procedures are summarized as processor system together with input-output (I/O) map of
follows: the complete design to the Virtex-5 FX70T FPGA device
family and the selected package in Fig. 4(a). This file is
i) Starting with the XPS, the board support packages (BSPs) introduced in the processor system by selecting “Project 
and libraries are generated by selecting “Software  Generate Add Source” from the ISE GUI of Fig. 7, and navigating to
Libraries and BSPs” on the XPS graphical user interface (GUI) “data” directory, and the “emb_ppc440_processor.ucf” is
shown in Fig. 6. added.
ii) Next, the Netlist is generated by selecting: “Hardware  iv) Next, the programming file (BitStream) for the complete
Generate Netlist”. This stage of the design also generates embedded PowerPC™440 processor system is generated
all the “wrappers”, device drivers, and all the necessary by Double-clicking the blue-colored highlighted “Generate
design and technology files that would required by ISE for Programming File” shown in Fig. 7 to generate the
complete synthesis and implementation of the embedded programming file for the embedded processor project. This
processor system in the ISE. is the ISE implementation phase of the design which is
detailed in [2], [12]. However, the various stages of this ISE
iii) After the Netlist generation, attention is turned to the implementation are briefly described in the following. As
Xilinx ISETM. A section of the Xilinx ISE™ graphical user can be seen in Fig. 7, the ISE has seven major phases,
interface (GUI) for the PowerPC™440 embedded namely:
Fig. 6: The XPS graphical user interface (GUI) for the creation and initial compilation of the embedded processor system
Step 1) User Constraints, a text file that has syntactic descriptions of how individual
block RAMs constitutes a contiguous logical data space.
Step 2) Synthesize – XST (Xilinx Synthesis Tool), The Xilinx Data2MEM [21] uses BMM files to direct the
translation of data into the proper initialization form. Note
Step 3) Implemented Design, that since a BMM file is a text file, it is directly editable.
This file together with the bitstream and all the generated
Step 4) Generate Programming File, device drivers will be required to program the Virtex-5
during the software design portion of the embedded
Step 5) Configure Target Device, processor system. The BMM file is located in the top level
directory of the processor system together with the
Step 6) Update Bitstream with processor Data, and bitstream (with the extension .BIT).
Step 7) Analyze Design Using Chipscope. vi) Since the embedded processor project is now fully
updated by both Xilinx ISE™ and XPS, attention is
Double-clicking the “Generate Programming File” again turned to the XPS shown in Fig. 6 to perform the
implements Steps 2), 3) and 4) to generate this file. Note following:
that the XPS generated the UCF which takes care of step
1) Generate the block diagram of the complete system
1). Otherwise using the Xilinx PlanAhead, the UCF is generated by selecting from the XPS GUI of Fig.
would 6: Project  Generate Block Diagram Image which
is shown in Fig. 8.
have been created here in Step 1). Because, the design is
not ready for the target Virtex-5 FX70T FPGA, Steps 5), 2) Generate the complete design report by selecting
6), and 7) are not implemented here. The generation of from the XPS GUI of Fig. 6: Project  Generate
the bitstream completed without errors but with some and View Design Report. This report gives the
warnings. detailed information on the embedded processor
system but is not shown in this work since it is more
v) Note that the embedded processor design is coordinated than 200 pages. It is useful as a reference note to
by both the Xilinx ISE™ and the XPS. It is observed that accessing the different peripherals, memory types,
immediately after the generation of the Programming and memory and peripheral drivers especially when
File (bitstream); the Xilinx ISE™ indicates that the modifications, addressing and integrating custom
project design is out of data while the XPS indicates that hardware are necessary.
the project file has changed on disk on their respective
GUIs. Therefore, Step 1) to Step 4) is repeated to update 3) Generate and export the designed embedded
the system, after which both notifications disappear. processor hardware to the Xilinx software
development kit (Xilinx SDK) by selecting from the
In addition to the Programming File, an important file is XPS GUI in Fig. 6: Project  Export Hardware
also generated called the “Block Memory Map (BMM)” file Design to SDK.
with extension bmm. For the current PowerPC™440
project, this file is edkBmmFile_bd.bmm. The BMM file is
Fig. 7: A section of the Xilinx ISE™ graphical user interface from where the PowerPC™440 embedded processor system design is instantiated.
Fig. 8: The block diagram of the PowerPC™440 embedded processor system with associated memory types, peripherals, clock generator, buses, hardware
and software specifications and key/symbols
Although the Export dialog box offers two options for with each allocated 32-KB SRAM from the default 8-KB as
exporting the designed hardware: Export Only and Export and shown in Fig. 9(b). In MicroBlaze™ processor system, small
Lunch SDK, the “Export Only” is selected since the designed cache sizes are implemented with FPGA look-up tables
hardware will be exported in the next two sub-section for (LUTs) while large cache sizes are implemented using block
memory and peripheral testing as well as the Dhrystone RAMs (BRAMs). As mentioned earlier in the previous Section
benchmark performance comparison of the designed III, these caches are optional and can also be configured during
PowerPC™440 processor system with Xilinx MicroBlaze™ software development for the embedded processor system as
embedded processor. However, this export process shown and discussed later in Section V. The design summary
automatically creates an SDK directory in the current design of the MicroBlaze™ embedded processor system created using
hierarchy and places the hardware structure of the designed the base system builder (BSB) is shown in Fig. 9(a) and list the
PowerPC™440 processor system major software associated with the processor system as shown
(emb_ppc440_processor.xml) as an XML document in the under Overall in the File Location category. Like the
created SDK directory. “Application” stage in the PowerPC™440 processor system,
the component associated with the “Application” stage are also
IV. THE EMBEDDED MICROBLAZE™ listed under the “System Summary” for the created
MicroBlaze™ embedded processor system.
PROCESSOR SYSTEM
DEVELOPMENT USING XILINX The software associated with the just created MicroBlaze™
INTEGRATED SOFTWARE embedded processor system is then compiled so that all the
ENVIRONMENT (ISE) AND XILINX memory types, peripherals, memory and peripheral driver
PLATFORM STUDION software as well as the entire embedded processor system are
updated. The compilation procedures are similar that described
The procedures for creating the embedded MicroBlaze™ for the PowerPC™440 embedded processor system where the
processor system is essentially the same as that for the Xilinx ISE and the XPX are used interchangeably to perform
embedded PowerPC™440 system using the Base System these compilations.
Builder (BSB). However, some differences exist in the
architectural design of the embedded MicroBlaze™ embedded
processor system when compared to the embedded
PowerPC™440 processor system. Here, name assigned to the
embedded MicroBlaze™ processor system project is
“emb_mb_processor”. At the “Processor” stage using the Base
System Builder (BSB) to create the MicroBlaze™ embedded
processor, “MicroBlaze” is selected as the option for
“Processor Type” as in the case of Fig. 4(d).
As discussed in Section II-(E), the choices and configurations
of different memory types and peripherals influences the
performances of embedded processors, especially for the
MicroBlaze™ processor where the FPGA fabrics are used to
implement the logic circuits and drivers. Thus, for the
“Peripheral” selection stage, data-side and instruction-side
local memory types and controllers are selected. These two
were in-built within the PowerPC™400 core. Similar to the
PowerPC™ processor, the DDR2 SDRAM, the SRAM and the
UART are included in the MicroBlaze™ processor system.
These peripherals together with their address range are shown
in the design summary of Fig. 9(a). Unlike in PowerPC™440
where the instruction and data caches are in-built and fixed at
32-KB with three memory options SRAM, DDR2 SDRAM
and BRAM available for enabling the cache memory type; only
the first memory type options are available for enabling the
MicroBlaze™ processor memory cache. While the instruction
and data memory cache size in the PowerPC™440 core is fixed
at 32-KB, that in the MicroBlaze™ processor core can be
specified. Noting that the amount of FPGA fabrics required to
implement the memory and the memory address decoders
varies with the specified memory size; the instruction and data (a) Based System Builder: “Summary”
caches for the MicroBlaze™ processor system are enabled
(b) Based System Builder: “Cache”

Fig. 9: The BSB: (a) the Summary and Cache design stages for the embedded MicroBlaze™ processor system design.
Fig. 10: The block diagram of the MicroBlaze™ embedded processor system with associated memory types, peripherals, clock generator,
buses, hardware and software specifications and key/symbols.
As in the previous Section III, the wrappers and hardware to create the SDK directory in the top level hierarchy of the
drivers, Libraries and the board support packages (BSPs) as MicroBlaze™ processor project directory and the hardware
well as the Netlist are generated using the XPS via its GUI description text file that encrypts the MicroBlaze™ embedded
while the Synthesis, programming file (Bitstream), block processor system is exported to this SDK directory. Finally, the
memory map (.BMM) file, all other implementation files and block diagram image and the XPS synthesis summary are
the device utilization summary are generated using the Xilinx generation using the XPS via its GUI. The MicroBlaze™
ISE™ software via its GUI. Next, the XPS via its GUI is used embedded processor system created is shown in Fig. 10.
V. SOFTWARE DEVELOPMENT AND

PERFORMANCE VERIFICATION OF THE Next, the Virtex-5 FX70T ML507 FPGA board is connected,
EMBEDDED POWERPC™440 AND turned ON and program by selecting Tools  Program FPGA
MICROBLAZE™ PROCESSOR SYSTEMS from the SDK GUI as shown in Fig. 11. This process requires
USING THE XILINX SOFTWARE the MicroBlaze processor programming file (bitstream)
DEVELOPMENT KIT (XILINX SDK) generated in the previous sub-section named
“emb_mb_processor.bit” and the block memory map
It is important to test and ensure that the included memories (edkBmmFile.bmm).
and peripherals in an embedded system are available,
accessible and fully functional. Thus, in this section, the The results from the FPGA can be observed on the
selected memories and peripherals of the embedded HyperTerminal window of the host computer using the RS232
PowerPC™440 and MicroBlaze™ processor systems are serial ports of both the FPGA and the host personal computer
tested for functionality verification. As discussed in the last (PC) via a null RS232 serial cable. Here, the host PC is an
two sections, the generated hardware description files Intel® Core™2 Quad CPU computer running at 2.66GHz. The
(emb_ppc440_processor.xml and emb_mb_processor.xml) universal asynchronous receiver transmitter (UART) serial
have been placed in their respective SDK directories. These port (commonly called serial port) uses a protocol that provides
tests are performed using the Xilinx software development kit a useful and convenient way of testing processor-based, high-
(SDK). The procedures for creating the software platforms and level code. The print command of C is used to display
programming the FPGA are summarized as follows. intermediate values from the FPGA. For consistency in the
data transmission rate, the RS232 port for the host PC is
Beginning with the embedded MicroBlaze™ processor system, configured as that for the FPGA in Section III (see Fig. 5(a))
the Xilinx SDK software is launched and the hardware as follows: Baud rate = 115200, Data = 8 bits, Parity = none,
description file is imported independently into the Xilinx SDK Stop = 1 bit, Flow control = none.
workspace via the SDK GUI. This process automatically builds
and initializes all the embedded processor system device The Memory Test application is executed on the FPGA as
(memories and peripherals) drivers and controllers. “Debug on Hardware” from the SDK GUI as shown in Fig.
5.10. Running the Memory Test application on the Virtex-5
First, a new “Software Platform” is created on the embedded
ML507 FPGA produces the result shown on the
MicroBlaze™ processor system using the Xilinx SDK GUI. A
HyperTerminal of Fig. 12(a). Note that here, the
new “Manage Make C Application Project” is then created
PowerPC™440 processor hardware description file
under the “Software Platform” and the “Memory Tests”
“emb_ppc440_processor.xml”, the programming file
application is selected which uses the “TestApp_Memory.c”
(bitstream) generated in the previous sub-section named
shown in Fig. 9(a). The Xilinx SDK automatically builds and
“emb_pp440_processor.bit”, and the block memory map
compiles the software application project and reports any
(edkBmmFile.bmm) are required to program the FPGA.
error(s).
Fig. 11: Xilinx software development kit graphical user interface for software development and programming the Virtex-5 ML507 FPGA
using the “Debug on Hardware” option.
In order to test the peripherals, another new “Manage Make C VI. INDUSTRY-STANDARD DHRYSTONE
Application Project” is created using the same procedures for BENCHMARK PERFORMANCE
the Memory Test. The “Peripheral Tests” which uses the
EVALUATION ON THE DESIGNED
“TestApp_Peripheral.c” shown in Fig. 9(a). The same
procedures in the test memory case are followed to build, THE EMBEDDED PROCESSOR
compile and test the embedded MicroBlaze™ processor SYSTEMS
peripherals. Running the Peripheral Test application on the
The Dhrystone is a benchmark test program used to evaluate
Virtex-5 ML507 FPGA produces the result shown on the the performance of embedded processor system and its
HyperTerminal of Fig. 12(b). The memory and peripheral tests
performance is compared to that of the manufacturer to
performed for the MicroBlaze™ embedded processor system
measure how well the memory types, peripheral and
is repeated for the PowerPC™440 embedded processor optimization techniques have been employed to create the
system, the results similar to Fig. 12(a) and (b) were obtained
embedded processor system for enhanced performance. As
but are not shown here for space economy. mentioned in Section II-(A), the performance for the
Dhrystone benchmark evaluation are usually measured in
These test results indicate that the memories and peripherals of
terms of the maximum FPGA operating frequency (fmax) and
the embedded processor systems are fully functional and well the Dhrystone million instructions per second (DMIPs) [10],
configured which implies that embedded processor systems
[17], [18]. Unfortunately, only the Dhrystone benchmark
could be deployed for the development of embedded system
program for evaluating embedded MicroBlaze™ processor
applications based on the selected devices.
system is available here for evaluation. However, since
essentially the same memory types, peripheral and their
respective controllers, the results for the benchmarking of the
MicroBlaze™ processor system could be used to judge the
PowerPC™440 processor system and noting that the
PowerPC™ is known for higher speed performance running at
a maximum frequency of 550MHz and 1,100 DMIPs when
compared to MicroBlaze™ of 210MHz and 240 DMIPs as
discussed in Section II-(D) [17], [18].
To enhance the performance of the Dhrystone benchmarking

of the design MicroBlaze™ embedded processor system, the
Dhrystone is configured to load directly into the on-board
BlockRAMs (BRAMs) from the Xilinx platform studio (XPS)
for speed performance at maximum operating frequency and
DMIPs execution. In this work, the Dhrystone program is first
implemented in the SDK similar to the Memory and Peripheral
test programs to ensure that it is free of errors. A copy of the
just tested MicroBlaze™ processor system is made. Next the
XPS is opened via the Xilinx ISE™ GUI following the same
(a) Memory test
way in which it was created. A new directory called
Dhrystone_TestApp_microblaze_0 is created within the XPS
emb_mb_processor hierarchy. A new software application is
then created in the XPS created also called
“Dhrystone_TestApp_microblaze_0” as shown in Fig. 13. The
Dhrystone benchmark program is then imported into the new
“Dhrystone_ TestApp_microblaze_0” software application. As
discussed in Section II-(E), the medium optimization Level 02
(Medium (–O2)) is selected as the compiler optimization
option as shown in the lower right-hand corner of Fig. 13. The
new project is then compiled by right-clicking the new
Dhrystone_TestApp_microblaze_0 application selecting
“Build Project”. This action creates the executable and
linkable (EFL) file for the project.
Since the copied project has change, the Xilinx ISE™ project
(b) Peripheral test has also changed and it shows out of data. Thus, the complete
MicroBlaze™ embedded processor project is agian fully
Fig. 12: The MicroBlaze™ processor: (a) memory and (b)
recompiled using both the XPS and the Xilinx ISE™ software
peripheral test results on the HyperTerminal window.
according to the 6-step procedures summarized in Section III.
New board support packages (BSPs), Netlist, programming
file (emb_mb_processor.bit), block memory map
(edkBmmFile.bmm), hardware description file software development kit (SDK).

(emb_mb_processor.xml) are generated and exported to the
Fig. 13: The XPS for creating, compiling and initializing the Dhrystone benchmark program to load from on-board BRAM for benchmark
performance evaluation of MicroBlaze™ embedded processor on Virtex-5 ML507 FPGA.
The Xilinx SDK is again opened. A new software platform is result may be different when the embedded programs are larger
created called “Dhrystone_Test”. A new “Manage Make C than the on-board BRAMs.
Application Project” is also created. This time around, the just
created and compiled “Dhrystone_Test” software application Following the same procedures as for the embedded
is selected. Next, the Virtex-5 ML507 is programmed and the MicroBlaze™ processor system, the Dhrystone benchmark
Dhrystone application is executed. The maximum operating program was implemented and evaluated on the embedded
frequency obtained is 188.2 MHz against the 210 MHz PowerPC™440 processor system. The maximum operating
specified by Xilinx and 204.7 DMIPs against the 240 DMIPs frequency obtained is 495.8 MHz against the 550 MHz
specified by Xilinx and 1001.6 DMIPs against the 1100 DMIPs
specified by Xilinx for the MicroBlaze™ processor [16], [18]. specified by Xilinx [16]. By dividing the DMIPs by the
By dividing the DMIPs by the maximum operating frequency maximum operating frequency obtained by Xilinx for the
obtained by Xilinx for the Virtex-5 ML507 FPGA gives 0.9748 Virtex-5 ML507 FPGA gives 1.8211 which implies that the
which implies that the designed MicroBlaze™ embedded designed MicroBlaze™ embedded processor system is highly
processor system is highly optimized for embedded optimized for embedded applications.
applications. Note that the embedded programs are initialized
and implemented via the BRAM due to its small size, but the
TABLE I: SUMMARY OF THE DHRYSTONE BENCHMARK peripherals) that constitute the embedded processor systems
PERFORMANCE EVALUATION FOR THE EMBEDDED POWERPC™440 design are well selected in line with the design considerations
AND MICROBLAZE™ PROCESSOR SYSTEMS. proposed earlier.
A summary of the DMIPs benchmark performance results by
Maximum Dhrystone Million DMIP/
both embedded processor systems are given in TABLE I. From Frequency Instructions Per freqmax
TABLE I, the DMIPs/freqmax from Xilinx’s implementation is (freqmax), MHz Second (DMIPs)
1.1429 [18] compared to the 0.9748 obtained by the designed Embedded MicroBlaze™ 210 240 1.1429
embedded MicroBlaze™ processor system shows that the later (Xilinx)
is 14.71% lower than the former. Also, comparing the 2.0000 Embedded MicroBlaze™ 188.2 204.7 0.9748
from Xilinx [17] to the 1.8211 obtained by the design (Designed)
embedded PowerPC™440 indicates that the later is 8.95% Embedded PowerPC™ 550 1100 2.0000
(Xilinx)
lower than the former. Despite the DMIPs/freqmax lower values
Embedded PowerPC™ 495.8 1001.6 1.8211
obtained using the designed embedded processor systems, it is (Designed)
evident that the embedded processor systems shows good
computational efficiencies and that the devices (memories and
TABLE II: THE XILINX PLATFORM STUDIO (XPS) EMBEDDED POWERPC™440 AND MICROBLAZE™ PROCESSOR SYSTEMS SYNTHESIS
SUMMARY.
PowerPC™440 Embedded MicroBlaze Embedded

Processor System Processor System
Flip Flops Look-Up Tables BlockRAMs Flip Flops Look-Up Tables BlockRAMs
Used (LUTs) Used (BRAMs) Used Used (LUTs) Used (BRAMs) Used
proc_sys_reset_0_wrapper 67 51 67 51
jtagppc_cntrl_inst_wrapper 2
mdm_0_wrapper 119 117
clock_generator_0_wrapper 4 3 4 3
ddr2_sdram_wrapper 2,355 1,768 2 3,458 2,077 5
sram_wrapper 544 316 540 295
rs232_uart_1_wrapper 141 127 144 130
lmb_bram_wrapper 8
ilmb_cntlr_wrapper 2 6
dmb_cntlr_wrapper 2 6
dlmb_wrapper 1 1
ilmb_wrapper 1 1
xps_bram_if_cntlr_1_bram_wrapper 16
xps_bram_if_cntlr_1_wrapper 255 201
plb_v46_0_wrapper 138 220
mb_plb_wrapper 150 410
ppc440_0_wrapper 2 3
microblaze_0_wrapper 1,375 1,220
TABLE III: THE XILINX ISE™ DEVICE UTILIZATION SUMMARY FOR THE EMBEDDED POWERPC™440 AND MICROBLAZE™ PROCESSOR
SYSTEMS.
PowerPC™440 Embedded MicroBlaze Embedded
Processor System Processor System
Flip Flops Used Flip Flops Device Flip Flops Used Flip Flops Device
Available Utilization Available Utilization
Slice Logic Utilization
Number of Slice Registers 3,040 44,800 5% 5,051 44,800 11%
Number of Slice LUTs 2,538 44,800 5% 3,871 44,800 8%
Number of Route-Thrus 22
Number of Occupied Slices 1,737 11,200 15% 2,748 11,200 24%
Number of LUT Flip-Flops Pairs Used 4,134 57,202 7% 6,740 57,202 11%
Number of Bonded IOBs 184 640 28% 184 640 28%
Number of LOCed IOBs 184 184 100% 184 184 100%
IOB Flip Flops 330 330
Number of Block RAM/FIFO 20 148 13% 17 148 18%
Total of Memory Used (KB) 720 5,328 13% 612 5,328 11%
Number of BUFG/BUFCTRLs 7 32 21% 7 32 21%
Number of IDELAYCTRLs 3 22 13% 3 22 13%
Number of BUFIOs 8 80 10% 8 80 10%
Number of DCM_ADVs 1 12 8% 1 12 8%
Number of PLL_ADVs 1 6 16% 1 6 16%
Number of PPC440s 1 1 100%
Number of BSCANs 1 4 25%
Number of DSP48Es 3 128 2%
Average Fanout of Non-Clock Nets 3.07 3.33
VII. COMPARISON OF DEVICE UTILIZATION the advantages and disadvantages of FPGA embedded
CONSUMED BY THE DESIGNED EMBEDDED processor systems when compared to off-the-shelf
microprocessors, microcontrollers, digital signal processors
POWERPC™440 AND MICROBLAZE™
(DSPs) and application specific integrated circuits (ASICs)
PROCESSOR SYSTEMS have also been critically examined and compared. The
challenges and drawbacks of designing FPGA embedded
In this section, the Xilinx platform studio (XPS) synthesis and
processor systems from hardware and software view points
Xilinx ISE™ device utilization reports generated by the XPS
have been highlighted and discussed. Substantive discussions
and Xilinx ISE™ are summarized and are used to deduce and
compare the FPGA hardware resources consumption for on the IBM PowerPC™440 hard processor and the Xilinx
creating the PowerPC™440 and the MicroBlaze™ embedded MicroBlaze™ soft processor core have been presented and
processor systems. The XPS synthesis report summary is shown references to detailed discussions on these two processor types
in TABLE II whereas the Xilinx ISE™ device utilization have also been given.
summary is shown in TABLE III.
Important and critical design considerations as well as
From the XPS synthesis results of TABLE II, it is obvious that comprehensive hardware/software co-design techniques for
the MicroBlaze™ consumes more FPGA hardware resources FPGA embedded processor systems design have been presented
when compared to the embedded PowerPC™440 processor and used to design two efficient embedded processor systems,
system. For example, the PowerPC™440 used only 2 flip flops namely: an embedded hard-core PowerPC™440 and a soft-core
to implement the ppc440_0_wrapper, whereas the MicroBlaze™ processor systems. Both embedded processor
MicroBlaze™ used 1,375 to implement the systems have been implemented and tested on a Xilinx Virtex-
microblaze_0_wrapper which increases hardware cost. Also, 5 FX70T ML507 FPGA development board. The evaluation of
the DDR2 SDRAM (ddr2_sdram_wrapper) implementation for the DMIPs (Dhrystone million instruction per second) on the
the PowerPC™440 processor system consumes 2,355 flip flops two designed embedded processor systems showed that the
against the 3,458 flip flops required by the MicroBlaze™ designed embedded PowerPC™440 processor system is 8.95%
processor system, which invariably increase hardware cost. lower than the result reported by Xilinx, whereas the designed
Although, the debug module is implemented in the silicon of embedded MicroBlaze™ processor system is 14.71% lower
the PowerPC™440 hard processor core, a smaller amount of than that reported by Xilinx for the DMIPs benchmark test.
119 flip flops are required to realize the logic operation in the Furthermore, the embedded PowerPC™440 processor system
MicroBlaze™ processor system. On the other hand, the consumed less FPGA resources when compare to the embedded
PowerPC™440 utilized 255 and 138 flip flops to implement the MicroBlaze™ processor system.
xps_bram_if_cntlr_1_bram_wrapper and the
plb_v46_0_wrapper respectively as against the 150 flip flops Based on the DMIPs performance results, the embedded
required by the MicroBlaze™ processor system to implement PowerPC™440 processor system appear as a suitable choice for
the mb_plb_wrapper. On the average, all other hardware implementing real-time FPGA embedded processor systems for
consumptions by both embedded processor systems are time critical application due to it operating frequency based on
comparable as can be observed in TABLE II. the DMIPs results. However, work has already started on FPGA
development and implementation of adaptive neural network
The Xilinx ISE™ device utilization report summary of TABLE identification algorithms. Due to the fact that only the basic
III shows that the main processing engine of the MicroBlaze™ intellectual property (IP) cores were used for both
processor system may have been built from three high- embedded processor systems design (see Fig. 5 and Fig. 9); 1)
performance DSP48E multipliers with significant 6,740 look- FPGA implementation of adaptive neural network
up tables (LUTs) flip flop pairs. Also, the number of slices identification algorithms and 2) a complete FPGA-in-the-loop
occupied by the MicroBlaze™ processor system outweighs that implementation of a computationally intensive adaptive model
occupied by the PowerPC™ processor system by 9%. predictive control algorithm on both embedded processor
Furthermore, the number of slice registers and LUTs used in the systems as co-processors are currently been exploited for
embedded MicroBlaze™ processor system design is in excess further performance verifications.
of 6% and 3% when compared to that used in the
PowerPC™440 processor system design. It can be observed REFERENCES
that the embedded PowerPC™440 processor design required
additional 22 flip flops for routing and additional 2% excess flip [1] V. A. Akpan (Nov., 2009), “FPGA Embedded Systems
flops for build the memory. Design Technologies: with an Overview of Xilinx Systems
Design Tools”, Department of Electrical and Computer
Engineering, Aristotle University of Thessaloniki, Greece,
VIII. CONCLUSION AND DISCUSSIONS pp. 1 – 31. [Online] Available:
http://users.auth.gr/~iosamar/technicalreports.htm.
The importance of embedded processors in FPGA embedded
[2] V. A. Akpan, “Model-based FPGA embedded-processor
systems have been examined and discussed. For completeness, systems design methodologies: Modeling, syntheses,
implementation and validation”, African Journal of
Computing & ICT, To Appear in vol. 5, no. 1, 2012, pp. 1 01.ibm.com/chips/techlib/techlib.nsf/techdocs/F72367F77

– 26. 0327F8A87256E63006CB7EC
http://www.ajocict.net/uploads/Final_Akpan__Model-
Based_FPGA_-_2011_.pdf.
[15] Embedded Processor Block in Virtex-5 FPGAs. Reference
Guide, v1.7, Oct. 6, 2010, pp. 1 – 347.
[3] E. Monmasson, L. Idkhajine, M. N. Cirstea, I. Bahri, A. http://www.xilinx.com/support/documentation/user_guide
Tisan and M. W. Naouar, “FPGAs in industrial control s/ug200.pdf
applications”. IEEE Transactions on Industrial
Informatics, vol. 7, no. 2, pp. 224 – 242, May 2011.
[16] MicroBlaze Processor Reference Guide: Embedded
Development Kit (EDK), v11.0, Apr. 19, 2010, pp. 1 – 210.
[4] A. Malinowski and H. Yu, “Comparison of embedded http://www.xilinx.com/support/documentation/sw_manua
system design for industrial applications”. IEEE ls/xilinx12_1/mb_ref_guide.pdf
Transactions on Industrial Informatics, vol. 7, no. 2, pp.
244 – 254, May 2011.
[17] Virtex–5 FXT FPGAs Documentations (2010).
http://www.xilinx.com/products/virtex5/fxt.htm.
[5] P. Meloni, S. Secchi and L. Raffo, “An FPGA-based
framework for technology-aware prototyping of multi-core [18] M. Alden, “Xilinx extends platform FPGA performance
embedded architectures,” IEEE Embedded Systems Letters, with award winning MicroBlaze soft processor”, Fall
vol. 2, no. 1, pp. 5 – 9, Mar. 2010.
Microprocessor Forum, San Jose, October 9, 2006, Xilinx
Press Release #0695.
[6] S. Kilts, “Advanced FPGA Design: Architecture, http://www.xilinx.com/prs_rls/2006/embedded/0695micro
Implementation, and Optimization,” New Jersey, USA: blaze5.htm.
John Wiley & Sons, 2007.
[19] D. Pellerin and R. Bodenner, “Using FPGAs as
[7] Altera Inc, U.S.A. www.altera.com. coprocessors for DSP and image processing”, Embedded
Training Program, Embedded Systems Conference, ESC-
263, San Jose – USA, 2007, pp. 1 – 13.
[8] Xilinx Inc, U.S.A. www.xilinx.com.
[9] J. Bier, “Give FPGAs embedded nod”, Embedded Systems [20] S. Guccione, “List of FPGA-based computing machines”,
Conference, Silicon Valley, USA, April, 2006, pp. 1 – 2. [Online] Available: http://www.io.com/~guccione/HW-
list.html.
[10] B. H. Fletcher, “FPGA embedded processors: Revealing [21] Xilinx Inc., Data2MEM: User Guide, UG658, Version 1.0,
true system performance”, Embedded Systems Conference, April 27, 2009, pp. 1 – 44.
San Francisco, 2005, pp. 1 – 18.
[11] J. Young and B. Machesney, “Reality check: A guide to

understanding optimized processor core”, Predictable ABOUT AUTHORS
Success, Synopsys Inc., White Paper, September 2011, pp.
1 – 10. Vincent A. Akpan obtained the B.Sc. degree in
Physics from the Delta State University (DELSU),
Abraka, Nigeria in 1997; a Master of Technology
[12] V. A. Akpan (Jul., 2011), “Development of new model- (M.Tech.) degree in Instrumentation from The
based adaptive predictive control algorithms and their Federal University of Technology, Akure (FUTA),
implementation on real-time embedded systems”, Ph.D. Nigeria in 2003; and a Ph.D. degree in Electrical &
Dissertation, Department of Electrical and Computer Computer Engineering from the Aristotle University
Engineering, Aristotle University of Thessaloniki, Greece, of Thessaloniki (AUTH), Thessaloniki, Greece in 2011.
pp: 1 – 517, [Online] Available:
http://invenio.lib.auth.gr/record/127274/files/GRI-2011- Between 1998 and 2004, he was a Graduate Assistant with the Department of
7292.pdf
Physics, DELSU. Since 2004 he has been with the Department of Physics
Electronics, FUTA where he is currently Lecturer II. His integrated research
[13] PowerPC 440x6 Embedded Processor Core. User’s interests include: computational intelligence, system identification, adaptive
Manual, v7.1, September 29, 2010, pp. 1–601. predictive control, real-time embedded systems, signal processing & machine
https://www- vision, instrumentation, and Robotics & Automation. He is the co-author of a
01.ibm.com/chips/techlib/techlib.nsf/techdocs/2D417029 book and has authored or co-authored more than 25 articles in refereed journals
AE3F3089872570F8006D4E99 and conference proceedings. He is a regular reviewer in 6 international
scientific/academic journals and several IEEE sponsored conferences. He is
[14] IBM PowerPC 440 Core: A high-performance, superscalar also an editorial board member for the American Journal of Intelligent Systems.
processor core for embedded application. IBM
Microelectronics Division, Research Triangle Park, NC. Dr. Akpan is a member of the IEEE, USA and The IET, UK. He was one among
September 19, 1999, pp. 1 – 18. https://www- two Nigerian recipients of the 2005/2006 Greek State Scholarship (IKY) for a
Ph.D. programme tenable in Greece.
View publication stats

Hard and Soft Embedded FPGA Processor Systems Design: Design Considerations and Performance Comparisons

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Hard and Soft Embedded FPGA Processor Systems Design: Design Considerations and Performance Comparisons

Загружено:

Авторское право:

Доступные форматы

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Hard and Soft Embedded FPGA Processor Systems Design: Design

Article · November 2013

Vincent Andrew Akpan

The user has requested enhancement of the downloaded file.

Hard and Soft Embedded FPGA Processor Systems Design: Design

I. INTRODUCTION are provided for justification [3], [4].

bottleneck, a custom co-processor can be designed in the [8], [12], [20].

Fig. 2: The block diagram of a hard-core PowerPC™440 processor

considerations for improved performances. Although, the features; and so on.

To be more specific, the proposed FPGA embedded processor

(a) New project summary (b) Based System Builder: “Welcome”

The “Peripheral” stage allows different memory types and

The “Cache” stage allows the instruction and data caches

The “Summary” is the last stage of the BSB-guided steps for

processor system design is shown in Fig. 7. During the

(b) Based System Builder: “Cache”

V. SOFTWARE DEVELOPMENT AND

To enhance the performance of the Dhrystone benchmarking

(edkBmmFile.bmm), hardware description file software development kit (SDK).

PowerPC™440 Embedded MicroBlaze Embedded

Computing & ICT, To Appear in vol. 5, no. 1, 2012, pp. 1 01.ibm.com/chips/techlib/techlib.nsf/techdocs/F72367F77

[11] J. Young and B. Machesney, “Reality check: A guide to

View publication stats

Вам также может понравиться