Вы находитесь на странице: 1из 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.


An FPGA Implementation of a Flexible, Parallel Image Processing Architecture

Suitable for Embedded Vision Systems.

Conference Paper · January 2003

DOI: 10.1109/IPDPS.2003.1213415 · Source: DBLP


41 98

2 authors, including:

Peter Lee
University of Kent


Some of the authors of this publication are also working on these related projects:

Investigation of Adiabatic Logic Circuits For Very Low-Power Logic Design View project

Lightfield Cameras and Compressive Sensing View project

All content following this page was uploaded by Peter Lee on 06 June 2014.

The user has requested enhancement of the downloaded file.

An FPGA Implementation of a Flexible, Parallel Image Processing Architecture
Suitable for Embedded Vision Systems

Stephanie McBader1, Peter Lee2

NeuriCam S.p.A, Via S M Maddalena 12, 38100 Trento, Italy
University of Kent at Canterbury, Canterbury, Kent, CT2 7NT, UK

Abstract. This paper describes the design of a This paper presents a novel parallel processing
programmable parallel architecture that is to be used for architecture which combines the flexibility of general-
signal pre-processing in intelligent embedded vision purpose machines, speed of DSPs, small-size and low-
systems. The architecture has been implemented and power performance of application-specific cores in a
tested using a Celoxica RC1000 Prototyping Platform with single, balanced platform specifically tailored to serve
a Xilinx XCV2000E FPGA. The system operates at a clock image processing operations. It describes the architecture
rate of 50 MHz and can perform pre-processing functions and performance of these processors when implemented as
such as filtering, correlation and transformation on an a prototype on a Xilinx XCV2000E FPGA, prior to
image of 256x256 pixels at up to 667 frames/s. realisation in a complete system-on-a-programmable-chip.
The paper will begin with a brief overview of the parallel
architecture, followed by a description of the
implementation and its use in an example application.
1. Introduction

Recent advances in semiconductor technology have now

made it possible to design complete embedded systems on 2. Architecture Overview
a chip (SoC) by combining sensor, signal processing and
memory onto a single substrate. This level of integration is With the advent of embedded vision systems, a novel
opening up new applications that, in the past, have not flexible system-on-chip architecture has been proposed to
been practically realisable. One such application area is in handle image acquisition and pre-processing. This
the design of embedded vision systems [1]. Compact architecture is intended to relieve the host processor,
vision systems are suitable for real-time applications such whether on-chip or remote, from performing repetitive
as vehicle detection [2] and security systems. These tasks of high computational requirement that are better
applications require a lot of processing power, typically in achieved using parallel architectures.
the range of billions of operations per second [3].
A typical vision system contains the processing layers
Data is acquired using a 256x256 CMOS digital camera illustrated in figure 1. An acquisition layer controls the
[4]. After acquisition, a significant portion of processing is sensor interface and pixel addressing, and passes source
required in the pre-processing phase prior to feature pixels to the pixel pre-processing layer, which, in turn,
extraction, classification and reaction. Most pre-processing performs corrections such as noise reduction and
algorithms, such as filtering, edge extraction and compensation. A DMA channel will then address regions
transformation usually require a series of repetitive of interest in the image for pre-processing. The image pre-
computationally intensive operations that are often processing layer prepares the Area-of-Interest (AOI) for
characterised by fine grain parallelism. As such, they are feature extraction and classification by applying
often inefficiently performed on sequential machines [5] programmed algorithms on its data. An object classifier
and are frequently implemented using a parallel array of can then pick up the pre-processed image from memory,
processors [6,7].

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

To Host

Classifier/Actor Image Memory

Program Memory & Main Controller

Image Memory Shifter O/P Sequencer

Image Preprocessing
Image Processing Array 15
DMA Channel 0 1
Pixel Pre-Processing
DMA Channel
Acquisition Layer
Temp. Original
Optical Sensor Buffers Buffers
From Sensor

Fig. 1. Layered Architecture Fig. 2. Image Pre-Processor

Pixel Address
Coeff_Addr Memory coeff
32-bit Result Reg.
Wr_Coeff ALU & Muxes
32 32

IPE 16
ACK Controller
16x32 16x32
control Reg. Reg.
File File
1 2

Instr_Addr Memory Pixel/Address
Wr_Instr FIFO (256-words)
Instruction 256x16
From DMA

Fig. 3. Image Processing Element

and act upon the information received. This paper will which a previous output of the image pre-processing layer
concentrate on the architecture of the image pre- was stored. The DMA channel then distributes the source
processing aspect of the system, which comprises the pixels to the Image Processing Array, which comprises 16
DMA channel and a parallel array of 16 processing identical Processing Elements, each operating on a set of
elements, detailed in figure 2. source pixels in accordance with a programmable
algorithm. The DMA channel is designed in such a way to
The DMA Channel addresses the source AOI according to detect overlapping regions, e.g., in adjacent 3x3 windows,
a set of 24 addressing modes which were chosen to cover in order to minimise the need for redundant pixel reads.
the most commonly used image processing algorithms The pre-processed image resulting from the parallel array
(e.g., windowing, correlation). The source frame may is pooled into an Image Memory accessible to the host
either be the output of the sensor, or a temporary buffer to processor, which can then extract the information needed

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

for classification and taking a subsequent course of action 256x256-pixel sensor via external I/O, and communicates
in response. with the host processor via a PCI bus available on-board.

The image pre-processing architecture communicates with The verification model of the architecture was synthesised
its host through a program memory. The host processor using Synposys FPGA Express synthesis tools and Xilinx
sends a block of control and data manipulation instructions Foundation ISE for Place and Route. Technology mapping
to the pre-processor’s program memory, and awaits and resource estimation as well as processing performance
feedback through status reporting from the main controller measurements are listed in tables 1-3.
before picking up data from the image memory.

The figure also indicates the presence of a shifter, which 6. Example Application
can be configured to compensate for scaling factors or
simply to normalise the output of the array. An output Figure 5 demonstrates an application which utilises the
sequencer multiplexes between the outputs of the parallel processor on board the RC1000-PP to pre-process
processing elements, and provides the handshake signals vehicle images for numberplate recognition. The parallel
necessary to confirm data delivery. processor removes the image background and unwanted
details. It therefore prepares the image for upper layers to
locate the plate position, before ‘cutting out’ characters
and passing them on to a neural network for classification.
3. The Image Processing Element
At 50 MHz clock frequency, the parallel processor
The Image Processing Array is composed of 16 identical implementation on the FPGA can achieve a throughput of
processing elements. Each element can be thought of as a up to 125 Frames/s; whereas the original software
small DSP specifically intended for image processing application which normally runs on a standard PC
algorithms. The processing element is built upon a 16-bit achieves 50 Frames/s with a processor clock frequency of
input, 32-bit output datapath, and a RISC-like instruction 266 MHz. The factor of 2.5 improvement over a CPU
set composed of 15 instructions. which is clocked at 5 times the input frequency is mainly
due to the parallelism of the architecture, and its optimised
Figure 3 illustrates the structure of the Image Processing datapath.
Element (IPE), which operates on two’s complements 16-
bit data and produces a 32-bit output. The IPE receives its
data manipulation instructions from the main controller, 7. Conclusion & Further Work
and operates on pixels stored in its local memory
according to the decoded instructions. It also comprises a Rapid implementation of parallel structures based on
small coefficient memory which can hold multiplication FPGAs using VHDL proves to be a very efficient, cost-
coefficients, convolution and correlation masks as well as effective and attractive methodology for design
matrix constants. It has two register files which may be verification. New multi-million gate FPGAs [9] with
used for temporary storage during computation. Once the extended memory and fast I/O interfaces made it possible
algorithm execution has been completed, the IPE makes to develop and test a large parallel architecture such as the
the data and target address available on its output, and one described in this paper. Future work will explore the
informs the output sequencer of its readiness. possibility of integrating a host RISC processor into the
system so as to complete the processing blocks needed for
a complete embedded vision system. This will be an ideal
4. Implementation use of System-on-a-Programmable-Chip (SOPC)
technology [10], where the host processor is implemented
The architecture described in the previous section was either as a soft or hard core on a high-density Field
designed as a soft IP core using VHDL, prior to being Programmable Device with sufficiently large amounts of
embedded with a host processor on the same on-chip memories and advanced interfaces.
programmable device. To evaluate the performance of the
system, the architecture was first implemented and tested Acknowledgements
using Celoxica’s RC1000-PP board [8]. This is made up of
a single Virtex FPGA with extended memory capability This work is funded by the European Commission’s Marie
(XCV2000E), and four external memory banks used for Curie Host Fellowship contract number HPMI-CT-199-
frame buffering. The architecture connects directly to a 00055 with NeuriCam S.p.A., Italy.

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

Host Processor (PC)

Memory Memory
M Arbiter & Banks
S Switches 0, 1, 2, 3

Image Pre-Processing Architecture

Fig. 4. The complete vision system

Performance Characteristics Unit FPGA Resources

Max Clock Frequency 50 MHz DMA Channel 2,392 Slices
Power Consumption 2.8 W IPE 790 Slices
(XPower Estimations) IPA (16 IPEs) 12,967 Slices
Processing Array 16 IPEs Barrel Shifter 101 Slices
Image Size 256x256 pixels Output Sequencer 36 Slices
External Memories 5 Mbits Complete System 16,083 Slices (83%)
Peak Performance 3.23 GOPS 128 RAM Blocks (80%)
Convolution 3x3 88.42 Frames/s
Table 2. FPGA Resources
Median Filter 62.42 Frames/s
Thresholding 666.67 Frames/s
3-Tap FIR Filter 217.39 Frames/s Phase Time
15-Tap FIR Filter 58.07 Frames/s Design Effort 8 Months
Correlation 8x8 136.80 Frames/s Implementation 2 Months
(64x64 AOI) Synthesis 2.5 Hours
Forward DCT 70.82 Frames/s Place & Route 3 Hours
Table 1. Performance Characteristics Table 3. Design Efforts

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

Fig. 5. Example Application


1. F. Paillet, Design Solutions and Techniques for Vision System on a Chip and Fine-grain Parallelism Circuit Integration, Workshop
at the IEEE ASIC / System On Chip Conf., Washington DC, USA, 2000.
2. M. Betke et al, Real-time multiple vehicle detection and tracking from a moving vehicle, Machine Vision and Applications, 12(2),
2000, 69-83.
3. N Yamashita et al, A 3.84 GIPS Integrated Memory Array Processor with 64 Processing Elements and a 2-Mb SRAM, IEEE J. of
Solid-State Circ., 29(11), 1994, 1336-1343.
4. NeuriCam, NC1802 Pupilla 640x480-pixel Digital Camera, Datasheet Preliminary Rel. 11/2001. www.neuricam.com
5. P. Athanas & A. Abbott, Addressing the Computational Requirements of Image Processing with a Custom Computing Machine: An
Overview, Workshop on Reconfigurable Architectures, IPPS '95, 1-15, 1995.
6. U Ramacher et al, A 53-GOPS Programmable Vision Processor For Processing, Coding-Decoding And Synthesizing of Images,
Proc. 27th Eur. Solid-State Circ. Conf., Villach, Austria, 2001, 160-163.
7. T Minami et al, A 300-MOPS Video Signal Processor with a Parallel Architecture, IEEE J. of Solid-State Circ., 26(12), 1991, 1868-
8. Celoxica, RC1000 Product Information Sheet, www.celoxica.com
9. Xilinx, Virtex 2000-E Datasheet, www.xilinx.com
10. Pat Mead, Investigating the Reality of System-On-a-Programmable-Chip, FPL 2001.
11. Anil Jain, Fundamentals of Digital Image Processing (New Jersey: Prentice Hall, 1989).

0-7695-1926-1/03/$17.00 (C) 2003 IEEE

View publication stats