Towards A Programmable Image Processing Machine On A Spartan-3 FPGA

Towards a Programmable Image Processing Machine on a Spartan-3 FPGA
Thomas Anthony Gartlan, B.A, B.A.I, M.Sc
Athlone Institute of Technology Athlone, Ireland July 2006
Submitted in part-fulfilment of the degree Master of Science (Advanced Engineering Techniques)
Declaration
I declare that this thesis, unless otherwise stated, is entirely my own work and that it has not been submitted at this or any other University as an exercise for a degree. Signed_________________________ Date_________________
Acknowledgements
I would like to thank Dr Fearghal Morgan, NUI Galway, my supervisor for his advice, pointers and discussion at crucial times during this project. Also thanks to him for the project idea and equipment made avilable to me. A special thanks to my beautiful wife Eleanor for her patience and understanding, whilst I abandoned ship for a little while every day, to write up this thesis.
Abstract
This project investigates the design and implementation of image processing algorithms on a FPGA. The FPGA is part of an overall system, developed by a team guided by Dr Fearghal Morgan, that consists of a development board produced by Diglent and user interface software produced by students in NUI, Galway. The image processing algorithms designed and implemented in this project are numerous point operations such as modifying brightness and contrast, Prewitt edge detection as an example of neighbourhood operations and the morphological operations, dilation and erosion. In addition the theory of warping was investigated and a foundation for future development in this area was presented.
Acronyms
CAD CSR DSP FPGA JPEG JTAG LAN LED MPEG NUIG PC PCB USB VHDL Computer Aided Design Control Status Register Digital Signal Processing Field Programmable Gate Array Joint Photographic Experts Group Joint Test Action Group Local Area Network Light emitting Diode Motion Image Experts Group National University of Ireland Galway Personal computer Printed Circuit Board Universal Serial Bus Very High Speec Integrated Circuit Hardware description Language
List of Figures
Figure 1 User Interface for the AppliedVHDL system ______________________ 6 Figure 2 Binary Image File ___________________________________________ 9 Figure 3 8-bit Intensity image(4*4) _____________________________________ 9 Figure 4 A 256 Colour Bitmap (Indexed) image after the Black and White attribute is set. _______________________________________________________ 10 Figure 5 Colourmap file if Black and White attribute is not set._____________ 11 Figure 6 Correct colourmap for a greyscale image. _______________________ 11 Figure 7 Different size greyscale images _______________________________ 12 Figure 8 180*180 Black and white ____________________________________ 12 Figure 9 Various Point Operations represented graphically(Burdick, Digital Imaging) ____________________________________________________________ 19 Figure 10 2-D Kernel is passed over the input image during convolution ______ 20 Figure 11 Examples of 3*3 Kernels ____________________________________ 21 Figure 12 2-D Kernel is passed over input image during Erosion ____________ 23 Figure 13 Kernel for Erosion using 4-connectivity _______________________ 23 Figure 14 Overview of DSP block architecture __________________________ 27 Figure 15 Main cycles of DSP Block Version 1 ___________________________ 28 Figure 16 Main Cycles of DSP Block Version 2 __________________________ 29 Figure 17 DSP Block Architecture ____________________________________ 31 Figure 18 DSP block main State Machine _______________________________ 33 Figure 19 Inputs of the Point Operations block ___________________________ 34 Figure 20 Outputs of the Point Operations block _________________________ 34 Figure 21 Architecture of the Point Operations block _____________________ 35 Figure 22 Original Greyscale image used to illustrate point operations(eye_32*32_grey.bmp). ______________________________________ 36 Figure 23 Processed images from various point operations _________________ 37 Figure 24 Part 1 of the Architecture of the Kernel3_img_proc block __________ 40 Figure 25 Timing of DataEn and Pixel_En signals ________________________ 40 Figure 26 Part 2 (a) of the Architecture of the Kernel3_img_proc block _______ 42 Figure 27 Part 2 (b) of the Architecture of the Kernel3_img_proc block ______ 43 Figure 28 Part 3 of the Architecture of the Kernel3_img_proc block __________ 44 Figure 29 Original eye 32*32 pixels image_____________________________ 45 Figure 30 Full Edge Detection with different threshold values _______________ 45 Figure 31 Horizontal Edge Detection with a sensitive value of CSR_7 ________ 46 Figure 32 Vertical edge Detection ____________________________________ 46 Figure 33 Black and White image received from NUI. (NUIGImage1_180*180.bmp) ___________________________________________ 47 Figure 34 Full, horizontal and Vertical Edge Detection on Black and White image (180*180)______________________________________________________ 47 Figure 35 Original Image developed to demonstrate Erosion _______________ 48 Figure 36 Performing Erosion ________________________________________ 48 Figure 37 Image developed to demonstrate usefulness of dilation ____________ 49 Figure 38 Result of one dilation ______________________________________ 49 Figure 39 Result of Erosion after Dilation_______________________________ 49 Figure 40 Illustration showing how line buffers are implemented_____________ 52 Figure 41 Core generator used to generate efficient shift registers based on RAM. ___________________________________________________________________ 54 Figure 42 Source and Destination Image for Table 10 ____________________ 62 6
Figure 43 Backward Transform for 90-degree rotation around centre pixel of 15*15 image _________________________________________________________ 65 Figure 44 Flowchart for Warping algorithm ____________________________ 68 Figure 45 Use to replace shaded section in earlier flowchart, if Bilinear Interpolation is used. __________________________________________________ 69
List of Tables
Table 1 Neighbourhood and Morphological Image Processing functions added to the system.......................................................................................................................... 25 Table 2 Point Image Processing Operations added to the system ........................... 26 Table 3 DSP Block Inputs .......................................................................................... 30 Table 4 DSP Block Outputs ....................................................................................... 30 Table 5 Internal Signals of the DSP block ................................................................ 32 Table 6 Pixel values of the original image(subset(8:12,3:7).................................... 38 Table 7 Pixel values of the brightened image (subset(8:12,3:7) .............................. 38 Table 8 Types of Transformations ............................................................................ 57 Table 9 Various Affine Transformations .................................................................. 58 Table 10 Forward Transform example for 90 degree rotation around the origin. 62 Table 11 Forward Transformation example for 6o degrees rotation around origin. ........................................................................................................................................... 63 Table 12 Backward Transform for 90-degree rotation around centre pixel of 15*15 image................................................................................................................................. 64
Contents
Chapter 1 .............................................................................................................................. 1 Introduction.................................................................................................................. 1 Overview ........................................................................................................................................ 1 Summary of objectives................................................................................................................. 2 Report Organisation..................................................................................................................... 3 Chapter 2 .............................................................................................................................. 4 Review of Current System.......................................................................................... 4 Introduction................................................................................................................................... 4 Overview of the Current System ............................................................................................... 4 Limitations of the Current System ............................................................................................ 7 Conclusion ...................................................................................................................................14 Chapter 3 ............................................................................................................................ 16 Image processing algorithms ................................................................................... 16 Introduction.................................................................................................................................16 Point Operations.........................................................................................................................16 Neighbourhood Operations .....................................................................................................20 Morphological Operations........................................................................................................22 Summary of New Image Processing functions added to the system................................24 Conclusion ...................................................................................................................................26 Chapter 4 ............................................................................................................................ 27 The New DSP controller block and image processing sub-blocks................ 27 Introduction.................................................................................................................................27 The DSP Controller, DSPblk.vhd...........................................................................................29 The Point Operations sub-block (pixel_img_proc.vhd) .....................................................34 The Neighbourhood & Morphological Operations sub-block (kernel3_img_proc.vhd) ..........................................................................................................39 Conclusion ...................................................................................................................................50 Chapter 5 ............................................................................................................................ 51 Implementation of Line Buffers ............................................................................. 51 Introduction.................................................................................................................................51 Method 1:- Synthesising the Line Buffers using VHDL code ...........................................52 Method 2: Creating the Line Buffers using the CORE Generator ...................................54 Conclusion ...................................................................................................................................55 Chapter 6 ............................................................................................................................ 56 Warping and Morphing ............................................................................................ 56 Introduction.................................................................................................................................56 Basic Theory and Applications ................................................................................................56 Detailed look at the Affine transformation Rotation ........................................................61 Implementation of Warping Algorithms................................................................................66 Conclusion ...................................................................................................................................70 Chapter 7 ............................................................................................................................ 71 Conclusion .................................................................................................................. 71
CHAPTER 1
INTRODUCTION
Overview
Image processing is an area of growing significance. Until recently electronics technology in the area of imaging has mainly focused on the capture and delivery of images in the form of analogue television. However in the last five years, as predicted by Moores law, the advent of relatively cheap and compact memory has seen digital images become viable for consumer products such as cameras and mobile phones. Digital cameras have now almost completely replaced their analogue ancestors. This is purely a significance of the availability of cheap and high density memory, since this allows the technology to overcome the biggest problem that digital images have always presented, that of large memory requirements. If we take a single 4*6 inch print and scan it at 400 ppi(parts per inch) then the resultant digital image will have 1600*2400 pixels. If we want colour then the memory required to store this image is 11Mb. A greyscale image requires slightly less at 3.67Mb.[1] Continuing this line of thought means we need almost 250 Mb to store just over 20 images. Ten years ago this type of memory was only available on hard disks. Video involves rapidly changing images and has obviously even higher memory requirements. For a colour image of modest size 640*480 and 25 frames per second, we need 23Mb for every second of video. However advances in technology have allowed even this formidable hurdle to be overcome and we now have small digital video cameras with Flash memory replacing their large outdated analogue equivalents. Advances in memory technology have allowed us to capture digital images and video cheaply but this in itself would be of no use if we did not have the bandwidth to transport the large amounts of digital data created from one place to another, for example from a digital camera to a PC or across the internet. Solving this communication problem is a combination of technology and intelligence. Technology gives us fibre optic cables and higher bit rates across USB and firewire cables while intelligence gives us coding standards such as JPEG and MPEG. We have become accustomed to moving images across the greatest network of all, the internet. Over the next few years as broadband technology and coding standards evolve we will see more and more higher quality video on the internet and LANs. So what now? We can capture digital images and video at ease and we can shuttle them around quickly from one location to another. However can we interpret them in a meaningful and intelligent way? Interpreting images requires sophisticated algorithms that allow us to gather useful information from images. The most obvious applications to date have been in the areas of security, for example face recognition at airports to hinder known terrorists or eye recognition to validate authorised personnel. Google have strived recently to allow one to use images to search the internet. This is only the beginning. One does not have to dwell too long to think of dozens of image applications if a system could only take a image and from it garner information in much the same way as the brain does automatically. Various 1
other image and video applications in the areas of medicine, military, studio are enumerated in [2] and [3]. The algorithms that will allow us to do this will require large amounts of computing power working on very large amounts of data. If algorithms can be broken down so that some of the tasks can be completed in parallel then great savings in computing time can be made. This is where FPGAs come in. An FPGA is basically programmable hardware that allows us to implement image processing algorithms in hardware as opposed to software. This allows us to take advantage of the inherent parallelism of hardware in implementing these computing intensive algorithms.[3],[4]and[5] A system has been developed by NUI Galway that allows transfer of images between a host PC and on board SRAM via a Xilinx Spartan 3 FPGA. The system provides a versatile user-defined dspblk element within the FPGA. This element includes memory read/write access and is programmable via the host using control registers. The aim of this masters project is to design, simulate and implement(in the dspblk) some common image processing algorithms to process these images within the FPGA. The system then allows the processed images to be uploaded to the host PC for viewing. The FPGA used is a Spartan 3 developed by Xilinx[6][7]. The board on which the FPGA is placed is called a Digilent Spartan-3 Starter Board Rev E [8]. This board also has a power supply and regulation, some interface sockets to allow other boards to be connected, switches, LEDs, flash memory and a serial port connection. This serial port connection is used to transfer data and images to and from the PC. A separate parallel to JTAG connection is used to transfer configuration data to the FPGA. Students in NUIG have developed hardware via a VHDL[9] design flow and Xilinx ISE CAD tools that configures the FPGA to allow images to be transferred between the PC and the on board memory. In addition Visual Basic was used to develop the application software that allows the user interact with the FPGA board. The entire system is well explained and documented in [ 10] and [11]
Summary of objectives
This project, in general terms, looks at the implementation of some established image processing algorithms on a Spartan3 FPGA. The image processing algorithms considered can, broadly speaking, fit into three main categories: Point, Neighbourhood and Morphological operations. Point operations involve modifying each individual pixel value, based on its current value, and independent of all other pixel values. Neighbourhood operations involve modifying each individual pixel value based on some computation performed on a number of the surrounding pixel values as well as its own value. Both Point and Neighbourhood operations alter the image. Morphological operations also depend on algorithms that involve neighbourhood pixels but these operations are used to understand the form or structure of an image as opposed to altering the image. More specifically the projects main objectives are:-
Review of the current system setup, in terms of first, understanding its operation, and second, highlighting the limitations from the point of view of implementing image processing algorithms. Research of image processing algorithms with a view to selecting the most appropriate to implement on the system available. Design of the selected image processing algorithms, and implementation on the current system. Make recommendations in terms of changes to the current system architecture in order to better cope with the varied demands of image processing algorithms.
Report Organisation
Chapter 2 reviews, the current system hardware and software. The system is scrutinised to determine its operation and limitations. Since the operation is well discussed elsewhere, the bulk of the focus in this chapter will be on the limitations of the system. While acknowledging the fantastic work done so far in developing the system via undergraduate and postgraduate projects, it is only after having an understanding of the systems limitations can the design work here begin. In addition, a critical analysis here will benefit any future work regarding improvements. Chapter 3 introduces the theory of the image processing algorithms that are considered here. These are point, neighbourhood and morphological operations. In each case, without showing how, it is stated which particular operations are implemented. Two very useful tables are presented in this chapter that show the user what register values are required to select specific functions. Having discussed the theory in chapter 3 and decided on the specific algorithms to implement, chapter 4 shows how the image functions were designed and implemented. First to be discussed is the new design of the DSP controller block that allows for heavy pipelining of some image processing tasks. Then it is shown how two new sub-blocks have been created to cater for the separate categories of image processing tasks. Chapter 5 focuses on one specific design feature, the line buffers, that involved a large amount of research and experiment. Line buffers are used to store one complete line of an image. Since, depending on the image width, they have the potential to use up a large amount of chip area on the FPGA it was felt to be worthwhile researching the most efficient way to implement these area hungry structures. The results are discussed. Finally, chapter 6 discusses the theory of warping and morphing. Due to time constraints no actual design work occurred in this area. However as a guideline for future projects the realisation of warping algorithms is discussed and a flowchart for one specific design is presented.
CHAPTER 2
REVIEW OF CURRENT SYSTEM
Introduction
This chapter describes in general detail the salient features of the system received from NUIG. First the architecture and operation is described, then the limitations are focused on, specifically with regard to implementing image processing algorithms Since a detailed overview of the system architecture has been provided elsewhere,[10] and [11], a large amount of effort will not be dedicated here reiterating this work. However a brief description will ensue, regarding the design in order to give the reader a feel for the work to date. Of perhaps more significance, and not dealt with elsewhere, are the limitations of the current system. These limitations still exist, since it was not the purpose of this project to address them. However it was important from the point of this project to identify these limitations since they have a direct impact on the results of the work carried out here, i.e. the implementation of image processing algorithms. Addressing some of these limitations is the subject of current projects, whereas solving other problems highlighted here may provide topics for future work.
Overview of the Current System

The system used for this project, can be broadly speaking, categorised into two main parts, hardware and software. The hardware part consists of a PCB (Printed circuit Board) based on the Xilinx Spartan3 FPGA and developed by Digilent. The board itself is referred to as the Spartan-3 Starter Board.[8] The FPGA design is developed using the VHDL design language and the Xilinx Integrated Software Environment (ISE) 6.3i software. In addition simulations were performed using ModelSim XE II/starter 5.8c. Work has already being done, by students of NUI Galway, towards developing an image processing system that is used as a starting point for this project. An overview of the FPGA design as it existed at the start of this project is described below. The design is well documented [10] and [11], however a brief description will now ensue. The design on the FPGA is hierarchical. The top level block is called AppliedVHDL. This block consists of three sub-blocks:-
The main sub-block is called NUIProject. The bulk of the design is contained within this sub-block. This is described in more detail later. The second top-level sub-block is the UART (Universal Asynchronous Receiver Transmitter). Its role is to communicate data between the PC and NUIProject block. Its main job is to convert serial data to parallel data and vice versa, since data within the FPGA is byte wide while data to and from the PC is via the serial port on the PC and JTAG interface on the board. Last is the display controller dispCtrlr sub-block. It receives data from the NUIProject block and performs tasks such as binary to 7-segment code conversion to allow data to be displayed on the boards 7 segment displays. The NUIProject top-level sub-block is where the majority of the design resides. This block in turn is sub-divided into the following sub-blocks. These are:The Memory controller, MemCtrlr, controls the flow of data to and from the on board RAM[12]. It does this by generating all the relevant RAM control signals upon a request to read or write to the RAM. This request could come from the IOCSR block or DSP block. Data to/from the RAM is 32 bits wide. The IOCSR block liaises between the UART on the one hand and the CSR and RAM on the other. The CSR(Control and Status Register) consists of 8 byte-wide registers. These registers allow the user to impart information to or program the system. The bottom 3 registers specify the RAM address if the user wishes to perform a read or write from/to a single RAM location. The next 2 registers allow the user to specify the number of RAM locations to use in its DSP operation. These two registers can therefore also be used to specify image size. The top three registers were unused. In this project it was decided to use the top two unused registers, CSR_6 and CSR_7, to program the image processing function required. This is discussed in a later chapter The Data controller, DatCtrlr, bundles 8-bit data from the UART to 32-bit wide data for writing to RAM and in the opposite direction it unbundles the data. In addition this block selects between RAM data and CSR data when sending data to the UART for transmission. Finally, and most importantly from this projects point of view, is the DSP block. This block has been redesigned during the course of this project to perform image processing tasks on an image stored in RAM. The processed image is written to a different location in RAM. This block is discussed in great detail in chapter 4. The user interface software used to download images and access memory and the CSR registers was designed in NUI Galway. It was developed using the Visual Basic programming language. However some 3rd party translation programs (written in C) are also required to translate bitmap images created in Microsoft Paint to greyscale images that can be downloaded to memory. More on this translation process later. The user interface program calls these translation programs automatically and so the user is isolated as much as possible from this translation process.
The user interface is shown in the figure below.
The CSR resgisters can be written and read, here. Data is entered in Hex format.
Individual RAM locations can be written and read. Addresses and data are both in Hex format
The size of the image in pixels is specified.
A bitmap image is selected to be loaded into the lowest quadrant in memory(00000H)
A second image is loaded into the second quadrant in memory(10000H). This is used if the processing task requires two images.
The processed image is read from the last quadrant in memory(30000H)
Figure 1
User Interface for the AppliedVHDL system
The description above of the AppliedVHDL system is admittedly brief however a more detailed one can be found in [10] and [11]. Attention will now turn, in the next section, to
the systems limitations. The purpose of this ensuing discussion is in no way to detract from the excellent work so far, but to cast a critical eye to determine where problems might be encountered in the future and to point out areas where the system might be improved.
Limitations of the Current System

The starting system, as it was received for this project, has in my opinion, a number of limitations. Some of these have been addressed by this project; some have been addressed elsewhere, while some others still exist. This section will discuss these limitations from the point of view of the ultimate desired goal, that of a versatile, fast and efficient real-time image processing system. These limitations concern the following, a) image size b) image transfer c) image translation software d) user interface software, e) hardware architecture and f) image processing functions. A short discussion ensues under each of these headings. Image Size All image sizes are specified in pixels. For examples an image that has a size of 16*16 is square with 16 pixels in the x-direction and 16 pixels in the y-direction. Total number of pixels for this image would then be 256. Each pixel value at present is represented by an 8bit value and hence can be stored using a 1 byte location. There are currently a few obvious ways in which the image size of the system may be limited.. The factors affecting image size are Memory, FPGA and Application software. The first is Memory. Currently the amount of memory available is 4 quadrants of 256 Kbytes. If storing of images is limited to one quadrant then the maximum image size is 512*512 pixels. It is possible in the case where the image processing function only requires 1 image to use half the available memory for the input image and the other half for the processed image. The image size then could be approximately 725*725 pixels. This would involve however a redesign of the memory controller. If on the other hand, the processing of colour images is required then 3 times as much memory is required for the same size of image since each pixel will have three 8-bit values associated with it, one for each of the three primary colours Red(R), Green(G) and Blue(B). The second factor that affects the size of image processed is the actual size of the FPGA. Currently it is a Xilinx Spartan 3. During the course of the project, it was discovered that, for some image processing functions, line buffers are required. In the minimum case, at least 3 line buffers are required. A great deal of effort was spent investigating the implementation of these line buffers on the FPGA. Initially it was thought the line buffers could be synthesised by modelling them using VHDL code. This would have the advantage of a semi-flexible image width, since changing the image size would require the change of only one parameter, width, before synthesis. This solution is semi-flexible because if the image width changes, the design has to be re-synthesised and downloaded. However it was discovered that this synthesis option lead to a very inefficient use of FPGA resources so that going down this route, the maximum image width that could be accommodated was 128 pixels. Using synthesis, the line buffers were implemented using Flip-Flops. Mapping fails when all these are used up. 7
The eventual solution fixed on a image width of 180 and then used the Xilinx CORE Generator to generate the line buffers. In this case the line buffers are more efficiently implemented using RAM based shift registers. Using this method the image width can be increased to approximately 320 pixels, after which mapping fails. Finally, there is a image size limitation due to the application software. Initially with the first version of software received the image size was limited to 8*8pixels. When this size was exceeded the software crashed with a runtime error. A new version of the software was received and the new image size limit is now 180*180 pixels. Images above this size, once again, cause the software to crash. In conclusion, current image limit is 180*180 pixels due to User Interface software limitation. If this is fixed then the next expected limit would be 320 * 320 set by the FPGA size, due to the area used up by the line buffers. If this is overcome, by choosing perhaps a VIRTEX FPGA then it is anticipated that the next hurdle would be presented by the memory size.
Upload and Download of Images At the moment, the current maximum image size is 180*180 pixels. This limit is due to the application software, which terminates abruptly with a runtime error, if this image size is exceeded. The current method of uploading and downloading images is via the serial port. The time it takes to download the image(180*180) is approximately 95 seconds. The time it takes to upload the processed image is approximately 125 seconds. The execution of the DSP function , i.e. internally processing the image is effectively instantaneous, since it can operate at effectively the system clk rate of 50MHz. However the DSP block must wait for data to be read from/written to RAM, and this reduces the effective clock rate by approximately a factor of 10. Even at this 5MHz rate, a frame size of 180*180 would be processed in approximately 0.006seconds (180*180/5MHz) or 154 frames/sec. If the frame size was 640*480 the frame rate would be 16 frames/sec, not far off the 25 frames/sec required for real time. However the bottleneck occurs in transferring data from memory to/from the host PC. A 3000 fold increase on the current throughput communication rate is required to approach real-time data rates. Projects in NUI have occurred to improve the situation. USB and Ethernet technologies are been investigated. These will definitely yield an improvement but by exactly how much is unknown as yet.
Image Type and Translation programs A great deal of confusion surrounds this area. This is primarily due to the fact that 3rd party software is being used in the creation of the bitmap images (Microsoft Paint) and in the translation of these images eventually to a txt file that consists of a series of commands used to write pixel values to the correct memory location. The exact operation of this software, and the file types used, needed to be investigated since the results obtained at times were often hard to understand. In particular, not changing, the image attributes to
Black and White in Microsoft Paint before saving seemed to causes data corruption even though the attributes revert back once the image is saved as 256 colour bitmap. Clearly the problem lies in the fact, that on one hand, our system requires greyscale images with 256 levels where 00 represents black and FF represents white, with various shades of grey in between, while on the other hand we are creating bitmap files which are colour and Indexed type images. To help clarify the process, a brief discussion will follow on the various image types. There are three main image types, Binary, Intensity and Indexed. Bitmap files are Indexed images, while Raw (binary) files are Intensity type images. In a binary image, each pixel assumes one of two discrete values, zero(off) and one(on). The figure below shows the pixel values and corresponding image for a 8*8 pixel binary image. 1 0 1 1 0 1 0 1 1 1 1 1 1 0 1 1 (b) Image Binary Image File
(a) Pixel Values Figure 2
Intensity images are the most commonly used images within the context of image processing. These are also know as RAW data files. They are greyscale images, where the pixel value represents the grey intensity. If 8-bits are used for the pixel value then there are 256 possible grey levels with 00(Hex) representing black and FF(Hex) representing white. All values in between represent some shade of grey. FF 50 A0 00 (c) Image
255 160 80 0 (a) Pixel values(decimal) Figure 3
(b)Pixel values(Hex) 8-bit Intensity image(4*4)
An Indexed image consists of a data matrix, and a colormap matrix. The colormap matrix has 3 columns of values, representing Red, Green and Blue values. Any colour can be made by combining varying amounts of these primary colours. The number of colours required determines the length of the columns. For example if 256 colours are required then the colormap matrix will be 256 by 3.
The data matrix has a value for each pixel in the image. However in this case, these values do not represent directly the pixel shade but instead are used to index the colourmap matrix. This is illustrated in the figure below. FF 00 FA F9 FC 07 Index Red Green Blue 00 0.000 0.000 0.000 : : 07 0.752 0.752 0.752 : : F9 1.000 0.000 0.000 FA 0.000 1.000 0.000 FB 1.000 1.000 0.000 FC 0.000 0.000 1.000 FD 1.000 0.000 1.000 FE 0.000 1.000 1.000 FF 1.000 1.000 1.000 (b) Color Matrix Figure 4 A 256 Colour Bitmap (Indexed) image after the Black and White attribute is set. The map file (colour matrix) shown above is part of the colour matrix created by MS Paint after the attributes have been change to Black and White but saved as 256 colour hence the 256 entries. The point to note is that the colour white is at entry FF and black at entry 00. However since the black and white attribute was set, only 2 colours will exist, black and white. In other words the data matrix will only contain the values 00 and FF only. It seems these are the values that are used in the RAW file after translation is finished. The full map file examined using Matlab can be seen in the file map_bw. Part of the file is also shown in Appendix[1]. If the black and white attribute is not set then we get the colour matrix file shown in the figure below.
Index 00 01 02 03 04 05 06 07 08 09 Red 0 0.502 0.502 0.502 0 0 0 0.502 0.502 0 Green 0 0.502 0 0.502 0.502 0.502 0 0 0.502 0.251 Blue 0 0.502 0 0 0 0.502 0.502 0.502 0.251 0.251
(a) Data Matrix
(c) Image
10
0A 0B 0C 0D 0E 0F 10 11
0 0 0.251 0.502 1 0.7529 1 1
0.502 0.251 0 0.251 1 0.7529 0 1
1 0.502 1 0 1 0.7529 0 0
Figure 5
Colourmap file if Black and White attribute is not set.
The full file can be found at map_orig or a large part of the file is given in Appendix[2]. Note here that the colour white (1,1,1) is at entry 0E. Therefore for a image with black and white colours only the entries 00 and 0E appear in the data matrix table. Once again it is these values that appear in the RAW file and are ultimately downloaded to memory. This is wrong, since in our greyscale interpretation of the world, FF should represent white and not 0E. MS paint, as far as can be discerned does not support greyscale images. In order to overcome this Photostudio and Matlab were used. A greyscale bitmap image was created with 256 shades of grey. In this case for any entry in the colour matrix all values in any row are the same. In other words if you have the same amounts of Red, Green and Blue mixed you get a shade of grey. A portion of a greyscale colourmap is shown below.
Index 0 1 2 3 4 5 6 7 8 9 10 Red 0 0.0039 0.0078 0.0118 0.0157 0.0196 0.0235 0.0275 0.0314 0.0353 0.0392 Green 0 0.0039 0.0078 0.0118 0.0157 0.0196 0.0235 0.0275 0.0314 0.0353 0.0392 Blue 0 0.0039 0.0078 0.0118 0.0157 0.0196 0.0235 0.0275 0.0314 0.0353 0.0392
: :
249 250 251 252 253 254 255 0.9765 0.9804 0.9843 0.9882 0.9922 0.9961 1 0.9765 0.9804 0.9843 0.9882 0.9922 0.9961 1 0.9765 0.9804 0.9843 0.9882 0.9922 0.9961 1
Figure 6
Correct colourmap for a greyscale image.
11
See the full greyscale colour matrix created by Photostudio in the file 256_grey. Unfortunately, for some unknown reason, the translation programs in this project can only handle a grey bitmap image created in this way if the file is very small. For example the bitmap image eye.bmp, shown on the left in the figure below can be translated but not the bitmap image meg_8bitgray.bmp. The first has a size of 32*32 pixels the second a size of 180*180 pixels.
a) 32*32 greyscale Figure 7
b) 180*180 greyscale Different size greyscale images
However the image NUIGImage1_180x180_bw.bmp , shown below, can be translated even though this is also 180*180 pixels in size. The difference here is that even though a 256 colour matrix is used only the values FF and 00 appear in the data matrix, i.e. black and white, whereas for the image meg_8bitgray.bmp the 256 grey colormatrix is used, but the data matrix contains values other than 00 and FF.
Figure 8
180*180 Black and white
It is interesting to note that if the Matlab command imfinfo is used to show all essential information on an image file, then both of the above 180*180 files look exactly the same. The file info_meg_8bitgray can be compared with the file info_NUIGimage in Appendix[3].
12
It is not within the remit of this project to solve the problems associated with the front end software. However it is hoped that the above analysis and review will be of benefit to whoever takes on this task. The job in my opinion consists of first taking the data matrix of a proper grey scale image created with 256 levels of grey. Then create a C program or other such program to embed this data into the command file that is sent down the board. Application Software In addition to the translation software there also seem to be additional limitations of the user interface software. If the image size exceeds 180*180 pixels the interface simply crashes. This originally happened for a smaller image size with the first version of the software received. However with the new version of software, this limit has been increased. This project has developed a number of image processing functions. If a image is downloaded into memory, it should be possible to perform a number of these functions on the same image, even though the processed image would be overwritten each time. This is not the case. If this worked then time could be saved downloading the input image each time. However if successful image processing functions are attempted, then the user interface program once again terminates abnormally. Hardware Architecture In addition to the current limit on memory size, discussed earlier, there are also constraints put in place by the present system in terms of how the memory is organised. At present each location in memory is 32 bits wide and therefore stores 4 pixels. Therefore there is the extra hardware and time of ungrouping pixels when reading, and regrouping pixels when writing. For most image processing functions and those implemented here, this is not a serious problem. However when it comes to Warping where translation functions depend on pixel addresses then the way the memory is organised at present makes implementation of Warping algorithms very complicated, since each pixel does not have a unique address. Image Processing Functions On receiving the system DSP function implemented in the dspBlk was reviewed. Though of no obvious practical use, it sufficed to illustrate the operation of the system in general, in terms of reading data, performing some function and writing back to memory. The DSP function performed was to subtract image1 from image2. This was not done on a pixel by pixel basis but on 32 bit word basis. In VHDL the 32 bit values are treated as unsigned. The following examples explain:Example 1:If Image 2 has 4 pixels only all white(FF FF FF FF) and image1 is all black(00 00 00 00) then the resulting calculation is (FF FF FF FF) (00 00 00 00) = (FF FF FF FF). 13
The resulting image is all white. However if Image2 is all black and Image 1 all white then the resulting calculation is (00 00 00 00) (FF FF FF FF) = (00 00 00 01) The result looks all black but has in fact 1 pixel value of 01 which is not quite black. In effect an overflow has occurred. Example 2:If Image1 has 4 shades black, dark grey, light grey and white pixels as follows, then the resulting values will be 00 80 C0 FF P0,0 P0,1 P0,2 P0,3
If Image 2 is all black (00 00 00 00) then the resulting calculation will be (00 00 00 00) - (00 80 C0 FF) = (FF 7F 3F 01) P0,0 P0,1 P0,2 P0,3 P0,0 P0,1 P0,2 P0,3
P0,0
P0,1
P0,2
P0,3
In summary then, the current DSP function, while useful in proving the general operation of the system, does not have any discernible advantage as an image processing algorithm. It is the objective of this project to implement some commonly known image processing algorithms.
Conclusion
The chapter has been used to give a brief overview of the current system as it was received from NUI Galway. Since the system has been described elsewhere, only a summary overview is given here. Time was dedicated to putting the system through the mill so to speak, in order to highlight any limitations that may have a direct effect on the size and nature of images that can be processed, as well as the type of algorithms that can be implemented. It was discovered that while the system works well for Black and White images up to a image size of 180*180 pixels it is not adept at handling greyscale images. The largest greyscale image that can be handled is 32*32 pixels. These limitations are due to application software limitations. If these limitations are overcome it was shown that there are other factors that will affect image size, namely memory size and FPGA size. Speed of image transfer is also an issue. At the moment since a serial port is being used, image transfer speeds are barely tolerable when image size exceeds 32*32 pixels. If realtime processing is an ultimate goal then faster connections need to be explored. Finally when considering the FPGA architecture required image processing it is desirable that the
14
FPGA should have a great deal of flexible multipliers for computation as well as memory for internal line buffers. Multipliers are not in abundance on the Spartan 3. However there use was avoided here by carefully choosing algorithms for implementation. The use of line buffers cannot be avoided. It is hugely advantageous to have these on chip and it is shown later how these have been implemented efficiently so as not to become a limiting factor. In other words the XC3S200 is sufficient here but only just. The next chapter discusses various image processing algorithms in general terms and indicates those that are chosen for implementation.
15
CHAPTER 3
IMAGE PROCESSING ALGORITHMS

Introduction
Image processing functions may be separated broadly into three categories; point, neighbourhood and morphological operations[13]. Point operations are algorithms where each pixel in an image is processed individually i.e. independent of all other pixels in the image. Operations can be unary where only one image is involved or binary where two images are combined in some way. This project has tended to concentrate on unary point operations. Neighbourhood operations use, not only the value of the current pixel, but also those of its neighbours on the same line and surrounding lines before and after. This obviously then requires large amounts of memory to store complete image lines, the number of lines stored depending on the size of neighbourhood required. Low and high pass filtering are typical neighbourhood operations as well as edge detection. Morphological operations are neighbourhood operations of a kind but are more interested in the shapes and forms in images. They generally work on binary images. Here, if the image is greyscale, a sort of threshold operation is performed before the morphological operation. The most common morphological operations are Erosion and Dilation.
Point Operations
For 8-bit greyscale images, each pixel has a value between 0 and 255, where 0 represents black and 255 represents white. Values in between represent varying shades of grey. Point operations involve taking each pixel value in turn and modifying its value according to some mathematical formula. For example the following are point operations where x represents the input pixel value and f(x) represents the output or processed pixel value:Reduction in Intensity: 0............if ......x < 10 f ( x) = x 10....if ......x 10 Thresholding: 0.........x < 128 f ( x) = 255.....x 128 Gamma Modification: f ( x) = e x
16
The first two operations are straightforward and can easily be performed on a Spartan FPGA. The last is a more complicated mathematical operation. The most obvious way to implement this operation is by means of a LUT(Look Up Table) since there are only 256 possible values for x. These operations can be represented graphically as shown in the figures below as produced originally by Burdick[13]. All of the point operations illustrated in this figure will be implemented.
17
255 Output 0 0 Input 255
a) Unity. Input pixel values are unchanged
255 Output 0 0 Input 255
b) Invert. All pixel values are inverted
255
c)
Output 0 0 Input 255
Threshold. Convert greyscale image to black and white
255
d) Decrease contrast.
255
e) Increase Contrast.
18
255
f) Decrease Brightness
255
g) Increase Brightness.
Figure 9 Various Point Operations represented graphically(Burdick, Digital Imaging)
19
Neighbourhood Operations
Neighbourhood operations combine a number of pixels in a pixels neighbourhood in some way to determine the output or processed pixel. How the pixels are combined (ex. averaged) and the size of the neighbourhood, depends on the algorithm in question. The most common and useful neighbourhood operation is convolution. It is effectively 2D convolution as opposed to the 1-D convolution associated with general DSP, since a 2D kernel or convolution mask is passed over the input image in order to calculate the pixel values of the output image. The size of the kernel depends on the application but in general a 3*3 kernel is sufficient. The kernel will therefore have 9 values or coefficients. It is these values that will determine the exact function of the processing, for example low pass filtering, high pass filtering or in this projects case Prewitt Edge detection. The convolution process is illustrated in the following figure.
C01 C02 C03 C11 C12 C13 C21 C22 C23 (a) Kernel
P01 P02 P03 .. P11 P12 P13
..
P21 P22 P23 P24 P25 P26 P27 P31 P32 P33 P34 P35 P36 P37 : P43 P44 P45 P46 P47 : P53 P54 P55 P56 P57
(b) Input Image
Figure 10
2-D Kernel is passed over the input image during convolution
As depicted above convolution involves passing the centre of the 2-D kernel (C12) over each image pixel value and multiplying the kernel coefficients by the overlapped image pixel values. The products are summed and divided by the weight of the kernel. The result is the pixel value of the output image positioned in the same relative position as the centre coefficient of the kernel on the input image. An equation for the current output pixel(Q35) in the figure above is: Q35 = (C01 * P24 ) + (C02 * P25 ) +(C03 * P26 ) +(C11 * P34 ) +(C12 * P35 ) +(C21 * P36 ) + (C22*P44) +(C23 * P45 ) +(C01 * P46 ) / (C01 +C02 +C03 +C11 +C12 +C13 +C21 +C22 + C23 )
20
Note, as always in convolution, the kernel is first flipped before passing over the input image. In practice this is not an issue since, as we will see, the Kernel is generally symmetrical in both directions. The following figure demonstrates the effect of changing the coefficient values of the kernel. 1 1 1 1 1 1 1 1 1
a) Coefficients for a Low Pass Filter -1 -1 -1 -1 9 -1 -1 -1 -1
b) Coefficients for a High Pass Filter -1 0 1 -1 0 1 -1 0 1 -1 -1 -1 0 0 0 1 1 1
c) Prewitt Horizontal and Vertical Edge Detection -1 0 1 -2 0 2 -1 0 1 -1 -2 -1 0 0 0 1 2 1
c) Sobel Horizontal and Vertical Edge Detection Figure 11 Examples of 3*3 Kernels
Convolution, as can easily be seen, could require a huge amount of computational effort depending on the actual coefficient values. Calculation of each output pixel value using a 3*3 kernel could potentially require 9 multiplications, 8 additions and 1 division. For Edge detection the actual output pixel values is not of interest as is in the case of filtering since it simply needs to be established if an edge exists or not. Therefore the division step can be eliminated. The numerator is simply examined, which represents the gradient, and a decision is made if an edge exists or not depending on its value in relation
21
to a preset threshold. The output pixel is then set to FF(white) if an edge exists and 00(black) otherwise. The resource savings possible in terms of the number of multiplications and additions is more than an order of magnitude, depending on the kernel size, if the 2-D kernel is separable and symmetric as is the case for Prewitt. This is discussed in application note by Altera[14]. In addition, if the coefficients are 1 then no multiplication is required. For these reasons the Prewitt Edge Detection algorithm was chosen as the most natural place to start in implementing edge detection algorithms on a FPGA since it leads to the simplest architecture. The complexities involved in implementing other algorithms can then be easily deduced. In summary then, as an example of neighbourhood operations, Prewitt edge detection has been selected. This can be horizontal, vertical or full(vertical and horizontal combined), as selected by the user. In addition the user will also be able to program a threshold value against which the gradient values are compared. Changing of this threshold value changes the sensitivity of the process, i.e. the higher the threshold values the less edges will be detected.
Morphological Operations
Morphological operations are generally concerned with the structure or form of an image as opposed to the appearance of an image. Morphological operations are used extensively in automatic object detection when trying to establish the shape and size of objects. It could be argued that edge detection is a morphological operation, however morphological operations are generally performed on binary images i.e. black and white, where 1 represents white and 0 represents black. Morphological operations can still be performed on greyscale images but first the image has to be converted to black and white. The two most common morphological operations are Erosion and Dilation. Here pixel values are modified depending on the binary values of its neighbours. Once again a 3*3 kernel is generally used and in this respect morphological operations are very similar to neighbourhood operations. A common binary erosion mask is where all the kernel values are 1 as shown below
22
1 1 1
1 1 1
1 1 1
P01 P02 P03 .. P11 P12 P13
..
P21 P22 P23 P24 P25 P26 P27 P31 P32 P33 P34 P35 P36 P37 : P43 P44 P45 P46 P47 : P53 P54 P55 P56 P57
(a) Kernel
(b) Input Image Figure 12 2-D Kernel is passed over input image during Erosion
Passing this mask over the image, as shown above means that for a pixel to remain white i.e. 1 all its neighbours must be 1 . Consider the kernel placed over the image as shown above then the following logical operation is performed to determine P35 , the output pixel. For Erosion P35(out) = 1 (white) if P35 AND P24 AND P25 AND P26 AND P34 AND P36 AND P44 AND P45 AND P46 = 1 else 0. This effectively means that the output pixel, if white in the first place, will only remain white if all its neighbours are white. The Erosion above is performed using 8-connectivity. If 4-connectivity is used then the corner pixels P24 , P26 , P44 , P46 for the specific operation above would not have been used. The kernel for 4-connectivity operation is shown in figure below: 0 1 0 Figure 13 1 1 1 0 1 0
Kernel for Erosion using 4-connectivity
Therefore with successive erosions large white areas will shrink in size and small isolated white areas will disappear altogether. Erosion can be used to clean up object boundaries
23
and remove spurious objects to allow for easy identification of objects by size and calculation of number of objects present. Dilation is the opposite of Erosion. Here White areas Dilate as opposed to Erode. The pixel in question will become white, if it was white already, or if any of its neighbours are white. This time the following logical operation is performed to determine the P35(out):For Dilation P35(out) = 1 (white) if P35 OR P24 OR P25 OR P26 OR P34 OR P36 OR P44 OR P45 OR P46 = 1 else 0 Just as is the case for erosion, dilation can also be performed using 8-connectivity or 4connectivity. As for erosion, for 4-connectivity dilation the corner values are not used. All the morphological operations described above will be implemented. The following section specifies exactly the image processing algorithms that have been implemented. To allow the user select and control the image processing tasks required unused bytes 6 and 7 within the CSR(Control Status Register) have been used.
Summary of New Image Processing functions added to the system

Previously unused bytes 6 and 7 of the CSR have been assigned to managing the image processing functions. The user via the user interface can select the image processing task required using byte CSR_6 while CSR_7 is then used if any control values, ex. threshold values, are required for the operation. Bit 3 of the CSR_6 byte is used to select between point operations and neighbourhood operations, while the three lower bits, CSR_6(2:0), are then used to select the specific operation within the category. The tables below show the exact values required for each operation
24
Description Full Edge Detection Horizontal Edge Detection Only Vertical Edge Detection Only Erosion with 8connectivity Erosion with 4connectivity Dilation with 8connectivity Dilation with 4connectivity Not used Full Edge Detection by default
CSR_6(7:0) Hex 00 01 02 03 04 05 06 07
CSR_7(7:0) Hex Edge threshold value. A good starting point is 20(Hex) Edge threshold value. A good starting point is 20(Hex) Edge threshold value. A good starting point is 20(Hex) Not Applicable Not Applicable Not Applicable Not Applicable
Table 1 Neighbourhood and Morphological Image Processing functions added to the system
25
Description No change Invert Brighten Darken Increase contrast Decrease contrast Convert Greyscale to Black and White Not used Greyscale to Black and White by default Table 2
CSR_6(7:0) Hex 08 0000 1000 09 0000 1001 0A 0000 1010 0B 0000 1011 0C 0000 1100 0D 0000 1101 0E 0000 1110 0F 0000 1111
CSR_7(7:0) Hex Not Applicable Not Applicable Value determines the amount to brighten by Value determines the amount to darken by Value determines the amount to increase contrast by Value determines the amount to decrease contrast by Value determines the grey value threshold to use Value determines the grey value threshold to use
Point Image Processing Operations added to the system
Conclusion
This chapter discussed the theory behind some point and neighbourhood image processing tasks. Though admittedly not an extensive discussion it hopefully conveys enough information for one to gain an appreciation for the task of implementing these systems. Implementation is the focus of the next chapter. All the tasks listed in the two tables above will be implemented. For a more detailed discussion of image processing theory see Burdicks excellent book[13] where many algorithms are implemented in C. Awcocks[15] book provides another perspective and for a very mathematical treatment of the subject see[16].
26
CHAPTER 4
THE NEW DSP CONTROLLER BLOCK AND IMAGE PROCESSING SUB-BLOCKS

Introduction
This chapter focuses on the implementation of the image processing algorithms described in the last chapter. A new version of the DSP controller has been designed. This will henceforth be referred to as version 2 to distinguish it from the original one received from NUI. This original DSP controller will be referred to as version 1. Version 2 is almost pin compatible with version 1, except for two new 8-bit inputs CSR_6 and CSR_7, that will be used to select and control the image processing function required. Though the footprint is very similar to the old version, which allows for easy replacement, the internal architecture is very different. There are two major differences between the version 1 and version 2 of the DSP controller. First the controller operation and image processing operations have been separated structurally by creating two new sub-blocks within the DSP controller to handle the various image-processing operations. An overview of the DSP controller version 2 architecture is shown in figure 14 below. DSPblk.vhd
Pixel_Img_Proc.vhd
Point Operations State Machine

Kernel3_Img_Proc.vhd
Neighborhood Operations
Figure 14
Overview of DSP block architecture
27
The central part of the architecture is the state machine controller that takes care of data transfer to and from the RAM and to and from the sub-blocks. The image processing tasks have been divided into two groups, point operations (pixel_img_proc.vhd) and neighbourhood operations(kernel3_img_proc.vhd). The second major change is a redesign of the state machine within the DSP controller. Version 1, has a major limitation that needed to be overcome before time intensive image processing algorithms could be implemented. The state machine in the DSP controller version 1 is designed in such a way, as to assume that all DSP functions can happen in one clock cycle. This is not the case for even the simplest of functions here, since even stripping the individual pixel values from the 32-bit values read from RAM take at least one clock cycle. Note operations in version 1 of the controller were previously performed on 32 bit values. A brief overview of the cycles performed in the state machine of version 1 is illustrated below:-
Read data
Perform Task
Write data
Figure 15
Main cycles of DSP Block Version 1
This is sufficient when the Perform Task only takes one clock cycle. However long delays would be introduced if the task were heavily pipelined, as is the case when neighbourhood operations are performed. Version 2 of the DSP controller addresses this problem by allowing for the fact that various DSP tasks may be pipelined and therefore, there may be a large time lag of many clock cycles between when the function block receives valid data and when valid data is output. In version 2 of the DSP block the state machine has been modified to perform the cycles shown in the figure below.
28
No valid data ready to write
Continue reading and passing data to the function block until valid data appears
When valid data appears continue reading and writing data until all data is read data left to read No data left to read When all data is read continue writing data until all data is written.
Figure 16
Main Cycles of DSP Block Version 2
From the above it can be seen that at any stage the controller might be continuously reading data, at the beginning, alternating between reading and writing data when the time lag of the sub-block or function block has been overcome, or continuously writing data towards the end. The design of the 3 VHDL blocks that make up the new DSP controller will now be examined. First the top-level block DSPblk.vhd is described in more detail and following that the two sub-blocks, pixel_img_proc.vhd and kernel3_img_proc.vhd are discussed.
The DSP Controller, DSPblk.vhd

The pin out for DSP block version 2 is shown below. As can be seen the pinout is identical to version 1, except for the addition of two new input 8-bit buses called CSR_6 and CSR_7. These two bytes are the upper two bytes of the CSR(Control Status Register) and up until now were unused. The user has access to these bytes via the user interface and can therefore read from and write to them. Making these two bytes available in the DSP controller block allows their values to be passed to the image processing sub-blocks, were they are used to select and control the image-processing task required. Chapter 3
29
documents the CSR values required to execute each task. The following is a complete list of the DSP blocks pins.
Pin clk rst dspActive System clock System reset
Description
This signal is used to activate the DSP block. While it is 0 the DSP blocks main state machine remains in the idle state Data read from RAM. Each RAM location is 32 bits wide Number of RAM locations to be accessed. This is determined by the size of the image This input has been added in order to allow an image processing task to be selected by the user. This value can be accessed via the user interface This input has been added and used by some of the image processing functions as a threshold value. It is also set by the user Used to indicate to the DSP when a read or write to RAM is completed
DatFromRam(31: 0) DspAddRange(15:0) CSR_6: (7:0)
CSR_7: (7:0)
ramDone
Table 3
DSP Block Inputs
dspDone dspRamWr dspRamRd dspDat2Ram(31:0) dspRamAdd(17:0)
Signal used by dspBlock to indicate it has finished the allotted task Enable RAM wr access by dspBlock Enable RAM rd access by dspBlock Data from dspBlock to be written to RAM RAM address (from dspBlock) Table 4 DSP Block Outputs
A more detailed view of the architecture of the DSP block is shown below. The complete VHDL code for the DSP block can be found in Appendix [4].
30
Figure 17
DSP Block Architecture
31
The external signals in figure 17 above are shown in bold. A description of the internal signals is give in the table below:
DataEn ldcnt0 incReadcnt incWritecnt CS Writecnt Readcnt Cnt Pipeline_lnt DataVal Finish_Writing Finish_Reading Total_length
Generated by state machine to indicate to sub-blocks that data is available Generated by the state machine to start the read and write address counters Generated by the state machine to increment the read address counter Generated by the state machine to increment the write address counter Current state of the state machine Keeps track of number of locations written Keeps track of number of locations read Keeps track of number of times DSP function block has been enabled Pipeline length for active sub_block Indicates to state machine that data is valid from sub-block DSP block is finished writing DSP block is finished reading Total number of data enables required to complete task
Table 5
Internal Signals of the DSP block
The DSP block contains the two sub-blocks. The sub-block pixel_img_proc is responsible for point operations, while the sub block kernel_img_proc is responsible for neighbourhood and morphological operations. Both of these blocks are activated and fed data, however only the output of one is chosen using a multiplexer. Bit 3 of the CSR_6 register, which is set by the user, determines which block output is chosen. The lower bits of the CSR_6 register determine which particular function is selected within the sub block. The main part of the DSP block, is the state machine which controls all activity including data to and from the RAM. A flowchart that describes in detail the state machine operation is shown below. The VHDL code can be found in Appendix [4].
32
Figure 18
DSP block main State Machine
As explained earlier the state machine has been expanded to allow for the fact that the operation of the sub-blocks can be heavily pipelined. Each of the sub-blocks will now be explained in more detail.
33
The Point Operations sub-block (pixel_img_proc.vhd)

The Pixel Image processing block is a sub-block within the main DSP controller block. It has the task of performing any point operations required on the image, for example, converting a greyscale image to black and white. Point operations were discussed in chapter 3. Table 2 listed all the point operations that are implemented by this block. The following is a complete list of the blocks pins.
Pin clk rst RamDataIn(31 downto 0) DataEn CSR_6: (7 downto 0) System clock System reset
Description
Data read from RAM. Each RAM location is 32 bits wide Used to register the RAM data when it is valid At present only the lower three bits are used to select the specific point operation required CSR_6(2:0) Some operations require a threshold value. CSR_7 is used to pass this threshold value.
CSR_7: (7 downto 0)
Figure 19
Inputs of the Point Operations block
There are only two outputs relating to the processed data and pipeline length.
RamDataOut(31 downto 0) Pipeline_lnt(15 downto 0)
Processed Data sent back to the DSP controller block Pipeline length of the operation. This is required by the DSP controller in order to determine when the data will become valid. At present all point operations have the same pipeline length.
Figure 20
Outputs of the Point Operations block
A detailed view of the block architecture is shown below in figure 21 below. The VHDL code for the block is given in Appendix [5]. 34
Figure 21
Architecture of the Point Operations block
The 32-bit data from the RAM is first seperated simultaneously into individual 8-bit pixels. A multiplexer controlled by the lower three bits of the CSR_6 register determines which point operation is selected. The CSR_7 register value is used by all except two of the operations as illustrated above. Finally the processed bytes are reassembled into a 32-bit word before being available to the main DSP block, for writing to RAM. Note there is in effect only a 32 bit-register at the input and output, with the rest of the internal logic being combinational. The bytes are processed in parallel benefiting from the power of a hardware implementation. Only the combinational logic for one of the bytes(pixel 0) is shown in figure 21 above. The pipeline length of this block in terms of the DataEn signal is 2, i.e very small since no neighbourhood pixel values are required. The greyscale image shown in figure 22 below will be used to illustrate the effect of the various operations. It is 32*32 pixels in size, although it has been stretched here for better visibility at the expense of quality. If reading this report online then clicking any of the images in this report should invoke Microsoft Paint to show the image.
35
Figure 22
Original Greyscale image used to illustrate point operations(eye_32*32_grey.bmp).
The following figure shows the effect of the various point operations on this image and the specific values used for CSR_6 and CSR_7.
Processed Image
Function
CSR_6(7:0) hex
CSR_7(7:0) hex
No Change A useful test to ensure pipeline delay etc is correct. Invert All pixels values are inverted around centre value of 80(hex) 09 Not Applicable 08 Not Applicable
Brighten
0A
Value used to brighten each pixel by (20 in this example)
Darken
0B
Value used to darken each pixel by (20 in this example)
36
Increase Contrast
0C
Value used to adjust pixel value by (20 in this example)
Decrease Contrast
0D
Value used to adjust pixel value by (20 in this example) Threshold value to use in determining whether a pixel should be set to black or white (56 hex in this example)
Convert to Black and White
0E or 0F
Figure 23
Processed images from various point operations
In addition to viewing the images, the actual pixel values were also examined using Matlab. The No Change operation is a useful function to ensure pixel values are not being corrupted or shifted in anyway as a result of processing. The following Matlab command can be used to read a bitmap image. [im1 map1] = imread('eye_32x32_grey.bmp'); im1 will be a 32*32 matrix containing the pixel values, whereas map1 is the 256*3 colourmap. Individual pixel values can then be examined as follows. For example the follow command looks at a 5*5 subset of a 32*32 matrix. im1(8:12,3:7) The original image has the following pixel values for the subset im1(8:12,3:7) i.e. 25 values.
37
122 104 99 105 123 Table 6
110 90 92 108 125
96 79 88 111 127
86 79 92 111 124
79 82 94 104 121
Pixel values of the original image(subset(8:12,3:7)
When the pixels are brightened by 20 (hex) i.e. 32(decimal) their values then become as shown in the table below. This can be obtained by reading the image into Matlab and examining the pixel values. The matlab commands used are: [im3 map3] = imread('point_brighten.bmp'); im3(8:12,3:7) while the pixel values of the brightened image are as follows:154 136 131 137 155 Table 7 142 122 124 140 157 128 111 120 143 159 118 111 124 143 156 111 114 126 136 153
Pixel values of the brightened image (subset(8:12,3:7)
In conclusion then the implementation of point operations has proved successful and relatively straightforward to implement. The next section now discusses the implementation of neighbourhood and morphological operations.
38
The Neighbourhood & Morphological Operations subblock (kernel3_img_proc.vhd)

In addition to point operations the DSP block has also been redesigned to perform neighbourhood and morphological operations. The neighbourhood operation chosen is Prewitt Edge detection whereas the common morphological operations Erosion and Dilation are also implemented. Since all of these operations require the use of a 3 * 3 kernel, they have been grouped together into a sub-block called kernel3_img_proc.vhd. The VHDL code for this block is available in Appendix[6]. The user chooses between point and these neighbourhood/morphological operations using bit 3 of the CSR_6 register, (CSR_6(3)=0 => neighbourhood operation, CSR_6(3)=1 => point operations). Neighbourhood operations are far more complex to implement since they require knowledge of a pixels neighbourhood values. This implies lots of memory, ideally internal easily accessible memory, since neighbourhood values must be readily available. The size of memory or buffering required depends on the image size and the kernel size. This will be discussed more in the next chapter where implementation issues are explained. For this project, which is an exploration of image processing on FPGAs and the issues involved, it was decided to fix upon the widely used default kernel size of 3 * 3 pixels. In the case of edge detection a threshold value also needs to be set to determine how sensitive the edge detection process should be. CSR_7 is used to allow the user pass this threshold value. A value of 20 Hex is a good starting value for the threshold. This can then be increased or decreased as required. The pinout for this sub-block is the same as for the point operations sub-block. This is very deliberate since it was felt all image processing sub-blocks could, for ease of use and simplicity, have the same pinouts. The architecture of this sub-block has 3 main parts. Each of these parts are shown on separate sheets below. The first part of this sub-blocks architecture is depicted below. It shows RAM data entering the block on a 32 bit bus, being separated into pixels 8-bits wide and then entering the three line buffers. Three line buffers are required to make neighbourhood values from the preceding line and the next line available, as well as the current line. The line buffers are enabled using the Pixel_En signal which is also generated in this block. Four pixel enable signals will be generated each time a Data_En signal is generated as illustrated in the timing diagram below. As can be seen, each DataEn valid activates the Pixel_En signal for 4 clock cycles. A great deal of time was spent considering the actual implementation of the line buffers and the result of this work is presented in the next chapter. At this point it is suffice to note that the line buffer length is equal to the image width.
39
Figure 24
Part 1 of the Architecture of the Kernel3_img_proc block
clk DataEn Pixel_En
Figure 25
Timing of DataEn and Pixel_En signals
The second part, part 2, of this sub-block, shown in figure 26, consists of the actual operations as describe above. All of these operations use the pixel values supplied by the line buffers. Part2(a) below shows the implementation of the following morphological operations: Erosion using 8-Connectivity
40
Erosion using 4-Connectivity Dilation using 8-Connectivity Dilation using 4-connectivity
Meanwhile part2(b) of this sub-block, depicted in the following diagram, illustrates the implementation of the Prewitt Edge detection algorithm. This consists of the following: Horizontal Edge Detection Vertical Edge Detection Full Edge Detection (Horizontal and Vertical combined) Finally the third part of this sub-blocks architecture illustrated in figure 28 shows how one of the operation outputs is selected using the value of CSR_6 register. Since there are less than eight operations only the lower three bits are required. Once the appropriate operation is selected, the processed pixel values from this block are reassembled into 32-bit chunks and passed back to the DSP controller block.
41
Figure 26
Part 2 (a) of the Architecture of the Kernel3_img_proc block
42
Figure 27
Part 2 (b) of the Architecture of the Kernel3_img_proc block
43
Figure 28
Part 3 of the Architecture of the Kernel3_img_proc block
44
In chapter 2, the limitations of the current system were discussed. One of these limitations concerns the size of greyscale images. It was found that the user software was only reliable for greyscale images up to 32*32 pixels. For Black and White images the maximum image size is 180 *180. Edge detection can generally be performed on greyscale and Black and White images and therefore the implementation was tested using the 32*32 greyscale eye image shown earlier as well as the following Black and White image received from NUI Galway. First using the eye image the following results were obtained for Full Edge detection. using different threshold values, which has the effect of making the process more or less sensitive to edges.
Figure 29
Original eye 32*32 pixels image
CSR_7 = 20 Figure 30
CSR_7 = 30
CSR_7 = 17
Full Edge Detection with different threshold values
Horizontal Edge detection means detecting edges as the kernel moves horizontally across the image. In effect therefore it will detect vertical lines best. The image above consists mostly of horizontal lines, and therefore horizontal edge detection in this case detects few edges, even at a the sensitive value of CSR_7 = 17, as shown in the figure below.
45
CSR_7 = 17 Figure 31 Horizontal Edge Detection with a sensitive value of CSR_7
On the other hand, for this particular image, vertical edge detection will detect quite a lot since these are the edges detected as the kernel moves vertically down the image. Vertical edge detection will be good at detecting horizontal lines. The result of vertical edge detection on this image is shown below.
CSR_7 = 17 Figure 32 Vertical edge Detection
Note if the results of horizontal edge detection and vertical edge detection are added, the result is full edge detection for this particular threshold value. If a Black and White image is used as opposed to Greyscale images then the user software for some unknown reason as present, can cope with larger image sizes up to 180*180 pixels. The image below, received from NUI, will be used to illustrate edge detection on larger(180*180) black and white images. Since the images are black and white, edge detection will obviously not be as sensitive to the threshold value since pixel values are FF or 00 with no values in between, hence gradient values will either be very large or very small with no in between.
46
Figure 33
Black and White image received from NUI. (NUIGImage1_180*180.bmp)
The results of full, horizontal and vertical edge detection are shown below:-
Full Edge Detection with CSR_7 = 20
Horizontal Edge Detection with CSR_7 = 20
Vertical Edge Detection with CSR_7 = 20
Figure 34
Full, horizontal and Vertical Edge Detection on Black and White image (180*180)
Note, edges are depicted here as white lines on a black background. It is very simple to change the code to have edges shown as black lines on a white background. Erosion and Dilation are generally performed on binary images and though the implementation here does allow greyscale images to be used, it is such that only the m.s.b of the pixel value is examined thus performing a sort of black and white conversion before erosion or dilation. From the point of view of illustrating the results of the erosion and dilation implementations specific images have been designed to better demonstrate the usefulness of these functions. The first image developed to illustrate the use of erosion is shown below. Though it was developed manually it is designed to depict a image of some coins that as can be seen are
47
touching. Erosion can be used to separate the objects so that they can be counted since counting algorithms generally require that the objects be separate.
Figure 35
Original Image developed to demonstrate Erosion
Image after 1 erosion
Image after 3 erosions Figure 36 Performing Erosion
Image after 5 erosions
As can be seen, in this case, it takes five erosions before the coins are separated. The number of erosions could be reduced by making the size of the kernel larger however this, as already discussed, would have other implications, in particular for memory. Also note since the kernel or structuring element is square the coins are beginning to lose their shape after 5 erosions. Different structuring element shapes could be used to counteract this. If erosion shrinks white areas then dilation can be viewed as the opposite in that it grows white areas. This can be useful, for example in the manufacturing process of PCBs (Printed Circuit Boards) where it is required to determine if a minimum track spacing is required. Suppose it is desired that the minimum track spacing equates to three pixels, then on the image below it can be seen, though not too clearly, about half way down the track, that a violation may have occurred.
48
Figure 37
Image developed to demonstrate usefulness of dilation
If this image is first dilated, then any areas less than the required minimum spacing of three pixels will grow together as shown below.
Figure 38
Result of one dilation
If this image is then eroded then the original image is restored, except in cases where there has been a violation and the tracks have grown together due to dilation. A tell tale sign will be left, that indicates there is a violation. This is shown in the image below where after erosion the tracks have not separated correctly. This final image could be subtracted from the original to show exactly where the problems lie.
Figure 39
Result of Erosion after Dilation
49
Conclusion
This chapter shows how the how the image processing algorithms discussed first in chapter 3 were designed and implemented. Two new sub-blocks have been created within the DSP block to cater for the Point operations and Neighbourhood operations. In addition the main state machine controller within the DSP block has been redesigned to allow for the fact that due to the complex nature of the neighbourhood operations, in particular, pipelining is required in order to implement the algorithms efficiently. One very important part of the implementation, and where a considerable amount of time was spent, is the design of the line buffers. The next chapter will now focus on this issue.
50
CHAPTER 5
IMPLEMENTATION OF LINE BUFFERS

Introduction
The most area intensive part of the edge detection algorithm, and other algorithms based on a 3* 3 kernel, are the line buffers. A line buffer is simply a shift-register that stores the pixels of an entire line. Therefore the shift register length is equal to the image width. Since three lines of the image have to be available for most simple image processing algorithms, this implies 3 shift registers, the length of each being equal to the image width. In addition, each pixel is at least 8-bits wide for greyscale images and 24 bits wide for colour images. Therefore assuming even a small image with a Width equal to 200, then the amount of clocked storage elements required is 3(number of line buffers)* 200(width of image) * 8(pixel size) = 4800. Efficient implementation of these line buffers therefore is a major issue, since this is the most area intensive part of the entire design. Two methods of implementing these line buffers were thoroughly examined. The first approach used synthesis by modelling the line buffers using VHDL. The second and most efficient approach uses the CORE Generator tool available within the Xilinx CAD package. Both methods try to achieve the same result i.e. to implement the line buffers using the most efficient method in terms of FPGA area. This is done if the 8-bit wide shift register is implemented using clocked RAM as oppose to Flip-Flops. This is discussed in Xilinx application note XAPP465[17]. The disadvantage of using RAM as opposed to Flip-Flops is that only the last pixel in the shift-register is accessible, as opposed to all pixel values in the line, if Flip-Flops are used. Since we only require access to the last three pixels, the solution is to use a combination of RAM for the majority of the line delay with Flip-flops implementing the end of the line delay. This is illustrated in the figure 40 below. The RAM part of the line buffer was created using the Core Generator. The Flip-Flops were synthesised from VHDL. This can be seen by examing the code for the Kernel3_img_proc block included in Appendix 6. The next section looks briefly at the attempt made to synthesise the entire line buffer fromVHDL. This was proven to be functionally correct but unfortunately had the serious drawback of synthesising to Flip-Flops as opposed to RAM.
51
SRL16 Core Generator RAM based shift register module(line width 3)
3 Flip-Flops
8 ..
line0_out line0_out1 line0_out2
Figure 40
Illustration showing how line buffers are implemented
Method 1:- Synthesising the Line Buffers using VHDL code

This was the first and preferred approach since the code is more portable to an ASIC solution and the width of the line buffers is easily changed on synthesis using a Width parameter. In other words if the image width changes the line buffer length could be changed easily during synthesis using a Width parameter. An application report[17] generated by Xilinx deals in part with the issue of mapping a shift register to RAM using synthesis. However the report only considers the synthesis of a 1-bit wide shift register. This design requires 8-bit wide shift registers since each pixel is 8-bits wide. It was thought that the approach in this Xilinx report could be expanded in some way to synthesise 8-bit wide shift registers by creating the type line_buffer that is an array, size 8, of std_logic_vectors. The std_logic_vector size equals the width of the line. The VHDL code is shown below. type line_buffer is array( 7 downto 0) of std_logic_vector((width-1) downto 0); -- Three line buffers required signal line0,line1,line2: line_buffer; line_buffers:process (clk,rst) begin if clk'event and clk='1' then if (pixel_en = '1') then for i in 0 to 7 loop -- line 0 line0(i) <= line0(i)((width-2) downto 0) & pxl0_core(i); -- line 1
52
line1(i) <= line1(i)((width-2) downto 0) & line1_core(i); -- line 2 end if; end if; end process; line2(i) <= line2(i)((width-2) downto 0) & line2_core(i); end loop;
Unfortunately, despite considerable investigation, the code does not synthesize to the desired structure. As can be seen from an extract of the synthesis report, Appendix[7], only one bit of the array implements the shift register using RAM. The rest of the shift registers are implemented inefficiently using Flip-Flops. This has some serious consequences since it was subsequently proven that the largest image size that could be catered for using this approach had a width of 128 pixels. If this Width is exceeded then Mapping fails, since resources on the chip (Spartan 3) are used up inefficiently. An abstract of the report is given in Appendix[8]. A different approach was therefore needed.
53
Method 2: Creating the Line Buffers using the CORE Generator

In this attempt the Line Buffers were created using the CORE Generator tool available within the Xilinx foundation CAD package. The category Basic Elements-> Registers, Shifters and Pipelining is used. The following options were selected as shown in the figure below:-
Figure 41
Core generator used to generate efficient shift registers based on RAM.
Note that the Depth value is chosen as 177, since a line width of 180 is assumed. The last three pixels in each line are stored using flip-flops to allow for access. Using trial and error it was discovered that the largest image size that the Application software can handle at present is 180 * 180 pixels. Note all images were kept square. Since the shift registers implemented are RAM based only the last bit in the shift register is available. Hence the reason for the 3 Flip-Flop extension.
54
This is the most efficient implementation of the line buffers and the one that is used. As can be seen now from the synthesis report, Appendix[9], the cell SRL16E is now being used. This is a shift register based on RAM. The final mapping report shown in Appendix[10] shows ample resources are now available on the FPGA since the shift registers are now being implemented efficiently. In other words the image size is not now limited by the hardware, as was in the previous case.
Conclusion
This chapter concentrated on the one area of the design that has the potential to have a huge impact on the resources used on the FGPA, that is the line buffers. Line buffers are essential to neighbourhood image processing algorithms. At a minimum, three lines need to be stored with each pixel requiring 8 bits. Therefore the number of storage elements required is 3*image width*8. Ideally these storage elements should be on chip if real time image processing is to be a possibility. Since the number of storage elements required is so large it was felt to be worth while to spend time considering the most efficient implementation of these elements. Initially the preferred solution of synthesising the elements was considered, since this would leave the code more portable to another FPGA or ASIC solution. However even at relatively small image widths of 128, synthesising the elements had a devastating effect on FPGA resources, in particular all flip-flops were entirely used up. Despite various attempts at manipulating the code there is no way of inferring efficient storage elements based on RAM. The second and only feasible option uses the core generator to generate fixed size shift registers based on RAM. This proved to be the most efficient use of FPGA resources and the one eventually settled on.
55
CHAPTER 6
WARPING AND MORPHING

Introduction
Originally it was felt that from an entertaining demonstration point of view that perhaps the warping of images would be a useful image processing algorithm to implement. This is fine, except that from an implementation point of view warping of images is a more complex algorithm than those we have looked at so far, both in terms of the mathematical computations that have to be performed and the control and manipulation of memory addresses, as this chapter will show. From the authors point of view, it seemed logical to start off with the simplest algorithms and gradually work up towards more complex ones. Therefore point operations were first implemented and then neighbourhood operations. This brings the project to a stage were the analysis of warping and morphing algorithms can benefit from the insight gained thus far. Unfortunately due to time limitations this project will not go so far as implementation but it is hoped that the analysis and theory provided here will serve well those who do perhaps take on this task in future projects.
Basic Theory and Applications

A more technical term for Warping is geometric spatial transformations. Geometric transformations modify the spatial relationship between pixels in an image. In a geometric spatial transformation the pixels of a source image are transformed to a destination image in some way. Control points are used to define the transformation. A set number of pixel co-ordinates in the source image are aligned with pixel co-ordinates in the destination image and all other pixels are mapped accordingly. The number of control co-ordinates used defines the type of geometric transformation. Gonzalez[16] notes that geometric transformations are often referred to as rubber-sheet transformations since the process is akin to printing the source image on a rubber sheet which can then be manipulated into any shape desired. The table below lists some of the types. Note the type of transformation or warping depends on the number of control points used.
56
Type of Transformation Affine
Number of Control pairs 3 pairs
Description Scaling, Translation, Rotation, Shearing Straight lines remain straight. Parallel lines remain parallel Tilting Parallel lines converge The resulting image can contain bends and curves. The higher the order the more curves.
Projective Polynomial
4 pairs 6 pairs (2nd order poly) 10 pairs (3rd order poly) Table 8
Types of Transformations
Note a pair of control points effectively means two coefficients as will be seen below. According to Gonzalez[18] the most commonly used forms of spatial transformations is the Affine transform. For all types of transformations pixels are assumed to occur at a single point with their location specified by Cartesian co-ordinates. If (x,y) represents the co-ordinates of pixels in the source image whereas (x,y) represents the pixel co-ordinates in the destination image then the general affine transform can be written in matrix form as shown below:-
[ x'
y ' 1] = [ x
t11 y 1]t 21 t 31
t12
22
t 32
0 0 1
Note an extra fixed dimension has been added purely to allow matrices be used to represent the transform.
t11 t12 The general transform T is t 21 22 t 31 t 32
0 0 1
The transform can scale, rotate, translate or shear a set of points depending on the values chosen for the elements of T. The table below shows the transform matrices for these various operations.
57
Transformation Type Identity
Transformation matrix T
1 0 0 0 1 0 0 0 1 S x 0 0 0 Sy 0 0 0 1 sin cos 0 0 0 1
Scaling
Rotation
cos sin 0 1 0
Shear Horizontal
0 0 1 0 0 1 0 0 1
Shear Vertical
1 0 1 0 0
Table 9
Various Affine Transformations
Note all these transforms use the origin, top left corner of the image as the reference point as opposed to the centre. This has implications as will be discussed in a later practical example. Given a particular transformation there are two ways to map an image Forward Mapping Backward Mapping
For Forward Mapping all the pixel co-ordinates in the source image are looped through and are mapped using a transform, such as the ones shown above, to co-ordinates in the destination image. The problem is that the co-ordinates calculated in the destination image will not necessarily be whole numbers. At first this may seem a minor irritant since the coordinates can simply be rounded off, however it is more than likely that some co-ordinates in the destination image never get assigned any values. In other words we end up with holes in the destination image. For this reason it is more common to employ Backward
58
Mapping by getting the inverse of the transforms given above. Therefore, mathematically, if
[ x' y ' 1] = [ x
then [x where T-1 is the inverse transform. y 1] = [ x'
y 1] T
y ' 1]T 1
For Backward Mapping all the pixel co-ordinates in the destination image are looped through, therefore there can be no holes, and are mapped using the inverse transform to some pixel co-ordinate in the source image. Of course, once again, the pixel co-ordinates mapped to in the source image may not be whole. However there are two generally accepted ways of dealing with this. These are a) Nearest Neighbour and b) Bilinear Interpolation. Using the Nearest Neighbour approach the pixel co-ordinates calculated are simply rounded off to the nearest whole pixel co-ordinate and this pixels value is then used. From a computational point of view this is by far the simplest of the two approaches but as shown by images in Burdick[1] can lead to jagged edges in the destination image. The best but most computational intensive approach is Bilinear Interpolation. Here the pixel value of the calculated sub co-ordinate pixel is determined using weighted values of the four pixel values it lies between. Weights are calculated according to closeness to each of the four pixels. The figures below depicts Forward and Backward mapping.
Figure 41 Forward Mapping
59
Figure 41
Backward Mapping
Morphing and Warping are terms that can sometimes be used interchangeably as if they were one and the same. They are not. As discussed, Warping involves taking an image and modifying it according to some transform to give the same image but distorted in some way. Morphing is were one given image appears to gradual change into a second given image. This is a special effect widely used in films. Morphing involves two steps. First prewarp the two images so that the main features of each image overlap spatially, and second cross-dissolve the colours. When this is done in the correct sequence over time then the effect is that the first image appears to gradually change into the second image. Therefore morphing is not the same as warping but involves warping. Pre-warping is used to align features that appear in both images. This could be facial feature such as eyes, mouth and hair. If no pre-warping is performed then the result is a double image effect. Another term for morphing is Metamorphosis. Neither of these terms should be confused with Morphological operations discussed in earlier chapters. There are many applications of Warping. In the area of satellite photography, for example, a image may be taken of an area on earth that does not lie directly beneath the satellites path. The image can then be warped to give an image that appears as if it was directly underneath. In face recognition for example it may be common to pre-warp the image first before applying recognition software in order that the orientation of the face is always the same regardless of the angle to the camera. As discussed warping is also needed for morphing, which is a widely used special effect in films. For affine transforms as discussed above, straight lines remain straight, parallel lines remain parallel and no curves are introduced. If the final warped image is to be made bend, or curves are to introduced where there was none, then a polynomial warp needs to be employed. As mentioned in the first table in this chapter more control points are needed for a polynomial warp. Mathematically a polynomial warp can be expressed as follows:-
60
[x , y ] =
C11
C12 C 22
C13 C 23
C14 C 24
C15 C 25
C 21
1 x' C16 y ' C 26 ( x' ) 2 x' y ' 2 ( y ' )
As can be seen from this equation, for a polynomial warp, at least 6 control points are needed or 12 coefficients. Also note that backward mapping is employed here. In terms of selecting a warping algorithm for implementation, the best approach, in my opinion, is to start simple, tease out any issues that arise, and then draw on this experience to increase the complexity and usefulness of the system. For this reason, as a possible candidate for implementation the affine transformation rotation was selected. The next section looks at this particular transformation a little more closely.
Detailed look at the Affine transformation Rotation

The rotation transformation was described mathematically, in the table above, as:-
[ x'
y ' 1] = [ x
cos y 1] sin 0
sin cos 0
0 0 1
Remember (x,y) are pixel co-ordinates of the source image whereas (x,y) are the calculated pixel co-ordinates of the destination image. The above equation defines a forward map, since each co-ordinate in the source image is looped through and the transformation applied to determine what co-ordinates they map to in the destination image. For this particular case, since some of the transform coefficients are zero, this equation can be simplified to
[ x'
y' ] = [ x
cos y] sin
sin cos
In the simplest case if we have a fixed rotation of 90 degrees, we get
[ x'
y' ] = [ x
0 1 y] 1 0
Therefore in equation form:x = y y = -x 61
This is a Forward Transform, since the algorithm will loop through all the pixel coordinates of the source(x,y) to determine pixel values in the destination. The first few values are shown in the table below:-
(x,y) source (0,0) (0,1) (0,2) (0,3) : : (1,0) (2,0) (3,0) (4,0)
(x,y) destination (0,0) (1,0) (2,0) (3,0) : : (0,-1) (0,-2) (0,-3) (0,-4)
Table 10 Forward Transform example for 90 degree rotation around the origin In the table above, only a selected few co-ordinates were chosen to illustrate the 90 degree transformation. The rotation is illustrated more pictorially in the figure below. (0,0) (1,0) (2,0) (3,0) (0,1) (0,2) (0,3) Source Image
(0,-3)
(0,-2)
(0,-1)
(0,0) (1,0) (2,0) (3,0) Destination Image
Figure 42
Source and Destination Image for Table 10
There are two problems with the above. First the rotation took place around the origin of the source(0,0) and therefore mapped into a space possibly not defined for the destination. This is solved as will be shown by modifying the transforms to take place about the centre of the source image. But what are the centre pixels of the 4*4 image or 16*16 image that have been used in this project. To have a truly single centre pixel, images should have an 62
odd number of rows and columns, and should be square. For example, the centre pixel of an image 15*15 has co-ordinates (7,7). Second, since 90 degrees was chosen as the angle of rotation, no holes will result in the destination. If on the other hand the angle of rotation were 60 degrees the equations for transformation would have been: 0.5 0.866 y] 0.5 0.866
[ x' or
y' ] = [ x
x = 0.5x + 0.866y y = -0.866x + 0.5y The table below shows forward transform of the first few source image pixels.
x 0 0 0 0 0 1 1 1 1 1
y 0 1 2 3 4 0 1 2 3 4
x' 0 0.866 1.732 2.598 3.464 0.5 1.366 2.232 3.098 3.964
y' 0 0.5 1 1.5 2 -0.866 -0.366 0.134 0.634 1.134
x' (rounded) y' (rounded) 0 1 2 3 3 1 1 2 3 4 0 1 1 2 2 -1 0 0 1 1
Table 11
Forward Transformation example for 6o degrees rotation around origin.
In this example note that since the destination co-ordinates are rounded off there is a danger that not all pixels in the destination image get values, i.e. there may be holes. To improve the situation we need to first of all employ a backward transform to ensure there are no holes and secondly modify the transform by adding an adjustment to ensure rotation occurs around the origin. When these two steps are completed the backward transform can then be described mathematically as follows:
[x
y ] = [ x'
cos y' ] sin
sin + [ xc cos
1 cos sin yc ] 1 cos sin
If = 900 then
63
[x
y ] = [ x'
0 1 + [ xc y' ] 1 0
1 1 yc ] 1 1
If the image size used is 15*15 then the centre pixel co-ordinates(xc , yc) is (7,7) and the equations become: x = -y+14 y = x This time the algorithm loops through all the destination image pixel co-ordinates and determines what source co-ordinates to use as shown in the table below:
(x,y) destination (x,y) source (0,0) (14,0) (0,1) (13,0) (0,2) (12,0) (0,3) (11,0) : : : : (1,0) (14,1) (2,0) (14,2) (3,0) (14,3) (4,0) (14,4) : (7,7) (7,7) Table 12 Backward Transform for 90-degree rotation around centre pixel of 15*15 image In this case since a backward transform is being used every co-ordinate in the destination image will get a value and there will be no holes. Also note that since the transform rotation is now around the origin no calculated co-ordinates for the source image fall outside the spatial area (0,0) to (15,15). Finally for this particular example since we used 90 degrees for the angle of rotation all calculated co-ordinates were whole numbers, therefore no rounding or bilinear interpolation was required. If 60 degrees had been used, then the simplest thing would be to simply round off the calculated source pixel co-ordinates, since bi-linear interpolation would require a few more steps involving floating point calculations.
64
(0,0) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (1,0) (2,0) (3,0) (4,0) (5,0) (6,0 (7,0)
(7,7)
Destination Image (0,7)
(7,7)
(11,0) (12,0) (13,0) (14,0) (14,1) (14,2) Source Image
(14,3)
(14,4)
Figure 43 Backward Transform for 90-degree rotation around centre pixel of 15*15 image This image should be viewed from the point of view that if the bottom image (source) was rotated by 90 degrees clockwise, the result is the top image(destination). The arrows imply that backward mapping was used. Once again, since a 90 degree rotation was chosen for the transformation, all source pixel co-ordinates calculated were whole numbers. In other words the transformation did not yield values between pixels. Therefore in this case there was no need for rounding to nearest neighbour or bilinear interpolation, as would normally be the case. The next section will look at the implementation issues involved if warping algorithms are to be designed on a FPGA and the hardware system used thus far.
65
Implementation of Warping Algorithms

For point operations and neighbourhood operations, pixel values were calculated. In the case of neighbourhood operations the calculations were pipelined to yield a high through put rate. By choosing the Prewitt algorithm for edge detection, floating point calculations were avoided and hence the use of multipliers were not required. The pixel values of the source image were processed sequentially, by stepping through its addresses, to calculate the processed pixel values of the destination image. Once floating point arithmetic was avoided the greatest challenge was the incorporation of line buffers to store a number of lines of the source image at any one time. A slightly different scenario emerges when implementing warping algorithms. If, as is recommended, backward mapping is employed, then addresses of the destination or processed image pixels are stepped through sequentially. The destination pixel co-ordinates are used together with some transformation coefficients in calculations to determine the co-ordinates of the source image pixel values to use. Only in the very simplest case would these coefficients be whole integers and therefore during these calculations multiplication of floating numbers will be required. Once a source image co-ordinate has been calculated it will inevitably be composed of real numbers for example (2.3, 2.4). There is then two choices:a) round off the source pixel co-ordinates to determine the co-ordinates of the nearest neighbour , determine the RAM address of this pixel, then read its value and use this value as the value for the current destination pixel. In this case the nearest neighbour would have co-ordinates (2,2) so this pixels value would be used or b) determine the co-ordinates of the 4 surrounding pixels. In this case the 4 pixels to use have co-ordinates (2,2), (2,3), (3,2), (3,3). A weighted portion of each of these pixels values is then used to determine the eventual value. In this case it is possible that 4 memory reads would be required. In summary then, here are the main points that should be considered, if implementing warping algorithms. Inverse warping transformations (backward mapping) should be used, in order to avoid holes in the destination image. For a destination image without jagged edges then bilinear interpolation should be used instead of nearest neighbour. Floating point arithmetic, additions and multiplications, for all but the simplest of algorithms, will be required. Increasing the complexity and ability of the warping algorithms increases the number of coefficients needed for the transformation and hence the number of multiplications and additions required in the calculation of the source image pixel co-ordinates. If Bilinear Interpolation is used then 4 RAM reads may be required since the value of 4 separate source image pixels will be required in the calculation of a single destination pixel value. It is therefore very desirable to have fast memory access. For flexibility, the coefficients of the warping algorithms should be programmable by the user.
66
Finally it may be desired to warp only a subsection of the image. If this is the case the system needs to be programmable to allow the user identify this area.
The following figure is a preliminary flowchart that attempts to identify the major steps that are required for the implementation of a warping algorithm.
67
Initialise (x,y) co-ordinates
Using transformation coefficients and (x,y) coordinates, calculate source coordinates (x,y)
Round off to whole numbers the source co-ordinates calculated (nearest neighbour)
Determine the RAM address of the source co-ordinates and read its value.
Using Nearest Neighbour algorithm
Assign this value to pixel at current destination coordinates(x,y)
Determine RAM address of current destination coordinates(x,y) and write its value
Increment y co-ordinate by 1
no At end of row yes Increment x co-ordinate by 1
no At end of image yes Finished
Figure 44
Flowchart for Warping algorithm
68
For the above flowchart it is assumed that each pixel resides at a separate memory address. In this project, memory was structured in such a way that 32 bits were accessed at a time. Since each pixel requires 8 bits, 4 pixels are stored at each RAM address. There does not, however, exist the possibility of accessing individual bytes. For the above flowchart, there does exist the possibility of grouping 4 pixels together before performing a write to the destination image thus providing a more efficient implementation. However it is unlikely that this same efficiency can be achieved when reading the source image pixels since successive pixels required might not occur at the same memory word addresses and therefore a separate read will be required for each pixel value required. The situation becomes worse if Bilinear Interpolation is used since now 4 pixels are required to calculate each destination pixel value. In this case, the flowchart in the figure above remains the same except the four shaded boxes get replaced with that shown in the figure below.
Use source co-ordinates calculated to determine 4 nearest neighbours
Determine the RAM addresses of the source pixels required and read each of their values.
Calculate the weightings to be used for each of the 4 pixels read. Weightings depend on proximity to pixel co-ords calcualted. Using the values of 4 pixels read and the corresponding weightings, calculate a value for the inter-pixel co-ords.
Using Bilinear Interpolation algorithm
Assign this value to pixel at current destination coordinates(x,y)
Figure 45 Use to replace shaded section in earlier flowchart, if Bilinear Interpolation is used.
69
Conclusion
This chapter looked at the theory surrounding image warping. Warping an image involves transforming a given image from one spatial area to another spatial area. For forward transformations each pixel co-ordinate in the source image is taken and using the transformation, the co-ordinates in the destination image that the pixel maps to is calculated. The problem with forward mapping is that we may end up with holes in the destination image. An alternative and recommended approach is to perform a backward mapping using the inverse transformation. In this case each pixel in the destination image is taken and using the inverse transform, the co-ordinates of the pixel value to use in the source image is calculated. If the co-ordinates calculated are not whole there are two accepted approaches of dealing with this a) the nearest neighbour and b) bilinear interpolation. All warping algorithms will involve multiplication and addition of real numbers. The number of multiplications and additions required depends on the complexity of the algorithms chosen. Towards the end of this chapter a flowchart was presented as an initial guide for the design and implementation of a warping algorithm. An issue that was highlighted was the need for access to individual pixels. If bilinear interpolation is used 4 pixels may need to be read from the source image for each destination pixel value calculated. These pixels will more than likely require 4 reads since they will not reside at the same word address. It is hoped that the analysis presented in this chapter will prove useful for anyone undertaking project work in this area.
70
CHAPTER 7
CONCLUSION
This project investigated the theory, design and implementation of various image processing algorithms on a Spartan-3 FPGA. The FGPA is part of a board designed by Digilent and this in turn is part of an overall system that allows images to be downloaded and uploaded between the FPGA and PC. Before embarking on the actual realization of image processing functions, a thorough review of the overall system, both hardware and software, took place. On the hardware side it was discovered that, at a general level, the Spartan-3 was of sufficient size to accommodate image widths of about 300 pixels so is therefore not critical. If however it is decided to implement algorithms with lots of real number multiplications as in the case of warping algorithms then the choice of FPGA may need to be re-examined. Memory is also not critical since the current size of memory is sufficient to hold four greyscale images each 512*512 pixels. However, if colour images are used, then the image size capabilities reduces by about a factor of 3 since each colour image pixel require 24 bits as opposed to 8 bits for greyscale. Even though the memory does not reside on chip, access speeds were sufficient to be in no way limiting. In terms of hardware the most pressing concern is the slow upload and download times of images greater in size than 32*32 pixels. This is due to the fact that a serial port is being used. Upgrade to a USB connection should substantially improve matters here. The user application software was found to have more pressing issues than the hardware. Third party software has been used that does not effectively handle greyscale images greater that 32*32 pixels and so a limitation has been imposed on us here. If black and white images are used then the situation improves slightly but the system software was still found to crash if an image size of 180*180 pixels is exceeded. Also, at the moment, if successive processing tasks are attempted on a image in RAM then the system crashes. It is only possible to perform one task for each image downloaded. This is not really satisfactory in the long run, and future projects need to address all of these software issues. In summary the limitations relate to image size and type, upload and download speeds, and executing more than one image processing function on an image without having to restart the software each time. Even though these limitations proved time consuming in identifying them and are still bothersome, they did not interfere too greatly with the main task of implementing various image processing tasks. Its a question of scale, since relatively small images, 32*32 pixels, were used to demonstrate the image functions but the designs developed are capable of handling any size images. First the DSP controller was redesigned since the state machine that controls data to and from memory has to allow for the fact that some image tasks may be heavily pipelined. The new controller can no longer perform a simple read-execute-write cycle since multiple reads may be required before any valid data is available for writing. The only drawback of the new DSP controller design is that only one image can be handled. This was not a problem here since all the image tasks considered only operated on one image.
71
The first and most simple image processing tasks to be considered were Point operations. These tasks involve taking the pixel values of the source image in turn and adjusting them in some way to give a new image. The point operations implemented were, invert, brighten, darken, change contrast and convert greyscale to black and white. By changing a register value via the GUI, any of these functions can be selected by the user. In addition to point operations a number of neighbourhood tasks have also been designed and implemented. These operations are significantly more complicated since the new value of a pixel not only depends on its current value but also on the values of its immediate spatial neighbours, left and right, above and below. This is hugely significant since it means we need to store, for the neighbourhood chosen here, at least three lines of the image in order to have immediate access to the neighbourhood pixel values. In addition to the increased complexity of calculating the new pixel value, there was also the issue of how to store three lines of an image internally on the FPGA in the most efficient manner possible in terms of chip resources. After much ado, the core generator was used to implement fixed width line buffers in the most efficient way possible using on chip RAM as opposed to flip-flops. In the end the following neighbourhood operations were implemented, Prewitt vertical, horizontal and full edge detection as well as the morphological operations erosion and dilation. Once again the user can select a neighbourhood task using register values accessible via the user interface. Finally the theory of warping and morphing was researched, and even though no actual design work occurred in this area, the groundwork was prepared by outlining the issues involved and developing a flowchart that shows how a generic warping algorithm may be implemented. Overall the project has been enjoyable, rewarding and successful. But what of future developments. Based on the experienced gained here I will attempt to outline where I feel future projects could focus, in order to achieve the ultimate goal of achieving a truly versatile, fast image processing machine. The user interface software needs to improve in order to handle large images, to handle greyscale images correctly and to be able to perform more than one image task without crashing. The upload and download speeds need to be improved by changing over to faster USB connection between the PC and FPGA board. Remember for real-time video we ultimately need to be able to handle 25 to 30 image per second. The DSP controller could be improved on yet again to allow for image processing tasks that require two images as opposed to one. This may be achieved by combining the DSP controller presented here, that caters for pipelining with the previous version of DSP controller that could handle two images but could not handle pipelining. Other neighbourhood tasks could be investigated that involve non-integer coefficients, in order to study the impact of using real numbers. The size of the kernel could increase for neighbourhood operations and perhaps in time become programmable. The co-efficients of the kernel could be made programmable The theory presented here regarding warping could be built upon by designing and implementing an affine warping algorithm. Suggestions on how to do this were presented in this report. These are the main pointers for immediate future projects. The advantages of hardware over software in terms of execution times for image processing algorithms is well documented and is further evident by the designs completed here, however to take
72
advantage of this, there needs to be a solid framework of application software and communication buses to allow images on the FPGA to be quickly and easily accessible. In addition to make it all worthwhile, more and more image processing features will have to be added with greater programmability for good control of the tasks.
73
References
[1] Burdick, H.E. Digital Imaging: Theory and Applications: McGraw-Hill, 1997. (ISBN 007-913059-3) [2] Miller,A. System Generator for Video Processing. Presentation for Xilinx UK www.xilinx.com [3] Video and Image Processing Design Using FPGAs. Altera, White Paper April 2006, ver.1.0. www.altera.com [4] Guo, Z., Najjar, W., Vahid, F. and Vissers, K. A Quantitive Analysis of the Speedup Factors of FPGAs over Processors. Proc. ACM/IEEE Conf. on Field Programmable Gate Array (FPGA). Monterey, CA, Feb 2004 [5] Shah, R.V, eInfochips Ltd. Image Processing Applications on New Generation FPGAs. [2006] FPGA and Structured ASIC Journal. www.fpgajournal.com [6] Spartan-II Development System. Trenz Electronics, 2001. www.trenz-electronic.de [7] Spartan-3 Datasheets. www.xilinx.com [8] Digilab 2E Reference Manual. Digilent Inc www.digilent.com [9] IEEE Standard VHDL Language Reference Manual, IEEE Standard [10] 'hAllmhurain, A. DSP using Xilinx Spartan-3. National University Ireland, Galway, Dept of Electronic Engineering. Project submitted for B.E Degree[2005]. Supervisor Dr Fearghal Morgan. [11] Comer, A. Image Processing using the Xilinx Spartan-3. National University Ireland, Galway, Dept of Electronic Engineering. Project submitted for B.E Degree[2005]. Supervisor Dr Fearghal Morgan. [12] IS61LV5128 High Speed CMOS Static RAM by Integrated Silicon Solution [13] Burdick, H.E. Digital Imaging: Theory and Applications: McGraw-Hill, 1997. (ISBN 0-07-913059-3) [14] Altera. Application Note 364(ver 1.0). Edge Detection Reference Design. Octobober 2004. www.altera.com [15] Awock, G.J. and Thomas, R. Applied Image Processing: Macmillan Press Ltd,1995.(ISBN 0-333-58242-X) [16] Gonzalez, R.C., and Woods, R.E.[2001]. Digital Image Processing, Prentice Hall (ISBN 0-130-94650-8) [17] Xilinx application note XAPP465(v1.1). Using Look-Up Tables as Shift Registers(SRL16) in Spartan-3 Generation FPGAs. May 20,2005. www.xilinx.xom [18] Gonzalez, R.C., Woods, R.E., and Eddins, S.L. Digital Image Processing using MATLAB. Prentice-Hall. (ISBN 7-5053-9876-8)
74
Appendix 1 Colourmap File after setting the Black and White Attribute
Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Red 0 0.502 0 0.502 0 0.502 0 0.7529 0.7529 0.651 0.251 0.3765 0.502 0.6275 0.7529 0.8784 0 0.1255 0.251 0.3765 0.502 0.6275 0.7529 0.8784 Green 0 0 0.502 0.502 0 0 0.502 0.7529 0.8627 0.7922 0.1255 0.1255 0.1255 0.1255 0.1255 0.1255 0.251 0.251 0.251 0.251 0.251 0.251 0.251 0.251 Blue 0 0 0 0 0.502 0.502 0.502 0.7529 0.7529 0.9412 0 0 0 0 0 0 0 0 0 0 0 0 0 0
: : :
242 243 244 245 246 247 248 249 250 251 252 253 254 255 0.251 0.3765 0.502 0.6275 1 0.6275 0.502 1 0 1 0 1 0 1 0.7529 0.7529 0.7529 0.7529 0.9843 0.6275 0.502 0 1 1 0 0 1 1 0.7529 0.7529 0.7529 0.7529 0.9412 0.6431 0.502 0 0 0 1 1 1 1
75
Appendix 2 Colourmap file if Black and White attribute is not set.

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Red 0 0.502 0.502 0.502 0 0 0 0.502 0.502 0 0 0 0.251 0.502 1 0.7529 1 1 0 0 0 1 1 0 0.502 0.502 1 1 0 0 0 Green 0 0.502 0 0.502 0.502 0.502 0 0 0.502 0.251 0.502 0.251 0 0.251 1 0.7529 0 1 1 1 0 0 1 1 1 0.502 0 0.502 0 0 0 Blue 0 0.502 0 0 0 0.502 0.502 0.502 0.251 0.251 1 0.502 1 0 1 0.7529 0 0 0 1 1 1 0.502 0.502 1 1 0.502 0.251 0 0 0
: : :
76
Appendix 3 Information re:bmp images using Matlab Filename: 'meg_8bitgray.bmp' FileModDate: '14-Mar-2006 14:17:28' FileSize: 33478 Format: 'bmp' FormatVersion: 'Version 3 (Microsoft Windows 3.x)' Width: 180 Height: 180 BitDepth: 8 ColorType: 'indexed' FormatSignature: 'BM' NumColormapEntries: 256 Colormap: [256x3 double] RedMask: [] GreenMask: [] BlueMask: [] ImageDataOffset: 1078 BitmapHeaderSize: 40 NumPlanes: 1 CompressionType: 'none' BitmapSize: 32400 HorzResolution: 0 VertResolution: 0 NumColorsUsed: 256 NumImportantColors: 0 Filename: 'NUIGImage1_180x180_bw.bmp' FileModDate: '13-Mar-2006 14:28:52' FileSize: 33478 Format: 'bmp' FormatVersion: 'Version 3 (Microsoft Windows 3.x)' Width: 180 Height: 180 BitDepth: 8 ColorType: 'indexed' FormatSignature: 'BM' NumColormapEntries: 256 Colormap: [256x3 double] RedMask: [] GreenMask: [] BlueMask: [] ImageDataOffset: 1078 BitmapHeaderSize: 40 NumPlanes: 1 CompressionType: 'none' BitmapSize: 32400 HorzResolution: 0 VertResolution: 0 NumColorsUsed: 0 NumImportantColors: 0
77
Appendix 4 VHDL code for the DSP top-level block (DSPblk.vhd)

===================================================================== ============== -- Original Code Copyright (c) 2005 by Tommy Gartlan -- Synthesisable VHDL model for dspBlock -- Created : MAy 2005. Tommy Gartlan, DKIT -- Signed off: June 2006 -- Description : -- Version 2 of DSP controller block to allows for pipelining of image processing tasks -- This block also now contains two sub-blocks. -- A pixel image processing (pixel_img_proc) that caters for point operations -- and neighbourhood image processing sub-block(kernel3_img_proc) that caters for -- Prewitt Edge Detectection and morphological operations --- bit 3 of CSR(6) register is used to select which sub-block output is active -- csr_6(3) = 0 => Neighbourhood operations -- csr_6(3) = 1 => Point operations -This block is a generic DSP controller for the implementation of DSP functions in Real Time -- Upon the assertion of dspActive it reads data from RAM passes it to the DSP function block, -- recevices data from the DSP function block and writes data to RAM. --- The DSP function block will be pipelined.Various function blocks will have different -- pipeline lengths. This DSP controller block continues to read data from the RAM and passing -- it to the DSP function block until there is valid data to write. Then the DSP controller may alternate -- between reading and writing. If all the data has been REad then the DSP block will stay active while there -- is data left to write.(i.e until the DSP function block pipeline has been flushed out) --- This DSP block keeps track of the read address(readcnt), the write address(writecnt) and if there is -- data still valid or not(cnt) --- 1. Activated on assertion of dspActive -- 2. Reads RAM 32-bit data value from both RAM quadrant 0 and 1 (ram address bus width is 18 bits => 256k addressable 32-bit longwords) -Quadrant 0 base address (binary) = 00_0000_0000_0000_0000 -Quadrant 0 top address = 00_1111_1111_1111_1111 -Quadrant 1 base address = 01_0000_0000_0000_0000 -Quadrant 1 top address = 01_1111_1111_1111_1111 -Quadrant 2 base address = 10_0000_0000_0000_0000 -Quadrant 2 top address = 10_1111_1111_1111_1111 -Quadrant 3 base address = 11_0000_0000_0000_0000 -Quadrant 3 top address = 11_1111_1111_1111_1111 (256k) -- 3. Passes data to DSP function block
78
-- 4. Receives data from DSP function block -- 5. Writes data to RAM -- 4. Requires prior setup of DSP RAM address range (dspAddRange(15:0)), allowing programming of 0 " 64k delta frame cycles -dspAddRange(15:0) is stored in CSR(4:3) bytes -- 5. Asserts dspDone on completion of task library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_ARITH.ALL; --use IEEE.numeric_std.all; use work.NUIGPackage.all; entity dspBlk is Port ( clk : in std_logic; -- system clock strobe rst : in std_logic; -- system reset dspActive : in std_logic; -- dspBlock active flag dspDone : out std_logic; -- dspBlock done datFromRam : IN std_logic_vector(31 downto 0); -- data read from RAM dspAddRange : in std_logic_vector(15 downto 0); -number of RAM locations to be accessed CSR_6: in std_logic_vector(7 downto 0); CSR_7: in std_logic_vector(7 downto 0); dspRamWr : out std_logic; -- enable RAM wr access by dspBlock dspRamRd : out std_logic; -- enable RAM rd access by dspBlock ramDone : in std_logic; -- RAM active flag dspDat2Ram : out std_logic_vector(31 downto 0); -- data from dspBlock to be written to RAM dspRamAdd : out std_logic_vector(17 downto 0) -- RAM address (from dspBlock) ); end dspBlk; architecture RTL of dspBlk is type stateType is (idle, rdRamASt1, rdRamASt2, Data_Enable, wrRamSt1, wrRamSt2,RegDly1); --Version 2 mod signal CS, NS : stateType := idle; signal ldCnt0 : std_logic; -- dspRamAdd(15:0) control signals signal cnt : integer range 0 to 65535; -2^16 -1 -- Keeps track of number of times DSP function block has been enabled signal Readcnt : integer range 0 to 65535; -Keeps tack of number of lcoations read signal Writecnt : integer range 0 to 65535; -keeps track of number of locations written signal Total_len : unsigned(15 downto 0); signal Finish_Reading : std_logic; -- Indicates when all data has been read signal Finish_Writing:std_logic; -- Indicates when all data has been written signal DataEn : std_logic; -Enables the DSP function block signal DataFromPixelBlk : std_logic_vector(31 downto 0);
79
signal DataFromKernelBlk : std_logic_vector(31 downto 0); signal DataVal : std_logic; -Indicates data from Function block is valid signal Pipeline_lnt_pixel : std_logic_vector(15 downto 0); --Holds length of Pipeline in fuction block signal Pipeline_lnt_kernel : std_logic_vector(15 downto 0); --Holds length of Pipeline in fuction block signal Pipeline_lnt : std_logic_vector(15 downto 0); --Holds length of Pipeline in fuction block signal incReadcnt: std_logic; -- enable Read counter signal incWritecnt : std_logic; -- enable Write counter signal DSPfunction: std_logic := '1';
-- Use small index values until final step since simulation with -- Simulation of RamBFM is extremely slow when a large RAM block is used. --1 --********change this for implementation ********* constant msbIndex : integer := 17; -- msb bit index -- implementation constant msbIndexM1 : integer := 16; -- next lowest bit index implementation --constant msbIndex : integer := 7; -- msb bit index -- simulation --constant msbIndexM1 : integer := 6; -- next lowest bit index -- simulation --*********************************************************
--
COMPONENT pixel_img_proc PORT( RamDataIn : IN std_logic_vector(31 downto 0); DataEn : IN std_logic; rst : IN std_logic; clk : IN std_logic; CSR_6: in std_logic_vector(7 downto 0); CSR_7: in std_logic_vector(7 downto 0); RamDataOut : OUT std_logic_vector(31 downto 0); Pipeline_lnt : OUT std_logic_vector(15 downto 0) ); END COMPONENT; COMPONENT kernel3_img_proc PORT( RamDataIn : IN std_logic_vector(31 downto 0); DataEn : IN std_logic; rst : IN std_logic; clk : IN std_logic; CSR_6: in std_logic_vector(7 downto 0); CSR_7: in std_logic_vector(7 downto 0); RamDataOut : OUT std_logic_vector(31 downto 0); Pipeline_lnt : OUT std_logic_vector(15 downto 0)
80
); END COMPONENT;
begin -- instantaite sub block for point operations Inst_pixel_img_proc:pixel_img_proc PORT MAP( RamDataIn => datFromRam, DataEn => DataEn, rst => rst, clk => clk, CSR_6 => CSR_6, CSR_7 => CSR_7, RamDataOut => DataFromPixelBlk, Pipeline_lnt => Pipeline_lnt_pixel ); -- instantaite sub block for neighbourhood operations Inst_kernel3_img_proc:kernel3_img_proc PORT MAP( RamDataIn => datFromRam, DataEn => DataEn, rst => rst, clk => clk, CSR_6 => CSR_6, CSR_7 => CSR_7, RamDataOut => DataFromKernelBlk, Pipeline_lnt => Pipeline_lnt_kernel ); --2 --********change this for implementation ********* --comment this line out for implementation --dspRamAdd(17 downto 8) <= (others => '0'); --************************************************* select_block: dspDat2Ram <= DataFromPixelBlk when CSR_6(3) = '1' else DataFromKernelBlk; select_pipe_lnt: Pipeline_lnt <= Pipeline_lnt_pixel when CSR_6(3) = '1' else Pipeline_lnt_kernel; -- Counters ramAddCnt: process(clk, rst) begin if rst = '1' then cnt <= 0; Readcnt <= 0; elsif clk'event and clk = '1' then if ldCnt0 = '1' then cnt <= 0; Readcnt<= 0; Writecnt <=0; else if DataEn = '1' then
81
cnt <= cnt + 1; end if; if incReadcnt = '1' then Readcnt <= Readcnt + 1; end if; if incWritecnt = '1' then Writecnt <= Writecnt + 1; end if; end if; end if; end process; -- Generate Finish_Reading chkReadDone: Finish_Reading <= '1' when (Readcnt >= unsigned(dspAddRange)) else '0'; --Generate Finish_Writing Total_len <= unsigned(dspAddRange) + unsigned(Pipeline_lnt); chkWriteDone: Finish_Writing <= '1' when ((cnt >= Total_len - 1 ) and CS = WrRamSt2 and ramDone = '1')else '0'; -- Generate Data Valid. chkDataVal: DataVal <= '1' when (cnt >= (unsigned(Pipeline_lnt))) else '0'; -- delay dspDone by one period after intDspDone -- to enable RAM write transaction to complete before -- switching the RAM address path from the DSP address to the IO -- address delayDspDone: process(clk, rst) begin if rst = '1' then dspDone <= '0'; elsif clk'event and clk = '1' then dspDone <= Finish_Reading and Finish_Writing; end if; end process; -- dspBlk State machine, synchronous section FSMSynch: process(clk, rst) begin if rst = '1' then CS <= idle; elsif clk'event and clk = '1' then CS <= NS; end if; end process; -- dspBlk State machine, combinational section FSMComb: process(CS,dspActive, Finish_Reading,Finish_Writing, DataVal,ramDone,Readcnt,Writecnt) begin DataEn <= '0'; dspRamWr <= '0'; dspRamRd <= '0'; ldCnt0 <= '0'; incReadcnt <= '0';
82
incWritecnt <= '0'; dspRamAdd(msbIndex) <= '0'; dspRamAdd(msbIndexM1) <= '0'; --3 --***** change ths for implementation
-- RAM quadrant 0 (RAM read)
************************ -
dspRamAdd(15 downto 0) <= CONV_STD_LOGIC_VECTOR(Readcnt,16); - implementation --dspRamAdd(5 downto 0) <= CONV_STD_LOGIC_VECTOR(Readcnt,6); -- simulation --************************************************************* NS <= CS; unchanged -- NB! : default assignment leaves state
case CS is when idle => if dspActive = '1' then NS <= rdRamASt1; else ldCnt0 <= '1'; end if; when rdRamASt1 => NS <= rdRamASt2; when rdRamASt2 => dspRamRd <= '1'; if ramDone = '1' then NS <= Data_Enable; incReadcnt <= '1'; end if; when Data_Enable => DataEn <= '1'; if (DataVal = '1') then NS <= wrRamSt1; elsif (Finish_Reading = '0') then NS <= rdRamASt1; end if; when wrRamSt1 => --4 -- ************ change this for implementation **************************** dspRamAdd(15 downto 0) <= CONV_STD_LOGIC_VECTOR(Writecnt,16); -- implementation --dspRamAdd(5 downto 0) <= CONV_STD_LOGIC_VECTOR(Writecnt,6); -- simulation -********************************************************************* *** dspRamAdd(msbIndex) <= dspRamAdd(msbIndexM1) <= NS <= wrRamSt2; '1'; '0'; -- quadrant 2 (RAM write)
83
when wrRamSt2 => --5 -- ************ change this for implementation **************************** dspRamAdd(15 downto 0) <= CONV_STD_LOGIC_VECTOR(Writecnt,16); -- implementation --dspRamAdd(5 downto 0) <= CONV_STD_LOGIC_VECTOR(Writecnt,6); -********************************************************************* *** dspRamAdd(msbIndex) <= '1'; dspRamAdd(msbIndexM1) <= '0'; dspRamWr <= '1'; if ramDone = '1' then incWritecnt <= '1'; if Finish_Reading = '0' then NS <= rdRamAst1; elsif Finish_Writing = '0' then if DSPfunction = '0' then NS <= Data_Enable; else NS <= RegDly1; end if; else NS <= Idle; end if; end if; when RegDly1 => NS <= Data_Enable; when others => null; end case; end process; end RTL;
84
Appendix 5 VHDL code for the Point Operations sub-block (pixel_img_proc.vhd)

library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_ARITH.ALL; use IEEE.STD_LOGIC_UNSIGNED.ALL; -- Uncomment the following lines to use the declarations that are -- provided for instantiating Xilinx primitive components. --library UNISIM; --use UNISIM.VComponents.all; -==================================================================== -- Created by Tommy Gartlan 31-May-2005 -- Signed off June-2006 -- This sub-block takes care of pixel image processing tasks and is -- instantiated within the DSP controller block -- The CSR(6)(2 downto 0) register is used to select the image processing task -- IF CSR_6(2 downto 0) is --- 000 No change to pixel values -useful for testing -- 001 Invert pixel values --- 010 Brighten -Pixel brghtend by value in CSR_7 -- 011 Darken -Pixel darkened by value in CSR_7 -- 100 Increase Contrast -Contrast increas depends on CSR_7 value -- 101 Decrease Contrast -Contrast decreas depends on CSR_7 value -- 110 Greyscale to Black and White -- 111 Greyscale to Black and White --- Full description found in associated thesis ---- This block receives 1 Ram location(32 bits wide) divides the data -- into individual pixel values and performs a simple threhold operation -- on each pixel. --- At the moment the the threshold is fixed internally at 80 -- Pixels values <= Threshold will be set to 00 -- Pixel values > Threshold will be set to FF --- This block is useful for converting a grayscale image to -- a black and white one. --- The Pipleline_lnt refers to the number of registers in the datapath -- This block has a pipeline length of 2 clock cycles. In otherwords -- valid output data will appear 2 clcok cycles after valid input data -- arrives. This information is used by the DSP blockto determine when
85
-- to begin writing data to RAM. --==================================================================== entity pixel_img_proc is Port ( RamDataIn : in std_logic_vector(31 downto 0); DataEn : in std_logic; rst : in std_logic; clk : in std_logic; CSR_6: in std_logic_vector(7 downto 0); CSR_7: in std_logic_vector(7 downto 0); RamDataOut : out std_logic_vector(31 downto 0); Pipeline_lnt : out std_logic_vector(15 downto 0)); end pixel_img_proc; architecture Behavioral of pixel_img_proc is signal pxl3,pxl2,pxl1,pxl0:std_logic_vector(7 downto 0); signal pxlout3,pxlout2,pxlout1,pxlout0:std_logic_vector(7 downto 0); constant white: std_logic_vector(7 downto 0) := B"1111_1111"; constant black: std_logic_vector(7 downto 0) := B"0000_0000"; -- The Threshold is fixed at 80. This could be programmable at a later date signal threshold:std_logic_vector(7 downto 0); signal color_change:std_logic_vector(7 downto 0); signal signal signal signal signal signal signal signal begin --The pipeline_lnt refers to the numer of clock cycles between valid input data -- and valid output data Pipeline_lnt <= X"0001"; -- Assiging various threhold values threshold <= CSR_7; color_change <= CSR_7; process(pxl3,pxl2,pxl1,pxl0) begin pxlout3_temp <= ('0' & pxlout2_temp <= ('0' & pxlout1_temp <= ('0' & pxlout0_temp <= ('0' & pxlout3_temp_sub pxlout2_temp_sub pxlout1_temp_sub pxlout0_temp_sub <= <= <= <= pxlout3_temp:std_logic_vector(8 pxlout2_temp:std_logic_vector(8 pxlout1_temp:std_logic_vector(8 pxlout0_temp:std_logic_vector(8 downto downto downto downto 0); 0); 0); 0); 0); 0); 0); 0);
pxlout3_temp_sub:std_logic_vector(8 pxlout2_temp_sub:std_logic_vector(8 pxlout1_temp_sub:std_logic_vector(8 pxlout0_temp_sub:std_logic_vector(8
downto downto downto downto
pxl3) pxl2) pxl1) pxl0) & & & &
+ + + +
('0' ('0' ('0' ('0' -
& & & &
color_change); color_change); color_change); color_change); & & & & color_change); color_change); color_change); color_change);
('0' ('0' ('0' ('0'
pxl3) pxl2) pxl1) pxl0)
('0' ('0' ('0' ('0'
86
end process; -- Seperate into pixels process (clk, rst) begin if rst='1' then pxl3 <= (others => '0'); pxl2 <= (others => '0'); pxl1 <= (others => '0'); pxl0 <= (others => '0'); pxlout3 <= (others => '0'); pxlout2 <= (others => '0'); pxlout1 <= (others => '0'); pxlout0 <= (others => '0'); elsif (clk'event and clk='1') then if DataEn = '1' then pxl3 <= RamDataIn(31 downto 24); pxl2 <= RamDataIn(23 downto 16); pxl1 <= RamDataIn(15 downto 8); pxl0 <= RamDataIn(7 downto 0); case CSR_6(2 downto -- No Change when "000" => pxlout3 pxlout2 pxlout1 pxlout0 -- Invert when "001" => pxlout3 <= pxlout2 pxlout1 pxlout0 0) is <= <= <= <= pxl3; pxl2; pxl1; pxl0;
not(pxl3); <= not(pxl2); <= not(pxl1); <= not(pxl0);
-- Brighten when "010" => if (pxlout3_temp(8) = '1') then pxlout3 <= X"FF"; else pxlout3 <= pxlout3_temp(7 downto 0); end if; if (pxlout2_temp(8) = '1') then pxlout2 <= X"FF"; else pxlout2 <= pxlout2_temp(7 downto 0); end if; if (pxlout1_temp(8) = '1') then pxlout1 <= X"FF"; else pxlout1 <= pxlout1_temp(7 downto 0); end if;
87
if (pxlout0_temp(8) = '1') then pxlout0 <= X"FF"; else pxlout0 <= pxlout0_temp(7 downto 0); end if; -- Darken when "011" => if (color_change pxlout3 <= else pxlout3 <= end if; if (color_change pxlout2 <= else pxlout2 <= end if; if (color_change pxlout1 <= else pxlout1 <= end if; if (color_change pxlout0 <= else pxlout0 <= end if; -- Increase contrast when "100" => if (pxl3 > X"7F") then if (pxlout3_temp(8) = '1') then pxlout3 <= X"FF"; else pxlout3 <= pxlout3_temp(7 downto 0); else point if (pxlout3_temp_sub(8) = '1') then pxlout3 <= X"00"; else pxlout3 <= pxlout3_temp_sub(7 downto 0); end if; end if; if (pxl2 > X"7F") then if (pxlout2_temp(8) = '1') then pxlout2 <= X"FF"; else pxlout2 <= pxlout2_temp(7 downto 0); else point if (pxlout2_temp_sub(8) = '1') then end if; -- pixel value is less than mid end if; -- pixel value is less than mid > pxl3) then X"00"; pxl3 - color_change; > pxl2) then X"00"; pxl2 - color_change; > pxl1) then X"00"; pxl1 - color_change; > pxl0) then X"00"; pxl0 - color_change;
88
pxlout2 <= X"00"; else pxlout2 <= pxlout2_temp_sub(7 downto 0); end if; end if; if (pxl1 > X"7F") then if (pxlout1_temp(8) = '1') then pxlout1 <= X"FF"; else pxlout1 <= pxlout1_temp(7 downto 0); else point if (pxlout1_temp_sub(8) = '1') then pxlout1 <= X"00"; else pxlout1 <= pxlout1_temp_sub(7 downto 0); end if; end if; if (pxl0 > X"7F") then if (pxlout0_temp(8) = '1') then pxlout0 <= X"FF"; else pxlout0 <= pxlout0_temp(7 downto 0); else point if (pxlout0_temp_sub(8) = '1') then pxlout0 <= X"00"; else pxlout0 <= pxlout0_temp_sub(7 downto 0); end if; end if; -- Decrease contrast when "101" => if (pxl3 > X"7F") then if (pxlout3_temp_sub(7) = '0') then -- orig_value - contrast < mid point value pxlout3 <= X"7F"; else pxlout3 <= pxlout3_temp_sub(7 downto 0); end if; else -- pixel value is less than mid point if (pxlout3_temp(7) = '1') then -- orig_value + contrast > mid point value pxlout3 <= X"7F"; else pxlout3 <= pxlout3_temp(7 downto 0); end if; -- pixel value is less than mid end if; -- pixel value is less than mid
89
end if; end if; if (pxl2 > X"7F") then if (pxlout2_temp_sub(7) = '0') then -- orig_value - contrast < mid point value pxlout2 <= X"7F"; else pxlout2 <= pxlout2_temp_sub(7 downto 0); end if; else -- pixel value is less than mid point if (pxlout2_temp(7) = '1') then -- orig_value + contrast > mid point value pxlout2 <= X"7F"; else pxlout2 <= pxlout2_temp(7 downto 0); end if; end if; if (pxl1 > X"7F") then if (pxlout1_temp_sub(7) = '0') then -- orig_value - contrast < mid point value pxlout1 <= X"7F"; else pxlout1 <= pxlout1_temp_sub(7 downto 0); end if; else -- pixel value is less than mid point if (pxlout1_temp(7) = '1') then -- orig_value + contrast > mid point value pxlout1 <= X"7F"; else pxlout1 <= pxlout1_temp(7 downto 0); end if; end if; if (pxl0 > X"7F") then if (pxlout0_temp_sub(7) = '0') then -- orig_value - contrast < mid point value pxlout0 <= X"7F"; else pxlout0 <= pxlout0_temp_sub(7 downto 0); end if; else -- pixel value is less than mid point if (pxlout0_temp(7) = '1') then -- orig_value + contrast > mid point value pxlout0 <= X"7F"; else pxlout0 <= pxlout0_temp(7 downto 0); end if; end if;
90
end if; end process;
-- Change to Black and White using threshold -- 110 or 111 when others => if (pxl3 > threshold) then pxlout3 <= white; else pxlout3 <= black; end if; if (pxl2 > threshold) then pxlout2 <= white; else pxlout2 <= black; end if; if (pxl1 > threshold) then pxlout1 <= white; else pxlout1 <= black; end if; if (pxl0 > threshold) then pxlout0 <= white; else pxlout0 <= black; end if; end case; end if; -- end DataEn -- end clk
RamDataOut(31 downto 24) <= pxlout3; RamDataOut(23 downto 16) <= pxlout2; RamDataOut(15 downto 8) <= pxlout1; RamDataOut(7 downto 0) <= pxlout0; end Behavioral;
91
Appendix 6 VHDL code for the Neighbourhood and Morphological Operations sub-block kernel3_img_proc.vhd
library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_ARITH.ALL; use IEEE.STD_LOGIC_UNSIGNED.ALL; --TG -==================================================================== -- Created by Tommy Gartlan 6-March-2005 -- Signed off June 2006 --- See thesis report for block diagrams --- This block handles all 3*3 Kernel image processing operations -- The CSR(6)(2 downto 0) register is used to select the image processing task -- IF CSR_6(2 downto 0) is --- 000 Full Prewitt Edge Detection -Threshold_X is set by CSR_6 -Threshold_Y is set by CSR_7 -- 001 Horizontal Edge Detection Only -Threshold_X is set by CSR_6 -- 010 Verticval Edge Detection Only -Threshold_Y is set by CSR_7 -- 011 Erosion 8-Connectivity -- 100 Erosion 4-connectivity -- 101 Dialation 8-Connectivity -- 110 Dialation 4-Connectivity -- 111 Not used...will default to Full Edge detection ----==================================================================== -- Written by Tommy Gartlan 31-May-2005 --- This block is designed to perform Prewitt Edge Detection -- It does this by performing horizontal and vertical edge detection seperate -- and then combing the two to give an overall result -- This block receives 1 Ram location(32 bits wide) divides the data -- into individual pixel values and feeds them into the first line buffer. -- There are three line buffers needed in total. These are synthesised albeit -- at the moment not very efficiently. The line buffer length should be the -- same as the image width. Therefore the line buffer length may be varied during -- synthesis by changing the 'Width' parameter currently set at 16. --- Effectively the Gradient is determined in both the horizontal and vertical directions
92
-- If these gradients are more than the corresponding thresholds then an edge is declared -- by settin the pixel value white(FF) else the pixel is set black (00). --- The threshold in the x direction 'Threshold_X' and the thrsehold in the Y direction 'Threshold_Y' -- can be varied. --- ======== Parameters ====================================================== --Width --- Image width -Threshold_X --- Gradient in X direction must surpass this value to be declared an edge -Threshold_Y --- Gradient in Y direction must surpass this value to be declared an edge --===================================================================== ========================= entity kernel3_img_proc is Port ( RamDataIn : in std_logic_vector(31 downto 0); DataEn : in std_logic; rst : in std_logic; clk : in std_logic; CSR_6: in std_logic_vector(7 downto 0); CSR_7: in std_logic_vector(7 downto 0); RamDataOut : out std_logic_vector(31 downto 0); Pipeline_lnt : out std_logic_vector(15 downto 0)); end kernel3_img_proc; architecture Behavioral of kernel3_img_proc is -- The following code must appear in the VHDL architecture header: -- make visible one of the desired line buffers created by the core generator component line_buf_ext177 --component line_buf_ext29 port ( CLK: IN std_logic; D: IN std_logic_VECTOR(7 downto 0); Q: OUT std_logic_VECTOR(7 downto 0); CE: IN std_logic); end component; type stateType is (zero,one,two,three,four); signal CS, NS : stateType := zero; -- width of the image is (core_width + 3) = 180 constant core_width: integer:= 177; --constant core_width: integer:= 29; -------------------- edge detect declarations ---------------------
93
constant white: std_logic_vector(7 downto 0) := B"1111_1111"; constant black: std_logic_vector(7 downto 0) := B"0000_0000"; -- Pipeline length is number of 'DataEn' delays form valid input to valid output -- Pipeline obviously varies according to the width of the image in multiples of 4 since -- this is the way memory is organised plus 2 since there are 8 pixel clock delays in this design -- image width is core_witdh + 3 -- have to wait 2 line delays for first pixel to emerge => 2 * (core_width +3) -- divide this by 4 since the we want to know the number of DataEn delays which -- is 4 times slower than the pixel clock =((2 * (core_width+3))/4) -- The final 2 is due to 8 more pixel delays in the block which is 2 DataEn delays constant pipeline_length: integer:= 2 + ((2 * (core_width+3))/4);
-- The Thresholds foredge detection are set by CSR registers 6 and 7 signal Threshold_X:std_logic_vector(9 downto 0) ; signal Threshold_Y:std_logic_vector(9 downto 0) ;
-- registers for 3 * 3 kernal -- Made 10 bits wide since they are added to give a 10-bit -- sum. The upper 2-bits are padded with 0 before addition. signal line0_out,line0_out2,line0_out3:std_logic_vector(9 downto 0); signal line1_out,line1_out2,line1_out3:std_logic_vector(9 downto 0); --2nd and 3rd outputs of line buffers 0 and 2 signal line2_out,line2_out2, line2_out3:std_logic_vector(9 downto 0); -- signals used to hold individual pixel values signal pxl3,pxl2,pxl1,pxl0:std_logic_vector(7 downto 0); -- signals used in determingthe gradient in the X direction signal kx_pxl_add, kx_pxl_add_dly1, kx_pxl_add_dly2, kx_gradient: std_logic_vector(9 downto 0); -- signals used in determining the gradient in the Y direction signal ky_add_top,ky_add_btm,ky_gradient,ky_grad_dly1,ky_grad_dly2: std_logic_vector(9 downto 0); -- signal used in erosion and dilation signal pxls_and_top, pxls_and_mid, pxls_and_btm: std_logic; signal pxls_and_all, pxls_and_all_dly1, pxls_and_all_dly2: std_logic; signal pxls_and_4, pxls_and_4_dly1, pxls_and_4_dly2: std_logic; signal pxls_or_top, pxls_or_mid, pxls_or_btm: std_logic; signal pxls_or_all, pxls_or_all_dly1, pxls_or_all_dly2: std_logic; signal pxls_or_4, pxls_or_4_dly1, pxls_or_4_dly2: std_logic;
94
-- These signals contain the output pixel values that make up -- the 32 wide RamDataOut signal output signal k_out:std_logic_vector(7 downto 0); signal k_out_dly1, k_out_dly2, k_out_dly3:std_logic_vector(7 downto 0); --The Pixel_En signal is enable 4 times for every DataEn s -- Data_En enables the movement of 1 RAM location(32 bits wide) -- Pixel_En enalbes the movement of i pixel (8 bits wide) signal pixel_en : std_logic; --extra signal added as a result of core signal line0_core: std_logic_vector(7 downto 0); signal line1_core: std_logic_vector(7 downto 0); signal line2_core: std_logic_vector(7 downto 0);
begin -- instantiate the line buffers --line0_buf : line_buf_ext29 line0_buf : line_buf_ext177 port map ( CLK => clk, D => pxl0, Q => line0_core, CE => pixel_en); --line1_buf : line_buf_ext29 line1_buf : line_buf_ext177 port map ( CLK => clk, D => line0_out(7 downto 0), Q => line1_core, CE => pixel_en); --line2_buf : line_buf_ext29 line2_buf : line_buf_ext177 port map ( CLK => clk, D => line1_out(7 downto 0), Q => line2_core, CE => pixel_en);
-- convert pipeline length to std_logic_vector Pipeline_lnt <= CONV_STD_LOGIC_VECTOR(pipeline_length,16); -- Assiging various threhold values Threshold_X <= CSR_7 & "00"; Threshold_Y <= CSR_7 & "00"; ----------------- Seperate the RAMDataIn into pixels ----------------
95
process (clk, rst) begin if rst='1' then pxl3 <= (others => '0'); pxl2 <= (others => '0'); pxl1 <= (others => '0'); pxl0 <= (others => '0'); elsif (clk'event and clk='1') then if DataEn = '1' then pxl3 <= RamDataIn(31 downto 24); pxl2 <= RamDataIn(23 downto 16); pxl1 <= RamDataIn(15 downto 8); pxl0 <= RamDataIn(7 downto 0); elsif pixel_en = '1' then pxl0 <= pxl1; pxl1 <= pxl2; pxl2 <= pxl3; end if; -- end DataEn end if; -- end clk end process; --------------------- State machine used to generate Pixel_En signal--------------cnt_sync: process(clk,rst) begin if rst = '1' then CS <= zero; elsif clk'event and clk = '1' then CS <= NS; end if; end process; cnt_comb: process(CS,DataEn) begin pixel_en <= '0'; NS <= CS; case CS is when zero => if DataEn = '1' then NS <= four; pixel_en <= '1'; end if; when one => if DataEn = '1' then NS <= four; pixel_en <= '1'; else NS <= zero; pixel_en <= '0'; end if; when two => NS <= one; pixel_en <= '1'; when three => NS <= two; pixel_en <= '1'; when four => NS <= three; pixel_en <= '1';
96
when others => null; end case; end process;
------------------------------------- Tke Kernel registers 3 * 3 -----------------------------------line0_out3 -> line0_out2 -> line0_out --line1_out3 -> line1_out2 -> line1_out --line2_out3 -> line2_out2 -> line2_out kernel_reg: process(clk,rst) begin if rst = '1' then line0_out <= (others => '0'); line0_out2 <= (others => '0'); line0_out3 <= (others => '0'); line1_out <= (others => '0'); line1_out2 <= (others => '0'); line1_out3 <= (others => '0'); line2_out <= (others => '0'); line2_out2 <= (others => '0'); line2_out3 <= (others => '0'); elsif clk'event and clk = '1' then if Pixel_en = '1' then line0_out3 <= "00" & line0_core; line0_out2 <= line0_out3; line0_out <= line0_out2; line1_out3 <= "00" & line1_core; line1_out2 <= line1_out3; line1_out <= line1_out2; line2_out3 <= "00" & line2_core; line2_out2 <= line2_out3; line2_out <= line2_out2; end if; end if; end process;
-------------------- Generate gradient in the X direction for edge detection -----------------kx_proc: process(clk,rst) begin if rst = '1' then kx_pxl_add <= (others => '0'); kx_pxl_add_dly1 <= (others => '0'); kx_pxl_add_dly2 <= (others => '0'); kx_gradient <= (others => '0'); elsif clk'event and clk = '1' then if Pixel_en = '1' then
97
kx_pxl_add <= line0_out + line1_out + line2_out; kx_pxl_add_dly1 <= kx_pxl_add; kx_pxl_add_dly2 <= kx_pxl_add_dly1; if (kx_pxl_add > kx_pxl_add_dly2) then kx_gradient <= kx_pxl_add - kx_pxl_add_dly2; else kx_gradient <= kx_pxl_add_dly2 - kx_pxl_add; end if; end if; end if; end process; -------------------- Generate gradient in the Y direction for edge detection ----------------ky_proc: process(clk,rst) begin if rst = '1' then ky_add_top <= (others => '0'); ky_add_btm <= (others => '0'); ky_gradient <= (others => '0'); ky_grad_dly1 <= (others => '0'); ky_grad_dly2 <= (others => '0'); elsif clk'event and clk = '1' then if Pixel_en = '1' then ky_add_top <= line0_out + line0_out2 + line0_out3; ky_add_btm <= line2_out + line2_out2 + line2_out3; if (ky_add_top > ky_add_btm) then ky_gradient <= ky_add_top - ky_add_btm; else ky_gradient <= ky_add_btm - ky_add_top; end if; ky_grad_dly1 <= ky_gradient; ky_grad_dly2 <= ky_grad_dly1; end if; end if; end process; ---------------------------- Erosion and Dilation ------------------------- Generate pxls_3_and and_proc: process(clk,rst) begin if rst = '1' then pxls_and_top <= '0'; pxls_and_mid <= '0'; pxls_and_btm <= '0'; pxls_and_all <= '0'; pxls_and_all_dly1 <= '0'; pxls_and_all_dly2 <= '0'; pxls_or_top <= '0'; pxls_or_mid <= '0'; pxls_or_btm <= '0'; pxls_or_all <= '0'; pxls_or_all_dly1 <= '0'; pxls_or_all_dly2 <= '0'; elsif clk'event and clk = '1' then
98
if Pixel_en = '1' then -- Erosion 8-connectivity pxls_and_top <= line0_out(7) and line0_out3(7); pxls_and_mid <= line1_out(7) and line1_out3(7); pxls_and_btm <= line2_out(7) and line2_out3(7); -- Will return a '1' only if all 9 pixels pxls_and_all <= pxls_and_top and pxls_and_btm; pxls_and_all_dly1 <= pxls_and_all_dly2 <= line0_out2(7) and line1_out2(7) and line2_out2(7) and are the same pxls_and_mid and
pxls_and_all; pxls_and_all_dly1;
-- Erosion 4-connectivity -line0_out2(7); -pxls_and_mid <= line1_out(7) and line1_out2(7) and line1_out3(7); -line2_out2(7); -- Will return a '1' only if all 5 pixels are the same pxls_and_4 <= line0_out2(7) and pxls_and_mid and line2_out2(7); pxls_and_4_dly1 <= pxls_and_4_dly2 <= -- Dialation 8-connectivity pxls_or_top line0_out3(7); pxls_or_mid line1_out3(7); pxls_or_btm line2_out3(7); -- Will return a '1' pxls_or_all pxls_or_btm; pxls_and_4; pxls_and_4_dly1;
<= line0_out(7) or <= line1_out(7) or <= line2_out(7) or
line0_out2(7) or line1_out2(7) or line2_out2(7) or
only if any of 9 pixels are '1' <= pxls_or_top or pxls_or_mid or
pxls_or_all_dly1 <= pxls_or_all_dly2 <=
pxls_or_all; pxls_or_all_dly1;
-- Dialation 4-connectivity -line0_out2(7); -pxls_or_mid <= line1_out(7) or line1_out2(7) or line1_out3(7); -line2_out2(7); -- Will return a '1' only if any of 5 pixels are '1' pxls_or_4 <= line0_out2(7) or pxls_or_mid or line2_out(7); pxls_or_4_dly1 <= pxls_or_4_dly2 <= pxls_or_4; pxls_or_4_dly1;
99
end if; end if; end process;
------------------ Generate the output pixel K_out by selectin the output from one of the various tasks ---select_k_out: process(clk,rst) begin if rst = '1' then k_out <= (others => '0'); elsif clk'event and clk = '1' then if Pixel_en = '1' then case CSR_6(2 downto 0) is -- Horizontal Edge Detection when "001" => if (kx_gradient > Threshold_X) then k_out <= white; else k_out <= black; end if; -- Vertical Edge Detection when "010" => if (ky_grad_dly2 > Threshold_Y) then k_out <= white; else k_out <= black; end if; -- Erosion 8-connectivity when "011" => if (pxls_and_all_dly2 = '1') then k_out <= white; else k_out <= black; end if; -- Erosion 4-connectivity when "100" => if (pxls_and_4_dly2 = '1') then k_out <= white; else k_out <= black; end if; -- Dialation 8-connectivity when "101" => if (pxls_or_all_dly2 = '1') then k_out <= white; else k_out <= black; end if; -- Erosion 4-connectivity when "110" => if (pxls_or_4_dly2 = '1') then k_out <= white; else k_out <= black; end if;
100
when others => -- Full Edge Detection if ((kx_gradient > Threshold_X) or (ky_grad_dly2 > Threshold_Y)) then k_out <= white; else k_out <= black; end if; end case; end if; end if; end process; -- pixe_en
--------------------- Generate using delays the 4 pixel output group ------------------------------output_pixels: process(clk,rst) begin if rst = '1' then -k_out <= (others => '0'); k_out_dly1 <= (others => '0'); k_out_dly2 <= (others => '0'); k_out_dly3 <= (others => '0'); elsif clk'event and clk = '1' then if Pixel_en = '1' then -- decide if there is an edge or not based on whether any of the gradients are greater than -- a pre set threshold -if ((kx_gradient > Threshold_X) or (ky_grad_dly2 > Threshold_Y)) then -k_out <= white; -else -k_out <= black; -end if; k_out_dly1 <= k_out; k_out_dly2 <= k_out_dly1; k_out_dly3 <= k_out_dly2; end if; end if; end process; -- Register the RamDataOut signal every DataEn. This keeps the data stable -- while it is being written to memory pixel_reassemble: process(clk,rst) begin if rst = '1' then RamDataOut <= (others => '0'); elsif clk'event and clk = '1' then if DataEn = '1' then RamDataOut(31 downto 24) <= k_out; -pixel 3 RamDataOut(23 downto 16) <= k_out_dly1; --pixel 2 RamDataOut(15 downto 8) <= k_out_dly2; --pixel 1
101
RamDataOut(7 downto 0) <= k_out_dly3; pixel 0 end if; end if; end process;
--
end Behavioral;
102
Appendix 7 Abstract of Synthesis report for Prewiit Edge detect for line buffers(width =32) implemented using VHDL INFO:Xst:741 - HDL ADVISOR - A 30-bit shift register was found for signal <line2<7><29>> and currently occupies 30 logic cells (15 slices). Removing the set/reset logic would take advantage of SRL16 (and derived) primitives and reduce this to 2 logic cells (1 slices). Evaluate if the set/reset can be removed for this simple shift register. The majority of simple pipeline structures do not need to be set/reset operationally. INFO:Xst:738 - HDL ADVISOR - 256 flip-flops were inferred for signal <line0>. You may be trying to describe a RAM in a way that is incompatible with block and distributed RAM resources available on Xilinx devices, or with a specific template that is not supported. Please review the Xilinx resources documentation and the XST user manual for coding guidelines. Taking advantage of RAM resources will lead to improved device usage and reduced synthesis time. Summary: inferred 1 Finite State Machine(s). inferred 954 D-type flip-flop(s). inferred 10 Adder/Subtracter(s). inferred 4 Comparator(s). inferred 64 Multiplexer(s). Unit <prewitt_edge_detect> synthesized.
103
Appendix 8 Mapping fails when line buffers become too big
Design Summary: Number of errors: 1 Number of warnings: 0 Logic Utilization: Number of Slice Flip Flops: Number of 4 input LUTs: Logic Distribution: Number of occupied Slices: (OVERMAPPED)
3,976 out of 3,840 103% (OVERMAPPED) 551 out of 3,840 14% 2,134 out of 1,920 111%
104
Appendix 9 Line buffers implemented using core generator implies use of cell SRL16E Final Results RTL Top Level Output File Name : kernel3_img_proc.ngr Top Level Output File Name : kernel3_img_proc # Registers : 70 # 1-bit register : 44 # 10-bit register : 18 # 8-bit register :8 # Shift Registers :2 # 3-bit shift register :2 # Multiplexers :7 # 2-to-1 multiplexer :7 # Adders/Subtractors :8 # 10-bit adder :6 # 10-bit subtractor :2 # Comparators :4 # 10-bit comparator greater : 4 Cell Usage : # BELS # GND # LUT1 # LUT2 # LUT2_L # LUT3 # LUT3_D # LUT3_L # LUT4 # LUT4_D # LUT4_L # MUXCY # VCC # XORCY # FlipFlops/Latches # FDC # FDCE # FDE # FDP # Shifters # SRL16E # Clock Buffers # BUFGP # IO Buffers # IBUF # OBUF : 367 :4 :3 : 23 : 83 : 59 :1 :1 : 15 :1 :1 : 107 :4 : 65 : 461 :4 : 214 : 242 :1 : 266 : 266 :1 :1 : 93 : 45 : 48
105
Appendix 10
Final Mapping report shows resources to spare
Design Summary -------------Number of errors: 0 Number of warnings: 0 Logic Utilization: Number of Slice Flip Flops: 767 out of 3,840 19% Number of 4 input LUTs: 962 out of 3,840 25% Logic Distribution: Number of occupied Slices: 821 out of 1,920 42% Number of Slices containing only related logic: 821 out of 821 100% Number of Slices containing unrelated logic: 0 out of 821 0% *See NOTES below for an explanation of the effects of unrelated logic Total Number 4 input LUTs: 1,301 out of 3,840 33% Number used as logic: 962 Number used as a route-thru: 37 Number used as Shift registers: 302 Number of bonded IOBs: 82 out of 173 47% IOB Flip Flops: 34 Number of GCLKs: 1 out of 8 12% Number of RPM macros: 3 Total equivalent gate count for design: 33,878 Additional JTAG gate count for IOBs: 3,936 Peak Memory Usage: 82 MB
106

Towards A Programmable Image Processing Machine On A Spartan-3 FPGA

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Towards A Programmable Image Processing Machine On A Spartan-3 FPGA

Загружено:

Авторское право:

Доступные форматы

Towards a Programmable Image Processing Machine on a Spartan-3 FPGA

Thomas Anthony Gartlan, B.A, B.A.I, M.Sc

Athlone Institute of Technology Athlone, Ireland July 2006

Submitted in part-fulfilment of the degree Master of Science (Advanced Engineering Techniques)

REVIEW OF CURRENT SYSTEM

Overview of the Current System

The user interface is shown in the figure below.

The size of the image in pixels is specified.

A bitmap image is selected to be loaded into the lowest quadrant in memory(00000H)

The processed image is read from the last quadrant in memory(30000H)

User Interface for the AppliedVHDL system

Limitations of the Current System

(a) Pixel Values Figure 2

255 160 80 0 (a) Pixel values(decimal) Figure 3

(b)Pixel values(Hex) 8-bit Intensity image(4*4)

(a) Data Matrix

0 0 0.251 0.502 1 0.7529 1 1

0.502 0.251 0 0.251 1 0.7529 0 1

Colourmap file if Black and White attribute is not set.

Correct colourmap for a greyscale image.

a) 32*32 greyscale Figure 7

b) 180*180 greyscale Different size greyscale images

180*180 Black and white

IMAGE PROCESSING ALGORITHMS

255 Output 0 0 Input 255

a) Unity. Input pixel values are unchanged

255 Output 0 0 Input 255

b) Invert. All pixel values are inverted

Threshold. Convert greyscale image to black and white

Figure 9 Various Point Operations represented graphically(Burdick, Digital Imaging)

P01 P02 P03 .. P11 P12 P13

(b) Input Image

2-D Kernel is passed over the input image during convolution

a) Coefficients for a Low Pass Filter -1 -1 -1 -1 9 -1 -1 -1 -1

b) Coefficients for a High Pass Filter -1 0 1 -1 0 1 -1 0 1 -1 -1 -1 0 0 0 1 1 1

c) Prewitt Horizontal and Vertical Edge Detection -1 0 1 -2 0 2 -1 0 1 -1 -2 -1 0 0 0 1 2 1

P01 P02 P03 .. P11 P12 P13

Kernel for Erosion using 4-connectivity

Summary of New Image Processing functions added to the system

Point Image Processing Operations added to the system

THE NEW DSP CONTROLLER BLOCK AND IMAGE PROCESSING SUB-BLOCKS

Point Operations State Machine

Overview of DSP block architecture

Main cycles of DSP Block Version 1

No valid data ready to write

Main Cycles of DSP Block Version 2

The DSP Controller, DSPblk.vhd

Pin clk rst dspActive System clock System reset

DatFromRam(31: 0) DspAddRange(15:0) CSR_6: (7:0)

DSP Block Inputs

dspDone dspRamWr dspRamRd dspDat2Ram(31:0) dspRamAdd(17:0)

DSP Block Architecture

Internal Signals of the DSP block

DSP block main State Machine

The Point Operations sub-block (pixel_img_proc.vhd)

Inputs of the Point Operations block

RamDataOut(31 downto 0) Pipeline_lnt(15 downto 0)

Outputs of the Point Operations block

Architecture of the Point Operations block

Original Greyscale image used to illustrate point operations(eye_32*32_grey.bmp).

Value used to brighten each pixel by (20 in this example)

Value used to darken each pixel by (20 in this example)