Вы находитесь на странице: 1из 57

This thesis proposes a design of Fast Fourier Transform and applies it into Nios II Embedded Processor.

This chapter covers the motivation, problem statement, project objectives, and scope of work, project contributions and finally thesis organization. In last few decades, embedded systems have experienced an accelerating growth both in computing power and scope of their possible applications. Moreover the designing procedure for embedded system also changed immensely. As the application demands goes on increasing with the time the complexity of the embedded system is waxing. Combination of software and hardware in design leads to improve the system performance such approach is known as Co-Design.

1.1 CO-DESIGN Hardware/software co-design is the main technique used in the thesis. It can be defined as the cooperative design of hardware and software. Co-design methodology deals with the problem of designing heterogeneous systems. One of the goals of co-design is to shorten the time-to-market while reducing the design effort and costs of the designed products. Co-design can be implemented on embedded systems and processor is the main part of any embedded system. The advantages of using processors are manifold, because software is more flexible and cheaper than hardware. This flexibility of software allows late design changes and simplified debugging opportunities. Furthermore, the possibility of reusing software by porting it to other processors reduces the time-to market and the design effort. Finally, in most cases the use of processors is very cheap compared to the development costs of ASICs, because processors are often produced in high-volume, leading to a significant price reduction. However, hardware is always used by the designer, when processors are not able to meet the required performance. This trade-off between hardware and software illustrates the optimization aspect of the co-design problem. Codesign is an interdisciplinary activity, bringing concepts and ideas from different disciplines together, e.g. system-level modeling, hardware design and software design.

The design flow of the general co-design approach is depicted in figure 1 Step 1: The co-design process starts with specifying the system behavior at the system level. Step 2: After this, a pure software system will be developed to verify all algorithms. Step 3: Performance analysis will be performed to find out the system bottlenecks. 1

Step 4: The hardware/software partitioning phase a plan will be made to determine which parts will realized by hardware and which parts will be realized by software. Obviously, some system bottlenecks will be replaced by hardware to improve the performance. Step 5: based on the results of step 4, hardware and software parts will be designed respectively. Step 6: co-simulation. At this step, the completed hardware and software parts will be integrated together and performance analysis will be performed. Step 7: if the performance meets the requirements, the design can stop and if the Performance cant meet the requirements, new HW/SW partitioning and a new design.

Primarily software

Virtual Prototype

Primarily hardware

Hardwar e design System def. Hw and Sw parts Software design

Hardwar e fab. Integration and test

Function design

Software code

Hardware and software partitioning & codesign

Figure 1- the design flow of the general co-design approach


1.2 FFT Algorithm Fast Fourier Transform (FFT) algorithms are widely used in many areas of science and engineering. Some of the most widely known FFT algorithms are Radix-2 algorithm, Radix-4 algorithm, Split Radix algorithm, Fast Hartley transformation based algorithm and Quick Fourier transform. The Discrete Fourier Transform (DFT) is used to produce frequency analysis of discrete non-periodic signals. The FFT is another method of achieving the same result, but with less overhead involved in the calculations. One of the most widely used technique in science and engineering is the concept of Fourier Transform and other algorithms based on it. In signal processing, it is primarily used to convert an input signal in time domain into frequency domain and vice-versa. In the world of digital, signals are sampled in time domain. So, we have Discrete Fourier Transform (DFT) in the digital world. DFT is applied on a discrete input signal and we get the frequency characteristics of the signal as the output. Performing inverse DFT, which has a mathematical form very similar to the DFT, on the frequency domain result gives back the signal in the time domain. This means that the signal when converted into frequency domain will give us the various frequency components of the input signal and then can be used to remove certain unwanted frequency components. This concept can be used in image or audio compression and filters on communication signals to name a few. Discrete Fourier Transform is a very computationally intensive process that is based on summation a finite series of products of input signal values and trigonometric functions. Its time complexity of the algorithm in O(n2).To increase the performance, several algorithms were proposed which can be implemented in hardware or software. These set of algorithms are known as Fast Fourier Transforms (FFT). The first major FFT algorithm was proposed by Cooley and Tukey. Many FFT algorithms were proposed with a time complexity of O(nlogn). Some of them are Radix-2 algorithm, Radix-4 algorithm and Split Radix algorithm. In this paper, we discuss ways of parallelizing these algorithms to reduce the communication overhead.

1.3 Motivation The Fast Fourier transform is a critical tool in digital signal processing where its value in analyzing the spectral content of signals has found application in a wide variety of applications. 10

The most prevalent of these applications is being in the field of communications where the ever increasing demand on signal processing capabilities have given rise to the importance of the Fourier transform to the field. However, the Fourier transform is a part of many systems in a wide variety of industrial and research fields. Its uses range from signal processing for the analysis of physical phenomena to analysis of data in mathematical and financial systems. The majority of systems requiring Fourier transforms are real time systems which necessitate high speed processing of data. Given the complexity in performing The Discrete Fourier, the implementation of high speed Fast Fourier transform has required the use of dedicated hardware processors. The majority of high performance Fourier transforms has required the use of full custom integrated circuits and has typically been in the form of an application, specifically integrated circuit. Although much work has been put into raising performance while reducing hardware requirements, and also cost, the cost of full custom hardware still limits the availability of Fourier transform hardware to low volume production. Nevertheless the development of programmable logic hardware has produced devices that are increasingly capable of handling large scale hardware. High density field programmable gate arrays (FPGA) that are already available in the market can boast upwards of 180,000 logic elements, nine megabits of memory, and on board processors. The use of FPGA in implementing hardware eliminates the need for the long and costly process of creating a full custom integrated circuit and the time and cost of testing and verification. Saving cost in designing, testing, and time from design to a functional device. These features of the FPGA make it especially attractive for the purpose of creating embedded processors for research and development purposes. However the design of any of embedded processors must consider two important factors efficiency and flexibility for reaching an ideal design.

1.4 Problem Statement Efficiency and flexibility are two of the most important driving factors in embedded system design. Efficient implementations are required to meet the tight cost, timing, and power constraints present in embedded systems. Flexibility, albeit tough to quantify, is equally important; it allows system designs to be easily modified or enhanced in response to bugs, evolution of standards, market shifts, or user requirements, during the design cycle and even 10

after production. Various implementation alternatives for a given function, ranging from customdesigned hardware to software running on embedded processors, provide a system designer with differing degrees of efficiency and flexibility. Unfortunately, it is often the case that these are conflicting design goals. While efficiency is obtained through custom hardwired

implementations, flexibility is best provided through programmable implementations. Hardware/software partitioning separating a systems functionality into embedded software (running on programmable processors) and custom hardware (implemented as coprocessors or peripheral units) is one approach to achieve a good balance between flexibility and efficiency.

1.5 Project Objectives The aims of this project are as follow: Design and implementation of Fast Fourier Transform (FFT) algorithm into embedded system a) Utilizing Nios II embedded processor. b) Implementation of FFT algorithm using NIOS II Processor without Custom Instruction. c) Implementation of FFT algorithm using NIOS II Processor with Custom Instruction. d) Comparison of algorithm in terms of speed and area for both the designs of algorithm with and without custom instruction FFT

1.6 Scope of Work Taking into account the resources and time available, this project is narrowed down to the following scope of work. 1. This project only considers 16 point FFT floating point. The Decimation-In- Time (DIT) algorithm is chosen. 2. The algorithm is implemented in C++ language. 3. Floating Point Custom Instruction is targeted for Nios II platform and implemented in ALTERA Cyclone II DE2 board. 4. NIOS II IDE User Interface (GUI) has been used for the purpose of interfacing with FPGA hardware to provide inputs and display outputs. 5. Universal Serial Bus (USB) is used for transmitting and receiving data between FPGA board. 10

6. This Embedded system is applied in Spectral Analysis as an application.

1.7 Project Contributions The most important contributions of this project are: 1. Integration framework of ALTERA development kit platform. 2. Utilizing Nios II Floating Point Custom Instruction in the design to increase performance and accelerate speed. 3. Created a simple protocol that is used for interaction with and communication between hardware and software via computer serial port.

1.8 Thesis Outline The thesis is organized into 6 chapters. The first chapter (this chapter) presents the background of the work, problem statement, research objectives, work scope and contributions of this project. Thesis is organized into nine chapters along with references and appendix. Chapter 1 presents a basic overview of need for an embedded system, about co-design, FFT algorithm it also contains motivation and proposed work for the project. The last section presents the outline of the thesis. Chapter 2 presents literature survey about concept of soft IP core and its use for embedded system. It also presents the survey of different IP cores like Nios II, FFT Algorithm etc... Chapter 3 presents an overview of Methodology, Flow chart, Algorithm, and system designing using NIOS II processor. Chapter 4 describes an introduction of the Fast Fourier Transform. A derivation of the FFT is given and concentrated to radix-2 algorithm and also describes complete system description Embedded System is first explained, followed by methodology, Nios II Floating Point Custom Instruction, and finally implementation of FFT algorithm in C++ language.


Chapter 5 shows the system results and Nios II results. All results are appraised and compared. Conclusion, which summarizes the work in this thesis.

Chapter 6 future works is also proposed, which essentially suggests ways to Improve and extend the current design. Embedded systems are hardware and software components working together to perform a specific function. To design an embedded system processor should be choose a core based on the requirements and performance constraints of their particular application. Each core has different performance characteristics and features that are suitable for specific applications. Survey of different cores is done in following chapter.


2.1 IP core In information technology, design reuse is the inclusion of previously designed components in software and hardware. This term is more frequently used in hardware development. Developers can reuse a component in both similar and completely different applications, for example a component used as part of a central processing unit for a PC could be reused in a handheld device or a set-top box. Thus an IP core is defined as a pre-defined, pre-verified complex functional block that is integrated into the logic of particular design. In electronic design an IP core is a reusable unit of logic, cell, or chip layout design that is the intellectual property of one party and is used in making a FPGA or ASIC for a product. An IP (intellectual property) core is a block of logic or data that is used in making a field programmable gate array ( FPGA ) or application-specific integrated circuit ( ASIC ) for a product. As essential elements of design reuse , IP cores are part of the growing electronic design automation ( EDA ) industry trend towards repeated use of previously designed components. Ideally, an IP core should be entirely portable - that is, able to easily be inserted into any vendor technology or design methodology. Universal Asynchronous Receiver/Transmitter ( UART s), central processing units ( CPU s), Ethernet controllers, and PCI interfaces are all examples of IP cores. One of the most important product development decisions facing SOC designers today is choosing an intellectual property (IP) core. It can impact product performance and quality, as well as time-to-market and profitability. But SOC designers face many challenges when choosing a core. Determining which core is most appropriate for a given SOC requires careful consideration. Decisions must be made about the type of core (soft vs. hard), the quality of the deliverables, and the reliability and commitment of the IP provider. Continuing improvements in silicon manufacturing technology have made vast amounts of silicon real estate available to 10

todays design engineers. Unfortunately, the ability of engineering teams to design circuits has not kept pace. This imbalance has spawned the IP core industry. IP cores allow design teams to rapidly create large system-on-a-chip designs (SOCs) by integrating pre-made blocks that do not require any design work or verification. A number of difficult challenges accompany this new design style. Depending on the core, they can be minimized or exacerbated. First of all, IP cores may be delivered to customers in one of two forms: soft or hard. In both cases, the customer receives a functionally verified design. A soft core, also known as a synthesizable core, is synthesized by the customer and implemented in its SOC. A hard core, on the other hand, is fully implemented and ready for manufacturing. (Technically, a design is not implemented until it is manufactured. In this context, however, implemented means laid-out and ready for manufacturing.) The SOC team need only drop the hard core into the chip as a single monolithic piece. Soft and hard cores have different problems and benefits, which are addressed below. An IP core jump-starts a key part of the SOC design task. The design team gets a verified design, which enables them to complete their chip in less time with fewer engineering and EDA resources. However, integrating a core into a chip requires many steps. How easily this is accomplished, if at all, depends on the deliverables provided. This paper details some of the collateral deliverables that enable easy core integration into all stages of the SOC design process. Finally, there is the IP vendor to consider. The IP industry is still young and there have been a number of poor products and even some failures, and they have not been confined to start-ups. Consequently, a customer must evaluate not only the IP core, but also the IP provider.

2.1.1 Types of IP core

Cores can be classified in three categories: hard, firm and soft.

Soft cores

Synthesizable behavior description of complete microprocessor in hardware description language like VHDL or Verilog is called soft core. HDL is analogous to high level language such as C in the field of computer programming. IP cores delivered to chip makers as RTL permit chip to modify designs (at the functional level).
Hard cores


Hard core is generally defined as a lower-level, physical description provided in any of a variety of physical layout file format. These layouts must obey the target foundry's process design rules, and hence, hard cores delivered for one foundry's process cannot be easily ported to a different process or foundry. Such cores, whether analog or digital, are called hard cores, because the cores application function cannot be meaningfully modified by the customer.
Soft vs. Hard Cores

Lets examine the pros and cons: Because soft cores are not implemented, they are inherently more flexible in function and implementation than hard cores. On the other hand, hard core developers can afford to spend more time optimizing their implementations because they will be used in many designs. Thus, there is a perception that hard cores offer higher performance. In fact, high-end, full-custom hard cores designed for the most advanced processes do offer more performance than soft cores. By using latches, dynamic logic, 3-state signals, custom memories, and so on, the full-custom design team can achieve much better results than a fully static synthesized design. For an SOC that requires performance that pushes the limits of current process and design technology, a full-custom hard core is better able to meet these needs. However, if the performance target is within the range of a soft core, then the potential performance advantage of a hard core is immaterial. The SOC design team can meet its performance goals with a soft core while taking advantage of its inherent flexibility. (As process technology improves, the maximum frequency limits of soft cores will also improve, making them an option for even more SOC designs.) Even at slower clock frequencies, a hard core may offer an advantage in terms of silicon area. But this is not always true. Often, a hard core is simply hardened using an ASICstyle methodology, which offers no advantage in area of speed. In other cases, a fullcustom core is not re-optimized for each process generation, thus diminishing its frequency and area advantages.
Technology Independence & Portability


One of the advantages of a soft core is that it is technology independent. That is to say, the high level Verilog or VHDL does not require the use of a specific process technology or standard cell library. This means that the same IP core can be used for multiple designs, or for future generations of the current design. (Some soft core IP providers use design styles that make their cores technology-dependent, but the advantages of this approach are unclear.) A hard core, on he other hand, is very technology-specific. In fact, if a foundry changes its process parameters or library factors, a hard core may not work correctly with the process tweaks. This introduces risk since the IP provider will need to re-verify the hard core when process parameters are changed. Hard cores can be ported to new process technologies, but the effort to re-optimize full-custom cores is significant and costly. It may take two years or more for some advanced microprocessor cores. Because of this, hard cores are often optically scaled for new processes. While simple and fast, this technique diminishes many of the advantages of the full-custom optimizations done by the design team for the original process. Furthermore, optical scaling introduces additional risk, since it only guarantees that the new design meets design rules. It does not guarantee correct timing or function. Since the optical scaling is a short-cut design style, it can be very difficult to fully re-verify an optically scaled IP core. In reality, soft cores are likely designed with one technology and library in mind. The design itself is independent of this choice of technology but it optimized for this one technology and library. Similar technologies will be near-optimal, but some significantly different technologies (for instance, ones with very slow RAMs) may not see equivalent results. However, this effect is secondary. Soft cores will generally be better optimized than optically scaled hard cores.

Speed/Area/Power Optimization

Hard cores are optimized once, when they are implemented by the IP provider. Because the core is optimized only once, the IP provider can afford to spend significant resources. Thus, a hard core will typically run faster than a comparable soft core for that one technology in which it is implemented. But, even in that single technology, it is only optimized for one set of goals. If the goal is low area at reasonable performance, the 10

highly tuned performance-optimized hard core may be too large for the application. Soft cores, on the other hand, can be application optimized: Timing, area and power targets can be adjusted to fit the specific embedded SOC design. For instance, if an SOC uses a 200- MHz clock, then a soft IP core that was designed to run at 250 MHz can be targeted to run at exactly 200 MHz instead. This allows for smaller area and lower power while still meeting the design constraints. This application optimization also extends down to low-level IO timing. The IO constraints of a soft core can be adjusted to exactly fit the environment the core will be used in. If a hard core has a late output signal, there is little the SOC designer can do to improve that timing. If an SOCs speed, area and power targets are exactly what the hard core was targeted for, then that hard core will be competitive. For the great majority of designs, however, a soft core will be better optimized for that particular SOC. Soft cores offer another advantage over hard cores: compile-time customizations. These are design options chosen prior to implementation. Cache memory size is a common compile-time customization. A soft-core processor can be configured for exactly the cache size needed by the specific embedded application. A hard core, on the other hand, cannot be customized in this way. Another customization employed in many soft cores is instruction specialization, or optional support for certain instructions. For example, support for external coprocessors may be IP cores are released from IP core provider either as soft core, firm core or hard cores depending on the evel of changes that the SoC designer (also called IP cores user) can make to them, and level of transparency they come with when delivered to the final SoC integrator. A soft core consists of a synthesizable HDL description that can be synthesized into different semiconductor processor and design libraries. A firm core contains more structural information, usually a gate level net list that is ready for placement and routing. A hard core includes layout and technology depending timing information and is ready to be dropped into a system but no changes are allowed to it. Hard cores have usually smaller cost as they are final plug-and play design implemented in specific technology library and no changes are allowed in them. Tremendous verity of IP cores of all types and functionalities is available to SoC designr. Therefore, designer 10

are given the great advantages to select from a rich pool of well-designed and carefully verified cores and integrate them , in a plug-and-play fashion, in the system thy are designing

Core category Soft core Firm coore Hard core

Changes Many Some No

Cost High Medium Low

`Description HDL Netlist Layout

Table 1- Evaluating the IP Provider

Evaluating the IP Provider

There are many companies that offer IP cores. Many are small, start-up design houses, and some are large, well-established companies using IP cores as a new method of delivering their designs to customers. Unfortunately, the size of a company is not an indicator of IP core quality. The CO designer should verify the commitment a company has made to IP core products. For example, an IP provider that is not completely committed to IP cores has offerings that may only be previous designs repackaged as IP cores. A company that is serious about building high quality cores will design them from scratch with reuse in mind. This section details some hallmarks of designs made to be reusable. First of all, watch out for soft cores that are the source code for a full-custom hard core. Since these designs were not originally made to be synthesizable, they will be poor products when compared to those designed to be synthesizable. When making a hard core, optimizations can be made based on the known implementation style. However, in a soft core, the implementation is not yet done, so these shortcuts should not be taken since they may result in nonfunctional or sub-optimal implementations. Another thing to look for in a soft core is registered interface signals. By registering IOs, internal logic can be timing-independent from anything the SOC team hooks up. Furthermore, it enables easy timing predictability and gives very good timing constraints to the SOC designer. All of these things make 10

SOC design easier. A soft core that was designed from the beginning to be reusable will typically have more configuration choices and more flexibility in implementation. It will also likely be delivered with multiple design environments in mind. A design made without reuse in mind will be less flexible in function and implementation. An IP core with poor deliverables can also be difficult to integrate into an SOC flow. Therefore, it is very important to evaluate the IP core deliverables to make sure the correct EDA tools are supported and all steps of the SOC flow can be addressed properly. The choice of the IP provider is perhaps as important as the choice of the IP core itself. An IP provider that is making a significant commitment to IP cores is an absolute necessity. Furthermore, the SOC team needs to know that the IP provider will be there in the future to support the product as well as to introduce the new products. There are many challenges facing todays SOC designer. Using a high-quality IP core from a reputable company should make those challenges easier, not more difficult.
Firm Core

It is structural description of a component provided in HDL. Like the hard cores, firm core sometimes called semi-hard cores. It also carry placement data but are configurable to various applications. It provides some retarget ability and some limited optimization.
2.1.2 Sources of IP cores

There are two sources, commercial and open source. IP core available under license version are commercial cores. Some of the well-known commercial cores are Alteras Nios II, Xilinxs MicroBlaze and PicoBlaze and Tensilicas Xtensa. An open source code of soft-core processor is freely available under the GNU (L) GPL license and can be downloaded across the internet. These are LEON by Gaisler Research and Open RISC 1000 from opencores.org.
2.1.3 Comparison of soft IP cores

As part of an extensive library of cores, Altera developed Nios, a processor specifically designed for programmable logic and system-on-a-programmable chip integration. The Nios processor is a pipelined, single-issue RISC processor in which most instructions run in a single clock cycle. There are two versions available, one with a native 16-bit word 10

size and one with a native 32-bit word size. There is a development kit available which includes a C/C++ compiler, debugger, assembler, as well as other development utilities. It also supports operating systems such as c/OS, Linux, nucleus etc [6]. Xtensa is Tensilica's best known processor IP architecture. Tensilica's Xtensa architecture offers a key differentiating feature of a user-customizable instruction set. Using the supplied customization tools, customers can extend the Xtensa base instruction-set by adding new user-defined instructions. After the final processor configuration is made and submitted, Tensilica's processor generator service builds the configured Xtensa IP core, processor design kit, and software development kit. The software kit is built on the Eclipse-based integrated development environment, and uses a GNU derived tool-chain. An instruction set simulator enables customers to begin application development before actual hardware is available. In response to the development of soft-core processors, Xilinx introduced the MicroBlaze processor. The Micro Blaze processor is a 32-bit RISC processor that supports both 16-bit and 32-bit busses and supports Block Ram and/or external memory. All peripherals including the memory controller, the UART, and the interrupt controller run off a standard OPB bus. Additional processor performance can be achieved by exploiting Virtex-II architecture features such as the embedded multiplier and ALU. Xilinx also provides GNU-based tools, including a C-compiler, a debugger, and an assembler, as well as all of the standard libraries. The fully synthesizable code of the Open RISC 1000 processor is freely available and was designed with an emphasis on scalability and platform independence. The architecture consists of a 32-bit RISC Integer Unit with a configurable number of general-purpose registers, configurable cache and TLB sizes, dynamic power management support, and space for user provided instructions a complete GNU-based development environment is available and includes a C-compiler, assembler, linker, debugger, and simulator. The LEON3 processor core is a synthesizable VHDL model of a 32-bit processor compliant with the SPARC V8 architecture. The core is highly configurable and 10

particularly suitable for system-on-a-chip designs. The core is interfaced using the AMBA 2.0 AHB bus and supports the IP core plug & play method provided in the Gaisler Research IP library. The processor can be efficiently implemented on FPGA [4].

3.1 CO-DESIGN METHODOLOGY The activity of co-design is interchangeable with that of personalization when a service user and a provider set about creating the desired service for an individual. If an overarching objective is to make services more personalized, than a huge amount can be learnt about the design of the activity of personalization by working with a small number of customers. Through such work, service designers gain insights on how people see and communicate their needs, how they perceive the role of the provider and the requirements of the support they need. The practical tools for personalization can be designed with those who will need to use them. Co-design is a very public and visible process. As uncomfortable as it can feel, transparency through greater collaboration is key to both managing expectations early and getting honest and accurate (and therefore useful) outputs. The scale of this openness needs to be managed carefully 10

as scrutiny by too many can mean that political assuagement or appeasement overrides the careful crafting of a solution emergent through the process of designing. Co-design has challenged many professional designers because the idea of allowing anybody to have a go is seen as a threat to quality as well as a denial of skill and talent. One view is that there is some truth in this, but its often the case that those expressing such a concern are basing their view on a conventional understanding of what design does, and an unclear picture of an emerging role for design and designers. The lone designer can solve simple problems and give form to solutions, but complex challenges demand collaborative platforms and projects. Its also worth remembering that tangible and elegant solutions still need to be designed and this is the unique contribution of trained designers. A belief is that professional designers are valuable in new ways and not to the detriment of what designers have always done well. However the activity of designing responses to complex challenges is too important to leave only to designers.




Figure2- Work Flow


3.3 Algorithm 1. FFT Algorithms 2. Create the Embedded System into Altera QUARTUS II 10.1 system programmable-onchip (SOPC) Builder 3. Implementation of FFT Algorithms in C++ language in NIOS II IDE.

4. Create the serial connection between the Hardware and the Software. 5. Upload the system without custom instruction in FPGA (EP2C35F672C6) 6. Upload the system with custom instruction in FPGA (EP2C35F672C6) 7. Compare the results of both the systems. 3.4 System using NIOS II processor In this project the embedded system is generated, for generation of this system NIOS II soft core processor is used. The generation of the system using NIOS II processor is done in this chapter.

3.4.1 Introduction

The NIOS II processor is a general-purpose RISC processor core with the following Features: Full 32-bit instruction set, data path, and address space 32 general-purpose registers Optional shadow register sets 32 interrupt sources External interrupt controller interface for more interrupt sources Single-instruction 32 32 multiply and divide producing a 32-bit result Dedicated instructions for computing 64-bit and 128-bit products of multiplication Floating-point instructions for single-precision floating-point operations Single-instruction barrel shifter. Access to a variety of on-chip peripherals, and interfaces to off-chip memories and peripherals.


Hardware-assisted debug module enabling processor start, stop, step, and trace Under control of the NIOS II software development tools. Optional memory management unit (MMU) to support operating systems that require MMUs. Optional memory protection unit (MPU) Software development environment based on the GNU C/C++ tool chain and the NIOS II Software Build Tools (SBT) for Eclipse Integration with Altera's Signal Tap II Embedded Logic Analyzer, enabling Realtime analysis of instructions and data along with other signals in the FPGA design. Instruction set architecture (ISA) compatible across all NIOS II processor systems. Performance up to 250 DMIPS

A Nios II processor system is equivalent to a microcontroller or computer on a chip that includes a processor and a combination of peripherals and memory on a single chip. A Nios II processor system consists of a Nios II processor core, a set of on-chip peripherals, on-chip memory, and interfaces to off-chip memory, all implemented on a single Altera device. Like a microcontroller family, all Nios II processor systems use a consistent instruction set and programming model. The Nios II processor is a configurable soft IP core, as opposed to a fixed, off-the-shelf microcontroller. We can add or remove features on a system-by-system basis to meet performance or price goals. Soft means the processor core is not fixed in silicon and can be targeted to any Altera FPGA family. Altera provides ready-made Nios II system designs that can use as is. If these designs meet your system requirements, there is no need to configure the design further. In addition, we can use the Nios II instruction set simulator to begin writing and debugging Nios II applications before the final hardware configuration is determined. Nios II is offered in 3 different configurations: Nios II/f (fast), Nios II/s (standard), and Nios II/e (economy).
Nios II/f


The Nios II/f core is designed for maximum performance at the expense of core size. Features of Nios II/f include:

Separate instruction and data caches (512 B to 64 KB) Optional MMU or MPU Access to up to 2 GB of external address space Optional tightly coupled memory for instructions and data Six-stage pipeline to achieve maximum DMIPS/MHz Single-cycle hardware multiply and barrel shifter Optional hardware divide option Dynamic branch prediction Up to 256 custom instructions and unlimited hardware accelerators JTAG debug module Optional JTAG debug module enhancements, including hardware breakpoints, data triggers, and real-time trace

Nios II/s

Nios II/s core is designed to maintain a balance between performance and cost. Features of Nios II/s include:

Instruction cache Up to 2 GB of external address space Optional tightly coupled memory for instructions Five-stage pipeline Static branch prediction Hardware multiply, divide, and shift options Up to 256 custom instructions JTAG debug module Optional JTAG debug module enhancements, including hardware breakpoints, data triggers, and real-time trace 10

Nios II/e The Nios II/e core is designed for smallest possible logic utilization of FPGAs. This is especially efficient for low-cost Cyclone II FPGA applications. Features of Nios II/e include:

Up to 2 GB of external address space JTAG debug module Complete systems in fewer than 700 LEs Optional debug enhancements Up to 256 custom instructions

3.4.2 NIOS II Architecture

The Nios II architecture is a RISC soft core architecture which is implemented entirely in the programmable logic and memory blocks of Altera FPGAs. The soft-core nature of the Nios II processor lets the system designer specify and generate a custom Nios II core, tailored for his or her specific application requirements. System designers can extend the Nios II's basic functionality by adding a predefined memory management unit, or defining custom instructions and custom peripherals. The NIOS II architecture describes an instruction set architecture (ISA). The ISA in turn necessitates a set of functional units that implement the instructions. A NIOS II processor core is a hardware design that implements the Nios II instruction set and supports the functional units described in this document. The processor core does not include peripherals or the connection logic to the outside world. It includes only the circuits required to implement the NIOS II architecture.


Figure3- NIOS II Core Block Diagram

The NIOS II architecture defines the following functional units: Register file Arithmetic logic unit (ALU) Interface to custom instruction logic Exception controller Internal or external interrupt controller Instruction bus Data bus Memory management unit (MMU) Memory protection unit (MPU) Instruction and data cache memories Tightly-coupled memory interfaces for instructions and data JTAG debug module

3.4.3 Processor Implementation

The functional units of the Nios II architecture form the foundation for the Nios II instruction set. However, this does not indicate that any unit is implemented in hardware. The Nios II architecture describes an instruction set, not a particular hardware 10

implementation. A functional unit can be implemented in hardware, emulated in software, or omitted entirely. A Nios II implementation is a set of design choices embodied by a particular Nios II processor core. Each implementation achieves specific objectives, such as smaller core size or higher performance. This flexibility allows the Nios II architecture to adapt to different target applications.

Implementation variables generally fit one of three trade-off patterns: more or less of a feature; inclusion or exclusion of a feature; hardware implementation or software emulation of a feature. An example of each trade-off follows: More or less of a featurefor example, to fine-tune performance, you can increase or decrease the amount of instruction cache memory. A larger cache increases execution speed of large programs, while a smaller cache conserves on-chip memory resources. Inclusion or exclusion of a featureFor example, to reduce cost, you can choose to omit the JTAG debug module. This decision conserves on-chip logic and memory resources, but it eliminates the ability to use a software debugger to debug applications. Hardware implementation or software emulationFor example, in control applications that rarely performs complex arithmetic, you can choose for the division instruction to be emulated in software. Removing the divide hardware conserves onchip resources but increases the execution time of division operations.

3.4.4 Register File

The NIOS II architecture supports a flat register file, consisting of thirty-two 32-bit general-purpose integer registers, and up to thirty-two 32-bit control registers. The architecture supports supervisor and user modes that allow system code to protect the control registers from errant applications. The NIOS II processor can optionally have one or more shadow register sets. A shadow register set is a complete set of NIOS II generalpurpose registers. When shadow register sets are implemented, the CRS field of the status register indicates which register set is currently in use. An instruction access to a general10

purpose register uses whichever register set is active. Typical use of shadow register sets is to accelerate context switching. When shadow register sets are implemented, the NIOS II processor has two special instructions, rdprs and wrprs, for moving data between register sets. Shadow register sets are typically manipulated by an operating system kernel, and are transparent to application code. A Nios II processor can have up to 63 shadow register sets.

3.4.5 Arithmetic Logic Unit

The Nios II ALU operates on data stored in general-purpose registers. ALU operations take one or two inputs from registers, and store a result back in a register as shown in table no.1

In hardware implementation custom instruction maps the processes in software such as addition, subtraction, multiplication & division directly to the ALU of NIOS II processor which is used directly as hardware. Inclusion of such hardware reduces the clock cycles and time required for execution of algorithm.

Category Arithmetic

Details The ALU supports addition, subtraction, multiplication and division on signed and unsigned operators.


The ALU supports the equal, not-equal,greater-than-or-equal,and lessthan relational operators(==,!=,>=,<)on signed and unsigned operators.

Logical Shift and Rotate

The ALU supports AND, OR, NOR and XOR logical operations. The ALU supports shift and rotate operations. And can shift/rotate data y 0 to 31 bit positions per instruction. The ALU supports arithmetic shift and logical shift right/left. The ALU supports rotate shift right/left .
Table 2- Operations Supported by the Nios II ALU

3.4.6 Exception and Interrupt Controllers


The NIOS II processor includes hardware for handling exceptions, including hardware interrupts. It also includes an optional external interrupt controller (EIC) interface. The EIC interface enables you to speed up interrupt handling in a complex system by adding a custom interrupt controller.

3.5 Memory and I/O Organization This section explains hardware implementation details of the NIOS II memory and I/O organization. The discussion covers both general concepts true of all NIOS II processor systems, as well as features that might change from system to system. The flexible nature of the NIOS II memory and I/O organization are the most notable difference between NIOS II processor systems and traditional microcontrollers. Because Nios II processor systems are configurable, the memories and peripherals vary from system to system. As a result, the memory and I/O organization varies from system to system. A Nios II core uses one or more of the following to provide memory and I/O access: Instruction master portAn Avalon Memory-Mapped (Avalon-MM) master port that connects to instruction memory via system interconnect fabric Instruction cacheFast cache memory internal to the Nios II core Data master portAn Avalon-MM master port that connects to data memory and peripherals via system interconnect fabric Data cacheFast cache memory internal to the Nios II core Tightly-coupled instruction or data memory portInterface to fast on-chip memory outside the Nios II core.

3.5.1 Instruction and Data Buses

The NIOS II architecture supports separate instruction and data buses, classifying it as Harvard architecture. Both the instruction and data buses are implemented as AvalonMM master ports that adhere to the Avalon-MM interface specification. The data master port connects to both memory and peripheral components, while the instruction master port connects only to memory components. 10

3.5.2 Memory and Peripheral Access

The NIOS II architecture provides memory-mapped I/O access. Both data memory and peripherals are mapped into the address space of the data master port. The Nios II architecture uses little-endian byte ordering. Words and half words are stored in memory with the more-significant bytes at higher addresses. The Nios II architecture does not specify anything about the existence of memory and peripherals; the quantity, type, and connection of memory and peripherals are system-dependent. Typically, Nios II processor systems contain a mix of fast on-chip memory and slower off-chip memory. Peripherals typically reside on-chip, although interfaces to off-chip peripherals also exist.

3.5.3 Instruction Master Port

The NIOS II instruction bus is implemented as a 32-bit Avalon-MM master port. The instruction master port performs a single function: it fetches instructions to be executed by the processor. The instruction master port does not perform any write operations. The instruction master port is a pipelined Avalon-MM master port. Support for pipelined Avalon-MM transfers minimizes the impact of synchronous memory with pipeline latency and increases the overall fMAX of the system. The Nios II processor can prefetch sequential instructions and perform branch prediction to keep the instruction pipe as active as possible. The instruction master port always retrieves 32 bits of data. The instruction master port relies on dynamic bus-sizing logic contained in the system interconnect fabric. By virtue of dynamic bus sizing, every instruction fetch returns a full instruction word, regardless of the width of the target memory.
3.5.4 Data Master Port

The NIOS II data bus is implemented as a 32-bit Avalon-MM master port. The data master port performs two functions: Read data from memory or a peripheral when the processor executes a load Instruction. Write data to memory or a peripheral when the processor executes a store Instruction.

3.5.5 Cache Memory


The NIOS II architecture supports cache memories on both the instruction master port (instruction cache) and the data master port (data cache). Cache memory resides on-chip as an integral part of the Nios II processor core. The cache memories can improve the average memory access time for Nios II processor systems that use slow off-chip memory such as SDRAM for program and data storage. The instruction and data caches are enabled perpetually at run-time, but methods are provided for software to bypass the data cache so that peripheral accesses do not return cached data. Cache management and cache coherency are handled by software. The Nios II instruction set provides instructions for cache management.

3.5.6 Tightly-Coupled Memory


memory provides


low-latency memory



performance-critical applications. Physically, a tightly-coupled memory port is a separate master port on the NIOS II processor core, similar to the instruction or data master port. Compared to cache memory, tightly-coupled memory provides the following benefits: Performance similar to cache memory. Software can guarantee that performance-critical code or data is located in Tightly-coupled memory. No real-time caching overhead, such as loading, invalidating, or flushing memory.

3.5.7 Address Map

The address map for memories and peripherals in a Nios II processor system is design dependent. You specify the address map in Qsys and SOPC Builder. There are three addresses that are part of the processor and deserve special mention: Reset address Exception address Break handler address 10

Programmers access memories and peripherals by using macros and drivers. Therefore, the flexible address map does not affect application developers.

3.5.8 Memory Management Unit

The optional NIOS II MMU provides the following features and functionality: Virtual to physical address mapping Memory protection 32-bit virtual and physical addresses, mapping a 4-GB virtual address space into as much as 4 GB of physical memory 4-KB page and frame size Low 512 MB of physical address space available for direct access Hardware translation look aside buffers (TLBs), accelerating address translation Separate TLBs for instruction and data accesses Read, write, and execute permissions controlled per page Default caching behavior controlled per page TLBs acting as n-way setassociative caches for software page tables TLB sizes and associativities configurable in the Nios II Processor parameter editor Format of page tables (or equivalent data structures) determined by system software Replacement policy for TLB entries determined by system software Write policy for TLB entries determined by system software

3.5.9 Memory Protection Unit

The optional NIOS II MPU provides the following features and functionality: Memory protection Up to 32 instruction regions and 32 data regions Variable instruction and data region sizes Read and write access permissions for data regions Execute access permissions for instruction regions Overlapping regions 10

3.6. JTAG Debug Module The Nios II architecture supports a JTAG debug module that provides on-chip emulation features to control the processor remotely from a host PC. PC-based software debugging tools communicate with the JTAG debug module and provide facilities, such as the following features: Downloading programs to memory Starting and stopping execution Analyzing registers and memory Collecting real-time execution trace data

3.7 Embedded system generation using NIOS II processor The Nios II development flow consists of three types of development: hardware design steps, software design steps, and system design steps, involving both hardware and software. System design steps involve both the hardware and software, and might require input from both sides.

Figure4-Flow chart of System generation using Nios II processor

3.8 Defining and Generating the System in SOPC Builder


After analyzing the system hardware requirements, SOPC Builder is used to specify the Nios II processor core(s), memory, and other components your system requires. SOPC Builder automatically generates the interconnect logic to integrate the components in the hardware system. It can be selected from a list of standard processor cores and components provided with the Nios II EDS. Custom hardware can also be added to accelerate system performance. Custom instruction logic to the Nios II core can be added which accelerates CPU performance, or a custom component can be added which offloads tasks from the CPU. The following step covers adding standard processor and component cores. The primary outputs of SOPC Builder are the following file types: SOPC Builder Design File (.sopc) it contains the hardware contents of the SOPC Builder system. SOPC Information File (.sopcinfo) it contains a human-readable description of the contents of the .sopc file. The Nios II EDS uses the .sopcinfo file to compile software for the target hardware. Hardware description language (HDL) filesare the hardware design files that describe the SOPC Builder system. The Quartus II software uses the HDL files to compile the overall FPGA design into an SRAM Object File (.sof).

3.8.1 Integrating the SOPC Builder System into the Quartus II Project

After generating the Nios II system using SOPC Builder, it is integrated into the Quartus II project. Using the Quartus II software, all tasks are performed required to create the final CYCLONE II FPGA hardware design. Using the Quartus II software, assign pin locations for I/O signals, specify timing requirements, and apply other design constraints. Finally, compile the Quartus II project to produce a .sof to configure the CYCLONE II FPGA. Download the .sof to the CYCLONE II FPGA on the target board (EP2C35F672C6) using an Altera download cable, such as the USB-Blaster. After configuration, the FPGA behaves as specified by the hardware design, which in this case is a Nios II processor system. The Nios II processor and the interfaces needed to connect to other chips on the DE2 board are implemented in the Cyclone II FPGA chip. These 10

components are interconnected by means of the interconnection network called the Avalon Switch Fabric. Memory blocks in the Cyclone II device can be used to provide an on-chip memory for the Nios II processor. They can be connected to the processor either directly or through the Avalon network. The SRAM and SDRAM memory chips on the DE2 board are accessed through the appropriate interfaces. Input/output interfaces are instantiated to provide connection to the I/O devices used in the system. A special JTAG UART interface is used to connect to the circuitry that provides a Universal Serial Bus (USB) link to the host computer to which the DE2 board is connected. This circuitry and the associated software is called the USB-Blaster. Another module, called the JTAG Debug module, is provided to allow the host computer to control the Nios II processor. It makes it possible to perform operations such as downloading programs into memory, starting and stopping execution, setting program breakpoints, and collecting real-time execution trace data. Since all parts of the Nios II system implemented on the FPGA chip are defined by using a hardware description language, a knowledgeable user could write such code to implement any part of the system.

4.1 Fast Fourier Transform In this chapter, several methods for computing the Discrete Fourier Transform (DFT) efficiently are presented. In view of the importance of the DFT in various digital signal processing applications, such as linear filtering, correlation analysis, and spectrum analysis, its efficient computation is a topic that has received considerable attention by many mathematicians, engineers, and applied scientists. Basically, the computational problem for the DFT is to compute the sequence {X (k)} of N complex-valued numbers given another sequence of data {x (n)} of length N, according to the following formula:

In general, the data sequence x (n) is also assumed to be complex value. Similarly, the Inverse Discrete Fourier Transform (IDFT) becomes: We can observe that for each value of k, direct 10

computation of X (k) involves N complex multiplications (4N real multiplications) and N-1 complex additions (4N-2 real additions). Consequently, to compute all N values of the DFT requires complex multiplications and complex additions. Direct computation of the

DFT is basically inefficient primarily because it does not exploit the symmetry and periodicity properties of the phase factor WN. In particular, these two properties are: Symmetric property Periodicity property

The computationally efficient algorithms described in this section, known collectively as fast Fourier transform (FFT) algorithms, exploit these two basic properties of the phase factor.

4.2 FFT Algorithms The FFT exist in two functionally equivalent forms known as decimation in time (DIT) and decimation in frequency (DIF). Both are a decomposition of the DFT by processing through sample computational units and reducing the computational complexity of DFT from O (N 2) to O (N log (N)). The various algorithms that result from the FFT are collectively known as RadixR Fast Fourier Transforms. The most popular Radix r choices are those of r = 2 and r = 4, and a commonly used advancement upon the FFT is the use of a mixed radix.

4.3 Radix-2 FFT Algorithms The Radix-2 algorithm takes the DFT and applies a common factor reduction equating the sum of two N/2 sequences to the N point sequence of the original DFT. Resulting in the Radix-2 FFT formula below:

This result in processing that follows the signal flow graph as shown in Figure


Figure 5-Radix-2 for an N point FFT

There are two methods of radix-2 algorithm Decimation in time FFT algorithm(DIT) Decimation in frequency FFT algorithm(DIF)

4.3.1 Decimation-In-Time FFT

now let us consider the computation of N=

point DFT by the divide and conquer approach.

We split the n point data sequence into N/2 point data sequences f1(n) and f2(n), corresponding to the even numbe0 r and odd number of samples of x(n) respectively. Thus f(n) and f2(n) are obtained by decimating x(n) by factor of 2 and hence the resulting FFT algorithm is called decimation in time FFT algorithm. The equation can be expressed as:


We observe that the direct computation of F1(k) requires ( computation of X(k) requires 2( ) + N/2=

) complex multiplications.

The same applies to the computation of N/2 additional complex multiplications required +N/2 complex multiplications. This first to +N Which is

step results in a reduction of the number of multiplications from

about the factor of 2 for N large. By computing N/4 point DFTs , we should obtain N/2 point DFTs F1(k) and F2(k) form the relations

The decimation of the data sequence can be repeated again and again until the resulting sequences are reduced to one one-point sequences. For N = performed v= this decimation can be

times. Thus the total no. of complex multiplication is reduce to (N/2) . Illustration, Figure depicts the

. The number of complex addition is N

computation of N = 16 point DFT. We observe that the computation is performed in five stages, beginning with the computation of eight 2 point DFTs, then four 4 point DFTs, then 2 eight point DFTs. Finally one 16 point DFT.


Figure 6-Butterfly structure

4.3.2 Decimation in frequency

Another important radix-2 FFT algorithm, called the decimation in frequency algorithm, is obtained by using the divide-and-conquer approach. To derive thealgorithm, we begin by litting the DFT formula into two summations, one of which involves the sum over the first N/2 data points. Thus we obtain


Now, let us split (decimate) x(k) into the even and odd number samples. Thus we obtain

The computation procedure above can be repeated through decimation of the N/2 point DFTs X(2k) and x(2k+1). The entire process involves V= stages of decimation where each

satge involves N/2 butterflies of the type Thus the total no. of complex multiplication is reduce to (N/2) . The number of complex addition is N . Illustration, Figure depicts the

computation of N = 16 point DFT. We observe that the computation is performed in five stages, beginning with the computation of eight 2 point DFTs, then four 4 point DFTs, then 2 eight point DFTs. Finally one 16 point DFT. The combination for


Figure7-Decimation in frequency

4.4 Algorithms Implementation In this project, we implemented both methods Decimation-In-Time (DIT) and Decimation-InFrequency (DIF) to verify which one is better in efficiency, speed, performance and delay. The following researchers are known to apply the same methods:


Weidong Li, Jonas Carlsson, Jonas Claeson, and Lars Wanhammar (Electronics Systems, Department of Electrical Engineering Link ping University) employed Fast Fourier Transform algorithm in Global Asynchronous Local Synchronous (GALS) based on decimation-infrequency radix-2 algorithm. They proved in their simulation that DIF has high performance and efficiency. Mohd Nazrin (UTM 2004) applied Fast Fourier Transform algorithm in FPGA technology. The design was based on decimation-in-time radix-2 algorithm. Pursuant to his simulation and results he concluded that DIT has many advantages such as high efficiency, speed, performance and low delay. Obviously, both methods give the same results but we are looking for performance, speed, hardware cost and efficiency. Through this thesis we will know the advantages and disadvantages of both the DIT and the DIF.

4.4.1 FPGA Implementation of FFT Algorithm

In hardware implementation, inclusion of custom instruction in Nios II processor is done. While designing a system, that includes an Nios II embedded processor, we can accelerate time-critical software algorithms by adding custom instructions to the Nios II processor instruction set. Custom instructions allow you to reduce a complex sequence of standard instructions to a single instruction implemented in hardware. You can use this feature for a variety of applications, for example, to optimize software inner loops for digital signal processing (DSP), packet header processing, and computation-intensive applications. In SOPC Builder, the Nios II parameter editor provides a GUI to add custom instructions to the Nios II processor.


Figure8-Custom Instruction Logic connects to ALU in SOPC builder system

In SOPC Builder, the custom instruction logic connects directly to the Nios II 4.4.3 Design steps for the hardware implementation After software implementation same procedure has to be followed with some changes in system. 1. For hardware implementation system is generated in SOPC builder with inclusion of Custom Instruction as hardware. 2. 3. Implementation of Algorithm is done on FPGA using NIOS II IDE. The program calculates the processing time and throughput for each of the versions, to demonstrate the improved efficiency of a custom instruction compared to software implementation.

4.5 HARDWARE IMPLEMENTATION In Hardware implementation, Custom Instruction is added in NIOS II Processor. The SOPC GUI supports the inclusion of custom instructions.


Figure 9- Custom instruction block in SOPC GUI.

Figure 10- Inclusion of Custom Instruction in NIOS II Processor


Figure 10 and Figure 11 shows the addition of custom instruction in NIOS II Processor. The inclusion of Custom Instruction is added as a floating point hardware in processor keeping all other peripherals same, it leads to increase in hardware. The Custom Instruction maps the memory location from SRAM interface in SOPC builder.

Figure 11-RTL view of Custom Instruction

4.5.1 Implementation of Custom Instruction Hardware and Software

Multicycle custom instructions complete in either a fixed or variable number of clock cycles. For a custom instruction that completes in a fixed number of clock cycles, you specify the required number of clock cycles at system generation. For a custom instruction that requires a variable number of clock cycles, you instantiate the start and done ports. These ports participate in a handshaking scheme to determine when the custom instruction execution is complete. The execution this instruction is shown in following custom instruction hardware port timing diagram.


Figure 12-Multicycle Custom Instruction Timing Diagram

The processor asserts the active high start port on the first clock cycle of the custom Instruction execution. At this time, the dataa and datab ports have valid values and remain valid throughout the duration of the custom instruction execution. The start signal is asserted for a single clock cycle. For a fixed length Multicycle custom instruction, after the instruction starts, the processor waits the specified number of clock cycles, and then reads the value on the result signal. For an n-cycle operation, the custom logic block must present valid data on the nth rising edge after the custom instruction begins execution. For a variable length Multicycle custom instruction, the processor waits until the active high done signal is asserted. The processor reads the result port on the same clock edge on which done is asserted. The custom logic block must present data on the result port on the same clock cycle on which it asserts the done signal. The Nios II system clock feeds the custom logic blocks clk port, and the Nios II systems master reset feeds the active high reset port. The reset port is asserted only when the whole Nios II system is reset. The custom logic block must treat the active high clk_en port as a conventional clock qualifier signal, ignoring clk while clk_en is deasserted. The Nios II custom instruction software interface is simple and abstracts the details of the custom instruction from the software developer. For each custom instruction, the Nios II embedded Design Suite (EDS) generates a macro in the system header file, system.h. You can use the macro directly in your C or C++ application code, 10

and you do not need to program assembly code to access custom instructions. Software can also invoke custom instructions in Nios II processor assembly language. Custom Instruction is added as floating point hardware in NIOS II processor.

4.6 Results of FFT algorithm using Custom Instruction The System is generated in SOPC Builder. In this system the custom instruction is added as floating point hardware.

Figure 13- System with Custom Instruction in SOPC Builder


Figure14- System Generation for Custom Instruction

Figure 15 and Figure 16 shows the system contents and generation of system in SOPC Builder. After generation of system in SOPC, the pin assignment and compilation is done in QUARTUS II.


Figure15 NIOS System with custom instruction Block Diagram File view

In Quartus II, The pin assignment is done by importing the file of CYCLONE II (EP2C35F672C6) FPGA. In the figure 6.6 the address lines from SRAM memory are assigned to custom instruction.

Figure 16-RTL view of NIOS II System

After the successful compilation of system, hardware generation in CYCLONE II FPGA is done and time limited file is generated as shown in Figure 17. 10

Figure 17- Hardware generation of NIOS II System in CYCLONE II

After hardware generation of system using CYCLONE II FPGA, the algorithm is implemented in NIOS II IDE. We have to follow same steps as stated in chapter 5 with addition of Custom Instruction files in NIOS II IDE Project. After generating hello_world.c file, we have to add custom instruction files floating_point.c, floating_point.h, floating_point_CI.c and floating_point_SW.c in the project. Then the project is build by command Build Project. While building a project, a macro function system.h is generated which connects the C-code of algorithm to Custom Instruction hardware. After successful build of project the Algorithm is implemented on CYCLONE II (EP2C35F672C6) FPGA. This implementation is done by command Run as NIOS II Hardware. After execution of this command the result is shown in NIOS II console window.


Figure 18- FFT result without Custom Instruction

In Figure 18, the output of implementation of Algorithm (Encryption) on FPGA is shown with clock cycles and time required for execution of Algorithm (Encryption).


Figure 19 FFT result with Custom Instruction

In Figure 19, the output of implementation of Algorithm on FPGA is shown with clock cycles and time required for execution of Algorithm . The table no.3 shows the clock cycles required for execution with custom instruction and without custom instruction. Type Without custom instruction With custom instruction 5937794 0.1186 7212992 0.14426 Clock Cycles Time Required

Table no. 3 Clock Cycles for execution of Algorithm



5.1 RESULT The System generated in SOPC with and without Custom Instruction is compiled in Quartus II environment. The FFT algorithm is compared in terms speed i.e. number of clock cycles required and area in terms of number of Logic elements (LEs) on FPGA. Type Without custom instruction With custom instruction
Table 4 CPU Clock cycles and time required

Clock Cycles

Time Required





From the performance analysis results of clock cycles and time required for execution of FFT algorithm in software i.e. system without Custom Instruction is more as compared with execution of algorithm in hardware i.e. system with custom instruction. In hardware implementation custom instruction maps the processes in software such as addition, subtraction, multiplication & division directly to the ALU of NIOS II processor which is used directly as hardware. Inclusion of such hardware reduces the clock cycles and time required for execution of algorithm. The system which is generated using SOPC Builder is compiled in Quartus II software.The hardware required for generation of system is depend upon the LEs used in CYCLONE II (EP2C35F672C6) FPGA . The comparison in hardware change is shown in table 6.



Total Count

Without Custom Instruction (10%) (9%)

With Custom Instruction (14%) (13%)

Total Logic Elements Total Combinational Functions Dedicated Logic Registers Total Pins Total Memory Bits Embedded multipliers 9-bit Elements

32216 32216

32216 475 473840 70

(6%) (9%) (10%) (6%)

(9%) (15%) (10%) (16%)

Table 5 comparison of compilation report

The above table shows the comparison between the software and hardware systems i.e. system with and without custom instruction. It shows that inclusion of custom instruction increases the hardware which gives better result in terms of clock cycles and require for execution of algorithm.

5.2 CONCLUSION FFT algorithm considered as a case study is implemented using the hardware / software codesign methodology. Hardware / software co-design methodology implementation gives an optimized design of the algorithm. Algorithm is implemented on CYCLONE II FPGA based around NIOS II processor. Speed of the algorithm is increased by (18.46%). The conclusions of the entire experiment and the project are presented in this chapter. Recommendations for 10

enhancing the precision and performance of FFT embedded system are also included in this chapter. The recommendations include speed and logic cells requirements.

5.3 Concluding Remarks This thesis demonstrates the design of embedded system and FPGA technology implementation of Fast Fourier Transform algorithm. The algorithm used was radix-2 decimation-in-time for 32floating point. The FFT embedded system was included the floating point custom instruction as an alternative choice for the floating point arithmetic operation. The floating point custom in struction has given the system better performance and speed in the floating point operation, which has been proven in the result. Moreover in this thesis, I introduced a new technique to provide any kind of data to FPGA development technology from host PC by using specified GUI by DEV-C. This method will make the connection easier, uncomplicated, useful and practical. Finally, our experiment thus far have demonstrated promising results, indicating that floating point custom instructions can result in large improvements in performance, energy, and timing, while significantly reducing design turnaround time.


FUTURE SCOPE There are countless ways that the designed FFT embedded system can be improved; for example: by introducing Higher N-Point FFT Computation, the algorithm architecture in the decimationin-frequency and high radix can be used to make the design more robust

As the selected Processor is soft core processor, enabling change of hardware according to the application. Optimization in area, required for FFT Algorithm, can be obtained by designing various optimization approaches for the various blocks of the algorithm. Here, FFT algorithm is accelerated using custom instruction with NIOS II processor, in future acceleration can be done with using C2H compiler tool. The extension of this project could be done in the field of communication where the ever increasing demand on signal processing capabilities which has given rise to the importance of the Fourier transform to the field. The importance of the Discrete Fourier Transform (DFT) in various digital signal processing applications such as linear filtering, correlation analysis, and spectrum analysis is always felt. Advantages offered by the system are that cyclone 2 board operates at up to 250MHz when an external device is connected another advantage is that FPGA(EP2C35F672C6) has about 35536 logic elements which enables us to design much more highly integrated circuits.


Reference [01] Ernst, R.: Co design of embedded systems: status and trends, Proceedings of IEEE Design and Test, AprilJune 1998, pp.4554 [02] [03] T. Ismail, A. Jerraya, Synthesis Steps and Design Models for Codesign IEEE Computer A. Kalavade, E. Lee, A Hardware-Software Co-design Methodology for DSP Applications, IEEE Design and Test, vol. 10, no. 3, pp. 16-28, Sept. 1993. [04] Edix Cetin, Richard C.S.Morling and Izzet Kale,(1997) An Integrated 256-point Complex FFT Processor for Real-Time Spectrum Analysis and Measurement, IEEE Proceedings of Instrument and Measurement Technology Conference, Vol. 1.96-101. [05] Hermann, D., J. Henkel, R. Ernst, An approach to the estimation of adapted Cost Parameters in the COSYMA System, 3rd International Conference on

Hardware/Software co-design, Grenoble, France, September 22-24, 1994, pp. 100-107 [06] Jason G. Tong, Ian D. L. Anderson and Mohammed A. S. Khalid: Soft-Core Processors for Embedded Systems, the 18th International Conference on Microelectronics (ICM) 2006. [07] Coelo Jr, C. J. N., Da Silva Jr., D. C., and Fernandes, A. O. Hardware software codesign of embedded systems, Proceedings of the 11th Brazilian Symposium on Integrated Circuit Design, January 1998. [08] Journal of ELECTRICAL ENGINEERING, VOL. 56, NO. 9-10, 2005, 265269 HARDWARE IMPLEMENTATION OF AES ALGORITHM by Marko Mali-Franc Novak-Anton Biasizzo. [09] [10] Monson H. Hayes. Digital Signal Processing. Schaums outlines 1999. MOHD NAZRIN. The Implementation of Fast Fourier Transform (FFT) Radix-2 Core Processor using VHDL in FPGA-Based Hardware. UTM 2003 10

[11] [12] [13] [14] [15] [16]

Altera Corporation (2004a). UART core with Avalon Interface Altera Corporation (2007). Nios II Processor Reference Handbook Altera Corporation. Using Nios II Floating Point Custom Instruction Altera Corporation. Nios II Custom Instruction User Guide Altera Corporation, Nios Software Development Tutorial. Altera Corporation, SOPC Builder Data Sheet.