Академический Документы
Профессиональный Документы
Культура Документы
High Level Model of IEEE 802.15.3c Standard and Implementation of a Suitable FFT on ASIC
Examensarbete utfrt i Elektroniksystem vid Tekniska hgskolan vid Linkpings universitet av Tanvir Ahmed LiTH-ISY-EX--11/4462--SE
Linkping 2011
High Level Model of IEEE 802.15.3c Standard and Implementation of a Suitable FFT on ASIC
Handledare:
Carl Ingemarsson
isy, Linkpings universitet
Mario Garrido
isy, Linkings universitet
Examinator:
Oscar Gustafsson
isy, Linkpings universitet
Avdelning, Institution Division, Department Electronics Systems Department of Electrical Engineering Linkpings universitet SE-581 83 Linkping, Sweden Sprk Language Svenska/Swedish Engelska/English Rapporttyp Report category Licentiatavhandling Examensarbete C-uppsats D-uppsats vrig rapport ISBN ISRN
Datum Date
2011-05-15
Titel Title
Svensk titel High Level Model of IEEE 802.15.3c Standard and Implementation of a Suitable FFT on ASIC
Sammanfattning Abstract A high level model of HSIPHY mode of IEEE 802.15.3c standard has been constructed in Matlab to optimize the wordlength to achieve a specic bit error rate (BER) depending on the application, and later an FFT has been implemented for dierent wordlengths depending on the applications. The hardware cost and power is proportional to wordlength. However, the main objective of this thesis has been to implement a low power, low area cost FFT for this standard. For that the whole system has been modeled in Matlab and the signal to noise ratio (SNR) and wordlength of the system have been studied to achieve an acceptable BER. Later an FFT has been implemented on 65nm ASIC for a wordlength of 8, 12 and 16 bits. For the implementation, a radix-8 algorithm with eight parallel samples has been adopted. That reduce the area and the power consumption signicantly compared to other algorithms and architectures. Moreover, a simple control has been used for this implementation. Voltage scaling has been done to reduce the power. The EDA synthesis result shows that for 16bit wordlength, the FFT has 2.64 GS/s throughput, it takes 1.439 mm2 area on the chip and consume 61.51 mW power.
Abstract
A high level model of HSIPHY mode of IEEE 802.15.3c standard has been constructed in Matlab to optimize the wordlength to achieve a specic bit error rate (BER) depending on the application, and later an FFT has been implemented for dierent wordlengths depending on the applications. The hardware cost and power is proportional to wordlength. However, the main objective of this thesis has been to implement a low power, low area cost FFT for this standard. For that the whole system has been modeled in Matlab and the signal to noise ratio (SNR) and wordlength of the system have been studied to achieve an acceptable BER. Later an FFT has been implemented on 65nm ASIC for a wordlength of 8, 12 and 16 bits. For the implementation, a radix-8 algorithm with eight parallel samples has been adopted. That reduce the area and the power consumption signicantly compared to other algorithms and architectures. Moreover, a simple control has been used for this implementation. Voltage scaling has been done to reduce the power. The EDA synthesis result shows that for 16bit wordlength, the FFT has 2.64 GS/s throughput, it takes 1.439 mm2 area on the chip and consume 61.51 mW power.
Acknowledgments
I would like to thank Oscar Gustafsson for giving me an opportunity to do my thesis in Electronics Systems. That gives me the access of the resources and all kind of facilities for doing my thesis. It gives me a new way of thinking and I believe that it will help me for my PhD in Japan. I am heartily thankful to my supervisors Carl Ingemarsson and Mario Garrido for guiding throughout the thesis and correcting various documents of mine with attention and care. Apart from that they helped me a lot to solve the technical issues related with the thesis. Their guidance helped me to get a grip on dierent design tool and VHDL, such that Matlab, Modelsim and Design Compiler. I oer my regards and blessing to all my friends who were sharing the lab with me for their inspiration and exchanging their culture and ideas. It was a great experience for me to work with dierent people from dierent countries and experiencing the multicultural environment. As well as it helps me a lot to know about dierent areas of electronics as they were working in dierent topics. Last but not least I am grateful to my parents for giving me every kind of support from my birth untill now. I believe that without their support it was not possible for me to continuing my Masters in Sweden.
vii
Contents
1 Introduction 2 Standard review of mm-Wave 2.1 Single carrier mode in mm wave PHY (SCPHY) . . . . 2.1.1 Bandwidth and carrier frequency . . . . . . . . . 2.1.2 Forward error correction (FEC) . . . . . . . . . . 2.1.3 Modulation . . . . . . . . . . . . . . . . . . . . . 2.2 High speed interface mode in mm wave PHY (HSIPHY) 2.2.1 Bandwidth and carrier frequency . . . . . . . . . 2.2.2 Forward error correction . . . . . . . . . . . . . . 2.2.3 Modulation . . . . . . . . . . . . . . . . . . . . . 2.2.4 OFDM . . . . . . . . . . . . . . . . . . . . . . . 2.3 Audio visual mode in mm wave PHY (AVPHY) . . . . . 2.3.1 Bandwidth and carrier frequency . . . . . . . . . 2.3.2 Forward error correction . . . . . . . . . . . . . . 2.3.3 Modulation . . . . . . . . . . . . . . . . . . . . . 2.3.4 OFDM . . . . . . . . . . . . . . . . . . . . . . . 3 High Level Model of IEEE 802.15.3c 3.1 System overview . . . . . . . . . . . 3.2 High level model . . . . . . . . . . . 3.2.1 Transmitter and receiver . . . 3.2.2 Channel . . . . . . . . . . . . 3.3 Performance evaluation . . . . . . . 3.3.1 SNR vs BER . . . . . . . . . 3.3.2 WordLength vs BER . . . . . 4 Background of FFT 4.1 Theoretical background . . . . . . 4.2 Architecture of the FFT . . . . . . 4.2.1 Feedforward architectures . 4.2.2 Single path delay feedback . 4.3 Building blocks of the FFT . . . . 4.3.1 Complex multiplier . . . . . 4.3.2 Buttery . . . . . . . . . . ix (HSIPHY) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 7 7 7 8 9 12 12 13 13 14 15 15 16 16 16 19 19 20 21 22 22 22 22 25 25 27 27 29 29 30 30
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
x 4.3.3 4.3.4
5 Implementation of FFT on ASIC 5.1 Design issue related to the FFT processor . 5.2 Radix-8 . . . . . . . . . . . . . . . . . . . . 5.3 Proposed architecture . . . . . . . . . . . . 5.3.1 Radix-8 buttery . . . . . . . . . . . 5.3.2 Shuer . . . . . . . . . . . . . . . . 5.4 ROMs for the coecients . . . . . . . . . . 5.5 Controller . . . . . . . . . . . . . . . . . . . 5.6 Methodology . . . . . . . . . . . . . . . . . 5.6.1 Hardware implementation in VHDL 5.6.2 Functionality testing . . . . . . . . . 5.6.3 Synthesizing and area calculation . . 5.6.4 Power calculation . . . . . . . . . . . 5.7 Design for Low Power . . . . . . . . . . . . 5.8 Comparison to previous approaches . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
6 Conclusion and Future Work 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography
List of Figures
2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 3.1 3.2 3.3 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 Constellation diagram of Constellation diagram of Constellation diagram of Constellation diagram of Constellation diagram of Constellation diagram of FEC data multiplexer. . Constellation diagram of Constellation diagram of Constellation diagram of Convolutional encoder. . /2 BPSK. . . . . . . /2 QPSK. . . . . . . /2 8-PSK. . . . . . . /2 16-QAM. . . . . . DAMI. . . . . . . . . OOK. . . . . . . . . . . . . . . . . . . . . . . QPSK modulation. . . 16 QAM modulation. 64 QAM modulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 10 10 11 12 12 13 14 15 16 18 20 23 23 26 26 27 28 28 29 29 30 31 31 32 32 32 35 36 36 37 37 38 38 39 39 40 41 41 42 43 44
IEEE 802.15.3c system. . . . . . . . . . . . . . . . . . . . . . . . . BER as a function of SNR. . . . . . . . . . . . . . . . . . . . . . . BER as a Function of Wordlength at SNR 35 dB. . . . . . . . . . . SFG of radix-2. . . . . . . . . . . . . . . . SFG of radix-4. . . . . . . . . . . . . . . . SFG of radix-16 decimation in frequency. SFG of radix-16 decimation in time. . . . Radix-2 feedforward architecture. . . . . . Radix-4 feedforward architecture. . . . . . Radix-2 feedback architecture. . . . . . . Radix-4 feedback architecture. . . . . . . Complex multiplier. . . . . . . . . . . . . Radix-2 buttery. . . . . . . . . . . . . . . ROM for coecients. . . . . . . . . . . . . Memory with pointer. . . . . . . . . . . . Shift registers. . . . . . . . . . . . . . . . SFG of radix-8 decimation in time. . . . SFG of radix-8 decimation in frequency. Data Path of the FFT . . . . . . . . . . Data path of the FFT. . . . . . . . . . . Implementation of radix-8 buttery. . . Shuing circuit. . . . . . . . . . . . . . Block diagram of shuer 1. . . . . . . . Block diagram of shuer 2. . . . . . . . Block diagram of shuer 3. . . . . . . . Block diagram of shuer 4. . . . . . . . Datapath controller. . . . . . . . . . . . ROM controller. . . . . . . . . . . . . . Entity of complex multiplier. . . . . . . Entity of a radix-2 buttery. . . . . . . . Entity of shuing circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contents 5.16 Area and power consumption of the FFT before and after frequency scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.17 Power consumption before and after voltage scaling. . . . . . . . . 5.18 Power and area for dierent length buer. . . . . . . . . . . . . . . 5.19 Power and area of complex multiplier. . . . . . . . . . . . . . . . . 5.20 Power and area of radix-8 buttery. . . . . . . . . . . . . . . . . . 5.21 Power and area of FFT. . . . . . . . . . . . . . . . . . . . . . . . .
45 45 46 47 48 49
Contents
List of Tables
2.1 2.2 2.3 2.4 2.5 2.6 2.7 3.1 3.2 3.3 4.1 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 Bandwidth and center frequency for dierent channels Modulation dependent normalization factor . . . . . . Subcarrier frequency allocation . . . . . . . . . . . . . Timing-related parameters for HSIPHY . . . . . . . . Low data rate channelization . . . . . . . . . . . . . . High data rate OFDM parameter . . . . . . . . . . . . Low data rate OFDM parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 13 15 17 17 17 18 19 21 21 30 33 33 40 46 47 48 49 49
MCS 6 specications . . . . . . . . . . . . . . . . . . . . . . . . . . Argument for modem.qammod . . . . . . . . . . . . . . . . . . . . Argument for modem.qamdemod . . . . . . . . . . . . . . . . . . . Comparison of pipelined architecture for the N point FFT . . . .
Constraint of the ASIC . . . . . . . . . . . . . . . . . . . . . . . . Design constraint of the FFT . . . . . . . . . . . . . . . . . . . . . Selection signal information . . . . . . . . . . . . . . . . . . . . . . Memory and Shift Register performance for dierent wordlength . Area and power for dierent components . . . . . . . . . . . . . . . FFT performance for dierent wordlength . . . . . . . . . . . . . . Comparison of architectures for the computation of a 512-point 8parallel FFT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of Various FFT for WPAN application . . . . . . . . .
Chapter 1
Introduction
The advancement of the applications in communication systems as well as the data rate of the applications are racing with time. Dierent task groups developed dierent standards and some of them are adopted by the IEEE. IEEE 802.15.3c is one of them. Some other applications of IEEE 802.15 standard are Bluetooth and Zigbee. These standards can support a data rate up to 100 Mb/s for short range (1 m - 10 m) communication. However, those atandards are not suitable for applications such as Live HD video streaming with a bit rate 3 Gbps, to replace the HDMI (2.2 Gbps) connection with wireless connectivity and large le transfer at very high speed. In 2005, IEEE 802.15 Alternative Task Group 3c developed a standard with an aim of providing wireless communication in a persons area while the data rate will be high enough to support those applications [1]. This standard uses the 60 GHz band as a carrier frequency. However, research shows that the band near 60 GHz has high attenuation in air compared to the 5 GHz band. As aresult, this band is more suitable for indoor rather than outdoor applictions. Moreover, it can limit the problem of channel interference. Later, in 2009, the standard was adopted by IEEE. The title of the thesis work is High Level Model of IEEE 802.15.3c and Implementation of a Suitable FFT on ASIC There are two components to this title. The rst one, high level model of the IEEE 802.15.3c standard. That include the exploration of the dierent aspects of the standard. Such as, Review of the standard and a high level model of one specic mode for this standard. The high level model has been used to optimized the dierent parameter (such as SNR and nite word length) for the physical layer. Second component is the implementation of a suitable FFT on ASIC. HSIPHY mode of this standard adopted orthogonal frequency division multiplexing (OFDM) to overcome multipath fading eect of wireless channel and FFT is the key component of OFDM. To implement an FFT on ASIC, a 65 nm technology standard cell library has been used. The main attention of the implementation was to reduce the power as well as the area. This document is organized in the following chapters: Chapter 1: Introduction 5
Introduction Chapter 2: Standard Review of mm-Wave- A review of the IEEE 802.15.3c standard and its dierent mode of operations. Chapter 3: High Level Model of IEEE 802.15.3c (HSIPHY) - Modeling of physical layer for High Speed Interface (HSIPHY) and eect of nite wordlength and SNR on bit error rate. Chapter 4: Backround of the FFT - Discussion about the algorithm of the discrete Fourier transform (DFT), dierent architectures of the FFT and the basic building blocks. Chapter 5: Implementation of the FFT on ASIC - Details of radix-8 and design issue, hardware implementation and results of the FFT. Chapter 6: Conclusion and Future Work - Dierent conclusions are drawn on the basis of the results and some direction for the research.
The whole design is based on Matlab and VHDL. Communication toolbox of Matlab has been used for the high level model of the standard and VHDL as a hardware description language for the implementation of the FFT. Modelsim and Design compiler have been used for the functionality testing and compilation of the design for a specic technology library, respectively. Finally, performance measurement (calculation of the power and area for a specic clock frequency) has been done by means of Design compiler and Nanosim.
Chapter 2
2.1
This mode provides three dierent classes of modulation and coding scheme targeting dierent wireless connectivity applications. Class 1 has been specied for low rate and low cost mobile operation while this mode can support a data rate 1.5 Gb/s. Class 2 has been specied to achieve a data rate up to 3 Gb/s and class 3 has been specied for the high speed and high performance applications with a data rate over 5 Gb/s [1].
2.1.1
This mode operates in four dierent carrier frequency that ranges between 57.24 GHz to 65.88 GHz [1]. However the bandwidth remains equal for all four cases. These channels are dened in Table 2.1. 7
Table 2.1: Bandwidth and center frequency for dierent channels Channel ID Start frequency Center frequency Stop frequency 1 57.24 58.32 59.40 2 59.40 60.48 61.56 3 61.56 62.64 63.72 4 63.72 64.80 65.88
2.1.2
This mode of operation support reed solomon (RS) block codes and low density parity check (LDPC) block codes as a forward error correction scheme, whereas RS block code is mandatory and LDPC block code is optional. The dierent coding schemes are described as follows. RS(255,239) The RS(255,239) code shall use the polynomial generator in Equation 2.1 [1], where the number of the input is 239, it generates 16 code words and send along with the 239 input words. So, the total number of outputs is 255.
16
g(x) =
k=1
x + 2
(2.1)
Here, is the root of primitive polynomial p(x) = 1 + x2 + x3 + x4 + x8 and x is the input data. LDPC(672,588) LDPC is systematic, i.e., it encode an information block of size k,i into a codeword c of size n, c by adding n-k parity bits. Each of the parity matrices is partitioned into a square sub blocks of size z z identity matrix. The cyclic permutation I matrix p is obtained from the cyclically shifting the identity matrix by I times.
1 0 p0 = ... 0 0
0 1 0 ... ...
0 0 ... 0 0 , p2 = 0 1 1 0 0
0 0 ... 0 1
0 0 1 0 0
LDPC(672,588) has 588 input bits and 672 output bits with a code rate of 7/8. Here, the number of parity bits is 84. The table is described in [1].
There has 504 input bits and 672 output bit in LDPC(672,504) with a code rate of 3/4. The number of parity bits is 168. However, it follows the same identity and permuted matrix as discussed in Section 2.1.2. The table is described in [1] LDPC(672,336) LDPC(672,336) is used for highly reliable applications with a code rate of 1/2. It takes 336 bits as an input and generates 672 bits. It follows the identity matrix of Section 2.1.2 and the table is described in [1].
2.1.3
Modulation
This mode supports six dierent modulation schemes depending on the data rate and the performance requirements of the applications. However, four of them are mandatory and the other two are optional. The optional schemes are used for low data rate application. /2 BPSK /2 is a binary phase modulation with /2 phase shift counterclockwise. Figure 2.1 shows the constellation mapping of the /2 BPSK signal. Here, zl is the input bit. The input bit has mapped with 1 of the constellation diagram when the input is 1. For the other case the bit is mapped with j. With this modulation one symbol is generated for every bit.
Zl
-1
/2 QPSK /2 QPSK encodes 2 bits per symbol, with a rotation of /2 counter clockwise. This modulation techniques shows four equally spaced phase on the radius. Figure
10
2.2 is the constellation mapping diagram for the /2 QPSK. This modulation scheme uses gray encoding [1].
d1d2
Q 11 1 -1 01 1 -1 00 I
10
/2 8-PSK The constellation diagram of /2 8-PSK is depicted in Figure 2.3. In this techniques three bits are mapped toh one symbol of the constellation. Here, the three bits are denoted d1 d2 d3 . Again, this also has the /2 rotation as in previous cases. Eight dierent symbols are used for representing the arrival bits. The bits shall be gray encoded here as well.
11
The /2 16QAM constellation diagram is depicted in Figure 2.4. Here four bits, b1 b2 b3 b4 are mapped to one symbol. 16 dierent symbols with dierent radius has been used to represent the arrival bit.
b1b2b3b4 1010
1011
-3d
-d
+3d I
1001
1100
1000
Dual Alternate Mark Inversion Dual Alternate Mark Inversion (DAMI) coding is optional and this scheme is used for low data rate and low cost applications. The constellation diagram is shown in the Figure 2.5. It takes two bits as input and generates one symbol.
On O Keying On O Keying (OOK) is also optional and this scheme is used for low data rate and low cost applications as DAMI. Figure 2.6 shows the constellation diagram. It takes one bit and generates one symbol for every bit.
12
10 1
00 11 1
01 1 I
2.2
The HSI PHY is designed for low latency, high speed data and it use orthogonal frequency domain multiplexing (OFDM). This mode supports dierent modulation and coding scheme using dierent frequency domain spreading factors, modulations and LDPC block codes.
2.2.1
This mode uses Channel IDs 2 and 3 of Table 2.1 as a carrier frequency [1]. The band starts from 59.40 GHz and ends at 63.72 GHz. The center frequencies are 60.48GHz and 62.64GHz respectively for Channel IDs 2 and 3.
13
2.2.2
This mode use both equal error protection (EEP) and unequal error protection (UEP) depending on the data rate and performance. The data multiplexer is shown in Figure 2.7. For the EEP case the both LDPC blocks will be the same and for the case of UEP, the two LDPC blocks will be dierent. In this mode four dierent LDPCs are used with dierent code rate. Three of them are the same as for SCPHY and the nal one is LDPC(672,420). This is discussed in the following.
Msb 8b
MUX
Lsb 8b
LDPC(672,420) LDPC(672,420) is used for high reliability applications with code rate 5/8. 420 bits is taken as a input and generate 672 bits. Here 252 bits are parity bit.
2.2.3
Modulation
This mode uses three dierent modulation techniques depending on the data rate and the performance. The modulation dependent normalization factor is given in Table 2.2. It is also stated in [1] that the value of d is 1 for normal constellation and 1.25 for skewed constellation. Table 2.2: Modulation dependent normalization factor Modulation K mod QPSK 1/ 1 + d2 16-QAM 1/ 5 (1 + d2 ) 64-QAM 1/ 21 (1 + d2 )
14 QPSK
The constellation diagram of QPSK is depicted in the Figure 2.8. SCPHY also use QPSK but without /2 rotation. However, it takes two bits b1 b2 as input and maps with the symbol. There are be four symbols on the radius of the constellation diagram.
Q 10 11 +1 +d
b1b2
-d
I -1 00 01
16 QAM 16 QAM take four bits d1 d2 d3 d4 as input and generate one symbol. The constellation diagram is in the Figure 2.9. There are 16 dierent symbols with dierent values and radius on the constellation diagram. It can provide higher data rate than QPSK.
64 QAM The constellation diagram of 64 QAM is shown in Figure 2.10. Six bits are map with one symbol. Here b1 b2 b3 b4 b5 b6 are six input bits. In the constellation diagram there are 64 dierent symbols with dierent radius and angles.
2.2.4
OFDM
This mode support OFDM. There will be 3 DC sub-carriers, 16 pilot sub-carriers, 16 guard sub-carriers and 336 data sub-carriers [1]. The sub-carriers and their logical indexes are described in Table 2.3. Again, the total number of sub-carriers are 512 with a throughput of 2.64 GS/s for this mode. The timing related parameters for the FFT are given in Table 2.4.
15
Q 0110 3 0111 -1 -1 0101 -3 0100 1100 1000 1 1 1101 3 1001 I 1111 1011 1110 1010
Table 2.3: Subcarrier frequency allocation Subcarriers type Number of subcarriers Logical subcarriers indexes Null subcarriers 141 [256 : 186] [186 : 255] DC subcarriers 3 1, 0, 1 Pilot subcarriers 16 [166 : 22 : 12] [12 : 22 : 166] Guard subcarriers 16 [185 : 178] [178 : 185] Data subcarriers 336 All others
2.3
This mode of the standard is mainly for multimedia applications, such as live HD video streaming, replacement of HDMI wired connectivity with wireless connectivity etc. This mode operate in two data rates: one is low data rate and the other one is the high data rate. The modulation and the coding schemes are varied for the data rate.
2.3.1
This mode supports two dierent data rate. One is high data rate and the other one is low data rate and dierent channels are used for those. High data rate uses Channel Id 2 of Table 2.1. Whereas, the low data rate support three dierent channels. These are described in Table 2.5. Here fc(HRP ) is the current high data rate channel.
16
Q 000100 000110 011100 010100 +7 000101 001101 011101 010101 +5 000111 001111 011111 010111 +3 000110 001110 011110 010110 +1 -7d -5d -3d -d +d +3d
110111 111111 101111 100111 110110 110110 101110 100110 +5d +7d I
000010 001010 011010 010010 -1 000011 001011 011011 010011 -3 000001 001001 011001 010001 -5 000000 001000 011000 010000 -7
2.3.2
This mode of the standard use convolutional encoding. The convolutional encoder diagram for this standard is depicted in Figure 2.11. The convolutional encoder encode with a code rate of 1/3. The convolutional encoder use 6 delay memory. And generator polynomial g0 = 1338 , g1 = 1718 andg2 = 1658 . The initial value of the memories are set to 0.
2.3.3
Modulation
This mode use the same QPSK and 16QAM modulation scheme as shown in Figures 2.8 and 2.9, respectively. This mode also use gray coded input bits.
2.3.4
OFDM
This mode use two dierent OFDM technique for low data rate and high data rate respectively. These are described in Table 2.6 and 2.7 for high data rate and low data rate respectively
17
Table 2.4: Timing-related parameters for HSIPHY Parameters Description Value fs Reference sampling rate 2640 MHz TC Sample duration 0.38 ns Nsc Number of subcarriers 512 Ndsc Number of data subcarriers 336 NP Number of pilot subcarriers 16 NG Number of guard subcarriers 141 NDC Number of DC subcarriers 3 NR Number of reserved subcarriers 16 NU Number of used subcarriers 352 NGI Guard interval length in samples 64 fsc Subcarrier frequency spacing 5.15625 MHz BW Nominal used bandwidth 1815 MHz TF F T IFFT and FFT period 193.94 ns TGI Guard interval duration 24.24 ns TS OFDM Symbol duration 4.583 MHz FS OFDM Symbol rate 16 NCP S Number of samples per OFDM symbols 576
Channel Index 1 2 3
Table 2.5: Low data rate channelization Start Frequency Center Frequency fc(HRP ) 207.625 MHz fc(HRP ) 49 MHz fc(HRP ) + 109.625 MHz fc(HRP ) 158.625 MHz fc(HRP ) fc(HRP ) + 158.625 MHz
Stop Frequency fc(HRP ) 109.625 MHz fc(HRP ) + 49 MHz fc(HRP ) + 207.625 MHz
Table 2.6: High data rate OFDM parameter Parameter Value Occupied bandwidth 1.76 GHz Reference sampling rate 2.538 GHz Number of subcarriers 512 FFT period Nsc(HR) /fs(HR) 202 ns Subcarrier spacing 1/TF F T (HR) 4.96 MHz Guard interval 64/fs(HR) 25.2 ns Symbol duration TF F T (HR) + TGI(HR) 227 ns Number of data subcarriers 336
18
Input
+ +
Table 2.7: Low data Parameter Occupied bandwidth Reference sampling rate Number of subcarriers FFT period Subcarrier spacing Guard interval Symbol duration Number of data subcarriers
rate OFDM parameter Value 92 MHz 317.25 MHz 128 Nsc(LR) /fs(LR) 403 ns 1/TF F T (HR) 2.48 MHz 28/fs(HR) 25.2 ns TF F T (HR) + TGI(HR) 492 ns 30
Chapter 3
3.1
System overview
The system is depicted in Figure 3.1. This system can be divided into two main section. These are Transmitter and Receiver. The transmitter get the data from the MAC or protocol and the receiver send the data to the protocol. The received data from the protocol are encoded by the LDPC encoder, where the extra bits are added to protect the signal from the noise on the channel. The coded bits are modulated by the modulator and converted to discrete samples. The OFDM block convert those samples from discrete frequency to discrete time signal. Later, the Digital to Analog Converter (DAC) converts the discrete signal to a continuous time signal. The continuous time signal is processed in the RF section. Before 19
20
transmitting by the antenna, the RF section up-converts the baseband signal and amplies. At the other end the RF section of the receiver receives the signal,applies proper ltering and down-converts the received signal.
Transmitter
MAC/ Protocol
MAC/ Protocol
OFDM
LDPC
OFDM
LDPC
Receiver
Transreceiver
Figure 3.1: IEEE 802.15.3c system. The transmitted signals are propagated through the wireless channel to the receiver which introduce noise. The receiver receives the noisy signal by the antenna. The received signals are continuous time signal. The continuous time signals are processed in the RF blocks and send it to the Analog to Digital Converter (ADC) block to make the signals ready for the baseband processing section. The ADC converts the continuous time signal to a discrete time signal. The discrete time signal is converted to frequency domain signal after the OFDM block, which is nothing except an implementation of FFT. Samples in frequency are converted into bits in the demodulator block. The retrieved bits are sent to the MAC or protocol after the LDPC block. In the LDPC block, the encoded bits are decoded with the help of parity bits.
3.2
The high level model has been constructed for the specication in Table 3.1. The modelling setup includes MATLab and the communication toolbox. The communication toolbox includes most of the blocks for the system. The unavailable blocks have been modelled by MATLab. The model consist of three main blocks. These are transmitter, receiver and channel. The transmitter and receiver consist
21
of forward error correction (FEC) as LDPC(672,588), modulator as 16-QAM and OFDM as a subcomponents.
3.2.1
Forward error correction (FEC) Forward error correction has been used on both transmitter and receiver. The LDPC object of communication toolbox has been used for this case. LDPC (672,588) follows the standard [1]. The table and the permuted identity matrices have been generated in Matlab. The table consist of the zero matrices and permuted identity matrices. Modulation and demodulation Modulation and demodulation convert the bits into samples as well as samples into bits respectively. Modulation has been done on the transmitter and demodulation on the receiver. 16-QAM modulation and demodulation have been performed for this model. There are modem.qammod, modem.qamdemod, modulate and demodulate function in the communication toolbox to perform the modulation and demodulation. The arguments for modem.qammod and modem.qamdemod are described in Table 3.2 and Table 3.3. Later the created objects have been used in modulate and demodulate function to perform the modulation and demodulation. Table 3.2: Argument for modem.qammod Argument Description Value M Modulation index 16 PhaseOset Oset phase of the mapping /2 SymbolOrder Symbol order of the input gray InputType Type of input bit
Table 3.3: Argument for modem.qamdemod Argument Description Value M Modulation index 16 PhaseOset Oset phase of the mapping /2 SymbolOrder Symbol order of the input gray InputType Type of input bit DecisionType Type of decision LLR NoiseVariance Noise Variance of system 1.2
Orthogonal frequency division multiplexing (OFDM) The OFDM block has been modelled using IFFT and FFT on transmitter and receiver, respectively. 141 null subcarriers, 3 DC subcarriers, 16 pilot sub-carriers
22
and 16 guard subcarriers have been added with the 336 data subcarriers before the IFFT on the transmitter. In the receiver, the data subcarriers have been extracted from the 512 sub-carriers.
3.2.2
Channel
The processed signal is transmitted through the channel. The channel is wireless and it has multipath fading eect. The channel can be characterized in two ways. One is large scale characterization and the other is small scale characterization [2]. Large scale characterization has been applied here, as in Equation 3.1. The path loss P L(d) can be dened by the average path loss P L(d) and shadowing fading X . P L(d)[dB] = P L(d)[dB] + X [dB] (3.1) However, the average pathloss P L(d) can be expressed as in Equation 3.2. Where d0 and n denote the reference distance and PL exponent. The pathloss exponent n varies for dierent enviroment. This model has been modeled for the room enviroment. Xq is for the additional attenuation due to specic obstruction by objects. P L(d)[dB] = P L(d0 )[dB] + 10n log10 d d0
Q
+
q=1
Xq , . . . for d d0
(3.2)
3.3
Performance evaluation
Two dierent performance measures have been observed in this model. One is BER as a function of SNR and the second one is BER as a function of wordlength in the FFT. These are described in the following subsections.
3.3.1
SNR vs BER
The BER has improved with the SNR of the system. The graph in Figure 3.2 shows the results for dierent wordlength. BER of the model reduced with in increment of the SNR. Figure 3.2 shows the blue line for wordlength 8, the red line for wordlength 12 and the black line for wordlength 16. So, to achieve some number of BER the SNR can be selected for a specic wordlength.
3.3.2
WordLength vs BER
BER as a function of wordlength has shown in Figure 3.3. Here, the SNR of the system is 35 dB. Wordlength can be selected from the graph to achieve specic BER. As quantization noise is reduced for higher wordlength, the BER is also improved with wordlength. It has been observed that the BER is reduced for the higher input wordlength.
23
10
10
10
10
10
10
10
10
15 20 25 Signal to NoiseRatio(dB)
30
35
40
10
10
10
10
10
10 12 Wordlength
14
16
18
Chapter 4
Background of FFT
A short description of the FFT algorithm, dierent architectures and the basic building blocks for the architectures are discussed in this chapter. Further information about the algorithm and architectures are discussed in [37].
4.1
Theoretical background
Some claim that 1965 is the start of the modern world, when J. Cooley and J. Tukey published their ecient method for numerical computation of the Fourier transform. Some others claim, the method was introduced by Gauss in the mid 1800s, the idea that lies at the heart of the algorithm is clearly present in an unpublished paper that appeared posthumously in 1866. However, the present and future demands are that now a days people process continuous signals by discrete methods. Computers and digital processing systems can not work with continuous sums. The FFT represent a general function in terms of summation of trigonometric functions. This mathematical operation transforms the time domain signal into frequency domain signal according to the DFT:
N
X[k] =
n=0
kn x[n]WN , k = 0...N 1
(4.1)
In Equation 4.1 X[k] and x[n] are the complex output and the input of N point kn FFT respectively, where n is the time index and k is the frequency index. WN is kn the twiddle factor. WN can be dened as in Equation 4.2.
kn WN = ej(2kn/N ) = cos(
(4.2)
For a better understanding of the operations performed by the FFT, the FFT is represented by its signal ow graph (SFG). Examples of signal ow graphs are shown in Figures 4.1, 4.2, 4.3 and 4.4. The SFGs in the Figures consist of butteries and complex rotations. For examples Figure 4.1 represents a radix-2 buttery, which computes: 25
26
Background of FFT
X[0] = x[0] + x[1] X[1] = x[0] x[1] Figure 4.2 shows a radix-4 buttery. A radix-4 buttery includes a complex multiplication by ej/2 = j. This is a trivial operation. From hardware point of view a trivial operation can be done without any hardware cost.
Figure 4.2: SFG of radix-4. The signal ow graph in Figure 4.3 shows a 16-point radix-2 DIF FFT and 2 the number after every stage, , indicates a rotation by, ej N . The the input sequences are in natural order whereas the outputs are bit reversed order. On the other hand, Figure 4.4 shows a signal ow graph of 16 point radix-2 DIT FFT. In this case, the inputs are in bit reversed order and the outputs are in natural order. Besides, the placement of multiplications is not same.
27
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7
0 0 0 0 0 2 4 6 0 0 0 0 0 2 4 6
0 0 0 4 0 0 0 4 0 0 0 4 0 0 0 4
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
4.2
The architecture of FFT can be divided in some dierent parts. Those are butteries, complex rotators, memories for twiddle factor, circuits for data management and control. Butteries and rotators are used for the calculation of mathematical operation of the signal ow graph. Basic pipelined architectures for the FFT operation are discussed below. The basic components for these architectures are discussed in the next section of this chapter.
4.2.1
Radix-2
Feedforward architectures
A radix-2 feedforward Architecture is depicted in Figure 4.5. The input sequence is broken down into two parallel data streams owing forward, with correct distance between the data elements entering the buttery scheduled by reorder. In this architecture both butteries and multipliers have an utilization ratio of 100%. C2 in the Figure 4.5 are switchs and BF2 are the radix-2 butteries. The numbers by the switch are the length of the buer. A detailed description about the architecture can be found in [3].
28
Background of FFT
0 8 4 12 2 10 6 14 1 9 5 13 3 11 7 15
0 0 0 4 0 0 0 4 0 0 0 4 0 0 0 4
0 0 0 0 0 2 4 6 0 0 0 0 0 2 4 6
0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Radix-4
A radix-4 feedforward architecture is depicted in Figure 4.6. C4 and BF4 in the Figure 4.6 are the switchs and radix-4 butteries. The lengths of the buers are shown by the number in the box. Here, the input sequence is broken into four parallel data streams and proper distance between data elements are kept by the shuer. In this architecture the multipliers and the butteries have an utilization ratio of 100%. This architecture is good for high throughput applications. This architecture is well described in [8].
29
64 128 192
C4
192 128 64
BF4
X X X
16 32 48
C4
48 32 16
BF4
X X X
4 8 12
C4
12 8 4
BF4
X X X
1 2 3
C4
3 2 1
BF4
4.2.2
Radix-2
A radix-2 feedback architecture is depicted in Figure 4.7. This architecture uses the registers eciently by storing one buttery output in the feedback shift registers, while a single data stream goes through the multiplier at every stage. However, this architecture suers 50% utilization of complex multipliers and butteries. This architecture is good for area ecient implementation. This architecture is described in [9].
Radix-4 A radix-4 single path feedback architecture is depicted in Figure 4.8. In this architecture the utilization of multipliers and butteries have been increased to 75%. However, the radix-4 buttery contains at least 8 complex adders and its utilization dropped to only 25%. More detail about the architecture can be found in [10]. The comparison of the dierent pipelined architectures is given in Table 4.1.
4.3
These architectures use some basic building blocks. Such as, complex multiplier, buttery, ROM table, RAM and shift register. These blocks are discussed as follows.
30
Background of FFT
Figure 4.8: Radix-4 feedback architecture. Table 4.1: Comparison of pipelined architecture for the N point FFT ARCHITECTURE Multipliers Adders Control Radix 2 feedforward [11] 2(log4 N 1) 4 log4 N Simple Radix 4 feedforward [8] 3(log4 N 1) 8 log4 N Simple Radix 2 feedback [11] 2(log4 N 1) 4 log4 N Simple Radix 4 feedback [11, 12] log4 N 1 8 log4 N Medium
4.3.1
Complex multiplier
The complex multiplier is shown in Figure 4.9. A complex multiplier can compute (a + j b)(c + j d) = (ac bd) + j (ad + bc). Here a + j b is the multiplicand and c + j d is the multiplier. These have both real and imaginary parts. This operation can be done by four real multipliers, one adder and one subtractor. The subtractor can be implemented by an adder with a carry 1.
4.3.2
Buttery
The buttery is depicted in Figure 4.10. For the two inputs a and b of the buttery the outputs are a + b and a b. This operation can be done by one complex addition and one complex subtraction. Here, a and b are complex inputs. Again, the subtraction can be done by setting the carry to 1.
4.3.3
ROM
A ROM is used to store the coecients of the complex multipliers. Each coecient are stored in a specic address of the ROM. The coecients is accessed by the address of the ROM. Dierent size ROMs is used depending on the size of the FFT and input wordlength. A ROM is depicted in Figure 4.11. Here, the address is 5 bits and the wordlength is 8 bits.
4.3.4
Buers
Buers are used to store the samples as well as make the proper sequences for the butteries. The buers are can be implemented by memories or shift register. Memories are probably used for the long length buer and shift register for the short length. A memory is depicted in Figure 4.12, where two pointers are pointing
31
Figure 4.9: Complex multiplier. the read and the write addresses of the memory. On the other hand, in the shift register, samples are shifted to the next register every clock cycle. A shift register is depicted in Figure 4.13.
x[0]
+ 0
X[0]
x[1]
+ 1
X[1]
32
Background of FFT
Read Pointer
Write Pointer
Write Pointer
Read Pointer
X[i]
L-1
X[i+L]
Chapter 5
Table 5.1: Constraint of the ASIC ASIC Constraint Value Library CORE65LPSVT Process 65 nm Global Power Supply 0.8 V Global Clock Frequency 330 MHz
Table 5.2: Design constraint of the FFT Design Parameter Value Length of the FFT 512 sample rate 2.64 GS/s Samples in parallel 8
33
34
5.1
FFT architectures can be divided into two dierent categories: pipelined architectures (such that feedforward and feedback) and memory-based architectures. These architectures are described in [7, 1315] and [1618] respectively. On one hand, pipelined architectures have the advantage of high throughput. However, these architectures have high area cost for large point FFTs. On the other hand, memory-based architectures have advantage of low area cost, but often the throughput is limited due to the memory access bandwidth and the available number of processing elements. In order to meet the requirements of IEEE 802.15.3c standard, a high throughput FFT processor needs to be designed. For high throughput applications, a pipelined FFT architecture has been adopted most times. Among dierent pipelined architectures, single path delay feedback architectures have the advantages of less number of memories and hardware compared to multipath feedforward architectures. However, single path delay feedback architectures use the processing unit for 50% compared to multipath feedforward architectures. On the other hand, multipath feedforward architectures can process two or more samples in parallel, whereas single path feedback ones only process one sample per clock cycle. Therefore, feedforward architectures can operate at slower clock than feedback architectures. For a slower clock, low power can be acheived for feedforward architectures. However, these architectures increase the hardware cost signicantly, as more complex rotators, butteries and memories are needed. The above listed architectures have some advantage and some common requirement, as has been well described in [1921]. A radix-8 and 8 parallel data architecture has been proposed for this application. As the throughput of the FFT is quite high, 8 parallel data can reduce the clock frequency and the direct implementation of radix-8 buttery need 8 parallel data. Besides, the proposed architecture reduces the number of multipliers and complex adders. Finally, the processing elements of the data path can operate at maximum 500 MHz (2 ns delay) clock frequency. Therefore, a 330 MHz clock has been used for the pipeline architecture, and 8 parallel samples are the good choice to reduce the input clock frequency.
5.2
Radix-8
Equation 4.1 shows that, for in-place computation of each value of k, N complex multiplications (4N real multiplications and 2N real additions) and N 1 complex additions (4N 2 real addition) are needed. The signal ow graph for the radix-8 0 decimation in time is depicted in Figure 5.1. However, the W8 coecient on the SFG can be ignored, because it represents a multiplication by 1. Figure 5.1 shows that samples are arriving at the input of the SFG as bit reversed, whereas the output are in natural order. The SFG of radix-8 decimation in frequency is depicted in Figure 5.2. Input samples are arriving in natural order and the outputs are in bit-reversed order. The complex multiplications are changed it position on the SFG. Apart from that the
35
X [0]
W8 1 W8 1 W8 1 W8 W8
2 0 0 0 0
X [1]
W8 W8
2 0
1 1 W8 W8 W8 1 1 W8
3 2 1 0
X [2] X [3]
1 1 1 1
same number of complex multiplications and additions are used in the decimation in frequency decomposition.
5.3
Proposed architecture
A 512-point FFT processor has been proposed for this application. The architecture of the FFT and datapath are depicted in Figure 5.3 and Figure 5.4. The architecture consists of three main parts. Fourteen ROM tables for the coecients of the multipliers. The data path computes the FFT and a controller has been used for controlling the ROM coecients as well as the data path. The controller has been easily implemented by a six-bit counter. Figure 5.4 shows that the datapath consist of three stages of Radix-8 buttery. The rst two stages of the FFT include a total of 14 complex rotators. The third stage has only a radix-8 buttery. Shuer 1 and shuer 4 have been used before and after the FFT, in order to provide input and output samples in natural order. Shuer 2 and shuer 3 have been used inside the FFT for maintaining the proper order of data inside the FFT. The dierent blocks of the FFT are described as follows.
5.3.1
Radix-8 buttery
The implementation of the radix-8 buttery is depicted in Figure 5.5. For this architecture, the radix-8 buttery has been done by direct implementation of the butteries and constant complex rotations. There are twelve butteries, two constant complex rotators and three trivial rotators. The radix-8 buttery has three stages. The rst stage of butteries are leading two complex rotation and one trivial rotation by (j). The second stage follows by two trivial rotations by (j).
36
X [0]
W8 1 1 W8 1
2 0
W8 1 1 W8
2
X [6] X [3]
X [7]
Input
Data Path
Output
Coefficient
Controller
14 ROM Table
Figure 5.3: Data Path of the FFT Figure 5.5 shows the interconnection network of the radix-8 buttery. Trivial rotations (1, j and j) have been done by some modication in the buttery at no extra hardware cost. The multiplication by 1 has been done by interchanging the inputs on the input port. Again, multiplication by j can be done by interchanging the real and imaginary outputs. And multiplication by j can be done by interchanging input and output signals as it has done for 1 and j.
5.3.2
Shuer
Figure 5.6 shows the basic block for the shuer. The shuer consists of two multiplexers and input and output buers. The input and output buer lengths vary at dierent stages of the datapath. Both memory and shift registers have been used for the implementation of the buers. A study on memory and shift register has shown that memory takes less area and consumes less power for long
37
Butterfly
Butterfly
Butterfly
Butterfly
Butterfly
Butterfly
Butterfly
Butterfly
Butterfly
length buers, whereas shift registers consume less power and less area for small length buers. Samples are stored in the buers for control signal 0. Samples of the output buers are replaced by input buers for control signal 1. The shuer 1 is shown in Figure 5.7. Twelve shuing circuits have been used in three stages. Dierent size of buers have been used in the dierent stages. First, second and third stages have 32, 16 and 8 input and output buers, respectively. Three dierent control signals have been used to control the shuers. For the rst stage the control signal shall change after every 32 clock as the length of input and output buers are 32. Second and third control signals must change after 16 and 8 clock cycles, respectively. However, the second and third selections shall wait for 32 and 48 clock cycle respectively. Shuer 2 and shuer 3 have also three stages. Figure 5.8 and 5.9 show the shuer 2 and shuer 3 respectively. The lengths of the buers for the shuer 2 and shuer 3 are 1, 2, 4 and 8, 16, 32. The gures show the interconnections of the shuer 2 and shuer 3. Three control signals have been used for the control of the three stages. Control signals 1, 2 and 3 for shuer 2 shall change after 1,2 and 4 clock cycles respectively, depending on the number of input and output
38
1 0
Shuffler 1X32
Shuffler 1X16
Shuffler 1X8
Shuffler 1X32
Shuffler 1X16
Shuffler 1X8
Shuffler 1X32
Shuffler 1X16
Shuffler 1X8
Shuffler 1X32
Shuffler 1X16
Shuffler 1X8
buers on each stages. Shuer 4 is depicted in Figure 5.10. There are twenty four shuing circuits that have been arranged in six stages. Six control signals have been used to control the stages of the shuer. The lengths of the input and the output buers of the six stages are 32, 4, 16, 2, 8 and 1. The control signals of the six stages must change from 0 to 1 every 32, 4, 16, 2, 8 and 1 clock cycle.
5.4
Fourteen ROMs in two stages have been used for this architecture. Seven memories of the 64 addresses for the rst stage and seven memories of 8 addresses for the second stage. The 64 addresses of the rst stage of ROMs can be represented by 6 bits. 64 coecients have been stored on each ROM. cos( 2 ) j sin( 2 ) is the N N content of the ROM for each specic address. cos( 2 ) and sin( 2 ) have been N N represented in 8 bit for the 8 bit implementation. The value of varies for each specic address and ROM. The value of for the address b5 b4 b3 b2 b1 b0 of the X-th ROM is X (b2 b1 b0 b5 b4 b3 )2 . Here, X is the number of memories from 1, 2 . . . 7 and b5 b4 b3 b2 b1 b0 is the address in the ROM. As an example, the value of for
5.5 Controller
39
Shuffler 1X1
Shuffler 1X2
Shuffler 1X4
Shuffler 1X1
Shuffler 1X2
Shuffler 1X4
Shuffler 1X1
Shuffler 1X2
Shuffler 1X4
Shuffler 1X1
Shuffler 1X2
Shuffler 1X4
Shuffler 1X8
Shuffler 1X16
Shuffler 1X32
Shuffler 1X8
Shuffler 1X16
Shuffler 1X32
Shuffler 1X8
Shuffler 1X16
Shuffler 1X32
Shuffler 1X8
Shuffler 1X16
Shuffler 1X32
Figure 5.9: Block diagram of shuer 3. address 001100 of ROM 4 is 4 (100001)2 . So, is equal to 132. Again, there are seven ROMs of 8 addresses in this architecture. Each ROM has addresses from 0 to 7. Eight addresses can be represented by 3 bits. The same cos( 2 ) j sin( 2 ) equation have been used for calculation of the content of N N the ROM. The value of for ROM X of b2 b1 b0 address is X (b2 b1 b0 )2 , where X varies from 1, 2 . . . 7. As an example, the value of for 101 address of ROM 5 can be calculated as 5 (101)2 = 25.
5.5
Controller
The controller for the FFT has been implemented by a simple six-bit counter. Signals of the counter have been used for controlling both the control signals of the datapath as well as the addresses of the ROMs. The control for the datapath is depicted in Figure 5.11. Control signals of shuers have been controlled by the signals of the counter. Fifteen control signals have been mapped with the dierent
40
Shuffler 1X32
Shuffler 1X4
Shuffler 1X16
Shuffler 1X2
Shuffler 1X8
Shuffler 1X1
Shuffler 1X32
Shuffler 1X4
Shuffler 1X16
Shuffler 1X2
Shuffler 1X8
Shuffler 1X1
Shuffler 1X32
Shuffler 1X4
Shuffler 1X16
Shuffler 1X2
Shuffler 1X8
Shuffler 1X1
Shuffler 1X32
Shuffler 1X4
Shuffler 1X16
Shuffler 1X2
Shuffler 1X8
Shuffler 1X1
signal of the counter depending on the time period of the signal. The MSB of the counter has been mapped to those control signals that have period of 64 clock cycles, whereas the LSB of the counter has been mapped to those control signals that have a period of 2 clock cycles. From control signal 2 to control signal 15 of the data path shall wait for half of the summation of the previous signals period. Equal number of buers have been used here. Number of delays and period of the signals are described in Table 5.3. Table 5.3: Selection signal information Control Signal Counter signal Period Delays 1 Count(5) 64 0 2 Count(4) 32 32 3 Count(3) 16 48 4 Count(0) 2 56 5 Count(1) 4 57 6 Count(2) 8 59 7 Count(3) 16 63 8 Count(4) 32 71 9 Count(5) 64 87 10 Count(5) 64 119 11 Count(2) 8 151 12 Count(4) 32 155 13 Count(1) 4 171 14 Count(3) 16 173 15 Count(0) 2 181 The controller for the ROM address is depicted in Figure 5.12. The fourteen ROM memories have been controlled by the same counter. Six signals of the counter have been mapped with the address bits of the rst 7 ROM memories, as the address of the rst 7 ROMs are represented by 6 bits. Three LSBs of the counter have been used for the controlling the address bits of next 7 ROM Table. Equalizing delays have been used for two stages of ROM. 56 and 63 delays have
5.6 Methodology
41
Shu er 1
Shu er 2
Shu er 3
Shu er 4
D D
D D D
D D D
D D D D D D
Counter
been used respectively for the 1st stage and 2nd stage ROMs, respectively.
ROM 64 X 7
ROM 8X7
Counter
6 bits 3 bits
5.6
Methodology
For the implementation, dierent design tools have been used: Modelsim for the functionality testing, Design compiler for the synthesis and Nanosim for the power calculation. VHDL has been used as a hardware description language. The basic blocks for the architecture have been programmed in VHDL. As the FFT has been implemented for dierent wordlengths, generic and generate have been used for parameterizable wordlength of the blocks. Later the blocks have been used to build the FFT. Design compiler and Nanosim have been used to calculate the area and power consumption of the FFT.
42
5.6.1
The entity of the complex multiplier is depicted in Figure 5.13. The generics WM1 and WM2 have been used to change the wordlength of multiplier and multiplicand. The basic block of the complex multiplier is a real value multiplier. A Wallace tree array multiplier has been used for this implementation. A pipeline of 5 stages has been used in the adder tree to reduce the critical path as well as to reduce the latency. The complex multiplier maintains the same input and output wordlength by discarding the LSB bits from the output.
library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; use ieee.std_logic_unsigned.all; entity complex_multiplier is generic(WM1 : integer:=3; WM2 : integer := 2); port( in_real : in std_logic_vector(WM1-1 downto 0); in_imag : in std_logic_vector(WM1-1 downto 0); coeff_real : in std_logic_vector(WM2-1 downto 0); coeff_imag : in std_logic_vector(WM2-1 downto 0); clk : in std_logic; reset : in std_logic; mult_real : out std_logic_vector(WM1-1 downto 0); mult_imag : out std_logic_vector(WM1-1 downto 0)); end complex_multiplier;
The entity of the buttery is shown in Figure 5.14. Generics have been used to change the wordlength and the truncation. The buttery keeps the input wordlength for TE equals to 0 and increases it one bit for TE equals to 1. The basic radix-2 buttery is used in radix-8 one. The entity of the shuer is depicted in Figure 5.15. WL, Lin, Lout and BT have been used in generic to change the wordlength, length of the input and output buers, and selection between memory and shift registers. Study of memory and shift register has shown that memories consume less power and take less area for long buers and opposite for shift registers. For this implementation, both architectures have been taken into consideration to optimize the power and area. These basic components have been used to build the radix-8 buttery and the shuers. Twiddle factors for the complex multipliers have been calculated by Matlab. Matlab has been used to generate the VHDL code for the ROMs. These ROMs, radix-8 buttery, shuers and complex multiplier have been used to build the FFT. A simple six-bit counter has been used to control the FFT.
5.6 Methodology
43
library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; use ieee.numeric_std.all; entity butterfly is generic( WL : integer := 3; TE : integer := 1); port( in_1_real : in std_logic_vector(WL-1 downto 0); in_1_imag : in std_logic_vector(WL-1 downto 0); in_2_real : in std_logic_vector(WL-1 downto 0); in_2_imag : in std_logic_vector(WL-1 downto 0); clk : in std_logic; out_1_real : out std_logic_vector(WL-1+TE downto out_1_imag : out std_logic_vector(WL-1+TE downto out_2_real : out std_logic_vector(WL-1+TE downto out_2_imag : out std_logic_vector(WL-1+TE downto end butterfly;
5.6.2
Functionality testing
The functionality of the FFT and the individual components has been tested by Modelsim. Test benches of individual component have been build and the functionality has been tested. Input and output sequences for the FFT have been generated in Matlab and the same input sequences have been used in the test bench of the FFT. The output sequences for the FFT have been tried to match with the output sequences generated by Matlab. Again, the datapath of the FFT has been tested without the radix-8 buttery and complex multiplier for the data management. Natural input sequences have been used at the input of the circuit with the proper control signals.
5.6.3
The FFT and individual components have been synthesized using Design compiler with CORE65LPSVT library. This library is for 65 nm process technology. Design compiler has been used to synthesis and optimize the area of the design for a specic clock as well as to generate the netlist of the design. The area of the FFT has been calculated by Design compiler.
5.6.4
Power calculation
The power consumption has been calculated by Nanosim. Random sequences for the FFT and individual components have been generated using Matlab. The netlist generated by Design compiler and the random sequences have been used to calculate the power. Voltage scaling has been done for the design by changing the supply voltage in the spice le.
44
library ieee; use ieee.std_logic_1164.all; entity shuffler is generic(WL : integer:= 10; Lin : integer := 20; Lout : integer := 10; BT : integer := 1); port( in0 : in std_logic_vector(WL-1 downto 0); in1 : in std_logic_vector(WL-1 downto 0); clk : in std_logic; sel : in std_logic; out0 : out std_logic_vector(WL-1 downto 0); out1 : out std_logic_vector(WL-1 downto 0)); end shuffler;
5.7
In the equation c is the area capacitance, Vdd is the supply voltage, f is the clock frequency and is the switching activity. The dynamic power can be improved by reducing the supply voltage Vdd , area capacitance c and clock frequency f. However, the area capacitance is indirectly related with the clock frequency. The area capacitance can be reduced by reducing the clock frequency. For optimizing the power of the FFT frequency scaling and voltage scaling have been done. Initially, the FFT has been synthesized for 380 MHz in order to operate any clock below 380 MHz. Due to the higher clock, the FFT takes more area. That results the higher capacitance and cause more power consumption. Voltage scaling can reduce the power. However, the area capacitance does not change for the voltage scaling. Frequency scaling has been done to reduce the area and power consumption. A 330 MHz clock has been used to reduce the area as well as the capacitance of the FFT. The bar charts in Figure 5.16 show the dierence of power and area for both clocks. The blue bars show the area and power for 380 MHz and the brown bars for 330 MHz. The results are shown for wordlength 8, 12 and 16. Voltage scaling has been done to reduce the power consumption of the FFT. Initially, the power of the FFT has been calculated for 1.2 V and there was a slack time of 0.5 ns. The voltage has been reduced from 1.2 V to 0.8 V and the slack time has been reduced as well. The bar chart in Figure 5.17 shows the change of power after voltage scaling. In the gure the blue bars show the power consumption for 1.2 V and the brown bars show the power consumption for 0.8 V. Memories have been replaced by shift registers for buer lengths over 8. On one hand, the switching activity increases with the length of the buers for shift registers and causes more power consumption. On the other hand, the switching
45
Power (mW)
Area (mm2)
8 bits
12 bits Wordlength
16 bits
Figure 5.16: Area and power consumption of the FFT before and after frequency scaling.
Power (mW)
1.2 V 0.8 V
activity remains constant for the memories. Therefore, memories have been used for large buers and shift registers for small buers. Finally, one large wordlength buers have been replaced by multiple small wordlength buers in parallel. By this technique the number of read and write pointers have been reduced for the memories. The area and the power of the memories and shift registers for different lengths are described in Table 5.4 and the bar charts in Figure 5.18 show the relative comparison for the memories and shift registers for lengths from 2 to 32. In Figure 5.18 the blue bars represent the area and power consumption of the memories and the brown bars represent the area and power consumption of the shift registers. The area as well as the the power for the memories and shift registers increase with the length of the buers. The bar chart for power consumption in Figure 5.18 shows that the power consumption for the shift registers with the buer length. However, the power consumption remains constant for the memories, whereas the switching activity for shift registers increases with the buer
46
length. Conversely, the switching activity for memories remains constant for any length buer. Table 5.4: Memory and Shift Register performance for dierent wordlength Memory Shift Register length Area (m2 ) Power (W ) Area (m2 ) Power (W ) 2 981.75 80.3306 281.63 38.1587 4 1095.63 83.3872 525.19 60.8420 8 1358.23 86.8200 998.91 122.4963 16 1865.86 91.3967 1943.75 246.1874 32 2771.59 94.3003 3836.04 502.1067
Area (um2)
Figure 5.18: Power and area for dierent length buer. The performance of the complex multipliers and radix-8 buttery have been evaluated in terms of power and area. The area and the power for the complex multiplier and the radix-8 buttery are described in Table 5.5 for wordlength 8, 12 and 16. The bar charts in Figure 5.19 and 5.20 show the power consumption and area for the complex multiplier and the radix-8 buttery, respectively. Table 5.5 shows the trade o between performance and wordlength. Power consumption and area increase with the wordlength.
5.8
Table 5.6 and bar charts in Figure 5.21 show the performance of the proposed architecture in account of power consumption and area. Figure 5.21 shows the power and area for the FFT and for the input and output reorder. The blue color of the bars show the power and area for the FFT and the brown color shows the area and power consumption of the input and output reorder. The FFT consumes
47
Table 5.5: Area and power for Word length 8 bit Complex Multiplier 12 bit 16 bit 8 bit 12 bit Radix 8 16 bit
dierent components Area Power (mm2 ) (mW ) 0.01434 0.9215 0.03222 1.6938 0.05748 2.6595 0.04826 6.3843 0.10402 9.2194 0.18221 12.2570
Power (mW)
Area (mm2)
12 Wordlength
16
Figure 5.19: Power and area of complex multiplier. more power than the input and output reorder, as the computations have been done in the FFT and causes more switching activity. The number of the complex rotators, complex adders and memories for the proposed architecture are compared with previous approaches in Table 5.7. As the table shows, this architecture requires less number of complex rotators, complex adders and memories than previous approaches, so the area have been reduced. Therefore, the area capacitance have been reduced as well as power consumption for the FFT. Table 5.8 shows the comparison of the proposed architecture with the previous approaches. For the proposed approach the results are shown for wordlengths 8, 12 and 16. As a dierent technology has been used for the proposed design, the power consumption and area need to be normalized. Power consumption and area have been normalized by Equation 5.3 and 5.2 according to [25, 26]: Normalized Area = Normalized Power = Area (Tech./65nm)2 (5.2) (5.3)
Table 5.8 shows that the proposed architecture achieves higher throughput and
48
Power (mW)
10 8 6 4 2 0 8 12 Wordlength 16
Area (mm2)
12 Wordlength
16
Table 5.6: FFT performance for Word length 8 bit Complete system 12 bit 16 bit 8 bit FFT 12 bit 16 bit
better eciency in terms of power consumption and area. For wordlength 12, the proposed architecture has reduced the power consumption by 10% and the area by 31% with respect to previous approaches for the same wordlength and FFT size [24].
49
Power (mm2)
Reorder FFT
Area (mW)
Reorder FFT
12 bits Wordlength
16 bits
Table 5.7: Comparison of architectures for the computation of a 512-point 8parallel FFT. PIPELINED AREA ARCHITECTURE Complex Complex Complex Type Radix Rotators Adders Sample Memory FF (MDC) Radix-8, [22] 14(6) 72 1170 FF (MDC) Radix-2, [23] 28 72 504 FB (MDF) Radix-2, [9] 28 144 504 Iterative Radix-16 + 2, [24] 32 256 1024 FF (MDC) Proposed, radix-8 14(6) 72 504
Table 5.8: Comparison of Various FFT for WPAN application PREVIOUS APPROACHES PROPOSED APPROACH Iterative FB (MDF) FB (MDF) FF (MDC) PARAMETERS [24] [27] [28] 8-bit 12-bit 16-bit Point (N) 512 2048 2048 512 512 512 Radix (r) 16 + 2 Mixed 2 8 8 8 Parallel samples(P) 8 4 8 8 8 8 Wordlength (bit) 12 9 9 8 12 16 Process(nm) 90 90 90 65 65 65 Voltage (V) 1 1 1 0.8 0.8 0.8 Clock (MHz) 324 300 300 330 330 330 Throughput (GS/s) 2.59 1.2 2.4 2.64 2.64 2.64 Area(mm2 ) 2.46 0.97 1.16 0.391 0.881 1.439 Normalized Area 1.28 0.5 0.6 0.391 0.881 1.439 Power(mW) 103.5 117 159 38.49 42.04 61.51 Normalized Power 47.84 54.08 73.49 38.49 42.04 61.51
Chapter 6
6.1
Conclusion
Based on the results the following conclusion can be drawn: High level model has been done for the standard and BER has been calculated for dierent level of SNR and wordlength. The FFT is parameterizable. This allows to choose wodrlength. The FFT has been optimized in order to reduce the area and the power consumption. Better results than previous approaches have been obtained. Radix-8 and 8 parallel samples reduce the number of hardware elements (20 complex multipliers are used). Simple control is needed for this architecture.
6.2
Future work
Future work on this topic can be done to improve the results specically: The high level model can be improved by using the ASIC toolbox. For that case the model will be more realistic for the hardware point of view. The channel model can be more realistic using small scale fading. The ASIC can be fabricated to measure the performance on hardware. Constant multiplications in the radix 8 buttery can be simplied. 51
52
Conclusion and Future Work A recongurable FFT that supports all the modes of this standard can be implemented. Other blocks such as forward error correction and modulation can be implemented on ASIC.
Bibliography
[1] I. 802.15.3c, Wireless Medium Access Control (MAC) and Physical Layer (PHY) Specications for High Rate Wireless Personal Area Networks (WPANs). 2009. [2] S.-K. S. Yong, P. Xia, and A. Valdes-Garcia, 60 GHz Technology For Gbps WLAN and WPAN. Wiley-IEEE Press, 2010. [3] L. R. Rabiner and B. Gold, Discrete-time signal processing. Prentice Hall, 1975. [4] A. Oppenheim and R. Schafer, Theory and application of digital signal processing. Prentice Hall, 1989. [5] W. W. Smith and J. M. Smith, Handbook of Real-Time Fast Fourier Transforms. Wiley-IEEE Press, 1995. [6] W. Cochran, J. Cooley, D. Favin, H. Helms, R. Kaenel, W. Lang, J. Maling, G.C., D. Nelson, C. Rader, and P. Welch, What is the fast Fourier transform?, Proceedings of the IEEE, vol. 55, pp. 16641674, Oct. 1967. [7] M. Garrido, Ecient hardware architectures for the computation of the FFT and other related signal processing algorithms in real time. PhD thesis, Universidad Politcnica de Madrid, 2009. [8] E. Swartzlander, W. Young, and S. Joseph, A radix 4 delay commutator for fast Fourier transform processor implementation, IEEE Journal of SolidState Circuits, vol. 19, pp. 702709, Oct 1984. [9] E. Wold and A. Despain, Pipeline and Parallel-Pipeline FFT Processors for VLSI Implementations, IEEE Transactions on Computers, vol. C-33, pp. 414426, May 1984. [10] A. Despain, Fourier Transform Computers Using CORDIC Iterations, IEEE Transactions on Computers, vol. C-23, pp. 993 1001, Oct. 1974. [11] S. He and M. Torkelson, Design and implementation of a 1024-point pipeline FFT processor, in Custom Integrated Circuits Conference, 1998. Proceedings of the IEEE 1998, pp. 131 134, May 1998. 53
54
Bibliography
[12] M. Snchez, M. Garrido, M. Lpez Vallejo, J. Grajal, and C. Lopez-Barrio, Digital channelised receivers on FPGAs platforms, in Radar Conference, 2005 IEEE International, pp. 816 821, may 2005. [13] S. He and M. Torkelson, Designing pipeline FFT processor for OFDM (de)modulation, in 1998 URSI International Symposium on Signals, Systems, and Electronics, Sep 1998. [14] Y. Chang and K. Parhi, An ecient pipelined FFT architecture, in IEEE Transaction on Circuit and Systems-II: Analog and Digital Signal Processing, June 2003. [15] S. Johansson, S. He, and P. Nilsson, Wordlength optimization of a pipelined FFT processor, in 42nd Midwest Symposium, Circuits and Systems, vol. 1, Aug. [16] L. Johnson, Conict free memory addressing for dedicated FFT hardware, in IEEE Transactions Circuits and Systems II: Analog and Digital Signal Processing, May 1992. [17] Y. Ma, An eective memory addressing scheme for FFT processors, in IEEE Transactions on Signal Processing, March 1999. [18] C. Wang and C. Chang, A new memory-based FFT processor for VDSL transceivers, in IEEE International Symposium on Circuits and Systems, May 2001. [19] A. Batra, J. Balakrishnan, G. R. Aiello, J. R. Foerster, and A. Dabak, Design of a multiband OFDM system for realistic UWB channel environment, in IEEE Transactions on Microwave Theory and Techniques, Sept 2004. [20] J. Lee, H. Lee, S.-I. Cho, and S.-S. Choi, A high-speed, low-complexity radix-24 FFT processor for MB-OFDM UWB systems, in IEEE International Symposium on Circuits and Systems, 2006, May 2006. [21] S.-M. Kim, J.-G. Chung, and K. Parhi, Low error xed-width CSD multiplier with ecient sign extension, in IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, Dec 2003. [22] M. Snchez, M. Garrido, M. Lpez, and J. Grajal, Implementing FFT-based digital channelized receivers on FPGA platforms, IEEE Transactions on Aerospace and Electronic Systems, vol. 44, pp. 15671585, Oct 2008. [23] J. Johnston, Parallel pipeline fast Fourier transformer, in IEE Proc. F Comm. Radar Signal Process., vol. 130, pp. 564572, Oct 1983. [24] S.-J. Huang and S.-G. Chen, A Green FFT Processor with 2.5-GS/s for IEEE 802.15.3c (WPANs), in International Conference on Green Circuits and Systems (ICGCS), pp. 9 13, June 2010.
Bibliography
55
[25] Y. Chen, Y.-W. Lin, Y.-C. Tsao, and C.-Y. Lee, A 2.4-Gsample/s DVFS FFT Processor for MIMO OFDM Communication Systems, IEEE Journal of Solid-State Circuits, vol. 43, pp. 12601273, May 2008. [26] B. Baas, A low-power, high-performance, 1024-point FFT processor, IEEE Journal of Solid-State Circuits, vol. 34, pp. 380387, Mar 1999. [27] Y. Chen, Y.-C. Tsao, Y.-W. Lin, C.-H. Lin, and C.-Y. Lee, An IndexedScaling Pipelined FFT Processor for OFDM-Based WPAN Applications, IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 55, pp. 146 150, Feb 2008. [28] S.-N. Tang, J.-W. Tsai, and T.-Y. Chang, A 2.4-GS/s FFT Processor for OFDM-Based WPAN Applications, IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 57, pp. 451 455, June 2010.