Вы находитесь на странице: 1из 22

A Systolic FFT Architecture for

Real Time FPGA Systems


Preston Jackson, Cy Chan, Charles Rader,
Jonathan Scalera, and Michael Vai
HPEC 2004
29 September 2004

This work was sponsored by DARPA ATO under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions
and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

MIT Lincoln Laboratory


Systolic Architecture-1
PAJ 9/29/2004

Outline

Introduction
Motivation
Evaluation metrics

Systolic Architecture-2
PAJ 9/29/2004

Parallel architecture

Systolic architecture

Performance summary

Conclusions

MIT Lincoln Laboratory

Radar Processing Application


ADC
x 1.2 GSPS

I/Q

Corrx,y [ m ] = x [ n ]y [ n m ]

32K
Correlation

ADC
y 1.2 GSPS

FFT

FIFO

Conjugate

8K FFT bottleneck
Real-time
Complex
0.6 GSPS input (16-bits)
1.2 GSPS output (12-bits)

- 1k

I/Q

Systolic Architecture-3
PAJ 9/29/2004

FFT

FIFO

FIFO

MIT Lincoln Laboratory

Evaluation Scorecard

The design changes will be scored based on the following


metrics:

Length of FFT
IO pins
Butterflies
Multipliers

Size

16

8192

Pins

Fly

Mult

Add

Shift

Adder/subtractors
Shift registers

Systolic Architecture-4
PAJ 9/29/2004

MIT Lincoln Laboratory

Outline

Introduction

Parallel architecture
Data flow graph
Effects of serial input

Systolic Architecture-5
PAJ 9/29/2004

Systolic architecture

Performance summary

Conclusions

MIT Lincoln Laboratory

Baseline Parallel Architecture

Systolic Architecture-6
PAJ 9/29/2004

Size

16

8192

Pins

448

229K

Fly

32

53K

Mult

Add

Shift

10

10

10

10

11

11

11

11

12

12

12

12

13

13

13

13

14

14

14

14

15

15

15

15

16

16

16

16

Parallel FFT
Butterfly structure
Removes
redundant
calculation

MIT Lincoln Laboratory

Complex Butterfly

Butterfly contains
1 complex addition
1 complex subtraction
1 complex, constant multiply

Size

16

8192

Pins

448

229K

Fly

32

53K

Mult
Add
Shift

WNr

Systolic Architecture-7
PAJ 9/29/2004

MIT Lincoln Laboratory

Complex Addition

Complex addition adds the real and


imaginary parts separately:

(a + jb) + (c + jd) = (a + c) + j(b + d)

Size

16

8192

Pins

448

229K

Fly

32

53K

Add

128

213K

Shift

Mult

2 adds

real

imag

c
b
d

Systolic Architecture-8
PAJ 9/29/2004

MIT Lincoln Laboratory

Complex Multiply

The FOIL method of multiplying complex


numbers:

(a + jb)(c + jd) = (ac bd) + j(ad + bc)

Size

16

8192

Pins

448

229K

Fly

32

53K

Mult

128

213K

Add

192

320K

Shift

4 multiplies and 2 adds

a
c

Systolic Architecture-9
PAJ 9/29/2004

real

imag

MIT Lincoln Laboratory

Efficient Complex Multiply

Another approach requires fewer multiplies:

(ad + bc) = c(a + b) a(c d)


(ac bd) = d(a b) + a(c d)

Size

16

8192

Pins

448

229K

Fly

32

53K

Mult

96

159K

75%

Add

288

480K

150%

Shift

3 multiplies and 5 adds

real

imag

d
Systolic Architecture-10
PAJ 9/29/2004

MIT Lincoln Laboratory

Parallel-Pipelined Architecture

Systolic Architecture-11
PAJ 9/29/2004

Size

16

8192

Pins

448

229K

Fly

32

53K

Mult

96

159K

Add

288

480K

Shift

10

10

10

10

11

11

11

11

12

12

12

12

13

13

13

13

14

14

14

14

15

15

15

15

16

16

16

16

A pipelined version
IO Bound
100% Efficient

MIT Lincoln Laboratory

Serial Input

Systolic Architecture-12
PAJ 9/29/2004

Size

16

8192

.01%

Pins

28

28

Fly

32

53K

Mult

96

159K

Add

288

480K

Shift

10

10

10

10

11

11

11

11

12

12

12

12

13

13

13

13

14

14

14

14

15

15

15

15

16

16

16

16

A serial version
IO-rate matches
A/D
6.25% Efficient

MIT Lincoln Laboratory

Outline

Introduction

Parallel architecture

Systolic architecture
Serial implementation
Application specific optimizations

Systolic Architecture-13
PAJ 9/29/2004

Performance summary

Conclusions

MIT Lincoln Laboratory

Serial Architecture

The parallel architecture can be collapsed

One butterfly per stage


Consumes 1 sample per cycle
Same latency and throughput
More efficient design

Stage 1

Stage 2

Stage 3

Size

16

8192

Pins

28

28

Fly

13

.03%

Mult

12

39

.03%

Add

36

117

.03%

Shift

22

12K

Stage 4

50% Efficiency
Systolic Architecture-14
PAJ 9/29/2004

MIT Lincoln Laboratory

High Level View

Replace complex structure with an


abstract cell which contains:
FIFOs
Butterfly
Switch network

Size

16

8192

Pins

28

28

Fly

13

Mult

12

39

Add

36

117

Shift

22

12K

Stage 1

Stage 2

Stage 3

Stage 4

Systolic Architecture-15
PAJ 9/29/2004

MIT Lincoln Laboratory

8192-Point Architecture

Requires 13 stages
Fixed point arithmetic
Varies the dynamic range to increase
accuracy
Overflow replaced with saturated value

Size

16

8192

Pins

28

28

Fly

13

Mult

12

39

Add

36

117

Shift

22

12K

10 11 12 13

4 int

4 int

5 int

6 int

7 int

8 int

9 int

10 int

4 frac

14 frac

13 frac

12 frac

11 frac

10 frac

9 frac

8 frac

Multipliers limit design to 18-bits and 150 MHz


Achieves 70 dB of accuracy

Systolic Architecture-16
PAJ 9/29/2004

0110.0101

6 + 165
MIT Lincoln Laboratory

Increase Parallelism

Add more pipelines


Design limited to 150 MHz by multipliers
I/Q module generate 600 MSPS
Meets real-time requirement through parallelism

Size

16

8192

Pins

112

112

400%

Fly

16

52

400%

Mult

48

156

400%

Add

144

468

400%

Shift

16

12K

100%

1 2 3 4 5 6 7 8 9 10 11

12

13

1 2 3 4 5 6 7 8 9 10 11

12

13

1 2 3 4 5 6 7 8 9 10 11

12

13

1 2 3 4 5 6 7 8 9 10 11

12

13

Systolic Architecture-17
PAJ 9/29/2004

MIT Lincoln Laboratory

Simplification

Target application allows a specific simplification


Pads a 4096-point sequence with 4096 zeros
Removes 1st stage multipliers and adders
Achieves 100% efficiency in steady state

Size

16

8192

Pins

160

160

143%

Fly

16

52

Mult

36

144

92%

Add

108

432

92%

Shift

8K

67%

1 2 3 4 5 6 7 8 9 10 11

12

13

1 2 3 4 5 6 7 8 9 10 11

12

13

1 2 3 4 5 6 7 8 9 10 11

12

13

1 2 3 4 5 6 7 8 9 10 11

12

13

Systolic Architecture-18
PAJ 9/29/2004

MIT Lincoln Laboratory

Outline

Introduction

Parallel architecture

Systolic architecture

Performance summary
Power, operations per second
FPGA resources, frequency
Latency, throughput

Systolic Architecture-19
PAJ 9/29/2004

Conclusions

MIT Lincoln Laboratory

Results

The current implementation has been placed on a


Virtex II 8000 and verified at 150 MHz

Power: 22 Watts @ 65 C
GOPS: 86 total @ 3.9 GOPS/Watt

FPGA resources (XC2V8000)

Systolic Architecture-20
PAJ 9/29/2004

Multipliers: 144 (85%)


LUTs and SRLs: 39,453 (42%)
BlockRAM: 56 (33%)
Filp flops: 35,861 (38%)

Frequency: 150 MHz


Latency: 1127 cycles
Throughput: 1.2 GSPS
MIT Lincoln Laboratory

Outline

Introduction

Parallel architecture

Systolic architecture

Performance summary

Conclusions
Applicability to other platforms
Future work

Systolic Architecture-21
PAJ 9/29/2004

MIT Lincoln Laboratory

Conclusions

Created a high performance, real-time FFT core


Low power (3.9 GOPS/Watt)
High throughput (1.2 GSPS), low latency (7.6 sec/sample)
Fixed-point (18-bits), high accuracy (70 dB)

General architecture
Extendable to a generic FPGA core
Retargetable to ASIC technology

Future work
Develop a parameterizable IP core generator

Systolic Architecture-22
PAJ 9/29/2004

MIT Lincoln Laboratory

Вам также может понравиться