Вы находитесь на странице: 1из 34

COSC6365 Lennart Johnsson

2014-04-01

Introduction to HPC

Lecture 20

Lennart Johnsson
Dept of Computer Science

COSC6365 Lennart Johnsson


2014-04-01

Fast Fourier Transform

1
COSC6365 Lennart Johnsson
2014-04-01

Signal
Image Processing
Processing Medical In Radio
In Electron Imaging Astronomy Modeling and
Microscopy Solution of PDEs

COSC6365 Lennart Johnsson


2014-04-01

FFT O(PlogP)
Cooley-Tukey, radix-2, -4, -8, .
Mixed-radix Cooley-Tukey (MR)
Prime Factor Algorithm (PFA)
Split-radix (SR)
Raders Algorithm
Transforms on real data
Sine, Cosine transforms: transforms on
real data with symmetry (odd and even
symmetry, respectively)
(Cosine transforms are used for JPEG, MPEG, )

2
COSC6365 Lennart Johnsson
2014-04-01

The Fast Fourier Transform

The Discrete Fourier Transform (DFT)


P-1
2i
X(m) = Pmj x(j), V m [0,P-1], P = e-


P
j=0

The Inverse Discrete Fourier Transform (IDFT)


P-1
2i
x(j) = Pmj X(m), V j [0,P-1], P = e-


P
j=0

COSC6365 Lennart Johnsson


2014-04-01

The DFT

X(0) 1 1 1 1 .. 1 x(0)
X(1) 1 P1 P2 P3 . P(P-1) x(1)
X(2) x(2)
1 1 P P P . P
2 4 6 2(P-1)
=
X(3) P 1 P3 P 6 P 9 . P 3(P-1) x(3)
.

.
.

X(P-1) 1 P(P-1) P(P-1)2 P(P-1)3 . P(P-1)(P-1) x(P-1)

X = Wx
The DFT is indeed Matrix-vector multiplication

3
COSC6365 Lennart Johnsson
2014-04-01

The Inverse DFT


x(0) 1 1 1 1 .. 1 X(0)
x(1) 1 P -1 P -2 P -3 . P -(P-1) X(1)
x(2) X(2)
1 1 P P-4 P-6 . P-2(P-1)
-2
1
= = W-1X
x(3) P 1 P-3 P -6 P -9 . P -3(P-1) X(3) P
.

.
.

.
x(P-1) 1 P-(P-1) P-(P-1)2 P-(P-1)3 . P-(P-1)(P-1) X(P-1)

Thus, the elements of W-1 are the inverse of the elements of W

P-1 P-1 = P if i=j


Proof: Pik P-kj = P(i-j)k
k=0 k=0 =( (i-j)P -1)/((i-j) -1)= 0 if ij

COSC6365 Lennart Johnsson


2014-04-01

FFT

Decimation-in-time (DIT)
Decimation-in-frequency (DIF)
Ordered vs scrambled (bit-reversed)
Self-sorting
In-place

4
COSC6365 Lennart Johnsson
2014-04-01

DIF FFT

P
-1 1

-i Now, consider even and odd l

Two half sized DFTs!

COSC6365 Lennart Johnsson


2014-04-01

DIF FFT
First computation step

The butterfly

5
COSC6365 Lennart Johnsson
2014-04-01

DIF FFT
First computation step

COSC6365 Lennart Johnsson


2014-04-01

DIF FFT
Result after recursive application of DIF splitting formula

6
COSC6365 Lennart Johnsson
2014-04-01

DIF FFT

Bit-reversed order (scrambled)


Normal order

COSC6365 Lennart Johnsson


2014-04-01

DIF FFT Twiddle factors


Only half as many as
the number of data
points (half of the unit
circle)
First stage use all P/2
rotations of P
2nd stage use every
other twiddle factor
3rd stage use every
fourth
.
First stage, twiddle
exponent: if msb=1,
then the remaining bits
define exponent
2nd stage, if msb-1=1,
then remaining lower
order bits define
exponent of P/2

7
COSC6365 Lennart Johnsson
2014-04-01

Normal and Bit-Reversed orders


Index Binary code Reversed binary code Bit-reversed index
0 0000 0000 0
1 0001 1000 8
2 0010 0100 4
3 0011 1100 12
4 0100 0010 2
5 0101 1010 10
6 0110 0110 6
7 0111 1110 14
8 1000 0001 1
9 1001 1001 9
10 1010 0101 5
11 1011 1101 13
12 1100 0011 3
13 1101 1011 11
14 1110 0111 7
15 1111 1111 15

COSC6365 Lennart Johnsson


2014-04-01

DIT FFT

Now consider l and l+P/2

since

8
COSC6365 Lennart Johnsson
2014-04-01

DIT FFT
The last computation step
The DIT butterfly

COSC6365 Lennart Johnsson


2014-04-01

DIT FFT
Result after recursive application of DIT splitting formula

9
COSC6365 Lennart Johnsson
2014-04-01

DIT FFT
Bit-reversed order (scrambled)

COSC6365 Normal order


Lennart Johnsson
2014-04-01

DIT FFT Twiddle factors


Only half as many as
the number of data
points (half of the unit
circle)
Last stage use all P/2
rotations of P
2nd to last stage use
every other twiddle
factor
3rd to last stage use
every fourth
.
Last stage, twiddle
exponent: if msb=1,
then the remaining bits
define exponent
2nd to last stage, if msb-
1=1 of result index,
then remaining lower
order bits define
exponent of P/2

10
COSC6365 Lennart Johnsson
2014-04-01

DIF vs DIT FFT


DIF
Normal input order, bit-reversed output
All P/2 twiddles used in first stage, every 2nd twiddle
second stage, every fourth in 3rd stage etc.
Twiddle exponent computed based on source index
DIT
Bit-reversed input, normal output
All P/2 twiddles used in last stage, every 2nd twiddle in
second last stage, every fourth in 3rd third last stage
etc.
Twiddle index computed based on result index
Both compute butterflies on
successively lower order bits of source index!

COSC6365 Lennart Johnsson


2014-04-01

DIF FFT
Normal to Bit-reversed order Bit-reversed to Normal order

DIF can be used for either normal or bit-reversed input order!


Output order always bit-reverse of input order!

11
COSC6365 Lennart Johnsson
2014-04-01

DIT FFT
Bit-reversed to Normal order Normal to Bit-reversed order

DIT can be used for either normal or bit-reversed input order!


Output order always bit-reverse of input order!

COSC6365 Lennart Johnsson


2014-04-01

FFT followed by Inverse FFT


DIF DIT

Use inverse twiddles for the inverse FFT


No bit-reversal necessary!

12
COSC6365 Lennart Johnsson
2014-04-01

FFT followed by Inverse FFT


DIF DIF

Use inverse twiddles for the inverse FFT


No bit-reversal necessary!
But? Twiddles!!! Allocation for forward and inverse different!!

COSC6365 Lennart Johnsson


2014-04-01

FFT followed by Inverse FFT


DIT DIT

Use inverse twiddles for the inverse FFT


No bit-reversal necessary!
But? Twiddles!!! Allocation for forward and inverse different!!

13
COSC6365 Lennart Johnsson
2014-04-01

FFT followed by Inverse FFT


DIT DIF

Use inverse twiddles for the inverse FFT


No bit-reversal necessary!
Twiddle allocation for forward and inverse the same!!

COSC6365 Lennart Johnsson


2014-04-01

Radix-4 DIF FFT

For rewrite as

14
COSC6365 Lennart Johnsson
2014-04-01

FFT followed by Inverse FFT


DIF followed by DIT, or DIT followed by DFT have
same twiddle allocation which is important in
parallel computation
DIT followed by DIT or DFT followed by DFT have
different twiddle allocation for forward and inverse
FFT. Problem in parallel computation (more
twiddle storage than necessary)
We illustrated this for normal input order. Same is
true for bit-reversed input order
Since bit-reversal is its own inverse, no explicit bit-
reversal necessary to restore input order for
forward followed by inverse FFT

COSC6365 Lennart Johnsson


2014-04-01

Radix-4 DIF FFT

15
COSC6365 Lennart Johnsson
2014-04-01

Radix-4 DIT FFT

With rewrite as

COSC6365 Lennart Johnsson


2014-04-01

Radix-4 DIT FFT

16
COSC6365 Lennart Johnsson
2014-04-01

Radix-8 DIF FFT

COSC6365 Lennart Johnsson


2014-04-01

Radix-8 DIT FFT

17
COSC6365 Lennart Johnsson
2014-04-01

FFT: arithmetic and memory ops


Butterflies
Arithmetic Operations Storage References
FFT Add/Sub Mult Total Data Twiddles Total
Radix-2 6 4 10 8 2 10
Radix-4 22 12 34 16 6 22
Radix-8 66 32 98 32 14 46

FFT
Arithmetic Operations Storage References
FFT Add/Sub Mult Total Data Twiddles Total
Radix-2 3Pp 2Pp 5Pp 4Pp Pp 5Pp
Radix-4 (22/8)Pp (12/8)Pp (17/4)Pp (16/8)Pp (6/8)Pp (11/4)Pp
Radix-8 (66/24)Pp (32/24)Pp (49/12)Pp (32/24)Pp (14/24)Pp (23/12)Pp

COSC6365 Lennart Johnsson


2014-04-01

Parallel FFT Data allocation


Example: Power-of-two data set, power-of-two processors
Consecutive data allocation Cyclic data allocation
P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7
0 4 8 12 16 20 24 28 0 1 2 3 4 5 6 7
1 5 9 13 17 21 25 29 8 9 10 11 12 13 14 15
2 6 10 14 18 22 26 30 16 17 18 19 20 21 22 23
3 7 11 15 19 23 27 31 24 25 26 27 28 29 30 31

Communication
Input order Consecutive Cyclic
Normal First n stages Last n stages
Bit-reversed Last n stages First n stages

18
COSC6365 Lennart Johnsson
2014-04-01

Parallel Radix-2 DIT FFT

Processor 0

Processor 1

Processor 2

Processor 3

Communication Local

COSC6365 Lennart Johnsson


2014-04-01

Parallel Radix-2 FFT + Inverse FFT


DIT DIF

Processor 0

Processor 1

Processor 2

Processor 3

Processor twiddle factor subset the same in forward and inverse FFT

19
COSC6365 Lennart Johnsson
2014-04-01

Parallel Radix-2 + Inverse FFT


DIF DIT

Use inverse twiddles for the inverse FFT


No bit-reversal necessary!

COSC6365 Lennart Johnsson


2014-04-01

Permutation based parallel FFT


Block allocation
Proc ID P0 P1 P2 P3 P4 P5 P6 P7 Data for the first radix-2
Initial Alloc. 0 2 4 6 8 10 12 14 stage cannot be performed
1 3 5 7 9 11 13 15 locally! Data for the first
stage is half the processor
address range apart

20
COSC6365 Lennart Johnsson
2014-04-01

Permutation based parallel FFT


Block allocation
Proc ID P0 P1 P2 P3 P4 P5 P6 P7 First radix-2 stage can be
Initial Alloc. 0 2 4 6 8 10 12 14 performed concurrently
1 3 5 7 9 11 13 15 without communication
Exchange 0 2 4 6 1 3 5 7 after the exchange!
8 10 12 14 9 11 13 15
No further stages can be
performed locally

COSC6365 Lennart Johnsson


2014-04-01

Permutation based parallel FFT


Block allocation
Proc ID P0 P1 P2 P3 P4 P5 P6 P7 2nd radix-2 stage can be
Initial Alloc. 0 2 4 6 8 10 12 14 performed concurrently
1 3 5 7 9 11 13 15 without communication
After 1st 0 2 4 6 1 3 5 7 after the exchange!
Exch. 8 10 12 14 9 11 13 15
After 2nd 0 2 8 10 1 3 9 11 No further stages can be
Exch. 4 6 12 14 5 7 13 15 performed locally

21
COSC6365 Lennart Johnsson
2014-04-01

Permutation based parallel FFT


Block allocation
Proc ID P0 P1 P2 P3 P4 P5 P6 P7
Initial Alloc. 0 2 4 6 8 10 12 14 3rd radix-2 stage can be
1 3 5 7 9 11 13 15 performed concurrently
After 1st 0 2 4 6 1 3 5 7 without communication
Exch. 8 10 12 14 9 11 13 15 after the exchange!
After 2nd 0 2 8 10 1 3 9 11
Exch. 4 6 12 14 5 7 13 15 Last stage cannot be be
After 3rd 0 4 8 12 1 5 9 13 performed locally
Exch. 2 6 10 14 3 7 11 15

COSC6365 Lennart Johnsson


2014-04-01

Permutation based parallel FFT


Block allocation
Proc ID P0 P1 P2 P3 P4 P5 P6 P7
Initial Alloc. 0 2 4 6 8 10 12 14 Last (4th) radix-2 stage
1 3 5 7 9 11 13 15 can be performed
After 1st 0 2 4 6 1 3 5 7 concurrently without
Exch. 8 10 12 14 9 11 13 15 communication after the
After 2nd 0 2 8 10 1 3 9 11 exchange!
Exch. 4 6 12 14 5 7 13 15
After 3rd 0 4 8 12 1 5 9 13 Note, the last exchange
Exch. 2 6 10 14 3 7 11 15 stage is the same as the
After 4th 0 4 8 12 2 6 10 14 first!
Exch. 1 5 9 13 3 7 11 15

22
COSC6365 Lennart Johnsson
2014-04-01

Permutation based parallel FFT


Block allocation
Proc ID P0 P1 P2 P3 P4 P5 P6 P7
Initial Alloc. 0 2 4 6 8 10 12 14 Four exchanges even
1 3 5 7 9 11 13 15 though there are only
After 1st 0 2 4 6 1 3 5 7 three bits used for
Exch. 8 10 12 14 9 11 13 15 processor addresses!
After 2nd 0 2 8 10 1 3 9 11
Exch. 4 6 12 14 5 7 13 15 All butterflies local!
After 3rd 0 4 8 12 1 5 9 13
Exch. 2 6 10 14 3 7 11 15
After 4th 0 4 8 12 2 6 10 14
Exch. 1 5 9 13 3 7 11 15

Data permuted!! Unshuffle on nonlocal


(all indices for source, index (left cyclic shift)!!
output index bit-reverse of indices above)

COSC6365 Lennart Johnsson


2014-04-01

The unshuffle How did it happen?


Index
Initial allocation (3 2 1 | 0)

Step 1: exchange (0 2 1 | 3)

Step 2: exchange (0 3 1 | 2)

Step 3: exchange (0 3 2 | 1)

Step 4: exchange (1 3 2 | 0)

23
COSC6365 Lennart Johnsson
2014-04-01

Permutation based parallel FFT


Block allocation
Proc ID P0 P1 P2 P3 P4 P5 P6 P7
Initial Alloc. 0 4 8 12 16 20 24 28 All butterflies local!
1 5 9 13 17 21 25 29
2 6 10 14 18 22 26 30
3 7 11 15 19 23 27 31 Four exchanges!
After 1st Exch. 0 4 8 12 2 6 10 14
1 5 9 13 3 7 11 15
16
17
20
21
24
25
28
29
18
19
22
23
26
27
30
31
Data permuted!!
After 2nd Exch. 0 4 16 20 2 6 18 22 (all indices for source,
1 5 17 21 3 7 19 23
8 12 24 28 10 14 26 30 output index bit-
9 13 25 29 11 15 27 31
After 3rd Exch. 0 8 16 24 2 10 18 26 reverse of indices
1
4
9
12
17
20
25
28
3
6
11
14
19
22
27
30 above)
5 13 21 29 7 15 23 31
After 4th Exch. 0 8 16 24 4 12 20 28
1
2
9
10
17
18
25
26
5
6
13
14
21
22
29
30 Unshuffle on nonlocal
3 11 19 27 7 15 23 31
index (left cyclic shift)!!

COSC6365 Lennart Johnsson


2014-04-01

Permutation based FFT


Proc ID P0 P1 P2 P3 P4 P5 P6 P7 Exchanges based on blocks
Initial Alloc. 0
1
4
5
8
9
12
13
16
17
20
21
24
25
28
29
defined by local msb
2 6 10 14 18 22 26 30
3 7 11 15 19 23 27 31
After 1st Exch. 0 4 8 12 2 6 10 14 (4 3 2 |1 0)
1 5 9 13 3 7 11 15
16 20 24 28 18 22 26 30 (1 3 2 |4 0)
17 21 25 29 19 23 27 31
After 2nd Exch. 0 4 16 20 2 6 18 22
1
8
5
12
17
24
21
28
3
10
7
14
19
26
23
30
(1 4 2 |3 0)
9 13 25 29 11 15 27 31
After 3rd Exch. 0 8 16 24 2 10 18 26
1
4
9
12
17
20
25
28
3
6
11
14
19
22
27
30
(1 4 3 |2 0)
5 13 21 29 7 15 23 31
After 4th Exch. 0 8 16 24 4 12 20 28
1 9 17 25 5 13 21 29 (2 4 3 |1 0)
2 10 18 26 6 14 22 30
3 11 19 27 7 15 23 31
(p-addr|m-addr)

24
COSC6365 Lennart Johnsson
2014-04-01

Permutation based FFT


Proc ID P0 P1 P2 P3 P4 P5 P6 P7 Exchanges based on blocks
Initial Alloc. 0
1
4
5
8
9
12
13
16
17
20
21
24
25
28
29 defined by local lsb
2 6 10 14 18 22 26 30
3 7 11 15 19 23 27 31 (4 3 2 |1 0)
After 1st Exch. 0 4 8 12 1 5 9 13
16 20 24 28 17 21 25 29
2 6 10 14 3 7 11 15
18 22 26 30 19 23 27 31 (0 3 2 |1 4)
After 2nd Exch. 0 4 16 20 1 5 17 21
8 12 24 28 9 13 25 29
2
10
6
14
18
26
22
30
3
11
7
15
19
27
23
31
(0 4 2 |1 3)
After 3rd Exch. 0 8 16 24 1 9 17 25
4 12 20 28 5 13 21 29
2 10 18 26 3 11 19 27 (0 4 3 |1 2)
6 14 22 30 7 15 23 31
After 4th Exch. 0 8 16 24 4 12 20 28
1
2
9
10
17
18
25
26
5
6
13
14
21
22
29
30
(2 4 3 |1 0)
3 11 19 27 7 15 23 31

(p-addr|m-addr)

COSC6365 Lennart Johnsson


2014-04-01

Minimizing the number of permutations


(p-addr|m-addr)
(4 3 2 |1 0) (4 3 2 |1 0) (4 3 2 |1 0) (4 3 2 |1 0)

(1 0 2 |4 3) (1 0 2 |4 3) (1 0 2 |4 3) (..|.)

(1 0 3 |4 2) (1 0 4 |2 3) (4 0 3 |1 2) (..|.)

(4 2 3 |1 0) (2 3 4 |1 0) (4 2 3 |1 0) (. |.)

Three permutations instead of four many ways

25
COSC6365 Lennart Johnsson
2014-04-01

Permutation based FFT


4-section Exchange
Proc ID P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 Sequence
Initial Alloc. 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61
2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62
(5432|10)
3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63
After 1st Exch. 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15
16 20 24 28 17 21 25 29 18 22 26 30 19 23 27 31
32 36 40 44 33 37 41 45 34 38 42 46 35 39 43 47 (1032|54)
48 52 56 60 49 53 57 61 50 54 58 62 51 55 59 63
After 2nd Exch. 0 16 32 48 1 17 33 49 2 18 34 50 3 19 35 51
4 20 36 52 5 21 37 53 6 22 38 54 7 23 39 55
8 24 40 56 9 25 41 57 10 26 42 58 11 37 43 59 (1054|32)
12 28 44 60 13 29 45 61 14 30 46 62 15 31 47 63
After 3rd Exch. 0 16 32 48 4 20 36 52 8 24 40 56 12 28 44 60
1 17 33 49 5 21 37 53 9 25 41 57 13 29 45 61
2 18 34 50 6 14 22 30 10 26 42 58 14 30 46 62 (3254|10)
3 19 35 51 7 15 23 31 11 37 43 59 15 31 47 63

COSC6365 Lennart Johnsson


2014-04-01

Permutation based parallel FFT


Cyclic data allocation
Proc ID P0 P1 P2 P3 P4 P5 P6 P7
All butterflies local!
Initial Alloc. 0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
After 1st 0 1 2 3 8 9 10 11 Three exchanges!
Exch. 4 5 6 7 12 13 14 15
After 2nd 0 1 4 5 8 9 12 13 Data permuted!!
Exch. 2 3 6 7 10 11 14 15
After 3rd 0 2 4 6 8 10 12 14
Consecutive order
Exch. 1 3 5 7 9 11 13 15 after exchange
sequence

26
COSC6365 Lennart Johnsson
2014-04-01

Permutation based parallel FFT


Cyclic data allocation Exchange
Proc ID P0 P1 P2 P3 P4 P5 P6 P7 sequence
Initial Alloc. 0 1 2 3 4 5 6 7 (3 | 2 1 0)
8 9 10 11 12 13 14 15
After 1st 0 1 2 3 8 9 10 11
Exch. 4 5 6 7 12 13 14 15 (2 | 3 1 0)
After 2nd 0 1 4 5 8 9 12 13
Exch. 2 3 6 7 10 11 14 15 (1 | 3 2 0)
After 3rd 0 2 4 6 8 10 12 14
Exch. 1 3 5 7 9 11 13 15
(0 | 3 2 1)

Consecutive order!! (m-addr|p-addr)

COSC6365 Lennart Johnsson


2014-04-01

Permutation based FFT


FFT computations carried out from msb to
lsb in data index (always)
To achieve that all computations are local
permutation depends on data allocation
(and if the FFT is self sorting)
Exchanges affect memory access strides
both in carrying out permutations, and in
carrying out butterfly computations

27
COSC6365 Lennart Johnsson
2014-04-01

Parallel unordered FFT


communication requirements

COSC6365 Lennart Johnsson


2014-04-01

Permutation based parallel FFT


Cyclic data allocation, DIF without permutations, twiddles
Proc ID P0 P1 P2 P3 P4 P5 P6 P7 Proc ID P0 P1 P2 P3 P4 P5 P6 P7
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22 23
Input 24 25 26 27 28 29 30 31 Twiddle
index 32 33 34 35 36 37 38 39 expon. 0 1 2 3 4 5 6 7
40 41 42 43 44 45 46 47 1st stage 8 9 10 11 12 13 14 15
48 49 50 51 52 53 54 55 16 17 18 19 20 21 22 23
56 57 58 59 60 61 62 63 24 25 26 27 28 29 30 31

Proc ID P0 P1 P2 P3 P4 P5 P6 P7 Proc ID P0 P1 P2 P3 P4 P5 P6 P7

0 4 8 12 16 20 24 28
0 2 4 6 8 10 12 14
Twiddle 16 18 20 22 24 26 28 30 Twiddle 0 4 8 12 16 20 24 28
expon. expon.
2nd 3rd stage 0 4 8 12 16 20 24 28
stage 0 2 4 6 8 10 12 14
16 18 20 22 24 26 28 30 0 4 8 12 16 20 24 28

28
COSC6365 Lennart Johnsson
2014-04-01

Permutation based parallel FFT


Cyclic data allocation, DIF without permutations
Proc ID P0 P1 P2 P3 P4 P5 P6 P7 Proc ID P0 P1 P2 P3 P4 P5 P6 P7
0 8 16 24 0 16 0 16
0 8 16 24 0 16 0 16
0 8 16 24 0 16 0 16
Twiddle 0 8 16 24 Twiddle 0 16 0 16
expon. 0 8 16 24 expon. 0 16 0 16
4th 0 8 16 24 5th stage 0 16 0 16
stage 0 8 16 24 0 16 0 16
0 8 16 24 0 16 0 16

COSC6365 Lennart Johnsson


2014-04-01

Twiddle factor allocation

29
COSC6365 Lennart Johnsson
2014-04-01

Permutation based FFT


DIT permutation based FFT twiddles
Proc ID P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60
Initial Alloc. 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61
2 6 10 14 18 22 26 30 34 38 42 46 50 54 58 62
3 7 11 15 19 23 27 31 35 39 43 47 51 55 59 63

Twiddle
expon. 1st 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
stage 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Twiddle 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
expon. 2nd
stage 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

Twiddle
expon. 3rd 0 16 8 24 0 16 8 24 0 16 8 24 0 16 8 24
stage 0 16 8 24 0 16 8 24 0 16 8 24 0 16 8 24

Twiddle 0 8 4 12 0 8 4 12 0 8 4 12 0 8 4 12
expon. 4th
stage 16 24 20 28 16 24 20 28 16 24 20 28 16 24 20 28

COSC6365 Lennart Johnsson


2014-04-01

Permutation based FFT


DIT permutation based FFT twiddles
Proc ID P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15

Twiddle
expon. 5th 0 4 2 6 16 20 18 22 8 12 10 14 24 28 26 30
stage 0 4 2 6 16 20 18 22 8 12 10 14 24 28 26 30

Twiddle 0 2 1 3 8 10 9 11 4 6 5 7 12 14 13 15
expon. 6th
stage 16 18 17 19 24 26 25 27 20 22 21 23 28 30 29 31

30
COSC6365 Lennart Johnsson
2014-04-01

Four step FFT


Constructing the four step FFT
Factorize N in two equal (palindrome) factors N N N

1. Compute first rank, N FFTs of size N


2. Multiply with twiddle factors
3. Transpose N N matrix
4. Compute last rank, N FFTs of size N

FFT4 FFT4

FFT4 FFT4

FFT4 FFT4

FFT4 FFT4

COSC6365 Lennart Johnsson


2014-04-01

Square Transpose

(c)

31
COSC6365 Lennart Johnsson
2014-04-01

Square Transpose

Xeon Clovertown Opteron 285


Memory 10.6 GB/s, 80-280 cycles Memory 6.4 GB/s, 150 cycles
L1 Cache 32K/core, 64B, 8way, 3 cycles L1 Cache 64K/core, 64B, 2way, 3 cycles
L2 Cache 8M/dual, 64B, 16way, 14 cycles L2 Cache 1M/core, 64B, 16way, 12 cycles

COSC6365 Lennart Johnsson


2014-04-01

Parallel FFT on Binary n-Cube

32
COSC6365 Lennart Johnsson
2014-04-01

Pipelined FFT on n-cube


Time Memory Processor
step location
0 1 2 3 4 5 6 7
4 5
0 2 2 2 2 2 2 2 2
0 1 Time step 0
1 - - - - - - - -
0 2 - - - - - - - - 6 7
3 - - - - - - - -
4 - - - - - - - -
0 1 1 1 1 1 1 1 1 2 3
1 2 2 2 2 2 2 2 2
1 2 - - - - - - - -
4 5
3 - - - - - - - - Time step 1
4 - - - - - - - - 0 1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
6 7
2 2 2 2 2 2 2 2 2 2
3 - - - - - - - -
4 - - - - - - - -
2 3
0 - - - - - - - - 4 5
1 0 0 0 0 0 0 0 0
3 2 1 1 1 1 1 1 1 1
0 1 Time step 2
3 2 2 2 2 2 2 2 2
4 - - - - - - - - 6 7
The Table entry is network dimension starting
2 3
with the dimension corresponding to the msb
of the p-addr)
The first four steps of a pipelined, in-place, FFT on a 3-cube

COSC6365 Lennart Johnsson


2014-04-01

Pipelined Bi-section FFT on n-cube


Time Memory Processor
step location
0 1 2 3 4 5 6 7 4 5
0 - - - - 2 2 2 2
1 2 2 2 2 - - - -
0 1 Time step 0
0 2 - - - - - - - -
3 - - - - - - - - 6 7
4 - - - - - - - -
5 - - - - - - - -
0 - - 1 1 - - 1 1 2 3
1
1
2
1
-
1
-
-
-
-
-
1
2
1
2
-
2
-
2
4 5
3 2 2 2 2 - - - - Time step 1
4 - - - - - - - - 0 1
5 - - - - - - - -
6 7
0 - 0 - 0 - 0 - 0
1 0 - 0 - 0 - 0 -
2 2
3
-
1
-
1
1
-
1
-
-
1
-
1
1
-
1
-
2 3
4 - - - - 2 2 2 2 4 5
5 2 2 2 2 - - - -
0 - - - - - - - - 0 1 Time step 2
1 - - - - - - -
3 2 - 0 - 0 - 0 - 0 6 7
3 0 - 0 - 0 - 0 -
4 - - 1 1 - - 1 1
5 1 1 - - 1 1 - - 2 3

The first four steps of a pipelined, in-place, FFT on a 3-cube

33
COSC6365 Lennart Johnsson
2014-04-01

Pipelined, 1-D, in-place, FFT on 11-cube

Performance of a pipelined, one-dimensional, in-place, radix-2, FFT on


a 11-cube as a function of data allocation (mesh shape of the 11-cube)

COSC6365 Lennart Johnsson


2014-04-01

Pipelined, 2-D, in-place, FFT on 11-cube

Performance of a pipelined, two-dimensional, in-place, radix-2, FFT on


a 11-cube as a function of data allocation (mesh shape of the 11-cube)

34

Вам также может понравиться