Академический Документы
Профессиональный Документы
Культура Документы
I. I NTRODUCTION
There is a large body of literature on fast transform algorithms. With few exceptions these algorithms are derived
by clever and often lengthy manipulation of the transform
coefficients. These derivations are hard to grasp, and provide
little insight into the structure of the resulting algorithm.
Further, it is hard to determine if all relevant classes of
algorithms have been found. This is not just an academic
problem as the variety of different implementation platforms
and application requirements makes a thorough knowledge
of the algorithm space crucial. For example, state-of-theart implementations of the discrete Fourier transform (DFT)
heavily rely on various variants of general radix algorithms to
adapt the implementation to the memory hierarchy [1], [2], [3]
or to optimize it for vector instructions and multiple threads
[4], [5], [6], [7]
In this paper, we derive fast algorithms for linear transforms
algebraically. This means that we do not manipulate the actual
transform to obtain an algorithm, but decompose a polynomial
algebra associated with the transform. In this spirit, we first
present two general decomposition theorems for polynomial
algebras and show that they generalize the well-known CooleyTukey fast Fourier transform (FFT). Then, we apply these
theorems to the discrete cosine and sine transforms (DCTs
and DSTs) and derive a large new class of recursive general
radix Cooley-Tukey type algorithms for these transforms, only
special cases of which have been known. In particular, these
new fast algorithms that we present are the first to provide a
general radix decomposition for the DCTs and DSTs.
This work was supported by NSF through awards 9988296, 0310941, and
0634967.
Markus Puschel and Jose M. F. Moura are with the Department of Electrical
and Computer Engineering, Carnegie Mellon University, Pittsburgh. E-mail:
{pueschel,moura}@ece.cmu.edu; ph:(412)268-6341; fax:(412)268-3890.
(2)
(3)
TABLE II
S IGNAL MODELS ASSOCIATED TO THE 16 DTT S (DCT S AND DST S ).
TABLE I
S IGNAL MODELS ASSOCIATED TO THE DFT S .
F
p(x)
xn
DFT-1 = DFT
DFT-2
DFT-3
DFT-4
1
xn 1
xn + 1
xn + 1
DFT(a)
xn a
f (k )
(k, ) entry of F
x
x
x
1
1/2
k
1
1/2
k
k
n
k(+1/2)
n
(k+1/2)
n
(k+1/2)(+1/2)
n
k
n
0k<n
(6)
p = p(x)
f (k ), k = cos
T -group
DCT-3
DST-3
DCT-4
DST-4
Tn
Tn
Tn
Tn
T
U
V
W
1
sin()
cos(/2)
sin(/2)
U -group
DCT-1
DST-1
DCT-2
DST-2
(x2 1)Un2
Un
(x 1)Un1
(x + 1)Un1
T
U
V
W
1
sin()
cos(/2)
sin(/2)
V -group
DCT-7
DST-7
DCT-8
DST-8
(x + 1)Vn1
Vn
Vn
(x + 1)Vn1
T
U
V
W
1
sin()
cos(/2)
sin(/2)
W -group
DCT-5
DST-5
DCT-6
DST-6
(x 1)Wn1
Wn
(x 1)Wn1
Wn
T
U
V
W
1
sin()
cos(/2)
sin(/2)
TABLE III
8 TYPES OF DCT S AND DST S OF SIZE n. T HE ENTRY AT ROW k AND
COLUMN
(9)
0i<n
type
DCTs
DSTs
1
2
3
4
5
cos k n1
cos k( + 12 ) n
1
cos(k + 2 ) n
cos(k + 12 )( + 21 ) n
cos k
1
n 2
cos k( + 12 ) 1
n 2
cos(k + 12 ) 1
n 2
cos(k + 12 )( + 21 ) 1
n+ 2
sin(k + 21 )( + 1) n
sin(k + 21 )( + 12 ) n
sin(k + 1)( + 1) 1
6
7
8
n+ 2
n+ 1
2
1)
n+ 1
2
1
)
2 n 1
2
sin(k + 1)( + 12 )
sin(k +
sin(k +
1
)(
2
1
)(
2
+
+
TABLE IV
4 TYPES OF SKEW DTT S AND ASSOCIATED SIGNAL MODELS . T HE
r IS IN 0 r 1. F OR r = 1/2 THEY REDUCE TO THE
T - GROUP DTT S .
PARAMETER
p = p(x)
f = f (k ), k = cos
DCT-3(r)
DST-3(r)
DCT-4(r)
DST-4(r)
Tn cos r
Tn cos r
Tn cos r
Tn cos r
T
U
V
W
1
sin()
cos(/2)
sin(/2)
0i< n1
2
2r+2i
( r+2i
) ( r+n1
)
n ,
n
n
(11)
The transform matrix is T , and transform algorithms correspond to factorizations of T into a product of sparse structured
matrices. This approach was adopted for the DFT in [33], [28]
(and as early as [21]), but also for other transforms in various
research papers on fast transform algorithms.
In ASP, we adopt the second approach for two reasons. First,
in ASP, a transform is a decomposition of a signal model into
its spectral components, e.g., as in (1). This decomposition is
a base change and hence represented by a matrix. Further, we
1 By transforms, we mean here those computing some sort of spectrum
of finite length discrete signals like the DFT or DTTs.
6
In = 4
F2 =
1
1
TABLE V
AUXILIARY MATRICES USED IN THIS PAPER .
2
3
1
1
2
3
6
..
. 1
7
6
.
..
5 , Sn = 6
.
.
5 , Jn = 4
.
4
1
1
2
6
1
, Zn = 6
4
1
.
..
0
0.
..
2
3
0
6
7
17
6
5 , Z n = 41
0
..
0
..
7
7
7,
15
1
3
1. 0
.
7
.
7
5
.
1
0 i < n,
(12)
n
m,
for 0 i1 <
0 i2 < m. This definition shows that
n
m matrix stored in row-major order.
Lnm transposes an m
Alternatively, we can write
Lnm :
i 7 im mod n 1,
n 1 7 n 1.
for 0 i < n 1,
bn
b n )1 = L
Analogous to the stride permutation, (L
m
(n+1)/m ,
and ( is defined right below)
b n1 I1 .
Lnm = L
m
for A = [ak, ].
C[x]/p(x)
0k<n
TTTT
TTTT
TTTT
TTTT
*
partial decomposition
t
tt
tt
t
t
tt
tt
t
t
tt
tt
t
ytt
C[x]/(x k )
C[x]/p(x)
C[x]/q(x) C[x]/r(x)
(15)
M
M
C[x]/(x i )
C[x]/(x j ) (16)
0j<m
0i<k
0i<n
C[x]/(x i ).
(17)
Here the i are the zeros of q and the j are the zeros
of r, which implies that both are a subset of the zeros i
of p. Both steps (15) and (16) use the Chinese remainder
theorem, whereas (17) is just a reordering of the spectrum.
The corresponding factorization of the Fourier transform is
provided in the following theorem.
Theorem 1 (Cooley-Tukey Type Algorithm by Factorization)
Let p(x) = q(x) r(x), and let c and d be bases of C[x]/q(x)
and C[x]/r(x), respectively. Further, denote with and the
lists of zeros of q and r, respectively. Then
Pb, = P (Pc, Pd, )B.
The matrix B corresponds to (15), which maps the basis b to
the concatenation (c, d) of the bases c and d, and P is the
permutation matrix mapping the concatenation (, ) to the
list of zeros in (17).
Note that the factorization of Pb, in Theorem 1 is useful
as a fast algorithm, i.e., reduces the arithmetic cost, only if
B is sparse or can be multiplied with efficiently. Referring to
Fig. 1, the partial decomposition is step (15) corresponding
to B.
Example: DFT. The DFT is a (polynomial) Fourier transform for the regular signal model given by A = M =
C[x]/(xn 1) with basis b = (1, x, . . . , xn1 ) as shown in
(6). We assume n = 2m and use the factorization x2m 1 =
(xm 1)(xm + 1). Applying Theorem 1 yields the following
decomposition steps:
C[x]/(xn 1)
C[x]/(xm 1) C[x]/(xm + 1)
(18)
M
M
2i
2i+1
C[x]/(x n )
C[x]/(x n ) (19)
0i<m
0i<n
C[x]/(x
0i<m
ni ).
(20)
where the entries are determined next. For the base elements
xm+ , 0 < m, we have
xm+
xm+
TABLE VI
A LGEBRAS OCCURING IN THE ALGORITHM DERIVATION BASED ON THE
DECOMPOSITION
algebra
basis
zeros
A = C[x]/p(x)
B = C[y]/q(y)
Ci = C[x]/r(x) i
b = (p0 , . . . , pn1 )
c = (q0 , . . . , qk1 )
d = (r0 , . . . , rm1 )
= (0 , . . . , n1 )
= (0 , . . . , k1 )
i = (i,0 , . . . , i,m1 )
0i<k 0j<m
C[x]/q(r(x))
M
C[x]/(r(x) i )
(21)
(23)
0i<k
0i<k 0j<m
0i<n
C[x]/(x i,j )
C[x]/(x i ).
(22)
(24)
(25)
(26)
Each C[x]/(x
ki )
is further decomposed as
M
C[x]/(xm ki )
C[x]/(x njk+i )
0j<m
n
= Lnm (Ik DFTm )Tm
(DFTk Im ),
(28)
(27)
n
is diagonal and usually called the
where the matrix Tm
twiddle matrix. Transposition of (27) using (14) yields the
corresponding decimation-in-time version.
Again, we note that after recognizing the decomposition
property (26), the derivation is completely mechanical.
Remarks. Theorem 2 makes use of the CRT in (22) and
(23), but it is the decomposition property of xn 1 that
produces the general radix structure. The previous work on
the algebraic derivation of this FFT did not make use of
decompositions. As we briefly discuss the next section, the
decomposition is a special case of a more general algebraic
principle.
An algorithm based on Theorem 2 is naturally implemented
recursively, where the smaller transform are called inside
loops coresponding to the tensor product and direct sum,
respectively. The structure of the algorithm also makes it a
candidate for efficient vectorization and parallelization [5], [6],
[33], [28].
C[x]/T3
C[x]/x C[x]/(x2 43 )
C[x]/x C[x]/(x
C[x]/(x
3
2 )
3
2 )
C[x]/x
C[x]/(x + 23 )
C[x]/(x + 23 ).
(29)
(30)
(31)
V1 ( 23 )
1
V0 ( 23 )
31
=
[1] and
,
1 31
V0 ( 23 ) V1 ( 23 )
respectively. Finally, we have to exchange the first two spectral
components in (31). The result is
1 0
1 1 1
0 1 0
0
1
0
1 .
DCT-43 = 1 0 0 0 1
3 1
0
1 1
0 0 1
0 1 31
1
0
0
1 1 1
0 1 0 2
1
1
)
0
1 .
DCT-43 = 1 0 0 0 cos( 12
2
5
1 1
0 0 1 0 cos( 12 ) 1 0
2
! 1
1 0 0
3
1
1
I1 0
= 0 0 1
1
1
2
2
1
0 1 0
0
1
1
0 . (32)
0 1
1 0
1
1 0 0
1
1
0 ,
I1 0 1
DCT-23 = 0 0 1
1 2
1 0 1
0 1 0
10
TABLE VII
DTT ALGORITHMS BASED ON FACTORIZATION PROPERTIES OF THE C HEBYSHEV POLYNOMIALS . T RANSPOSITION YIELDS A DIFFERENT SET OF
ALGORITHMS . R EPLACING EACH TRANSFORM BY ITS POLYNOMIAL COUNTERPART YIELDS ALGORITHMS FOR THE POLYNOMIAL DTT S . T HE BASE
CASES ARE GIVEN IN
DCT-12m = L2m
m (DCT-5m DCT-7m )B2m
DST-12m = L2m
m (DST-7m DST-5m )B2m
b
DCT-22m+1 = L2m+1 (DCT-6m+1 DCT-8m )B2m+1
DCT-22m =
DST-22m =
L2m
m (DCT-2m DCT-4m )B2m
L2m
m (DST-4m DST-2m )B2m
m+1
(C7)
(C5)
3m+2
DCT-73m+2 = Pm
(DCT-32m+1 ( 13 ) DCT-7m+1 )B3m+2
3m+2
DCT-53m+2 = Qm
(DCT-5m+1 DCT-32m+1 ( 23 ))B3m+2
DST-73m+1 =
DST-53m+1 =
DCT-83m+1 =
DST-83m+2 =
(S7)
3m+1
Pbm
(DST-32m+1 ( 31 ) DST-7m )B3m+1
(C8)
3m+1
Pbm
(DCT-42m+1 ( 13 ) DCT-8m )B3m+1
(S8)
3m+2
Pm
(DST-42m+1 ( 31 ) DST-8m+1 )B3m+2
DCT-63m+2 =
DST-63m+1 =
TABLE VIII
M ATRICES USED IN THE ALGORITHMS IN TABLE VII.
B2m
2
6
6
6
4
2
6
6
6
4
I2m+1
1
Im
Jm
Im
Jm
I2m+1
I
= m
Im
Im
Jm
= (DFT2 Im )(Im Jm ),
B2m+1 = 4 0
Jm
Im
I
I
J
m
m
m
Im 7
7
6
0
7
6
7
1/2
7, 6
Im 7 ,
5
4
5
Im
Jm
I2m+1
Jm
0 0
(C7)
(S7)
3
Jm
0 5
Jm
0
1
0
(C8)
(S8)
I2m+1
Im
m
0 0
Jm
Jm
(C5)
Im
(S5)
7
7
7,
5
Im
I2m+1
6
6
6
4 I
Jm
m
(C6)
Jm
Im
3
7
7
7
5
(S6)
1 7
1
6
6 Im 0 Jm
6
6
7
Im
6
7, 6
4
4
5
2
I2m+1
I2m+1
Jm
3
Im 7
7
Im 7
5
0 0
Jm
Im+1
I
m+1
2m+1
3m+2
3m+2
2m+1
b
b
b
b
4
Im+1 5 (L
Im+1 )
(L
= L
Im+1 )
= L
2
m+1
m+1
2
I2m+1
Im
b 3m+1
= I1 Q
3m+1
m
= Pbm
I1
TABLE IX
BASE CASES FOR THE ALGORITHMS IN TABLE VII. T HE T - GROUP DTT BASE CASES ARE PROVIDED IN TABLE X.
DCT-12 = F2
DST-11 = I1
DCT-22 = diag(1,
)
2
F2
DST-22 = diag( 1 , 1) F2
2
DCT-12 = DCT-12
DST-11 = I1
DCT-22 = F2
DST-22 = F2
1 1/2
1 1
3
DST-71 =
I
2 1
DCT-81 = h 23 I1 i
1
DST-82 = 1/2
1 1
DCT-72 =
DCT-72 = DCT-72
DCT-52 =
DST-71 = I1
DST-51 =
DCT-81 = Ih2
i
DCT-82 = 11 12
DCT-62 =
DST-61 =
1 1
1 1/2
3
h2 I1 i
1
1
1/2 1
3
I
2 1
DCT-52 = DCT-52
DST-51 = Ih1
i
DCT-62 = 11 21
DCT-61 = I1
11
question how to further decompose the skew DTTs for nontrivial sizes. This question is answered by the second identity
in Lemma 3, i): Tn a decomposes exactly as Tn does, which
establishes a one-to-one correspondence between algorithms
for the DTTs in the T -group and their skew counterparts.
We will focus on the derivation of algorithms for T -group
DTTs, and, due to space limitations, comment only very briefly
on the others. Additional details are in [19].
A. T -Group DTT Algorithms
In this section, we derive Cooley-Tukey algorithms for the
four DTTs in the T -group based on the decomposition Tn =
Tk (Tm ). These algorithms are then summarized in Table XI.
We start with a fixed DTT in the T -group with associated
algebra C[x]/Tn and C-basis2 b = (C0 , . . . , Cn1 ), where
C {T, U, V, W } depends on the chosen DTT. We assume
n = km, and use the decomposition Tn = Tk (Tm ). The
decomposition steps (21)(24) of Theorem 2 take the form
C[x]/Tn
C[x]/Tk (Tm )
M
C[x]/(Tm cos i+1/2
k )
0i<k
0i<k 0j<m
0i<n
(33)
(34)
(36)
(37)
12
(C3)
n
Km
= (Ik Jk Ik Jk . . . )Lnm .
(38)
(39)
= Tjm = Tmj ,
p1
= Tj ,
= 2Tm pi pi1 .
pi+1
pi+1 = Tim+j ,
which is the left hand side of (39). On the other hand, using
(55) in Appendix II with Tm playing the role of x in (55)
shows that pi+1 is also the right hand side of (39), as desired.
The equations (38) and (39) define the columns of the base
change matrix, which is thus given by
(C3)
B k,m
0i<k
(40)
The question that remains is how to decompose the smaller
transforms: the skew DTTm s and the polynomial DST-3k .
However, this poses no problem. Since for any a C,
Tn a decomposes exactly as Tn , we derive in a completely
analogous way the skew version of (40) as
M
()
n
DTTm (ri ) (DST-3k (r)Im )B k,m ,
DTTn (r) = Km
0i<k
12
6
7
..
6 Im1 Jm1
7
.
6
7
6
7
.
6
7
..
1
1
2
6
7
2
6
7
..
=6
7
.
6
7
Jm1
6
7
1
6
7
2
6
7
Im1
Jm1 7
6
4
5
1
2
Im1
(41)
which generalizes (40); namely, (40) is (41) for r = 1/2. The
numbers ri are computed from r using Lemma 1. The matrix
n
neither depends on the type of DTT, nor on r; the matrix
Km
()
B k,m does depend on the type of DTT, but not on r, since
the bases b and b are independent of r.
Further, since DTTs and skew DTTs have the same scaling
function (Tables II and IV), we obtain corresponding algorithms for the polynomial version of the transforms by just
replacing each DTT by its polynomial counterpart:
M
()
n
DTTn (r) = Km
DTTm (ri ) (DST-3k (r)Im )B k,m .
0i<k
(43)
13
(44)
()
Again, Bk,m only depends on the type of DTT, and not on r.
The polynomial version is again given by simply replacing
all DTTs by their polynomial counterparts:
M
()
n
DTTn (r) = Km
DTTm (ri ) (DCT-3k (r)Im )Bk,m .
0i<k
We mentioned above that choosing a U -basis in the subalgebra C[x]/Tk leads to base change matrices B k,m that are
sparse. For the T -basis, this is somewhat different. In fact,
inspecting (42) shows that the inverse base change b b,
1
i.e., Bk,m
is sparse (with at most two entries in each column).
For this reason, we will also consider the inverse of (43) and
(44). The sparsity of Bk,m depends on k; the best case is
k = 2 and the only one we consider in this paper.
All the details are in Table XI.
T -basis inverted. To express the inverse, we need the
inverse skew DTTs (Appendix III). The inverse of (44) will
take, after minor simplifications, in each case the general form
()
Mkn
n 1
)
(Km
where
=
()
Ck,m is closely related
for the DTTs of type 2 and 4 (the inverses of the DTTs in the
T -group). See Table XI for the exact form.
Variants. The algorithms derived above can be further
manipulated to obtain variants. We saw already an example:
the inversion of (44) to obtain (45). One obvious manipulation
is transposition, which turns each T -group DTT algorithm into
an algorithm for a DCT or DST of type 2 or 4 (the transposes
of the T -group DTTs).
More interestingly, each of the above algorithms has a corresponding twiddle version, which is obtained by translating
skew DTTs into their non-skew counterparts using (59) in
Appendix III. For example, the twiddle version of (43) is given
by
n
DTTn = Km
Ik DTTm Dk,m (DCT-3k Im )Bk,m ,
(46)
L
() i+1/2
)
is
a
direct
sum
of
the
where Dk,m =
X
(
m
0i<k
k
x-shaped matrices in (59) (Appendix III).
The twiddle version seems more appealing; however, we
will later see that at least in the 2-power case n = 2k they
incur a higher arithmetic cost. The reason is that skew and
non-skew DTTs can be computed with the same cost in this
case. For other sizes, the twiddle version may not incur any
penalty. Most state of the art software implementations [2],
[3] fuse the twiddle factors with the subsequent loop incurred
by the tensor product anyway to achieve better locality.
Base cases. We provide the base cases for the above
algorithms for size n = 2, 3 in Table X. The size 2 cases
TABLE X
BASE CASES FOR NORMAL AND SKEW T - GROUP DTT S OF SIZE 2 AND 3.
DCT-32 = F2 diag(1, 1/ 2)
DST-32 = F2 diag(1,
i 2)
h
DCT-42 = F2
DST-42 = F2
1
1
0 2i
1
1
0 2
DCT-32 = DCT-32
DST-32 = F2 diag(1/ 2, 1)
DCT-42 = diag(cos
DST-42 = diag(sin
1 1
, sin 8 ) F2 0 2
8
i
h
, cos 8 ) F2 10 12
8
1
r
2 i
r
0 2 cos
2
h 10 2 cos
1
r
, sin r) h
2
i
1
1
r
) F2 0 2 cos r
DCT-42 (r) = diag(cos 4 , sin r
4
2
h1
i
1
, cos r
) F2 0 2 cos r
DST-42 (r) = diag(sin r
4
4
2
1
F2 ,
2 cos r
)
2
iDST-32 (r) = diag( 2 sin1 r , sin1r ) F2
2
1
1
1
1
F2 diag(
iDCT-42 (r) = 0
r
r
2 cos
2 cos
4
2
1
1
1
1
iDCT-42 (r) = 0
F2 diag(
r
r
2 sin
2 cos
4
2
1
r
2 sin
4
1
r
2 cos
4
1 0 1/2
10 1
1 0
1
01 0
1 0 1
0
10 0 3/2
1
01 1
1
0
2
DST-33 = 1 0 0
0 1 1 0 3 0
1 1 1
01
31
1 0 1
DCT-43 = 1 0 0
0 1 31 0 1 1
1 1 1
01
3+1
10 1
DST-43 = 1 0 0
01 1
0 1 3+1
DCT-33 =
DCT-33 = DCT-33
1/2
0
1
10 1
1 0 1
01 0
1 0 1
0
3/2
0
q q q 1 0 1
1 0 0
1 1 0
3
0 1 1 diag(
, 18 , 12 ) 1 0 1
DCT-43 = 0 0 1
2
010
1 1 0 0 2 1
q
q
q 1 0 1
10 0
1 1 0
3
1
1
0 1 1 diag(
1 0 1
DST-43 = 0 0 1
,
,
)
2
8
2
02 1
1 10
01 0
DST-33 =
DCT-33 (r) =
DST-33 (r) =
1 1 1
1 1 0
I1
1 0 1
1 1 1
1 1 0
I1
1 0 1
cos( 1+r
) cos( 12r
)
3
3
cos(
1r
) cos( 1+2r
)
3
3
cos( 1+r
) cos( 12r
)
3
3
) cos( 1+2r
)
cos( 1r
3
3
101
010
001
2r
, sin 2+r
)DST-33 (r)
3
3
follow from the definition; most of the size 3 cases are derived
as explained in in Section V.
Special case. We briefly discuss the special case of (48) in
(C3)
Table XI for DTT = DCT-3 and k = 2. B2,m shonw in
Table XII(c) incurs multiplications by 2, which can be fused
14
TABLE XI
G ENERAL RADIX C OOLEY-T UKEY ALGORITHMS BASED ON Tn = Tk (Tm ) FOR THE DTT S IN THE T - GROUP ( AND THEIR TRANSPOSES ). I N EACH CASE
DTT {DCT-3, DST-3, DCT-4, DST-4}. T HE EXACT FORM OF THE OCCURRING MATRICES IS GIVEN IN TABLE XII.
U -basis:
n
DTTkm (r) = Km
0i<k
T -basis:
n
DTTkm (r) = Km
0i<k
DTTT
n = iDTTn (1/2),
Inverse of (48):
()
DTTm (ri ) DST-3k (r) Im )B k,m
(47)
()
DTTm (ri ) DCT-3k (r) Im )Bk,m
(48)
()
0i<k
(49)
TABLE XII
M ATRICES USED IN TABLE XI.
(a) Permutations.
n = (I J I J . . . )Ln ,
Km
k
k
k
k
m
6
6
6 Im1
6
6
6
6
6
6
6
6
6
6
6
4
(C3)
B2,m
12
7
7
7
7
7
..
7
.
21
7
7,
..
7
7
.
Jm1
7
1
7
2
7
Im1
Jm1 7
5
1
..
Jm1
1
2
Im Z m
6
Im Z m
6
6
.. ..
6
6
. .
6
4
Im Z m
Im
Im Zm
6
Im Zm
6
6
.. ..
6
6
. .
6
4
Im Zm
Im
7
7
7
7,
7
7
5
(C3)
, B
(S3)
, B
(C4)
, B
(S4)
3 2
Im Jm
7 6
Im Jm
7 6
7 6
.. ..
7, 6
7 6
. .
7 6
5 4
Im Jm
Im
3 2
I m Jm
7 6
I m Jm
7 6
7 6
.. ..
7,6
7 6
.
.
7 6
5 4
I m Jm
Im
Im
Zm
I
(S3)
(C4)
I
Zm
,
B2,m = m
,
B2,m = m
Im
2Im
Jm
,
2Im
(S4)
B2,m =
Im
7
7
7
7.
7
7
5
Jm
.
2Im
(d) Base change matrices for (49); from left to right: C (C3) , C (S3) , C (C4) , C (S4) .
2
3
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
4
Im1
Im1 Jm1
n )1 = Ln (I J I J . . . )
Mkn = (Km
k
k
k
k k
1
2
Im1
12
7
7
7
7
.
Jm1
7
7
7
.
.
17
1
.
2
27,
7
..
7
7
.
Jm1
7
1
7
7
2
5
I
..
I m Jm
6
I m Jm
6
6
.. ..
6
6
.
.
6
4
I m Jm
Im
3 2
Im Jm
7 6
Im Jm
7 6
7 6
.. ..
7, 6
7 6
. .
7 6
5 4
Im Jm
Im
7
7
7
7.
7
7
5
m1
VII. A NALYSIS
In this section we analyze the algorithms presented in this
paper in Tables VII and XI with respect to arithmetic cost and
where
E2,m =
Im
Zm
.
cos 2r (I1 2Im1 )
(51)
15
TABLE XIII
A LGORITHM FOR DCT-35 WITH COST (12, 6, 1). T RANSPOSITION YIELDS A DCT-25 ALGORITHM OF EQUAL COST.
2
0
60
6
61
40
0
1
0
0
0
0
0
0
0
0
1
0
1
0
0
0
3
0
07
`
7
07 I1 F2 diag(1, cos
5
1
0
)
5
F2 diag(1, cos
3
)
5
I2
I2
diag(cos 5 , 2 cos
diag(cos 3
, 2 cos
5
61
)
6
5
60
3
)
40
5
0
0
1
0
0
1
1/2
0
1
0
0
0
0
0
1
3
1
07
7
07
15
0
TABLE XIV
E XAMPLES OF C OOLEY-T UKEY TYPE ALGORITHMS FOR DTT S IN THE U /V /W - GROUP, BASED ON THE RESPECTIVE DECOMPOSITIONS
Ukm1 = Uk1 (Tm ) Um1 , V(k1)/2+km = V(k1)/2 (T2m+1 ) Vm , W(k1)/2+km = Wm W(k1)/2 (T2m+1 ). T HE POLYNOMIAL VERSIONS
ARE OBTAINED BY REPLACING ALL TRANSFORMS BY THEIR POLYNOMIAL COUNTERPARTS .
DST-1km1
DST-7km+(k1)/2
DST-5km+(k1)/2
=
=
(S1)
Pk,m
(S7)
Pk,m
(S5)
Pk,m
0<i<k
(S1)
DST-3m ( ki ) (DST-1k1 Im ) DST-1m1 Bk,m
0i< k1
2
(S7)
I
)
DST-7
)
(DST-7
Bk,m
DST-32m+1 ( 2i+1
k1
m
2m+1
k
M
0i< k1
2
1
2
1
2
log3 (n) + 74 ,
(53)
DST-5m
(52)
7
2
(S5)
DST-32m+1 ( 2i+2
) (DST-5 k1 I2m+1 ) Bk,m
k
(54)
16
R EFERENCES
[1] M. Frigo and S. G. Johnson, FFTW: An adaptive software architecture
for the FFT, in Proc. International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), 1998, vol. 3, pp. 13811384.
[2] Matteo Frigo and Steven G. Johnson, The design and implementation
of FFTW3, Proceedings of the IEEE, vol. 93, no. 2, 2005, special
issue on Program Generation, Optimization, and Adaptation.
[3] Markus Puschel, Jose M. F. Moura, Jeremy Johnson, David Padua,
Manuela Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca
Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nick
Rizzolo, SPIRAL: Code generation for DSP transforms, Proceedings
of the IEEE, vol. 93, no. 2, pp. 232275, 2005, special issue on Program
Generation, Optimization, and Adaptation.
[4] FFTW web site, 1998, www.fftw.org.
[5] F. Franchetti and M. Puschel, A SIMD vectorizing compiler for digital
signal processing algorithms, in Proc. IPDPS, 2002.
[6] F. Franchetti, Y. Voronenko, and M. Puschel, FFT program generation
for shared memory: SMP and multicore, in Proc. Supercomputing (SC),
2006.
[7] Spiral web site, 1998, www.spiral.net.
[8] M. Puschel and J. M. F. Moura, Algebraic signal processing theory:
Foundation and 1-D time, submitted for publication, part of [10].
[9] M. Puschel and J. M. F. Moura, Algebraic signal processing theory:
1-D space, submitted for publication, part of [10].
[10] M. Puschel and J. M. F. Moura, Algebraic signal processing theory,
available at http://arxiv.org/abs/cs.IT/0612077, parts of this manuscript
are submitted as [8] and [9].
[11] M. Puschel and J. M. F. Moura, The algebraic approach to the discrete
cosine and sine transforms and their fast algorithms, SIAM Journal of
Computing, vol. 32, no. 5, pp. 12801316, 2003.
[12] M. Puschel, Cooley-Tukey FFT like algorithms for the DCT, in
Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2003, vol. 2, pp. 501504.
[13] Y. Voronenko and M. Puschel, Algebraic derivation of general radix
Cooley-Tukey algorithms for the real discrete Fourier transform, in
Proc. International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2006, vol. 3, pp. 876879.
[14] M. Puschel and M. Rotteler, Algebraic signal processing theory:
Cooley-Tukey type algorithms on the 2-D hexagonal spatial lattice,
Applicable Algebra in Engineering, Communication and Computing, to
appear.
[15] M. Puschel and M. Rotteler, Algebraic signal processing theory: 2-D
hexagonal spatial lattice, IEEE Transactions on Image Processing, vol.
16, no. 6, pp. 15061521, 2007.
[16] W. H. Chen, C. H. Smith, and S. C. Fralick, A fast computational
algorithm for the discrete cosine transform, IEEE Trans. on Communications, vol. COM-25, no. 9, pp. 10041009, 1977.
[17] P. P. N. Yang and M. J. Narasimha, Prime factor decomposition of
the discrete cosine transform, in Proc. International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), 1985, pp. 772775.
[18] P. Duhamel and H. HMida, New 2n algorithms suitable for VLSI
implementations, in Proc. International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 1987, pp. 18051808.
[19] M. Puschel and J. M. F. Moura, Algebraic signal processing theory:
Cooley-Tukey type algorithms for DCTs and DSTs, extended version
of this paper; available at http://arxiv.org/abs/cs.IT/0702025.
[20] L. Auslander, E. Feig, and S. Winograd, Abelian semi-simple algebras
and algorithms for the discrete Fourier transform, Advances in Applied
Mathematics, vol. 5, pp. 3155, 1984.
[21] P. J. Nicholson, Algebraic theory of finite Fourier transforms, Journal
of Computer and System Sciences, vol. 5, pp. 524547, 1971.
[22] Th. Beth, Verfahren der Schnellen Fouriertransformation [Methods for
the Fast Fourier Transform], Teubner, 1984.
[23] S. Winograd, On the multiplicative complexity of the discrete Fourier
transform, Advances in Mathematics, vol. 32, pp. 83117, 1979.
[24] L. Auslander, E. Feig, and S. Winograd, The multiplicative complexity
of the discrete Fourier transform, Advances in Applied Mathematics,
vol. 5, pp. 87109, 1984.
[25] H. W. Johnson and C. S. Burrus, On the structure of efficient DFT
algorithms, IEEE Trans. on Acoustics, Speech, and Signal Processing,
vol. ASSP-33, no. 1, pp. 248254, 1985.
[26] M. T. Heideman and C. S. Burrus, On the number of multiplications
necessary to compute a length-2n DFT, IEEE Trans. on Acoustics,
Speech, and Signal Processing, vol. ASSP-34, no. 1, pp. 9195, 1986.
[27] H. J. Nussbaumer, Fast Fourier Transformation and Convolution
Algorithms, Springer, 2nd edition, 1982.
17
TABLE XV
A RITHMETIC COSTS ACHIEVABLE FOR THE 16 DTT S WITH THE ALGORITHMS IN THIS PAPER .
(a) T -group DTTs of 2-power and 3-power sizes n. All the 3-power size costs can be slightly improved upon (see Section VII). For 2-power sizes
n, the polynomial versions of the type 4 DTTs require n multiplications less and the polynomial DST-3 requires n/2 multiplications less.
Transform
Total cost
Achieved by
2-power n
DCT-3n
DST-3n
DCT-4n
DST-4n
DCT-3n (r)
DST-3n (r)
DCT-4n (r)
DST-4n (r)
2n log2 (n) n + 1
2n log2 (n) n + 1
2n log2 (n) + n
2n log2 (n) + n
2n log2 (n) n + 1
2n log2 (n) 12 n + 1
2n log2 (n) + n
2n log2 (n) + n
3-power n
DCT-3n
DST-3n
DCT-4n
DST-4n
DCT-3n (r)
DST-3n (r)
DCT-4n (r)
DST-4n (r)
4n log3 (n) 3n + 3
4n log3 (n) 3n + 3
4n log3 (n) n + 2
4n log3 (n) n + 2
4n log3 (n) n + 1
4n log3 (n) + 1
4n log3 (n) + n
4n log3 (n) + n
(b) U /V /W -group DTTs. The size of DCT-1 is n = 2k + 1, the size of DST-1 is n = 2k 1, the sizes of DCT-2 and DST-2 is n = 2k , the size
of DCT-5, DCT-6, DCT-7, DST-8 is n = (3k + 1)/2, and the size of DST-5, DST-6, DST-7, DCT-8 is n = (3k 1)/2. The polynomial DTTs of
type 2 are n 1 multiplications cheaper.
Transform
DCT-1n
( 23 n log2 (n 1) 2n 12 log2 (n 1) + 6,
1
n log2 (n 1) n 21 log2 (n 1) + 2,
2
( 23 n log2 (n + 1) 2n + 52 log2 (n + 1) + 2,
1
n log2 (n + 1) n + 21 log2 (n + 1), 0)
2
3
( 2 n log2 (n) n + 1, 21 n log2 (n), 0)
DST-1n
DCT-2n
DST-2n
0)
same as DCT-2
DCT-8n
DST-8n
DCT-5n
DST-5n
DCT-6n
DST-6n
same
same
same
same
DCT-7n
DST-7n
as
as
as
as
DCT-7
DST-7
DCT-7
DST-7
Total cost
Achieved by
2n log2 (n 1) 3n
log2 (n 1) + 8
2n log2 (n + 1) 3n
+3 log2 (n + 1) + 2
2n log2 (n) n + 1
2n log2 (n) n + 1
Table VII(a)
Table VII(a)
Table VII(a), (48)T = (50)T , (49)
Table VII(a), duality (57)T
4n log3 (2n 1) 5n + 5
Table VII(c)
4n log3 (2n + 1) 4n
+ log3 (2n + 1)
Table VII(c)
duality (57)
duality (57)
Table VII(d) and its transpose
Table VII(d) and its transpose
duality (57), Table VII(c)T
duality (57), Table VII(c)T
[36]
[37]
[38]
[39]
[40]
[41]
18
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
A PPENDIX I
C HINESE R EMAINDER T HEOREM
Let C[x]/p(x) be a polynomial algebra (see Section II)
and assume that p(x) = q(x)r(x) factors into two coprime
polynomials, i.e., gcd(q, r) = 1. Then the Chinese remainder
theorem (for polynomials) states that
: C[x]/p(x)
s(x)
C[x]/q(x) C[x]/r(x)
(s(x) mod q(x), s(x) mod r(x))
(s + s )
(s s )
= (s) + (s ),
= (s) (s ).
Informally, this means that computing in C[x]/p(x) and elementwise computing in C[x]/q(x) C[x]/r(x) is equivalent.
A PPENDIX II
C HEBYSHEV P OLYNOMIALS
Chebyshev polynomials are a special class of orthogonal
polynomials and play an important role in many mathematical areas. Excellent introductory books are [57], [47], [58].
TABLE XVI
F OUR SERIES OF C HEBYSHEV POLYNOMIALS . T HE RANGE FOR THE
ZEROS IS
closed form
Tn
Un
1, x
1, 2x
cos(n)
Vn
1, 2x 1
Wn
1, 2x + 1
sin(n+1)
sin
1 )
cos(n+ 2
1
cos 2
sin(n+ 1
)
2
1
sin 2
symmetry
zeros
Tn = Tn
Un = Un2
cos
cos
Vn = Vn1
cos
Wn = Wn1
cos
1 )
(k+ 2
n
(k+1)
n+1
1 )
(k+ 2
1
n+ 2
(k+1)
n+ 1
2
TABLE XVII
I DENTITIES AMONG THE FOUR SERIES OF C HEBYSHEV POLYNOMIALS ;
Cn HAS TO BE REPLACED BY Tn , Un , Vn , Wn TO OBTAIN ROWS 1, 2, 3, 4,
RESPECTIVELY.
Cn
Cn Cn2
Cn Cn1
Cn + Cn1
Tn
Un
Vn
Wn
2(x2 1)Un2
2Tn
2(x 1)Wn1
2(x + 1)Vn1
(x 1)Wn1
Vn
2(x 1)Un1
2Tn
(x + 1)Vn1
Wn
2Tn
2(x + 1)Un1
(55)
(56)
19
(C3)
(S3)
(r), Xn
TABLE XVIII
(C4)
(S4)
(r), Xn (r), Xn (r). W E USE c = cos(1/2 r)/n, s = sin(1/2 r)/n,
c = cos(1/2 r)(2 + 1)/(2n), AND s = sin(1/2 r)(2 + 1)/(2n). W HERE THE DIAGONALS CROSS , THE ELEMENTS ARE ADDED .
2 1
6 0
6 .
6 .
6 .
6
6 .
4 .
.
0
0
c1
..
..
..
.
..
.
.
s1
3 2
sn1 7
7
7
7,
7
7
5
cn1
6
6
6
6
6
6
4
c1
..
.
s1
0
..
.
..
.
..
sn1
.
cn1
0
0
.
..
.
..
0
cn
3 2
c
7 6 0
7 6
7 6
7, 6
7 6
7 4
5
s0
(57)
sn1
..
.
.
..
..
..
.
.
cn1
3 2
7
7
7
7,
7
5
6
6
6
6
6
4
c0
..
.
..
..
.
..
s0
sn1
.
cn1
3
7
7
7
7
7
5
example:
1
=
Xn(C3) (r)
cos(1/2 r)
cn
0
..
.
..
.
0
0
cn1
..
..
..
.
s1
.
..
.
.
0
sn1
c1
()
= DTTn Xn (r),
()
= DTTn Xn (r).
and
(59)
()
(r). (60)
1
()
The matrices Xn (r)
have the same x-shaped structure
()
and the same arithmetic complexity as Xn (r) and can
be readily computed because of their block structure. For
Markus Puschel
(M99SM05) is an Associate
Research Professor of Electrical and Computer Engineering at Carnegie Mellon University. He received his Diploma (M.Sc.) in Mathematics and his
Doctorate (Ph.D.) in Computer Science, in 1995
and 1998, respectively, both from the University
of Karlsruhe, Germany. From 1998-1999 he was a
Postdoctoral Researcher at Mathematics and Computer Science, Drexel University. Since 2000 he
has been with Carnegie Mellon University. He is
an Associate Editor for the IEEE Transactions on
Signal Processing, and was an Associate Editor for the IEEE Signal Processing Letters, a Guest Editor of the Journal of Symbolic Computation,
and the Proceedings of the IEEE. His research interests include signal
processing theory/software/hardware, scientific computing, compilers, applied
mathematics and algebra.
20