Вы находитесь на странице: 1из 4

DSP Implementation of Cholesky Decomposition

Perttu Salmela, Aki Happonen, Tuomas J arvinen, Adrian Burian, and Jarmo Takala
Tampere University of Technology Nokia
Tampere, Finland Finland
{perttu.salmela, jarmo.takala}@tut. {aki.p.happonen, tuomas.jarvinen, adrian.burian}@nokia.com
Abstract
Both the matrix inversion and solving a set of linear
equations can be computed with the aid of the Cholesky
decomposition. In this paper, the Cholesky decomposi-
tion is mapped to the typical resources of digital signal
processors (DSP) and our implementation applies a
novel way of computing the xed-point inverse square
root function. The presented principles result in savings
in the number of clock cycles. As a result, the Cholesky
decomposition can be incorporated in applications such
as 3G channel estimator where short execution time is
crucial.
1. Introduction
Data rates in wireless telecommunications are con-
tinuously increasing, although, the channel remains the
same and suffers from multipath fading and intersymbol
interference. For these reasons, it is crucial to apply
sophisticated channel estimation and equalization meth-
ods. The linear minimum mean square error (LMMSE)
estimation is such a method and it has been proposed for
universal mobile telecommunications system (UMTS)
receivers [1]. The LMMSE is inherently a demanding
operation as it requires complex matrix inversion. In this
paper, the Cholesky decomposition has been studied as
it can be used efciently for matrix inversion and for
solving linear systems.
For high-level software (SW) implementations a ref-
erence C code can be found in [2]. There exist also
intellectual property (IP) cores for Cholesky decompo-
sition [3]. In addition, there exist SW implementations
in libraries such as PLAPACK for parallel linear algebra
algorithms on distributed memory supercomputer [4] or
LAPACK, which is designed for vector computers with
shared memory [5]. However, they are not targeted to
embedded systems relying on DSPs.
In this paper, the Cholesky decomposition is imple-
mented as a SW routine running on TIs C55x processor
[6]. The algorithm is analyzed and the critical parts are
identied. Their computation is alleviated with substitu-
tion with less complex operations, efcient addressing
and order of computations, avoidance of intervening
controls, and single cycle execution of the innermost
kernel computation. In addition, a novel routine for
computing the inverse square root function is applied.
Among these principles, the word length and numerical
range of the xed-point number system is considered.
The developed implementation is evaluated in terms of
savings in clock cycles and suitability under strict time
limits, i.e., one matrix decomposition in one UMTS
time slot T
s
= 0.667 ms. The obtained results justify
the proposed SW implementation.
2. Cholesky Decomposition
Cholesky decomposition transforms a positive deni-
tive matrix A into a form A = LL
T
where L is a lower
triangular matrix. After the Cholesky decomposition
inverting the triangular L is easy. Inverse of original A
is obtained by A
1
= L
1
(L
1
)
T
. When L is given,
also Ax = b can be solved easily. First, L
T
y = b
is solved. Next, Lx = y is solved. Since the matrix
L is triangular, solving begins with a row having one
unknown. It is solved and the value is substituted to
the next row having two unknowns and the process
continues until the last row is solved.
The matrix L can be computed with
l
ii
=

_
a
ii

i1

k=1
l
2
ik
(1)
for diagonal elements and
l
ij
=
1
l
jj
_
a
ij

j1

k=1
l
ik
l
jk
_
(2)
for non-diagonal elements. Both (1) and (2) show that
the matrix L can overwrite the matrix A, which saves
memory. An algorithm as a pseudo-code for Cholesky
decomposition is given in Fig. 1.
3. Software Implementation
To obtain high throughput, the implementation must
maintain smooth program ow and high utilization of
the processor resources.
3.1. Avoidance of Division Operation
The algorithm in Fig. 1 includes division operation.
The division operation with SW subroutine would take
several clock cycles. On the contrary, the multiplication
operation takes typically only one clock cycle in DSPs.
The algorithm in Fig. 1 shows that the divider is a
diagonal element l
jj
of the L matrix. The value of
the diagonal element is the square root of the original
diagonal element a
jj
, from which a cumulative sum of
products is subtracted. Thus, there is a division by

x,
which can be replaced with multiplication by 1/

x. A
novel way of computing xed-point 1/

x is presented
in the fourth section in this paper. When square root

x
1-4244-0368-5/06/$20.00 2006 IEEE 6
for j := 1 step 1 until N do
for i := j step 1 until N do
begin
x := a
ij
for k := j 1 step 1 until 1 do
x := x l
ik
l
jk
if i = j then
l
ij
:=

x
else
l
ij
:=
x
l
jj
end
Fig. 1: Cholesky decomposition algorithm.
is required for computation of the diagonal elements, it
can be easily computed since

x =
x

x
, i.e., to compute

x with the aid of 1/

x function, one multiplication


is required.
3.2. Q4.11 Fixed-Point Arithmetic
The Cholesky decomposition is numerically stable
in that sense that it preserves the subunitary of matrix
elements. Therefore, xed-point presentation with only
fractional bits would be sufcient. Unfortunately, the
intermediate computations do not t in the subunitary
range and the native Q.15 numbers having 15 fractional
bits and sign bit can not be used.
Based on the xed-point simulations Q4.11 xed-
point number system is chosen, i.e., the maximum value
of integer part is 2
4
= 16 and the accuracy = 2
11
.
The usage of non-native number system is the only
drawback of the presented implementation. Due to the
non-native position of the binary point, the results of
the multiplications must be shifted explicitly.
3.3. Schedule and Addressing
The computation order is illustrated in Fig. 2(a). The
rst element, l
11
, and rest of the rst column, l
i1
,
i = 2, 3, . . . , N are computed as special case since the
accumulated sum of products is zero for each element.
After the element l
22
, the nested loops are entered and
the computation traverses the lower triangle always rst
downwards and secondly to the next column. With this
order all the elements in the same column can easily
share the same value of 1/

x function.
As a second benet, the address generation is allevi-
ated. Addressing memory location naively can require
one multiplication and two additions in the worst case,
i.e., addr
i,j
= A
base
+ N i + j. A simpler way is to
use a pointer which is updated as the matrix is traversed.
So, binding of loop iterator variable and addressing is
avoided, which alleviates hardware looping.
In Fig. 2(b), the accessed elements are shown,
when computing accumulated sum of products for l
64
.
Fig. 2(b) and matrix element placing in Fig. 2(c) show
that the elements can be addressed with pointers, which
are updated with auto-decrement. Thus, there is no
overhead of complex address generation in the inner-
most loop. When the computation continues to the next
_
_
_
_
_
_
_
_
_
_
_
_
l
11
a
12
a
13
a
14
a
15
a
16
a
17
a
18
l
21
l
22
a
23
a
24
a
25
a
26
a
27
a
28
l
31
l
32
l
33
a
34
a
35
a
36
a
37
a
38
l
41
l
42
l
43
l
44
a
45
a
46
a
47
a
48
l
51
l
52
l
53
l
54
l
55
a
56
a
57
a
58
l
61
l
62
l
63
l
64
l
65
l
66
a
67
a
68
l
71
l
72
l
73
l
74
l
75
l
76
l
77
a
78
l
81
l
82
l
83
l
84
l
85
l
86
l
87
l
88
_
_
_
_
_
_
_
_
_
_
_
_

1
2 3
4 5
6
7
8
9
10
a)
_
_
_
_
_
_
_
_
_
_
_
_
l
11
a
12
a
13
a
14
a
15
a
16
a
17
a
18
l
21
l
22
a
23
a
24
a
25
a
26
a
27
a
28
l
31
l
32
l
33
a
34
a
35
a
36
a
37
a
38
l
41
l
42
l
43
l
44
a
45
a
46
a
47
a
48
l
51
l
52
l
53
l
54
l
55
a
56
a
57
a
58
l
61
l
62
l
63
l
64
l
65
l
66
a
67
a
68
l
71
l
72
l
73
l
74
l
75
l
76
l
77
a
78
l
81
l
82
l
83
l
84
l
85
l
86
l
87
l
88
_
_
_
_
_
_
_
_
_
_
_
_



b)
_
_
_
_
_
_
_
_
_
0 1 2 ... N1
N N+1 N+2 ... 2N1
2N 2N+1 2N+2 ... 3N1
3N 3N+1 3N+2 ... 4N1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
N
2
N N
2
N+1 N
2
N+2 ... N
2
1
_
_
_
_
_
_
_
_
_
c)
Fig. 2: a) Computation order during the decomposition.
b) Accessed elements when computing the accumulated
sum for l
64
. c) Matrix element placing in one dimen-
sional array stores the matrix row-wise.
column, the rst element is on diagonal. It is efcient
to have a separate pointer for diagonal elements. When
the diagonal is accessed its address is incremented by
N + 1 as indicated by Fig. 2(c). Next, the address of
the diagonal is used as the new base address.
The algorithm in Fig. 1 tests whether a diagonal ele-
ment is accessed. However, such a test can be avoided.
When the middle loop terminates, the next element
is always diagonal. Avoidance of testing is benecial,
since conditional expression can lead to a jump, which
intervenes pipeline and causes extra latency.
3.4. Smooth Looping
The algorithm in Fig. 1 uses three loops. An easy
compilation strategy of the loop is to use two conditional
jump instructions. The rst one skips the whole loop if
the number of iterations is zero. The second branching
instruction jumps back to the beginning of the loop
body, if there are iterations left. The updating of the
loop iterator is included into the loop body.
The described loop compilation strategy is general,
but it does not utilize HW looping capabilities of DSPs
and does not achieve low enough cycle count. With HW
looping the DSP updates the loop iterator implicitly and
7

a invs
j ij
j1,1 j1,2 j1,3 j1,j1
i,1 i,2 i,3 i,j1
ij
. . .
...
l l l l
l l l l
l
a)
1 x

. . .
...
invs
i
ii
i,i1 i,3 i,2 i,1
l l l l
ii
l
a
b)
Fig. 3: (a) Computation of non-diagonal elements.
Previously computed 1/

x function is accessed with


variable invs. (b) Computation of diagonal elements.
determines when all the iterations are passed. Further-
more, the HW loop does not introduce additional latency
of jump instructions. The DSP applied in this study is
capable of two nested HW loops [7]. To utilize obvious
benets of the HW looping, the loops in the Cholesky
decomposition algorithm are re-written in such a way
that they are iterated at least once and unnecessary
accesses of the iterator variables are avoided.
3.5. Kernel Functions
The innermost loop of Cholesky decomposition com-
putes cumulative sum of products. Thus, it lends itself
to multiply and accumulate (MAC) operations. The
computation of a non-diagonal element is illustrated
in Fig. 3(a). This gure also shows, how the value
of previously computed 1/

x function is applied. The


diagonal elements are computed as shown in Fig. 3(b).
In this case, 1/

x function must be computed and its


value is saved for later use. Also the memory access
scheme differs as the products have the same operands,
i.e., the square is computed.
Even if the results of the multiplication of two Q4.11
xed-point numbers need to be scaled, the intermediate
results of accumulated sum of products do not need to
be scaled as they t into accumulator. The accumulated
sum can remain in Q8.22 format and only the nal result
needs to be scaled back to Q4.11 format.
4. Computation of 1/

x
The computation of 1/

x is demanding because of
the high non-linearities of the inverse and the square
root functions. Methods like digit recurrence [8], piece-
wise linear approximation, Taylor series expansion, or
polynomial approximation have been developed to com-
pute 1/

x. In [9] a convenient SW method has been


explained, but it relies on the oating-point numbers.
In this study, a novel way presented in [10] has been
used. In [10], a HW implementation was targeted, but
the method lends itself also to the SW implementation.
The main benets of the used 1/

x function are small


number of clock cycles and avoidance of look-up tables.
4.1. Initialization Algorithm
The method is based on substituting the highly non-
linear 1/

x with less non-linear 1/

1 + y. This is
justied since the positive subunitary x having leading
zeros can be presented as
x = 0.00 . . . 0
. .

1y (3)
and
x 2

= 1.y = 2
0
+ y = 1 + y, (4)
where also y is positive subunitary number. Thus,
1/

x = 2

2
/
_
1 + y. (5)
Depending on the remainder of /2, for even values
= 2k
1/

x = 2
k
/
_
1 + y (6)
and for odd = 2k + 1,
1/

x = 2
k

2/
_
1 + y. (7)
The expression 1/

1 + y, which takes values 1 and


1/

2 at y = 0 and y = 1, is approximated with a


straight line
1/
_
1 + y 1

2 1

2
y. (8)
For even
2
k
_
1

2 1

2
y
_
2
k
_
1
1
4
y
_
(9)
and for odd
2
k

2
_
1

2 1

2
y
_
2
k
_

2
1
2
y
_
. (10)
The approximations (9) and (10) can be computed
with minor effort. The only difculties in SW imple-
mentation can be obtaining the number of leading zeros,
. With limited instruction set, the could be obtained
with testing bits iteratively. However, the DSP applied
in this study has a special instruction for obtaining the
number of leading zeros. Thus, the can be obtained
with minor overhead.
The SW routine has to choose two branches according
to the parity of . In this case, it is advantageous to use
conditional execution instead of branching instruction.
The instructions of both branches are guarded by ags,
whose value enables or disable the execution of the
operation. Even if the operations that are not executed
take clock cycles, the total savings overcome.
8
Table 1: Number of clock cycles per decomposition.
Matrix size N 4 8 16 32 64
Proposed 0 iters 269 961 3671 16426 85070
Proposed 2 iters 402 1350 5126 22060 113284
High-level 0 iters 403 1507 6245 27093 128949
High-level 2 iters 529 1753 6731 28059 130975
AccelCore 1407
4.2. Newtons Iterations
After the initial approximation of 1/

x has been
obtained, the computation is continued with Newtons
iterations. Due to the accuracy of the initial value two
iterations are enough with Q4.11 numbers [10]. The
Newtons iteration gives root of an equation f(y) = 0
by recursively applying
y
n+1
= y
n
f(y)/f

(y). (11)
For y = 1/

x, f(y) = 1/y
2
x, and f

(y) = 2/y
3
y
n+1
=
1
2
y
n
(3 xy
2
n
), (12)
where the initial value is y
0
.
5. Performance
The matrix size affects considerably to the run time
since the Cholesky decomposition is class O(n
3
) al-
gorithm. The number of the clock cycles in Table 1
are given for matrix sizes N = 8, 16, 32, and 64 with
zero or two Newtons iterations. The results indicate
that the proposed implementation with N = 64 and two
Newtons iterations is capable of one decomposition in
one UMTS time slot, T
s
= 0.667 ms, with a reasonable
clock frequency 113284 cycles / T
s
= 170 MHz.
In the same table the number of clock cycles of
straightforward high-level implementation of the algo-
rithm in Fig. 1 are given. It does not use division oper-
ation, applies the developed routine for 1/

x function,
and computes the rst column as a special case. For
comparison, the performance of a dedicated IP core
has been also included. The core uses also 16-bit word
length. The throughput of the IP core for 4 4 matrix
is 633.2 ksps at 55.7 MHz [3]. So, the number of clock
cycles is 55.7 Mcps / (633.2 ksps / (4 4 samples /
matrix)) = 1407 cycles / matrix.
In Table 2 the numbers of arithmetic operations of
the Cholesky decomposition are given. The number of
operations converges to N
3
/3, but with small matrix
sizes the difference is more signicant. As the number
of clock cycles in Table 1 is compared with number of
operations in Table 2 one can also note more signicant
difference for smaller matrix sizes. With larger matrices
the innermost loops have more iterations and, thus, the
efciency of zero overhead loop of MAC operations is
more dominating. The results indicate that especially
with modest matrix size the efciency of Cholesky
decomposition implementation is of great importance.
Table 2: Number of arithmetic operations.
Matrix size N 4 8 16 32 64
N
3
/3 21 171 1365 10922 87381
Additions or subs. 10 84 680 5456 43680
Multiplications 20 120 816 5984 45760
1/

x functions 4 8 16 32 64
Total arithmetic ops. 34 212 1512 11472 89504
6. Conclusions
The Cholesky decomposition was targeted to telecom-
munications systems and the run time was the most
crucial design criteria. Detailed study showed that the
algorithm can utilize efciently typical DSP resources.
The implementation was enhanced with a novel method
of computing the xed-point inverse square root. The
proposed implementation obtained signicant savings in
clock cycles. The DSP implementation also overcame a
dedicated IP core. Finally, the applicability was justied
with the modest clock frequency required to run one
Cholesky decomposition in one UMTS time slot.
References
[1] P. Darwood, P. Alexander, and I. Oppermann,
LMMSE chip equalization for 3GPP WCDMA
downlink receivers with channel coding, in Proc.
IEEE Int. Conf. on Commun., vol. 5, Helsinki,
Finland, Jun. 2001, pp. 14211425.
[2] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and
B. P. Flannery, Numerical Recipes in C. Cam-
bridge, UK: Cambridge University Press, 1992.
[3] Accelcore
TM
cholesky matrix factorization, Ac-
celChip Inc., Aug. 2005, product specication.
[4] G. Baker, J. Gunnels, G. Morrow, B. Riviere, and
R. van de Geijn, PLAPACK: High performance
through high-level abstraction, in Proc. Int. Conf.
on Parall. Processing, Minneapolis, MN, USA,
Aug. 1998, pp. 414422.
[5] J. Demmel, LAPACK: a portable linear algebra
library for supercomputers, in Proc. IEEE Control
Syst. Soc. Worksh. on Computer-Aided Control Sys.
Design, Tampa, FL, USA, Dec. 1989, pp. 17.
[6] TMS320C55x Technical Overview, Texas Instru-
ments, Feb. 2000, SPRU393.
[7] TMS320C55x DSP Programmers Guide, Texas
Instruments, Aug. 2001, SPRU376A.
[8] E. Antelo, T. Lang, and J. Bruguera, Compu-
tation of
_
x/d in a very high radix combined
division/square-root unit with scaling and selection
by rounding, IEEE Trans. On Computers, vol. 47,
no. 2, pp. 152161, Feb. 1998.
[9] C. Lomont, Fast inverse square root, Department
of Mathematics, Purdue University, Tech. Rep.,
Feb. 2003.
[10] P. Salmela, A. Burian, T. J arvinen, J. Takala, and
A. Happonen, Fast xed-point approximation of
the inverse square root, submitted to IEEE Trans.
Circuits Syst. II: Express Briefs.
9