Attribution Non-Commercial (BY-NC)

Просмотров: 19

Attribution Non-Commercial (BY-NC)

- Matrices and Determinants for Iitjee
- Algebra Notes
- Lecnotes Econ Analiyses
- tcs2
- Algebra I Review Sheet[1].doc
- Neural Network Stuff
- Matrices Practice
- Exponential
- listofprogramC++
- 18 Matrices
- xi comp sc
- 108S1 2007 Exam Solutions
- Ucla Pso Manual
- 00028037
- EFS Survey
- Mat Ices
- MATLABch10
- Instructions
- Mathematics 2
- MA1506Lab3(MATLAB)

Вы находитесь на странице: 1из 4

Perttu Salmela, Aki Happonen, Tuomas J arvinen, Adrian Burian, and Jarmo Takala

Tampere University of Technology Nokia

Tampere, Finland Finland

{perttu.salmela, jarmo.takala}@tut. {aki.p.happonen, tuomas.jarvinen, adrian.burian}@nokia.com

Abstract

Both the matrix inversion and solving a set of linear

equations can be computed with the aid of the Cholesky

decomposition. In this paper, the Cholesky decomposi-

tion is mapped to the typical resources of digital signal

processors (DSP) and our implementation applies a

novel way of computing the xed-point inverse square

root function. The presented principles result in savings

in the number of clock cycles. As a result, the Cholesky

decomposition can be incorporated in applications such

as 3G channel estimator where short execution time is

crucial.

1. Introduction

Data rates in wireless telecommunications are con-

tinuously increasing, although, the channel remains the

same and suffers from multipath fading and intersymbol

interference. For these reasons, it is crucial to apply

sophisticated channel estimation and equalization meth-

ods. The linear minimum mean square error (LMMSE)

estimation is such a method and it has been proposed for

universal mobile telecommunications system (UMTS)

receivers [1]. The LMMSE is inherently a demanding

operation as it requires complex matrix inversion. In this

paper, the Cholesky decomposition has been studied as

it can be used efciently for matrix inversion and for

solving linear systems.

For high-level software (SW) implementations a ref-

erence C code can be found in [2]. There exist also

intellectual property (IP) cores for Cholesky decompo-

sition [3]. In addition, there exist SW implementations

in libraries such as PLAPACK for parallel linear algebra

algorithms on distributed memory supercomputer [4] or

LAPACK, which is designed for vector computers with

shared memory [5]. However, they are not targeted to

embedded systems relying on DSPs.

In this paper, the Cholesky decomposition is imple-

mented as a SW routine running on TIs C55x processor

[6]. The algorithm is analyzed and the critical parts are

identied. Their computation is alleviated with substitu-

tion with less complex operations, efcient addressing

and order of computations, avoidance of intervening

controls, and single cycle execution of the innermost

kernel computation. In addition, a novel routine for

computing the inverse square root function is applied.

Among these principles, the word length and numerical

range of the xed-point number system is considered.

The developed implementation is evaluated in terms of

savings in clock cycles and suitability under strict time

limits, i.e., one matrix decomposition in one UMTS

time slot T

s

= 0.667 ms. The obtained results justify

the proposed SW implementation.

2. Cholesky Decomposition

Cholesky decomposition transforms a positive deni-

tive matrix A into a form A = LL

T

where L is a lower

triangular matrix. After the Cholesky decomposition

inverting the triangular L is easy. Inverse of original A

is obtained by A

1

= L

1

(L

1

)

T

. When L is given,

also Ax = b can be solved easily. First, L

T

y = b

is solved. Next, Lx = y is solved. Since the matrix

L is triangular, solving begins with a row having one

unknown. It is solved and the value is substituted to

the next row having two unknowns and the process

continues until the last row is solved.

The matrix L can be computed with

l

ii

=

_

a

ii

i1

k=1

l

2

ik

(1)

for diagonal elements and

l

ij

=

1

l

jj

_

a

ij

j1

k=1

l

ik

l

jk

_

(2)

for non-diagonal elements. Both (1) and (2) show that

the matrix L can overwrite the matrix A, which saves

memory. An algorithm as a pseudo-code for Cholesky

decomposition is given in Fig. 1.

3. Software Implementation

To obtain high throughput, the implementation must

maintain smooth program ow and high utilization of

the processor resources.

3.1. Avoidance of Division Operation

The algorithm in Fig. 1 includes division operation.

The division operation with SW subroutine would take

several clock cycles. On the contrary, the multiplication

operation takes typically only one clock cycle in DSPs.

The algorithm in Fig. 1 shows that the divider is a

diagonal element l

jj

of the L matrix. The value of

the diagonal element is the square root of the original

diagonal element a

jj

, from which a cumulative sum of

products is subtracted. Thus, there is a division by

x,

which can be replaced with multiplication by 1/

x. A

novel way of computing xed-point 1/

x is presented

in the fourth section in this paper. When square root

x

1-4244-0368-5/06/$20.00 2006 IEEE 6

for j := 1 step 1 until N do

for i := j step 1 until N do

begin

x := a

ij

for k := j 1 step 1 until 1 do

x := x l

ik

l

jk

if i = j then

l

ij

:=

x

else

l

ij

:=

x

l

jj

end

Fig. 1: Cholesky decomposition algorithm.

is required for computation of the diagonal elements, it

can be easily computed since

x =

x

x

, i.e., to compute

is required.

3.2. Q4.11 Fixed-Point Arithmetic

The Cholesky decomposition is numerically stable

in that sense that it preserves the subunitary of matrix

elements. Therefore, xed-point presentation with only

fractional bits would be sufcient. Unfortunately, the

intermediate computations do not t in the subunitary

range and the native Q.15 numbers having 15 fractional

bits and sign bit can not be used.

Based on the xed-point simulations Q4.11 xed-

point number system is chosen, i.e., the maximum value

of integer part is 2

4

= 16 and the accuracy = 2

11

.

The usage of non-native number system is the only

drawback of the presented implementation. Due to the

non-native position of the binary point, the results of

the multiplications must be shifted explicitly.

3.3. Schedule and Addressing

The computation order is illustrated in Fig. 2(a). The

rst element, l

11

, and rest of the rst column, l

i1

,

i = 2, 3, . . . , N are computed as special case since the

accumulated sum of products is zero for each element.

After the element l

22

, the nested loops are entered and

the computation traverses the lower triangle always rst

downwards and secondly to the next column. With this

order all the elements in the same column can easily

share the same value of 1/

x function.

As a second benet, the address generation is allevi-

ated. Addressing memory location naively can require

one multiplication and two additions in the worst case,

i.e., addr

i,j

= A

base

+ N i + j. A simpler way is to

use a pointer which is updated as the matrix is traversed.

So, binding of loop iterator variable and addressing is

avoided, which alleviates hardware looping.

In Fig. 2(b), the accessed elements are shown,

when computing accumulated sum of products for l

64

.

Fig. 2(b) and matrix element placing in Fig. 2(c) show

that the elements can be addressed with pointers, which

are updated with auto-decrement. Thus, there is no

overhead of complex address generation in the inner-

most loop. When the computation continues to the next

_

_

_

_

_

_

_

_

_

_

_

_

l

11

a

12

a

13

a

14

a

15

a

16

a

17

a

18

l

21

l

22

a

23

a

24

a

25

a

26

a

27

a

28

l

31

l

32

l

33

a

34

a

35

a

36

a

37

a

38

l

41

l

42

l

43

l

44

a

45

a

46

a

47

a

48

l

51

l

52

l

53

l

54

l

55

a

56

a

57

a

58

l

61

l

62

l

63

l

64

l

65

l

66

a

67

a

68

l

71

l

72

l

73

l

74

l

75

l

76

l

77

a

78

l

81

l

82

l

83

l

84

l

85

l

86

l

87

l

88

_

_

_

_

_

_

_

_

_

_

_

_

1

2 3

4 5

6

7

8

9

10

a)

_

_

_

_

_

_

_

_

_

_

_

_

l

11

a

12

a

13

a

14

a

15

a

16

a

17

a

18

l

21

l

22

a

23

a

24

a

25

a

26

a

27

a

28

l

31

l

32

l

33

a

34

a

35

a

36

a

37

a

38

l

41

l

42

l

43

l

44

a

45

a

46

a

47

a

48

l

51

l

52

l

53

l

54

l

55

a

56

a

57

a

58

l

61

l

62

l

63

l

64

l

65

l

66

a

67

a

68

l

71

l

72

l

73

l

74

l

75

l

76

l

77

a

78

l

81

l

82

l

83

l

84

l

85

l

86

l

87

l

88

_

_

_

_

_

_

_

_

_

_

_

_

b)

_

_

_

_

_

_

_

_

_

0 1 2 ... N1

N N+1 N+2 ... 2N1

2N 2N+1 2N+2 ... 3N1

3N 3N+1 3N+2 ... 4N1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

N

2

N N

2

N+1 N

2

N+2 ... N

2

1

_

_

_

_

_

_

_

_

_

c)

Fig. 2: a) Computation order during the decomposition.

b) Accessed elements when computing the accumulated

sum for l

64

. c) Matrix element placing in one dimen-

sional array stores the matrix row-wise.

column, the rst element is on diagonal. It is efcient

to have a separate pointer for diagonal elements. When

the diagonal is accessed its address is incremented by

N + 1 as indicated by Fig. 2(c). Next, the address of

the diagonal is used as the new base address.

The algorithm in Fig. 1 tests whether a diagonal ele-

ment is accessed. However, such a test can be avoided.

When the middle loop terminates, the next element

is always diagonal. Avoidance of testing is benecial,

since conditional expression can lead to a jump, which

intervenes pipeline and causes extra latency.

3.4. Smooth Looping

The algorithm in Fig. 1 uses three loops. An easy

compilation strategy of the loop is to use two conditional

jump instructions. The rst one skips the whole loop if

the number of iterations is zero. The second branching

instruction jumps back to the beginning of the loop

body, if there are iterations left. The updating of the

loop iterator is included into the loop body.

The described loop compilation strategy is general,

but it does not utilize HW looping capabilities of DSPs

and does not achieve low enough cycle count. With HW

looping the DSP updates the loop iterator implicitly and

7

a invs

j ij

j1,1 j1,2 j1,3 j1,j1

i,1 i,2 i,3 i,j1

ij

. . .

...

l l l l

l l l l

l

a)

1 x

. . .

...

invs

i

ii

i,i1 i,3 i,2 i,1

l l l l

ii

l

a

b)

Fig. 3: (a) Computation of non-diagonal elements.

Previously computed 1/

variable invs. (b) Computation of diagonal elements.

determines when all the iterations are passed. Further-

more, the HW loop does not introduce additional latency

of jump instructions. The DSP applied in this study is

capable of two nested HW loops [7]. To utilize obvious

benets of the HW looping, the loops in the Cholesky

decomposition algorithm are re-written in such a way

that they are iterated at least once and unnecessary

accesses of the iterator variables are avoided.

3.5. Kernel Functions

The innermost loop of Cholesky decomposition com-

putes cumulative sum of products. Thus, it lends itself

to multiply and accumulate (MAC) operations. The

computation of a non-diagonal element is illustrated

in Fig. 3(a). This gure also shows, how the value

of previously computed 1/

diagonal elements are computed as shown in Fig. 3(b).

In this case, 1/

value is saved for later use. Also the memory access

scheme differs as the products have the same operands,

i.e., the square is computed.

Even if the results of the multiplication of two Q4.11

xed-point numbers need to be scaled, the intermediate

results of accumulated sum of products do not need to

be scaled as they t into accumulator. The accumulated

sum can remain in Q8.22 format and only the nal result

needs to be scaled back to Q4.11 format.

4. Computation of 1/

x

The computation of 1/

x is demanding because of

the high non-linearities of the inverse and the square

root functions. Methods like digit recurrence [8], piece-

wise linear approximation, Taylor series expansion, or

polynomial approximation have been developed to com-

pute 1/

explained, but it relies on the oating-point numbers.

In this study, a novel way presented in [10] has been

used. In [10], a HW implementation was targeted, but

the method lends itself also to the SW implementation.

The main benets of the used 1/

number of clock cycles and avoidance of look-up tables.

4.1. Initialization Algorithm

The method is based on substituting the highly non-

linear 1/

1 + y. This is

justied since the positive subunitary x having leading

zeros can be presented as

x = 0.00 . . . 0

. .

1y (3)

and

x 2

= 1.y = 2

0

+ y = 1 + y, (4)

where also y is positive subunitary number. Thus,

1/

x = 2

2

/

_

1 + y. (5)

Depending on the remainder of /2, for even values

= 2k

1/

x = 2

k

/

_

1 + y (6)

and for odd = 2k + 1,

1/

x = 2

k

2/

_

1 + y. (7)

The expression 1/

1/

straight line

1/

_

1 + y 1

2 1

2

y. (8)

For even

2

k

_

1

2 1

2

y

_

2

k

_

1

1

4

y

_

(9)

and for odd

2

k

2

_

1

2 1

2

y

_

2

k

_

2

1

2

y

_

. (10)

The approximations (9) and (10) can be computed

with minor effort. The only difculties in SW imple-

mentation can be obtaining the number of leading zeros,

. With limited instruction set, the could be obtained

with testing bits iteratively. However, the DSP applied

in this study has a special instruction for obtaining the

number of leading zeros. Thus, the can be obtained

with minor overhead.

The SW routine has to choose two branches according

to the parity of . In this case, it is advantageous to use

conditional execution instead of branching instruction.

The instructions of both branches are guarded by ags,

whose value enables or disable the execution of the

operation. Even if the operations that are not executed

take clock cycles, the total savings overcome.

8

Table 1: Number of clock cycles per decomposition.

Matrix size N 4 8 16 32 64

Proposed 0 iters 269 961 3671 16426 85070

Proposed 2 iters 402 1350 5126 22060 113284

High-level 0 iters 403 1507 6245 27093 128949

High-level 2 iters 529 1753 6731 28059 130975

AccelCore 1407

4.2. Newtons Iterations

After the initial approximation of 1/

x has been

obtained, the computation is continued with Newtons

iterations. Due to the accuracy of the initial value two

iterations are enough with Q4.11 numbers [10]. The

Newtons iteration gives root of an equation f(y) = 0

by recursively applying

y

n+1

= y

n

f(y)/f

(y). (11)

For y = 1/

x, f(y) = 1/y

2

x, and f

(y) = 2/y

3

y

n+1

=

1

2

y

n

(3 xy

2

n

), (12)

where the initial value is y

0

.

5. Performance

The matrix size affects considerably to the run time

since the Cholesky decomposition is class O(n

3

) al-

gorithm. The number of the clock cycles in Table 1

are given for matrix sizes N = 8, 16, 32, and 64 with

zero or two Newtons iterations. The results indicate

that the proposed implementation with N = 64 and two

Newtons iterations is capable of one decomposition in

one UMTS time slot, T

s

= 0.667 ms, with a reasonable

clock frequency 113284 cycles / T

s

= 170 MHz.

In the same table the number of clock cycles of

straightforward high-level implementation of the algo-

rithm in Fig. 1 are given. It does not use division oper-

ation, applies the developed routine for 1/

x function,

and computes the rst column as a special case. For

comparison, the performance of a dedicated IP core

has been also included. The core uses also 16-bit word

length. The throughput of the IP core for 4 4 matrix

is 633.2 ksps at 55.7 MHz [3]. So, the number of clock

cycles is 55.7 Mcps / (633.2 ksps / (4 4 samples /

matrix)) = 1407 cycles / matrix.

In Table 2 the numbers of arithmetic operations of

the Cholesky decomposition are given. The number of

operations converges to N

3

/3, but with small matrix

sizes the difference is more signicant. As the number

of clock cycles in Table 1 is compared with number of

operations in Table 2 one can also note more signicant

difference for smaller matrix sizes. With larger matrices

the innermost loops have more iterations and, thus, the

efciency of zero overhead loop of MAC operations is

more dominating. The results indicate that especially

with modest matrix size the efciency of Cholesky

decomposition implementation is of great importance.

Table 2: Number of arithmetic operations.

Matrix size N 4 8 16 32 64

N

3

/3 21 171 1365 10922 87381

Additions or subs. 10 84 680 5456 43680

Multiplications 20 120 816 5984 45760

1/

x functions 4 8 16 32 64

Total arithmetic ops. 34 212 1512 11472 89504

6. Conclusions

The Cholesky decomposition was targeted to telecom-

munications systems and the run time was the most

crucial design criteria. Detailed study showed that the

algorithm can utilize efciently typical DSP resources.

The implementation was enhanced with a novel method

of computing the xed-point inverse square root. The

proposed implementation obtained signicant savings in

clock cycles. The DSP implementation also overcame a

dedicated IP core. Finally, the applicability was justied

with the modest clock frequency required to run one

Cholesky decomposition in one UMTS time slot.

References

[1] P. Darwood, P. Alexander, and I. Oppermann,

LMMSE chip equalization for 3GPP WCDMA

downlink receivers with channel coding, in Proc.

IEEE Int. Conf. on Commun., vol. 5, Helsinki,

Finland, Jun. 2001, pp. 14211425.

[2] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and

B. P. Flannery, Numerical Recipes in C. Cam-

bridge, UK: Cambridge University Press, 1992.

[3] Accelcore

TM

cholesky matrix factorization, Ac-

celChip Inc., Aug. 2005, product specication.

[4] G. Baker, J. Gunnels, G. Morrow, B. Riviere, and

R. van de Geijn, PLAPACK: High performance

through high-level abstraction, in Proc. Int. Conf.

on Parall. Processing, Minneapolis, MN, USA,

Aug. 1998, pp. 414422.

[5] J. Demmel, LAPACK: a portable linear algebra

library for supercomputers, in Proc. IEEE Control

Syst. Soc. Worksh. on Computer-Aided Control Sys.

Design, Tampa, FL, USA, Dec. 1989, pp. 17.

[6] TMS320C55x Technical Overview, Texas Instru-

ments, Feb. 2000, SPRU393.

[7] TMS320C55x DSP Programmers Guide, Texas

Instruments, Aug. 2001, SPRU376A.

[8] E. Antelo, T. Lang, and J. Bruguera, Compu-

tation of

_

x/d in a very high radix combined

division/square-root unit with scaling and selection

by rounding, IEEE Trans. On Computers, vol. 47,

no. 2, pp. 152161, Feb. 1998.

[9] C. Lomont, Fast inverse square root, Department

of Mathematics, Purdue University, Tech. Rep.,

Feb. 2003.

[10] P. Salmela, A. Burian, T. J arvinen, J. Takala, and

A. Happonen, Fast xed-point approximation of

the inverse square root, submitted to IEEE Trans.

Circuits Syst. II: Express Briefs.

9

- Matrices and Determinants for IitjeeЗагружено:Himansu Mookherjee
- Algebra NotesЗагружено:nileshgughane
- Lecnotes Econ AnaliysesЗагружено:Nazan Nur Kuşçuoğlu Öğünlü
- tcs2Загружено:Ram Kumar
- Algebra I Review Sheet[1].docЗагружено:Dorothy Dwyer
- Matrices PracticeЗагружено:Sanjeev Panwar
- ExponentialЗагружено:Chibi Maruko Chan
- listofprogramC++Загружено:Kanchan Khurana
- Neural Network StuffЗагружено:SirItchAlot
- 18 MatricesЗагружено:Arif Asghar
- xi comp scЗагружено:coolboywanayou
- 108S1 2007 Exam SolutionsЗагружено:Alex Telfar
- Ucla Pso ManualЗагружено:YJohn88
- 00028037Загружено:Gowri Shankar
- EFS SurveyЗагружено:Indrit Bulica
- Mat IcesЗагружено:Peela Naveen
- MATLABch10Загружено:Amjad Shah
- InstructionsЗагружено:Rolando Venegas Venegas
- Mathematics 2Загружено:vivek95
- MA1506Lab3(MATLAB)Загружено:alibabawalaoa
- 2014 Midterm2 PracticeЗагружено:SueSan Chen
- connor_ch1.pdfЗагружено:JunnoKaiser
- 01062013-006Загружено:Tavva
- lect2Загружено:amin
- Matrices 2Загружено:ByungJae Kang
- Skill MatrixЗагружено:Anish Nair
- Higher Engineering Mathematics_B. S. Grewal.pdfЗагружено:zaid_rastogi
- fulltext (1)Загружено:Hossein Hashemi
- Loops and Flow ControlЗагружено:Guilherme
- CJR struktur aljabar1).docxЗагружено:adelinameidy

- TW Lecture1Загружено:Baruch Cyzs
- OFDM-MIMO Implementation in Line of Sight Microwave for LogTelЗагружено:Baruch Cyzs
- 01368734Загружено:Baruch Cyzs
- Chole Shorter MethodЗагружено:Baruch Cyzs
- Cholesky Decomposition in Mmse MimoЗагружено:Baruch Cyzs
- The Power-Inversion Adaptive Array Concept and PerformanceЗагружено:Baruch Cyzs

- xCP Data Flow Diagram - Stateless vs Statefull Process.pdfЗагружено:shark
- SynopsisЗагружено:Gagan Ghuman
- User Manual 2 RADWIN5000Загружено:luiscastillo7
- System Software- Unit IIЗагружено:JASPER WESSLY
- Micro Strategy - Wikipedia, The Free EncyclopediaЗагружено:Vignesh Balaji
- 51254Загружено:sisedg
- Mukesh Abap Sapui5 6yrЗагружено:Mukesh Sharma
- AxioVision Users Guide.pdfЗагружено:Erghtoast
- Explain-plan.pdfЗагружено:Tobin Joseph Mali
- 86261-P43011 Purchase Order GeneratorЗагружено:Luis
- KIDDElit Literature LibraryЗагружено:kukuh_subiarto
- Unit1-2 mark.docЗагружено:Kanimozhi A Kayalvizhi
- pcp301Загружено:jamesyu
- avr microcontroller.pptЗагружено:Rohit Kanyal
- unit-2 pdfЗагружено:nithilan92
- Requirements Document TemplateЗагружено:Ayush Pathak
- NVIDIA GeForce Graphics Driver 378.49 for Windows 10 64-bit 1Загружено:momogosurlo
- Esper ReferenceЗагружено:mhenley8059
- User InterfaceЗагружено:Teja Bm
- BE CSE SyllabusЗагружено:Pushpendra Mahor
- 123.txtЗагружено:Saka Galuh
- FiberLinX OverviewЗагружено:Constantin Tuca
- CCNA II Chapter 4 SummaryЗагружено:Adam Jian Yin
- SAP Data ArchivingЗагружено:gowthamigamidi
- sЗагружено:vfulco1
- Activematrix Service Grid Concepts GuideЗагружено:parthasarathy
- Versiondog - AP_enЗагружено:Barosz
- 607091748-28181_EGCP RE 1Загружено:Rodrigo Luengo
- Apps DBA QuestionsЗагружено:Arzina Omar
- studiu actor profesorЗагружено:Adriana Neagu

## Гораздо больше, чем просто документы.

Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.

Отменить можно в любой момент.