171 Sudarsanam P

1
Implementation of Polymorphic
Matrix Inversion using Viva
Arvind Sudarsanam, Dasu Aravind
Utah State University
Sudarsanam

MAPLD2005/171 2/29
Overview
Problem definition
Matrix inverse algorithm
Types of Polymorphism
Design Set-up
Hardware design flow (For LU Decomposition)
Results
Conclusions
Sudarsanam

MAPLD2005/171 3/29
Problem Definition
Given a 2-D matrix, A[N][N],

A =
A[1,1] A[1,2] A[1,3].. A[1,N]
A[2,1] A[2,2] A[2,3].. A[2,N]
A[3,1] A[3,2] A[3,3].. A[3,N]
.
A[N,1] A[N,2] A[N,3].. A[N,N]

Determine the Inverse matrix A
-1
, defined as
AxA
-1
= I

Sudarsanam

MAPLD2005/171 4/29
Algorithm flow
Step 1: LU Decomposition
Matrix A is split into two triangular matrices,
L and U

For i = 1:N
For j = I+1:N
A(j,i) = A(j,i)/A(I,i));
A(j,(i+1):N) = A(j,(i+1):N) - A(j,i)*A(i,(i+1):N);
End For j
End For i

Sudarsanam

MAPLD2005/171 5/29
Algorithm flow
Step 2: Inverse computation for triangular
matrices
L
-1
and U
-1
are computed using a variation of
Gaussian elimination

For i = 1:N
For j = i+1:N
Linv(j,i+1:N) = Linv(j,i+1:N) - L(j,i)* Linv(i,i+1:N);
End For j
End For i

Sudarsanam

MAPLD2005/171 6/29
Algorithm flow
Step 3: Matrix multiplication
L
-1
and U
-1
are multiplied together to generate A
-1

For i = 1:N
For j = 1:N
Ainv[i,j] = Ainv[i,j] +U[i,k]*L[k,j]
End For j
End For i
Sudarsanam

MAPLD2005/171 7/29
Types of Polymorphism
Following parameters can be varied for the
input matrix:
Data type variable precision, signed/unsigned,
and float
Information rate Rate at which input arrives into,
and leaves the system (pipelining/parallelism)
Order tensor matrix size (16x16, 32x32 etc.)

Sudarsanam

MAPLD2005/171 8/29
Polymorphism and Viva
Viva supports polymorphic hardware
implementation, just as any software
programming language.

A large library of polymorphic arithmetic,
control and memory modules is available.

Sudarsanam

MAPLD2005/171 9/29
Data Type Polymorphism
Poly-
morphi
c
Sudarsanam

MAPLD2005/171 10/29
Information Rate Polymorphism
Clock speed can be
changed based on the
input data rate
This Mul unit is a Truly polymorphic object. Based on the
input list size, the Viva compiler will generate the required
number of parallel multiplier units. The number of parallel
units will be denoted as K
Sudarsanam

MAPLD2005/171 11/29
Order Tensor Polymorphism
Value
of N
set at
run
time
Sudarsanam

MAPLD2005/171 12/29
Design Flow Top level block diagram
Central
Control
Unit
(CCGU)

Memory
Unit
for A
Memory
Unit
for L
Memory
Unit
for U
Memory
Unit
for L
-1
Memory
Unit
for U
-1
Memory
Unit
for A
-1
LU
Decompose
Loop
Unit
Inverse of
L
Loop
Unit
Inverse of
U
Loop
Unit
U
-1
X L
-1
Loop
Unit
From
Files
Sudarsanam

MAPLD2005/171 13/29
Design Flow
Main Steps Operation Sub Steps Sub Module
1 Initialize 0 Generate address
1 Write A onto BRAM
2 LU Decompose 0 Generate i, j, k
1 Read A[j,i], A[j()]
2 Compute new A[j,()]
3 Write A[j,()],A[j,i]
3 A2LU Convert 0 Generate j,k
1 Read A[j,()]
2 Compute L[j,()], U[j()]
3 Write L[j()], U[j()]
Sudarsanam

MAPLD2005/171 14/29
Design Flow
Main Steps Operation Sub Steps Sub Module
4 L inverse 0 Generate i,j, k
1 Read L[j()],L
-1
[j()]..
2 Compute new L
-1
[j()]
3 Write L
-1
[j,()]
5 U inverse 0 Generate i, j, k
1 Read U[j,()],U
-1
[j,()]..
2 Compute U
-1
[j,()]
3 Write U
-1
[j,()]
6 A inverse 0 Generate i, j, k
1 Read L[I,()], U[j,()]
2 Compute Ainv[i,j,()]
3 Update Ainv[i,j]
Sudarsanam

MAPLD2005/171 15/29
Hardware Design Set-up
Hardware:
PE6 (Xilinx 2V6000 FPGA) of the Starbridge
Hypercomputer, connected to an Intel x86
processor. (66 MHz / 33,768 Slices)

Software:
Viva 2.3, developed at Starbridge Systems
Sudarsanam

MAPLD2005/171 16/29
Implementation LU Decomposition
Loop Unit
i,j,k
Address
Generation
Unit
Memory
Unit
A[j,()],A[i,()],
A[j,i], A[i,i]
Computation
Unit
i,j,k
A[j,()],
A[j,i]
Sudarsanam

MAPLD2005/171 17/29
Loop Unit - Functionality
Given the order of the matrix N and the parallelism to be supported K,
The following loop structure needs to be generated.

For i = 1 to N
For k = ((i-1)/K)*K to N+1-K in steps of K
For j = i to N
Generate(i,k,j);
End j
End k
End i
Sudarsanam

MAPLD2005/171 18/29
Loop Unit - Architecture
A simple register-based implementation is shown. The overall latency is 2
Clock cycles.
Sudarsanam

MAPLD2005/171 19/29
Memory Unit - Distribution
A[1,1:8] A[2,9:16] A[1,17:24] A[1,25:32]
..
A[2,1:8] A[2,9:16] A[2,17:24]
A[3,1:8] A[3,9:16]
A[4,1:8]
.
.
.
.
One Block
Sudarsanam

MAPLD2005/171 20/29
Memory Unit - Architecture
BRAM memories are used to store data internally. (Matrix is expected to
fit into the BRAMs. Maximum value of N is 128)

There are K [(NxN)/K]x(variable Data Size) individual BRAMs.

The K values in each block in Matrix is distributed over the K BRAMs.
This results in a single clock access time for internal memory.

A[j] and A[j,i] will be fetched one after the other on every iteration.

The overall latency was found to be 3 clock cycles.
Sudarsanam

MAPLD2005/171 21/29
Address Generation - Functionality
Inputs: i,j,k from the Loop Unit
Outputs: Address in the BRAM for the A[j,()] and A[i,()] blocks of data
Address in the BRAM of A[j,i] and A[i,i]

The computations have been organized in such a way that A[i,()] needs to be
fetched only once for processing a complete column of blocks.

Thus, only one port is required to access both A[i,()] and A[j,()]

Sudarsanam

MAPLD2005/171 22/29
Address Generation - Architecture
Shift used instead of multipliers: N,K assumed to be powers of 2. (Latency = 1 cc)
Sudarsanam

MAPLD2005/171 23/29
Computation Units - Functionality
Inputs: - A[j,()] and A[i,()] blocks from BRAM unit
- A[j,i] and A[i,i] from the BRAM unit.
- Indices i,j,k from the loop unit.
Output: The modified A[j,()] block and the A[j,i] value.

Three steps are performed:
1. Modify A[i,()] based on the loop indices
2. Perform computations: Divide, Multiply, Subtract
3. Include A[j,i] on A[j,()] if required
Sudarsanam

MAPLD2005/171 24/29
Computation Units Architecture (K=8)
Sudarsanam

MAPLD2005/171 25/29
Results for LUD Slice Counts (N=16)
List Type Fix16 Fix32 Float
Size=4 1862 (8) 7305 (32) 5012 (12)
Size=8 3731 (16) 14472 (64) 9802 (24)
Size=16 7502 (32) 29018 (128) 19024 (48)
Number of ROM multipliers used shown in brackets.
Sudarsanam

MAPLD2005/171 26/29
Results for LUD Time Taken (in cycles)
List Type Fix16 Fix32 Float
Size=4 1212 1276 1232
Size=8 590 654 610
Size=16 279 343 299
Sudarsanam

MAPLD2005/171 27/29
Time taken Vs Size of Matrix (Fix16, K = 8)
Size of the matrix Time taken (in cycles)
16x16 590
32x32 4348
64x64 33528
128x128 264688 (3970320 ns)
A C code (N=128;Fix16) will take O(M*N
3
) time ~ 702545*M ns (where M is
number of cycles per iteration ~ 30) (On Intel Centrino 1.5GHz) ~ M/6 speed-up
Sudarsanam

MAPLD2005/171 28/29
Conclusions
A polymorphic design for matrix inverse was
implemented
Data type - Float/Fix16/Fix32
Information rate (K) - 4/8/16
Order Tensor (N) 16/32/64/128
Vivas effectiveness in polymorphic implementation
was evaluated.
Hardware design flow and Results were shown for
LU Decomposition.

Sudarsanam

MAPLD2005/171 29/29
Lessons learned
Pseudo polymorphism
Some of the polymorphic objects in the Viva library are
pseudo polymorphic. For e.g. floating point and fixed point
implementations of adder unit.
Need for timing analysis tool
It was difficult to compute the delays associated with each
block in the Viva library
Fix32 Vs Float
The division unit in the Viva library is optimized for Floating
point and not for fixed point (as shown in the results)

171 Sudarsanam P

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

171 Sudarsanam P

Загружено:

Авторское право:

Доступные форматы

1

Вам также может понравиться