Вы находитесь на странице: 1из 6

16-Bit Viterbi Decoder Processor

Ashkan Borna

Mojtaba Mehrara

Robert Mullenix

Brian Pietras

Student
Dept. of EECS
University of Michigan

Student
Dept. of EECS
University of Michigan

Student
Dept. of EECS
University of Michigan

Student
Dept. of EECS
University of Michigan

ashborna@umich.edu

mehrara@umich.edu

rbmullen@umich.edu

bpeitras@umich.edu

Figure 1 shows an encoder with a constraint length of 3 and


generator polynomials of 111 and 101 [1] which may be
shown in a closed form of (3, 7, 5). We used this structure as the
reference for designing our Viterbi decoder.

ABSTRACT
In this report, implementation of a Viterbi decoder on a 16 bit
RISC microprocessor with a 2-stage pipeline is described in
detail. Some extra modules and instructions have been added to
the baseline processor to optimize it as a Viterbi decoder. The
Viterbi algorithm is used for decoding a class of error correcting
codes called convolutional codes which are widely used as
channel coder in digital communication systems. We discuss our
design choices and supplemental additions, as well as some of the
pitfalls encountered.

Keywords
Viterbi algorithm, convolutional codes, RISC processor, VLSI.

1. INTRODUCTION
The Viterbi algorithm is widely used in communication systems
to extract the most probable bit sequence out of a transmitted data
stream that has been encoded using convolutional codes. The
algorithm is based on computing the distance (Hamming distance
for hard input data and Euclidean distance for soft input data)
between the received data sequence and all possible sequences,
and extracting the most probable one.

Figure 1. Encoder Structure

2.2 The Viterbi Algorithm


When a sequence of data is received from the channel, it is
desirable to estimate the original sequence that has been sent. The
process of identifying such a sequence can be done using a
diagram called trellis (Figure 2) The detection of the original
stream can be described as finding the most probable path through
the trellis.

In this project we have fit a soft input Viterbi decoder with a


constraint length of three and input width of three bits into a 16bit RISC microprocessor. The design goal was to minimize the
memory usage while maintaining the decoding speed. The report
is organized as follows. In section 2 a brief description of
convolutional coding and Viterbi decoding is presented. Section 3
gives an overview of the whole chip implementation along with
our approach for implementing the Viterbi algorithm in the
processor. Sections 4 and 5 describe testability features and pin
placement while section 6 includes the timing analysis of the final
design. Finally section 7 contains our final results.

000
001
010
011
100
101
110

2. BACKGROUND
2.1 Convolutional Encoder

111

In a convolutional encoder, the input is fed into a shift register


and the outputs are the results of the addition of different registers
and the primary input. The number of the registers in the encoder
plus one is called constraint length. A generator polynomial is
assigned to each output as a structural specification of the encoder
which defines the registers that should be used to produce that
specific output. For instance a generator polynomial of 101
indicates that the output is the addition of the primary input and
output of the second register. So each encoder can be uniquely
identified by its constraint length and generator polynomials.

Figure 2. Trellis diagram for a (4,13,17) decoder[2]


In the trellis diagram each node corresponds to an individual state
at a given time and indicates a possible pattern of recently
received data bits. Each branch indicates the transition to a new
state at the next timing cycle [1]. The transition from each stage
to the next is defined by a state transition diagram which is
sometimes called trellis legend. This legend is constructed
according to the structure of the encoder. Figure 3 shows the
legend which corresponds to our reference encoder.

lot in implementing the desired application. The arithmetic shift


operations stand as the sole exception. Due to the nature of our
funnel shifter design, we essentially got them for free. This
principle had the effect of putting an emphasis on minimizing
area and reducing our core size dramatically since we excised any
extraneous features.
The RAM block provides another example of this principle at
work. Since the dataflow in our application is real-time and our
algorithm only makes use of the register file, we do not need to
save information into memory. Even the smallest RAM block
available in the parts library would add unwanted delay and
complexity in addition to doubling our chip size.
The
specifications required us, however, to implement the load and
store operations. So we adapted a copy of our register file,
connected a 4 to 16 bit decoder to it and made it as a 16-word
RAM, fulfilling both the requirements and our design
philosophy.

Figure 3. Trellis legend for (3, 7, 5) decoder[3]


Each branch on the trellis has an assigned metric which represents
the cost of passing through that branch to the next state. For the
case of soft input Viterbi this value equals the Euclidean distance
between the actual received bits and the expected branch data.
The state metrics, or path metrics, are the accumulation of the
branch metrics through the most probable path arriving into a
specific state. Section 3.3 explores these operations in more
detail. After building up the trellis, two approaches called trace
back and register exchange may be used to decode the data. In
the register exchange method, a register is assigned to each state
and it records the decoded output sequence along the path from
the initial state to the final state. At the last stage, the decoded
output sequence is the one stored in the survivor path register
assigned to the state with the minimum path metric [4]. The trace
back approach needs less computation than register exchange, but
since the latter is faster using our approach and requires less
memory, we have chosen to use that for the decoder. Ideally the
sequence along the final survivor path is valid when all data in a
given sequence has been received and the whole trellis has been
constructed. But eventually after some point in the trellis diagram,
all survivor paths originate from the same state. This length is
called the trace back length and it equals to five times the
constraint length

3.2 Chip Architecture

3. CHIP OVERVIEW
3.1 Design Considerations

We used the one master and 16 slave latches design as the


template for the register file. Although this approach made
driving the bus difficult in absence of read signals, we addressed
this problem by placing keepers on the data lines. Several of the
Viterbi instructions required the three distinct registers, two
source registers and one destination. The normal baseline
instructions use one of the source registers as the destination
register. We used a mux to selectively couple and decouple one of
the read ports with the write port. This allowed us to maintain
normal functionality during baseline instructions, but expand
when executing the Viterbi instructions.

To accomplish Viterbi decoding while still preserving the


flexibility of a general-purpose processor, we designed and laid
out the base architecture that was provided and supplemented it
with a module that added all of the functions Viterbi required.
We also support load and store instructions through a 16-word
RAM block based on our register file. As mentioned before, we
did not require additional memory, since the Viterbi algorithm
makes efficient use of our register space.
We implemented the suggested two-stage pipeline, fetching each
instruction and then executing it. A deeper pipeline could have
possibly increased our performance, but it added a greater degree
of complexity. In the next few sections, we outline individual
components.

3.2.1 Datapath
Our fully custom datapath incorporates the register file, ALU,
shifter, and enough multiplexers (muxes) and tri-state buffers to
control the flow of data from one unit to another. Although we
designed each component independent of each other, we adjusted
input and output ports as the interactions revealed more accurate
knowledge about timing and load constraints. Section 6 details
the timing specifics.

Throughout our design, we kept a number of principles in mind to


guide us when making critical decisions. We decided not to
design around maximizing performance or minimizing powerconsumption, both traditional and highly desirable VLSI
objectives. In picking one, a team must ultimately sacrifice the
other, in addition to other important parameters. Oftentimes the
marginally incremental gains in speed or power savings come at a
tremendous cost elsewhere. Instead, we based our decisions
around providing an acceptable balance between complexity,
performance, power, and development time.

The adder design afforded us the most flexibility and range in


possible implementations. After an examination of the
descriptions and trade-offs between adder families, we decided on
the Variable-length (square-root) Carry Increment Adder [6]. It
closely resembles the Carry Select Adder, improving it with a
number of logic optimizations (mainly using propagate-generate
logic versus full-adder logic). During the layout, we learned that

Finally, we always kept our application in the forefront of our


mind when making design choices. We decided not to spend time
implementing features that would not benefit Viterbi decoding.
As a result, we did not implement most of the instructions in the
included Instruction Set Architecture (ISA) above the required
minimum, but we added several new instructions that helped us a

removing the multipliers needed for doing the squares. If we use


seven bits for the branch metric, the normalization operation,
which will be discussed in the next section, would occur quite
often and this reduces the accuracy of decoding. But in case of
using four bits for each branch metric, the possibility of this
normalization occurrence is greatly reduced and this results in
improvement in BER. Figure 8 shows the results of the Matlab
simulation for the Matlab Viterbi decoder, our decoder with the
Euclidean distance branch metric, and also our decoder with the
simplified branch metric.

the non-uniform length made custom implementation very timeconsuming with only marginal perceived performance gains. In
hindsight, we probably should have picked the fixed length
design.
For our shifter, we used the funnel shifter design described in [6].
We picked this shifter because of its simple, intuitive layout and
flexible functionality. We achieved logical and arithmetic right
and left shifts just by using a 4-to-1 mux on the input. Although,
the Viterbi Algorithm did not require arithmetic shifts explicitly,
we gained this functionality with almost zero additional work.
During the datapath construction, we noticed poorer than
expected performance from our components. Our muxes did not
do an adequate job of passing the data with minimal delay. We
spent some time investigating different mux designs, until we
resolved on a six-transistor circuit (a PMOS, an NMOS and two
inverters as output buffers) that perfectly met our needs. It had
the minimum delay of the different designs we tested, and we
could fit two of them inside our bit-slice width of 73.5 lambda.

The bmu computes X + Y, (~X) + (~Y), (~X) +(Y) and (X) +


(~Y) and stores them in the destination register. X and Y are the
three bit soft inputs to the decoder and come directly from the IO
registers and ~X and ~Y are their inverted values. Since we have
used three bits for the soft input we allocated four bits to each
branch metric container and we used one 16-bit word to store all
of them.

3.2.2 Controller

At each stage of the trellis we need to add the previous path


metrics to the branch metrics of the current stage according to the
trellis legend (Figure 3), and update the path metric assigned to
that state. In order to update the path metric we perform one addcompare-select operation for each state. This operation adds two
previous path metrics to current branch metrics, compares the two
values entering each stage and selects the minimum among them
as the winner.

3.3.3 Path Metric Unit

In addition to the logic that determined the datapath and Viterbi


control signals, the controller contained the Instruction Register
(IR), the Program Counter (PC), and the Program Status Register
(PSR) bits. Since we didnt need an entire word for the PSR, we
broke up the condition codes into individual registers, although
the scan chain connected them in the same order as the ISA
specification detailed. The condition codes were set during
Addition, Subtraction, Comparison, and certain Viterbi
Instructions. For Addition, Subtraction, and Comparison, the
output of the operation, regardless if written back to memory,
determined the value of the condition codes. For example, a
negative 2s complement result would set the N register, while an
overflow would set the F register. The Viterbi instructions treated
the condition code registers differently than the specification
detailed.

We have a pmu instruction in our ISA and we run it twice in the


assembly program for each decoding cycle. The first run
computes the path metrics of states 00 and 01 from the previous
path and branch metrics related to states 00 and 10. The second
one does the same thing for states 10 and 11 from states 01 and
11. It is obvious that we have a butterfly structure between two
successive stages when we perform the path metric computation.
This structure is shown in Figure 4. We perform a swap operation
among path metrics later on to implement this butterfly.

For the most part, the controller determines the next state logic on
the positive edge of the clock. We could not, however, get the
Next_PC register to set correctly unless we used the negative
edge. This had a detrimental effect on our final clock speed, since
that forced the branch instruction to finish in half a cycle.

As stated above, the bit width of each branch metric container is


four bits. The minimum bit width for the path metrics is five bits
and since we fit two path metrics in one register, we use eight bits
as the bit width for the path metrics. Due to the accumulative
nature of the path metric computation, these values tend to
overflow after several stages of computation. It is important to
normalize the values to prevent overflow. There are several
approaches for path metric normalization [5]. Among them the
Fixed Shift normalization fits well in our design. In this method
we use nine bits to perform the addition for each path metric in
pmu and depending on the possibility of overflow in the next
stage we choose between the eight most significant or eight least
significant bits as the final value for the path metric. When there
is a possibility of overflow in one of the metrics, all of them are
shifted to right by one bit. To keep track of overflow detection,
we used the C, N, Z and L flags in the PSR. These are set by pmu
instructions and are checked afterwards to perform shift
operations when necessary.

3.3 Viterbi Decoding implementation


3.3.1 Viterbi Decoder Components
In order to fit the decoding algorithm in our chip we decided to
implement some extra modules in verilog. We based our design
choices on the simulation results of a Viterbi decoder which we
implemented in Matlab. Later on, we used Matlab to fully verify
our assembly program, which we had developed to perform the
decoding using our chips extra features and instructions.

3.3.2 Branch Metric Unit


This module computes the metrics of the branches at each stage of
the trellis diagram. There is a bmu instruction in our processor
which performs the branch metric computation at each decoding
stage based on the inputs. As stated before, Euclidean distance is
traditionally used as the branch metric for soft input data types.
But during our Matlab simulations, we discovered that if we use
the regular distance without the squares, we would have
improvements both in terms of total Bit Error Rate (BER) and
Viterbi module area. The improvement in area is caused by

Computes 4 four bit branch metrics and


stores them in Rdest

BMU Rdest
PM0

PMU
PM1

PMU
Rsrc1,Rsrc2,Rdest

Adds the proper branch metrics in Rsrc1 to


the previous path metrics in Rsrc2,
compares them and puts the minimum in
Rdest - Each PMU instruction computes
two 8 bit path metric.

Swap1(2)
Rsrc1,Rsrc2,Rdest

Copies one of the computed path metrics of


Rsrc1 and Rsrc2 and puts them in Rdest to
prepare the path metric for next stage
computation

Hs Rsrc,Rdest (half
shift)

Shifts the higher and lower 8 bits of Rsrc


and copies that into Rdest. Used for
normalization of the path metrics

PM2

PMU
PM3

Hcmp1(2,3) Rsrc,
Rdest

Figure 4. Butterfly structure in computing path metrics [3]

3.3.4 Register Exchange


During this operation, the path related to each state is updated.
Since the trace back length is fifteen for our decoder, each path
can be stored in a 16-bit register in the register file. At each stage
the path related to the previous winning state is shifted to left by
one bit and the bit on the winner branch is inserted at bit position
zero. This path is then copied to the register assigned to the
current state. At the same time the last bit on the register which
corresponds to the path with the minimum path metric is sent to
the output port as the decoded data. We used the processors shift
instruction to complete this stage, so there were no extra modules.

Compares the higher and lower 8 bit values


in Rsrc and copies the minimum in Rdest Used for finding the minimum metric at
each stage to send out data from
corresponding path register
Table 1. Extra instructions

Our additional instructions save us twenty-two cycles for each


decoding stage. We also save about twenty instructions because
most of our registers hold an operand in both the upper and lower
8-bits. This means that we do not have to load and store to RAM
with our implementation. Overall these savings effectively double
our throughput.

3.5 Our Custom Serial Interface


Our chip implements a serial connection with a complete
handshake to the outside world. The output device asserts the
reset and do_viterbi signals to begin. At the rising edge of
do_viterbi, our chip asserts a send_data signal to the output device
to indicate that it is ready to receive data. Then the device
provides the data on opx_serial and opy_serial and asserts
data_in_valid for three clock cycles to build up the 3-bit registers
inside the Viterbi module. From then on, our chip asserts
send_data after each BMU command, receives opx and opy and
stores them for the next BMU. Sixteen decoding cycles after the
assertion of do_viterbi our chip starts outputting data and
asserting data_out_valid at the end of each decoding stage, which
takes approximately forty-three cycles. In order to prevent
processing without proper data, the controller stalls if it reaches a
BMU instruction and the BMU_ready signal is low. When the
outside world is done with sending data, it drops the do_viterbi
signal low so the chip knows to simply output the remaining
sixteen bits of data.

Figure 5 shows the program flow for decoding the input data.

4. TESTABILITY

Figure 5. Viterbi decoder program flow

We provided a scan chain in order to access important control


registers. When an external device asserts the scan_en signal via
an input pin to our chip, both the processor and the controller stall
normal operations. Figure 6 demonstrates the scan path through
the PSR, IR, and PC registers. On every clock cycle, data from
the scan_in input pin stores into the N register, with the value
stored in the N register moving to the L register, and so on, with
the value in the MSB of the PC outputting on the scan_out output
pin.

3.4 Extra instructions


We added five new types of instruction; a few instructions have
extra variations. Some of our instructions require two source
registers and a destination register, so we used the epodes 0x6xxx
and 0xAxxx to handle these. The controller uses a mux to
determine which bits of the opcode to use as the destination
register based on the instruction. New instructions are in the
following table.

scan_in

scan_out
PSR

IR

PC

LSB -> MSB for all modules

Figure 6. Scan chain.


Upon lowering the scan_en signal, the processor resumes
executing the instruction stored in the IR and advancing the PC
from its current value. The reset signal restarts the processor by
clearing the PC and other registers and loading the first
instruction stored in the ROM into the IR.

Isolated Delay

Integrated Delay

Register File Writes

0.7835ns

0.8646ns

Register File Reads

0.6744ns

0.7224ns

ALU

2.4154ns

2.3589ns

Shifter

0.290ns

0.6709ns

Table 2. Component delays, isolated and integrated.


The critical path through the datapath occurred while operating a
subtraction. The XOR gates that inverted the Rsrc inputs did a
poor job of driving the signal through the ALU.

5. PINS
Our processor uses twenty-six pins. I/O requires twelve pins,
leaving the remaining fourteen as power and ground. Figure 7
shows the I/O placement around our chip with respect to the
major internal components. Five pins serve as testability and
overall system control/synchronization (scan_en, scan_in,
scan_out, reset, and clk). We use one pin to start the beginning of
Viterbi decode (do_viterbi), three pins for validation/handshaking
(send_data, data_out_valid, and din_valid), two pins that provide
input to the Viterbi module (opx_serial and opy_serial), and one
pin (data_out) that contains the recovered signal.

We found that the critical instruction, however, was the Branch.


Because the controller required the value of the next PC be ready
at the negative edge of the clock, we needed to determine the
Branch in a half cycle. The Branch instruction took 4.75 ns to
complete. This brought our minimum clock period to 9.5ns. We
set our final clock frequency to 105MHz which will lead to a 2.45
Mbps throughput for the Viterbi decoder, since it takes around
forty-three instructions to decode each bit. We tried to remove the
artificial and constricting negative edge requirement of the branch
instruction, but could not find an adequate solution. Had we
resolved that issue, the Viterbi PMU instruction, at 7.4ns, would
have been the critical instruction, allowing us to increase the
clock frequency to 133MHz, an increase of 26%.

opx_serial

send_data

data_out

data_out_valid

do_viterbi

dirty gnd

dirty vdd
clean gnd

Component

opy_serial
RAM

clean vdd

Viterbi

din_valid

reset

clk

dirty gnd

clean_vdd

ROM

Ctrlr

dirty vdd

Dtpth

clean gnd

dirty gnd

dirty vdd
dirty gnd

scan_en

scan_in

scan_out

clean vdd

clean gnd

dirty vdd

Figure 7. Pin Placement

6. TIMING ANALYSIS
We ran timing analysis on each component as we built them,
determining the rise and fall times, worst-case delays, and critical
paths. As the circuit became progressively more complex, we
realized that testing each component in isolation would yield
inaccurate results due to a number of factors such as input drive
strength and output capacitance.
So we went back and
recalculated many of the delays. Table 1 highlights some of the
differences we observed.

Figure 8. Final Layout of the chip

Figure 9. Output Bit Error Rate vs. SNR for one million samples

[4] El-Dib, D.A., Elmasry, M.I., Modified register-exchange


Viterbi decoder for low-power wireless communications,
IEEE Transactions on Circuits and Systems, vol. 51, p.p.
371-378 (Feb 2004)

7. Conclusion
In this project we fitted a Viterbi decoder in a baseline 16 bit
RISC processor by adding some extra features and instructions.
Although we could have implemented the decoder in a separate
module like an ASIC, our design fits well into the processor and
we used the processors features to implement our application. In
order to verify the final processor design, we wrote an assembly
code which performed the decoding and compared the chips
outputs to the Matlabs outputs which proved the complete match
between the hardware implementation and Matlab Code. Figure 9
shows the results of Bit Error Rate measurements of our decoder
in Matlab for one million samples and for different channel signal
to noise ratios.

[5] Shung, C.B. Siegel, P.H. Ungerboeck, G. Thapar,


H.K., VLSI architectures for metric normalization in
the Viterbi algorithm, Proc. IEEE Int. Conf.
Communications (ICC90), vol 4, p.p. 1723-1728(Apr
1990)
[6] Weste, Neil H.E., and Harris, David, CMOS VLSI
Design, 3rd ed., Addison-Wesley, Reading, MA
(2004)

8. REFERENCES
[1] Forney, G.D., Jr., The viterbi algorithm, Proceedings of
the IEEE, Vol. 61, Issue 3, p.p. 268-278 (March 1973)

[2] Afzali-Kusha, A., IP Core Library Development for use in


Digital System designs(viterbi block), technical report,
University of Tehran,(May 2005)

[3] Mehrara. M, FPGA implementation of a Turbo decoder


using SOVA algorithm, BS project report, Sharif University
of Technology, (June 2005)

Вам также может понравиться