Вы находитесь на странице: 1из 23

Report on

Implementation of the Smith-Waterman Algorithm on a

Reconfigurable Computing Platform

Section 1
1.1 Description and module hierarchy
In this paper, we present implementations of the Smith-Waterman algorithm for both DNA and
protein sequences on the platform.
The main features include: a multi-stage processing element (PE) design which significantly
reduces the FPGA resource usage and allows more parallelism.
A pipelined control mechanism with uneven stage latenciesa key to minimize the overall PE
pipeline cycle time and a compressed substitution matrix storage structure, resulting in substantial
decrease of the on-chip SRAM usage.
The Smith-Waterman algorithm is a well-known dynamic programming algorithm for performing
local sequence alignment for determining similar regions between two DNA or protein sequences.
The algorithm consists of two steps:
1. Calculate the similarity matrix score.
2. According to the dynamic programming method, trace back the similarity matrix to
search for the optimal alignment. In the algorithm, the first step will consume the largest part of
the total calculation time.

1.2 Scoring Matrix Derivation

The definition of the Smith-Waterman algorithm is shown as below:
While calculating the similarity matrix, the score of any matrix element V(i,j) always depends on
the score of three other elements:
The up-left neighbor element V(i-1,j-1)
The left neighbor V(i,j-1)
The up neighbor V(i-1,j)

The algorithm intention is to score a pattern given as a test sequence with respect to a known
sequence in terms of the correlation with respect to position wise matching.
Implementations of the Smith-Waterman algorithm for both DNA and protein sequences on the
platform. The order of symbols present in a test pattern could reach to 3 billons. Say a potential
content in a DNA stream is found to by a characteristic of some disease. There is a definite
change seen when compared to a heathy genome. If these characteristics can be found in some
other persons DNA sample that would help him to get some precautionary measures. The
mutation causing during the evolution is the reason behind.

The patterns in a DNA on either sides of a double helix structure is always complementary. C-G,

This property helps to exploit use of only the content on one side of the string for the comparison.
The main features include: a multi-stage processing element (PE) design which significantly
reduces the FPGA resource usage and allows more parallelism to used.
A pipelined control mechanism with uneven stage latenciesa key to minimize the overall PE
pipeline cycle time and a compressed substitution matrix storage structure, resulting in substantial
decrease of the on-chip SRAM usage.

The Smith-Waterman algorithm is a well-known dynamic programming algorithm for performing

local sequence alignment for determining similar regions between two DNA or protein sequences.

The above matrix structure is to be computed when a given test sequence and reference sequence
are given. In the affine gap model, the gap is used to compensate for the insertion or deletion, to
make the alignment more condensed in satisfying an expecting model. The gap is usually a
consecutive null character string in a sequence and should be as long as possible. In the affine gap
model, the penalty score (or cost) for the first gap is called gap_open, and the cost for the
following gaps is called gap_extension.

1.3 Affine-gap Model

A change in the scoring pattern is given by involving programmable gap model called affine gap
models. Mainly useful when protein sequencing is checked.

In these formulas, stands for the gap_open, and stands for the gap_extension. E(i,j) and F(i,j)
are the maxima of the following two items: open a new gap or keep extending an existing gap.
Further research revealed that dynamic programming algorithm also mapped very well to a
systolic array due to its potential parallelity.

In the Smith-Waterman algorithm design for DNA sequence, there are four softwareprogrammable parameters, which allow the hardware implementation compatible with the
existing software programs, including both linear and affine gap model algorithms.

1.4 The Smith-Waterman Algorithm-PE Design

S-Out and T-out DFFs are used to store S[i] and T[j].
E-out DFF is used to store E(i,j), and it will be used by the same PE in the next clock cycle.
Its inputs come from the same PE which was generated in the previous clock cycle,
representing the values in its upper neighbour element.
F-out DFF is used to store F(i,j), and it will be used by the next PE. Its inputs come
from the previous PE, representing the values in its left neighbour element.
The input of V-diag DFF comes from the previous PE, and it is registered for one cycle
before it is used by the PE, so it represents the value of its upper-left neighbour element.
V-out DFF is used to store V(i,j).
Max_out DFF is used to store the maximum value of the similarity matrix. It has three inputs:
The maximum value coming from the previous PE
V(i,j) coming from the current PE
The maximum value stored in itself

A Straightforward Smith-Waterman Algorithm PE Design

Before the hardware implementation, we need to estimate the FPGA resources used by the
PE design. The PE data width should be decided by the maximum sequence length and the maximum
value in the substitution matrix. For example, if the length is 64 KBp and the maximum value is 11,
then the PE data width should be at least 20 bits, which means 2 20>64K*11. In the straightforward PE
design, there are five add/sub operations and six max operations. Because each max operation
consists of a subtraction and a 2:1 multiplexing operation, there are 11 add/sub and 6 2:1,
multiplexing operations in total for a PE. If the data width is set to 20 bits, a PE will require
about 340 adaptive look-up tables (ALUTs).

1.4.1 Simplify the Max-Out Operation of V(i,j)

In the PE design, the Max-out DFF is used to store the maximum value of the similarity
matrix. To get the maximum value, for the first step, it must compare the maximum output from
the previous PE with V(i,j) of the current PE, which stands for the comparison between the
horizontal neighbor matrix elements. In the second step, it must compare the output from the first
step with the value stored in the Max-out DFF itself. This stands for the comparison between the
vertical neighbor matrix elements. The result will be stored back to the Max-out DFF itself again
after the two steps.
Considering that at the end of the calculation, all the maximum values stored in each Max-out DFF
will shift out of the array, we can move the second comparison step outside of the array, while the first
comparison step is maintained by each PE. This way, we can delete a maximum operation from the

Simplify the Operation of V(i,j) -

When we cascade multiple PEs to form a PE array, we find that the output of V-out in the left
PE (red sub-block) is identical to the V-in of the right PE (blue sub-block). Therefore, we can
delete the V-in from the PE design. What we need to do is add a new output of the V-out
called V-out-Alpha and an input called instead of the V-in . When cascading the PE to form
the PE array, we need to connect the V-out-Alpha signal of the left PE to the V-in-Alpha signal of
the right PE.

A problem arising from these introduced DFFs is that the PE would require more than one
clock cycle to finish the calculation of one matrix element, which would hinder the
performance greatly.

1.4.2 Handling the Negative Number

When calculating the V(i-1,j-1)+(S[j],T[j]), the result is probably negative due to the
(S[j],T[j]) being negative when S[j] is not equal to T[j]. Because all the max operations are
based on unsigned numbers, we need to process the negative number before it comes to the max
According to Formula, all negative numbers will be reset to zero when following the local
sequencing algorithm. This means that whenever a negative number is generated from the
add/sub operation, we can reset it to zero unconditionally. Therefore, in the PE design, we
introduce a DFF to store the result from the add/sub operation, and we use the MSB (the sign bit)
of the add/sub result as a synchronous clear signal to the DFF. Thus, when a positive number is
derived from the add/sub function, it will be stored into the DFF as it is; when a negative number
is derived from the add/sub function, a zero will be stored into the DFF instead.

1.4.3 Multistage in the PE Design

In the previous sections, we introduced some DFFs to reduce the area cost and to process the
negative numbers, but this also generated more clock cycles to finish the calculation of a matrix
element. To solve this problem, our design features a pipelined control mechanism with uneven
stage latenciesa key to minimizing the overall PE pipeline cycle time.

With the FPGA internal phase-locked loop (PLL), we generate four clocks with the same clock
frequency but with a different phase relationship, as shown. These clocks are connected to the


The phase delays are decided by the timing simulation of the PE design. For example, the delay
from clock to clock-d2, is set to based on the type of FPGA architecture used, because there is a
subtraction operation and a max operation that need to finish during this period, and the longest
data path the delay from clock-d2 to clock-d3 is set.
The block diagram of a PE design with multistage (the LUT logic will be discussed in the next
section). In the design, and are software programmable parameters, and the values of
(S[i],T[j]) are also two software programmable parameters. By setting the parameters properly,
the hardware implementation could be compatible with the existing software programs, including
both linear and affine gap model algorithms.


Section 2
A Top Level Functional Description of the Design
The top module Gene_Sequencing containing Smith_waterman instace is responsible for clock
related assignments that needs to be fed to the the Processing element instances (P-1 down to P4)
contains instances of 4 primitive bocks of Processing elements
Smith_waterman has instances Memory_Element and Processing_Element
Processing Elements are the primitive building blocks of this algorithmic implementation that
computes the scoring matrix. In this implementation we computed 4 x 4 scoring matrix using 4
Processing Elements (Matrix size could be 4 x n, n being the symbols present in the test
sequence). Which effectively means the reference string has four symbols.
Processing_Element encapsulates Gene_Comp, Gap_Comp and Max_Comparator
The overall functionality is made for 4 symbols of test pattern being compared with 4 symbol
based reference sequence.


Section 3
Synthesis in Xilinx
The modules mentioned above are individually checked with respect to the behavioral
functionality assumed using individual test benches written in Verilog.
The synthesis process used in Xilinx ISE is straightforward, once all the modules are checked for
their syntax, a bit file is generated to run the code in FPGA. The intermediate process that Xilinx
follows before generating a bit file are Synthesis.

Synthesize Map Place Route

Once the bit-stream generated onto the FPGA the circuit functionality now is replicated by the


Section 4
Implementation Details
4.1 Clock Generation
The Smith Waterman algorithm implementation demands four different
phases of clock to accelerate the scoring matrix computation. In the Xilinx
platform, the four clocks with different phases are generated through the
Clock Generator IP core. Shown in the image, each clock has an exact phase
difference of 90 degrees. However, Xilinx FPGA imposes minimum
frequency constraint for exact phase generation (Hard to find these
requirements in the Xilinx resource documentations). Our project operated
the clock in 50MHz to keep the phase relationship unaltered, lower
frequency (less than 10MHz) created wrong scoring matrix.

As shown in the Figure, an additional clock was configured, operating in

400MHz, for sampling the other four clocks through the Integrated Logic
4.2 Initiating the Scoring Matrix Computation
Once the code was dumped into the FPGA, the module waits until a trigger
is given. The GPIO switch 1 is configured as the trigger to start the
computation. When the Switch is made high, the computation begins and the
generated matrix can be seen in the VIO (described in the next section).


Before dumping the code, make sure that the switch is in off state, once
everything is set, switch it on to generate the matrix.
4.3 Synchronization between Clocks
Though FPGA takes care of generating different clock phases, it need
some time to synchronize the clock phases. During this non-settling time
region, even though the trigger switch is set high, the modules should not be
triggered. To avoid this synchronization problem, a simple code has been
integrated into the main module which triggers the other modules only when
the FPGA Synchronization is achieved (which can be obtained through a
status signal called LOCKED from the Clock Generation IP core) and also
the trigger switch is set high.

4.4 Architecture
4.4.1 Processing Element:
As briefed in the Hierarchy section, the project is sectioned into multiple
hierarchy level to ease the testing process and increase the scalability of the
code. Processing Element (PE) module is the heart of the Smith Waterman
algorithm computation. In this project, the PE module has been instantiated
four times to compute a 4*4 matrix, nevertheless it can be instantiated N
number of time to compute an N*N matrix. The scoring matrix dimension is
only restricted by the total number of LUTs available in the Xilinx FPGA.
4.4.2 Memory Element:
The inputs Gene Sequence is hardcoded in the Memory Element (ME) (but
earlier it was planned to take the sequence from the RAM block). Number of

registers used, which is same as that of the Test Sequence length (T_in) and
the width of the data are all parameterized. It is very straightforward to add
additional gene sequence in the ME and instantiating the PE modules, but
one has to take care in changing the parameterized variables to support the
added data.
4.4.3 Handling Negative Numbers
Gene_comp and gap_comp modules generate negative numbers while
doing the subtraction operation, and the negative numbers are usually
represented in the 2s complement form in Xilinx FPGA (at least in the
Spartan 6 FPGA). Negative numbers are troublesome, since the comparator
module gives wrong comparison results and introduces errors in the score
matrix. Fortunately, in the Smith Waterman algorithm, it is allowed to
represent a negative number as zero, and still the algorithm computes the
right score matrix. Therefore, for representing 2^n elements, n+1 bits are
used. When computing subtractions, the MSB is always zero for the positive
numbers and 1 for negative numbers. Whenever the MSB bit is one, the
number is crushed to zero.
Affine Gap Model:
Affine gap model, which is an advanced version of original Smith Waterman
algorithm, is not supported in this project. But the inclusion of the affine gap
variable namely, beta, can be easily done by just replacing the encircled
Eout in the Gap_Comp module with beta variable as shown in the
below screen shot. Also, negative number generation during the subtraction
operation should be handled properly, as explained in the above section.
However, the affine gap model code, by adding the beta variable, is not
simulated and tested in this project.



Section 5
Verification using Chipscope
ChipScope: ILA and VIO
Chipscope is used for real time debugging that captures the user specified
signals through Integrated Logic Analyzers (ILA) and also send the data in
and out though VIO (Virtual Input and Output). In this project we have
configured 1 ILA and 4 VIOs. One ILA is used to capture all the four clock
signals, ENABLE signals that activates different PE blocks sequentially and
other random signals (we primarily used ILA to verify whether the generated
clock phases are correct and the enable signals are generated properly), and
4 VIOs to get the score matrix stored inside the registers.

ILA, VIO and ICON (please read the basics of Chip-scope from Xilinx user
manuals and other resources) are all IP cores, which can be easily generated
and configured. All the IP cores are integrated in the top-level block
-Gene_Sequencing, and almost all the instances are self-explanatory.
Once the code is dumped into the FPGA, and before triggering the GPIO
Switch 1, open the Chip-Scope tool. After the Chip-scope successfully
establishes the connection, click the run button. The ILA is configured to
wait until the trigger button (GPIO switch 1) is made high, and once it
senses the trigger, all the configured waveforms will be displayed.
Below figure shows the Match configuration, TriggerPort0 is connected to
the GPIO Switch 1 and it is matched to the value 1.


Trigger Configuration, as show in the below diagram, is configured such that

all the other signal waits until the TriggerPort0 becomes. This setting means
that ILA will start the acquisition only when the Trigger button (GPIO
Switch 1) is set high.

VIO modules need no trigger settings, it mirrors any data change happen to
any of the configured registers. Once the GPIO button is set high, which
activates all the modules, and starts the score matrix computation operation,
the computed results are stores inside the registers. VIO fetches the data
from all the registers (which stores the scores) and displays the result. The

data bits have been grouped to view the integer results. Image shown below
displays the score matrix using VIO.

Before we check for the functionality of the circuit, since the clocks with different phases is to be
we use


Section 6
The objective of this project is to generate a scoring matrix of dimension 4 x 4 using Smith
Waterman algorithm. We have used hard coded reference and test sequences using an array of
initialized registers as the memory module.
Reference sequence S AATG;

Test sequence T ATAG

Hand calculated and FPGA processed output scoring matrix are found to be the same.


Section 7
Issues faced during the synthesis and verification process.
Simulation usually does not model the effects due to Clock Synchronization. But in the hardware
platform if the clocks are not synchronized properly, the score matrix would result in wrong
values. The Clock synchronization problem is discussed in the Section 4.2
Clock In may not be directly fed into the ILA Module as a reference clock, a buffer related
problem which is not well addressed in the discussion forums. One workaround is, while
generating the Clock Genrator IP core, set the option as No Buffer (as shown in the image below).

Four clocks, with different phases, run in the frequency of 50MHz. In order to sample the clocks
using the ILA module, a sampling frequency of 400MHz (50MHz * 2 * 4) is needed, so an
additional clock is configured with the above frequency and fed it as a reference clock to ILA.
Without a proper understanding of the ILA module, it is very difficult to sample the desired
The design bottleneck coming to the practical implementation is with respect to the operational


Section 8
Conclusions and Future scope
The affine gap model mentioned in the reference is employed in case of protein
sequencing is not used in this implementation. In case of scoring a bigger sequence the
number of Processing_Elements are to be increased.
The sequence of the reference and test strings are called from a pre-initialized reg matrix
structure, which maps behavior of a RAM sitting on the FPGA.
Given a pattern of known behavior intruding a communication network, A similar scoring
could be employed to figure out to point if the data is being corrupted somewhere. Processing a
scoring matrix using dedicated hardware is much efficient compared to software based