Parker Parhi 1996

Area-Efficient Parallel FIR Digital Filter Implementations *
David A. Parker and Keshab K. Parhi

Department of Electrical Engineering
University of Minnesota
Minneapolis, M N 55455
E-Mail{dparlcer,parhi@ee.umn.edu}
Abstract
This paper presents a novel approach for implementing area-eficient pamllel (block) finite impulse
response (FIR) filters that require less hardware than tmditional block FIR filter implementations.
Pamllel processing is a powerfil technique because it can be used to increase the throughput of a
FIR filter or reduce the power consumption of a FIR filter. However, a traditional block filter
implementation causes a linear increase in the hardware cost (area) b y 61 factor of L, the block
size. In many design situations, this large hardware penalty cannot be tolerated. Therefore, it is
advantageous to produce parallel FIR filter implementations that require less area than traditional
block FIR filtering structures. I n this paper, we propose a method to design parallel FIR filter
structures that require a less-than-linear increase in the hardware cost. A novel adjacent coefficient
sharing based sub-structure sharing technique is introduced and used to reduce the hardware cost of
parallel FIR filters. A novel coeficient quantization technique, referred to as a maximum absolute
difference (MAD) quantization process, is introduced and used to produce qziantiied filters with good
spectrum chamcteristics. By using a combination of fast FIR filtering algorithms, a novel coeficient
quantization process and area reduction techniques, we show that pamllel FIR:filtering structures with
up to a 45% reduction in hardware is achieved for the given examples.
1: Introduction
The finite-impulse response (FIR) filter has been and continues to be one of the fundamental
processing elements in any digital signal processing (DSP) system. FIR filters are used in DSP
applications that range from video and image processing to wireless communications. In some a p
plications, such as video processing, the FIR filter circuit must be able to operate at high frequencies,
while in other applications, such as cellular telephony, the FIR filter circuit must be a low-power
circuit, capable of operating at moderate frequencies. Parallel, or block, processing can be applied
to digital FIR filters to either increase the effective throughput or reduce the power consumption
of the original filter. Traditionally, the application of parallel processing tso an FIR filter involves
the replication of the hardware units that exist in the original filter. If the area required by the
original circuit is A , then the L-parallel circuit requires an area of L x A . With the continuing trend
to reduce chip size and integrate multi-chip solutions into a single chip solution, it is important to
limit the silicon area required to implement a parallel FIR digital filter in it VLSI implementation.
In many design situations, the hardware overhead that is incurred by parallel processing cannot be
tolerated due to limitations in design area. Therefore, it is advantageous to realize parallel FIR
filtering structures that consume less area than traditional parallel FIR filtering structures.
While FIR filters have been given extensive consideration in the exis1,ing literature, most of
this work focuses on only a single aspect such as filter design (coefficient generation) [8]or filter
This research has been supported by the Army Research Ofice under contmct number DAIDAAHOI93-G-0318.
1063-6862/96 $5.00 0 1996 IEEE
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
93
coeacient quantization [20] or generation of filter circuits through compilation [S,41. Furthermore,
very little work has been done that deals directly with reducing the hardware complexity of parallel
FIR filters [lq. In order to design area-efficient parallel filten, we must consider the entire design
process. Therefore, our goal is to take a comprehensive look at all of the aspects from filter design
to implementation to produce low-area parallel FIR filter structures. In this paper, we propose a
methodology by which low-area parallel FIR filter structures can be produced. This method can
be broken down into four basic steps: 1) designing a FIR filter from spectrum requirements, 2)
generating a parallel architecture from the original FIR filter, 3) quantizing the filter coefficients
and 4) applying substructure sharing and other area reduction techniques to reduce the hardware
mat. Step 1 consists of nothing more than choosing a FIR filter design algorithm from the many
in existence. This gives our method added flexibility because it is not optimized to work with a
specific filter design algorithm. Since Step 1 is well understood, it is given no further consideration.
In the remainder of this paper, Steps 2 - 4 of our design method are discussed in detail. In addition,
we apply this method to some specific FIR filter examples.
2: Parallel Processing for High-speed or Low-Power

It is well-known that the application of parallel processing to a FIR filter can increase the throughput of the FIR filter. If a Lparallel filter is operated at the same clock rate as the original filter,
L output samples are generated every clock cycle compared to the single output sample that is
produced every clock cycle in the original filter. This implies that the L-parallel filter effectively
operates at L times the rate of the original FIR filter.
Wile it is clear that parallel processing can increase the throughput of a FIR filter, the technique
of parallel processing can also be used to reduce the power consumption of a FIR filter. This fact is
often overlooked. The application of parallel processing facilitates the lowering of the supply voltage
which in turn leads to a decrease in the power consumption [lS,151. Let Po = C,V2f,, represent the
power consumed in the original FIR filter, where COis the capacitance of the original filter, V, is
the supply voltage of the original filter and fo is the clock frequency of the original filter. It should
be noted that fo = l/To,where To is the clock period of the original filter. In order to maintain the
same sample rate, the clock period of a Lparallel filter must be increased to LT, since L samples
are produced every clock cycle. This means that COis charged in time LT, rather than in time To.
In other words, there is more time to charge the same capacitance (see Figure 1). This implies that
the supply voltage can be lowered to PV,, where p is a positive constant less than 1. By examining
the propagation delay considerations of the original and parallel filter, the power supply reduction
factor, P, can be determined. The propagation delay of the original circuit is given by
where k is a process dependent parameter and V, is the device threshold voltage. It should be noted
that the clock period, To,is typically set equal to the maximum propagation delay, Tpd,in a circuit.
The propagation delay of the Gparallel filter is given by
El (1)and (2) the following quadratic equation is obtained
L(PK - V,)2 = P(K - & ) 2 .

(3)
This equation is used to solve for 0. Once P is obtained, the reduced power consumption of the FIR
filter can be calculated using
p = P2(~CO)V,2(fO/L)
= P2COV2fO.
(4)
As can be seen, parallel processing leads to a reduction in power consumption by a factor of P 2 .
It should be noted that the supply voltage cannot be lowered indefinitely by increasing the level of
94
parallelism in a filter. There is a lower bound on the supply voltage which is dictated by the process
parameters.
3T
Figure 1. (a) Charging/Discharglngof entire capacitance in clock period T (b) ChargIng/Dlscharglng entire capacltance in clock perlod 3T using a 3-paraiiel filter
3: Fast FIR Algorithms

3.1: An Introduction t o Fast FIR Algorithms
As we have seen, parallel processing is a powerful technique that can be applied to FIR filters
to either increase the throughput or decrease the power consumption. Unfortunately, in many
situations, the use of parallel processing is avoided due to the linear increase in the hardware cost
that results from the use of this technique. Consider the polyphase [21] representation of a traditional
parallel FIR filter
L-1
L-1
L-1
where
This implies that
It is clear that the parallel filter requires L2 filtering operations of length-N/L. This is in agreement
with the fact that the complexity of a traditional block filter increases linearly with the block size or
the number of samples processed in parallel in a clock cycle. However, looking at (5), it is possible
to reduce the number of product terms on the RXS to something less than L 2 . Since the work of
Winograd [22], it is known that two polynomials of degree L - 1 can be multiplied using only (2L- 1)
product terms. This reduction in the number of multipliers comes at the expense of additional
95
adders. Replacing multipliers with add operations is advantageous because adders have a smaller
implementational cost than multiplien in terms of silicon area. For large values of L , however, the
number of adders becomes unmanageable. In this case, polynomial product algorithms using slightly
more than (2L - 1) product terms are employed. These suboptimal algorithms achieve a balance
between the number of multiplications and additions required to perform the polynomial product.
Since the product terms in the polynomial formulation of the parallel FIR filter are equivalent to
filtering operations, this implies that the parallel FIR filter can be realized using approximately
(2L - 1) FIR filters of length-N/L.
A relatively new class of algorithms, termed fast FIR algorithms (FFAs) [14, 13, 12, 6,24, 231,
rely upon this approach to produce reduced complexity parallel filtering structures. Using this
approach, the Lparallel filter can be implemented using approximately (2L - 1) filtering operations
of length-(N/L). The resulting parallel filtering structure would require (2N - N / L ) multipliers.
As an example, let N = 4 and let L = 2. The traditional 2-parallel approach would require 8
multiplications while 2-parallel fast filtering approach would require only G multipliers. For large
values of N, the advantage of the FFAs is clear.
In the general case, a (n-by-n) FFA produces a FIR filtering structure that is the functional
equivalent of a parallel FIR filter of block size n. The application of a (n-by-n) FFA produces a
set of filters each of which are length N/n, where N is the length of the original FIR filter. The set
of filters that are produced by a (n-by-n) FFA will consist of the n filters, H o ,H l , ..., H,-1, that
are produced by taking the polyphase decomposition of the original filter with decomposition factor
n, plus the filters that result from taking the additive combinations of these n filters. The proper
filter transfer function is realized with the addition of some pre- and post-processing steps that are
performed in conjunction with the filtering operations. This will become clear when specific fast FIR
filtering algorithms are presented in the subsequent sections. In this work, we focus our attention
on the (2-by-2) and (3-by-3) FFAs. Larger FFAs ((5--by-5)) do exist and are discussed in detail
in [14].
3.2: (2-by-2) Fast FIR Algorithm
Now that the necessary background material has been covered, we will focus our attention upon
specific fast FIR filtering algorithms. In this section, the (2-by-2) FFA is covered. The (2-by-2)
FFA results in a 2-parallel filtering structure. Let us begin by considering the standard 2-parallel
filtering structure. From (G), we have
Y = Yo
+ Z-'Yl
= (<YO Z-'X1)(Hu
Y = Yo + z - ' Y ~= XoHo
+ ZF'HI)
+ t-'(XoHi + X1Ho) + Z-'XiHl
which implies that
Figure 2 shows the resulting 2-parallel FIR filtering structure. This structure requires 2 N multipliers
and 2(N - 1) additions.
If (7) is written in a different, yet equivalent, form, the ( 2 - b v - 2 ) FFA is obtained:
Y = Yo + z-lY1 = (KO + z-lxl)(Ho
+ z-lH1)
Y = Yo + z-'Yi = XoHo
Y=l'b+z-'K
+ Z-'(XOHI
+ X1Ho) + F 2 X 1 H 1
=XaHo+r-'[(Xo+Xi)(Ho+H1) - X O H ~ - X ~ H ~ ] + Z - ' X ~ H ~
which implies that
96
Flgure 2. Traditional 2-Parallel FIR Fllter lmplemientatlon

The Zparallel fast FIR filtering structure which results from this (2-by-'2) FFA is shown in Figure
3. This structure computes a block of 2 outputs using 3 leugth-N/2 FIR filters and 4 pre/postprocessing additions. At first glance, it may appear that 5 filtering operations are required since
YOrequires 2 multiplies (filtering operations) and Yl requires 3 multiplies (filtering operations).
However, the terms XOHOand X I H l are found in both YOand Y l . These two terms need to be
computed only once which means that a total of only 3 filtering operationis are required. Therefore,
this structure requires 3(N/2) multipliers and 3 ( N / 2 - 1)+4 adders. For the sake of clarity, let us
take a quick look at a specific example. Consider applying the (2-by-2) FFA to a 4-tap FIR filter,
H = h , + h l z - ' + h z ~ - ~ + h 3 z - ~The
. (2-by-2) FFA decomposes the original 4-tap FIRfilter into 3
length-2FIRfilten,Ho = ho+h&,
HI= hl+h32-', and &+HI = (h()+hl)+(hz+h3)2-'. The
HI and Ha Hl each compute a single output every clock cycle. These three outputs are
filters Ho,
then combined as in Figure 3 to produce the two outputs, y(2k) or Yoand y(2k+ 1)or Y1. It should
be noted that the addition of HOand HI does not cost anything in terms of the implementation
because the filter coefficients are fixed and known prior to the implementation. This sum can be
computed off-line.
Flgure 3. Reduced-Complexity 2-Parallel Fast FIR Fllter Implementation

Since the implementational cost of a multiplier is much greater than that of an adder, the cost to
implement the parallel filtering structure can be approximated as being piroportional to the number
of multipliers required for implementation. This is a very reasonable approximation for comparison
purposes. Based upon this approximation, the 2-parallel fast FIR filtering structure requires about
25% less hardware (area) than the traditional 2-parallel implementation.
3.3: (bby-3)Fast FIR Algorithm

The (3-by-3) FFA is similar to the (243-2) FFA from the standpoint that it uses pre/postprocessing additions in the filtering structure to reduce the number of multipliers needed in the
implementation. The (3-by-3) FFA produces a parallel filtering structure of block size 3. In
order to understand how the (3-by-3) FFA works, we will again begin with the traditional parallel
filtering approach. From (6), we have
Y =yo +z-'Yl +%-ay, = ( x o + z - ' X i + Z - ~ X ~ ) ( +HzO- ' H ~ + z-'Hz)

Y = yo + z-'Yl+ r-aya = XOHO + z-'(XoH1+ X1.Ho)+
x-'(XoHz
+ XZHO+ X1H1) + z - ~ ( X I H +Z XZH1) + Z - ~ X Z H Z
97
which implies that
+
+
YO = H O X O Z - ~ ( H ~ X H, 2 X 1 )
YI = (HOXI H I X o ) C 3 H Z X 2
YZ = H O X Z + H ~ X I + H Z X O .
(9)
The traditional bparallel filtering structure requires 3 N multiplications and 3(N - 1) additions.
By manipulating (9) through a series of steps, the number of filtering operations can be reduced
which in turn reduces the total number of multipliers required to realize the 3-parallel filtering
structure. The (3-by-3) FFA is obtained by several applications of the same technique that was
used to produce the (2-by-2) FFA. From (6), we have
where
YO = HOXO- z - ~ H z X Z+ Z - ~ [ ( +
HH
~ 2 ) ( X 1+ X z ) - HIX1)]
Y1 = [(Ho
+ H i ) ( X o + XI) - H i x i ] - [HoXo- z - ~ H z X Z ]
YZ = [(He +Hi + Hz)(Xo + X i + X z ) ] - [(Ho+ Hi)(Xo + X i ) HIXI]
- [(Hi + H2)(X1 + X z ) - H i x i ) ] .
(10)
FFA is applied 4 times in order to derive

FFA. In Step 3, it is used once on the term ( X O z-'V)(H0+z-'W) from Step 2. In
The same technique that is used to produce the ( 2 - g - 2 )
+
+
the (3-by-3)
Step 5, the technique is used 3 times on the terms ( X O+ V ) ( X o W ) and VW from Step 3 (the term
VW appears twice in the equation of Step 3). Figure 4 shows the filtering structure that results from
the (3-by-3) FFA. This structure requires 6 length-N/3 FIR filters and 10 pre/post-processing
additions to realiie the proper transfer function. The (3-by-3) FFA structure requires 6(N/3)
multipliers and 6(N/3 - 1)+10 adden. Comparing the implementational cost of the traditional and
reduced-complexity Sparallel structures, it should be clear that the reduced-complexity filtering
structure provides a savings of approximately 33% over the traditional structure.
98
Figure 4. Reduced-Complexity 3-Parallel Fast FIR Filter Implementation

3.4: Cascading Fast
FIR Algorithms
In the previous two sections, we have shown how the FFAs can be used to produce reducedcomplexity 2-parallel and 3-parallel filtering structures. In many design situations, however, it
may be desirable to use parallel FIR filters with greater levels of parallelism. The (2-by-2) and
(3-by-3) FFAs have the added flexibility that they can be cascaded together to achieve greater
levels of parallelism. The cascading of FFAs is a straight-forward extension of the original FFA
application. For example, a (m-by-m) FFA can be cascaded with a (n-by-n) FFA to produce
a (m x n)-parallel filtering structure. The set of FIR filters that result from the application of the
(m-by-m) FFA are further decomposed, one at a time, by the application of the (n-by-n) FFA.
The resulting set of filters will be of length N / ( m x n). When cascading the FFAs, it is important
to keep track of both the number of multipliers and the number of adders required for the filtering
structure. The number of required multipliers is calculated as follows:
where T is the number of FFAs used, L;is the block size of the FFA at step-i and Mi is the number
of filters that result from the application of the i-th FFA. The number tof required adders that is
calculated as follows:
where Ai is the number of pre/post-processing adders required by the i-ih FFA. Consider the case
of cascading 2 (2-by-2) FFAs. The resulting 4-parallel filtering structuIe would require a total of
9N/4 multipliers and 20 9(N/4 - 1)adders for implementation. The reduced-complexity 4-parallel
filtering structure represents a hardware (area) savings of nearly 44% when compared to the 4 N
multipliers required in the traditional 4-parallel FIR filtering structure.
In order to understand how the FFAs are cascaded, it is useful to consider (6) once again. Let
us begin with a parallel FIR filter with a block size of 4. From (6),we have
Y = yo + Z-'Yl+ .Z-'y2 + Z - 3 Y 3
+ t-'Xz + z - ~ X ~ ) ( H+Q%-'Hi + Z-'HZ
Y (Xu
Z- 3H3).
(13)
The reduced-complexity 4-parallel FIR filtering structure is obtained by first applying the (2-by-2)
FFA to (13)and then applying the FFA a second time to each of the filtering operations that result
from the first application of the FFA. From (13),we have
Y = ( X i + z-'x;)(H:,
+ Z-lH;)
Application 1
Y = XhHA + z-'[(XA X:)(HA -k Hi)- XAHA - X:H:] + Z - ~ X ~ H ; .

(15)
The (2-by-2) FFA is then applied a second time to each of the filtering operations XAHA, X i H i
and ( X h X ; ) ( H i + H i ) of (15).
Application 2, Filtering Operation XAH;!
XhHh = ( X o + z-'X2)(Ho
XLHL = XOHO r-'[(Xo
+z-~Hz)
+ X Z ) ( H O+ H z ) - XOHO- X Z H Z ]+ z - ~ X Z H Z
(16)
Application 2, Filtering Operation X;HI
X;H;= X i H i
= (xi+ Z-2X3)(H1 + Z-2H3)

+ z - ~ [ ( X+~& ) ( H i + H3) - X1H1 - X S H ~+]z-*X
Application 2, Filtering Operation
3 H3
(XA+ X ; ) ( H A + H;)
(17)
(XA X i ) ( H A +Hi)= [(Xo+ X i ) .C2(Xz X3)][(Ho+ H i ) z - ~ ( H ~H4)]

( X i + X;)(HA +Hi)= (XO+ X i ) ( H o + H i ) + z-'[(XO X I + X z X3)(HO H 1 + H2 + H3)( X O+ X d ( H o + H I ) - ( X Z+ X3)(HZ H3)1 + z - ~ ( X ' + X3)(H2 + H3)
(18)
The second application of the (2-by-2) FFA leads to the 9 length-N/4 filtering operations that are
required in the 4-parallel filtering structure. The resulting reduced-complexity 4-parallel filtering
structure is shown in Figure 5. The 4-parallel structure shown in Figure 5 can be thought of as
3 separate (2-by-2) FFAs each producing 2 outputs which are combined to produce the 4 filter
outputs. As stated earlier, this structure uses approximately 44% less hardware than the traditional
&parallel FIR, filter.
m ..
.......___
,(e)
.(-I
,1**q
d9Q
*#.I)
119.1)
1<-..1)
.(9r4r.(rta
dC*I)
*e.*
....._.....
,
U
I
I
I
I
Figure 5. Reduced-Complexlty 4-Parallel FIR Filter

In the remainder of this section, we present the 6-parallel FIR filter. The &parallel FIR filter is
generated by cascading a (2-by-2) FFA with a (3-by-3) FFA. The process is essentially identical
to the process that was used to generate the 4-parallel filtering structure. Beginning with (6),the
(2-by-2) FFA is applied resulting in 3 filtering operations. The (3-by-3) FFA is then applied
to each of these filters producing the 18 filtering operations that are required in the 6-parallel
filtering structure. It should be noted that when ( 2 - b g - 2 ) and (3-by-3) FFAs are cascaded,
the (2-by-2) FFA is always applied first as this will lead to the lowest implementational cost
(see (11) and (12)). The resulting &parallel filtering structure requires 18N/6, or 3N, multipliers
and 42 18(N/6- 1) adders. This reduced-complexity G-parallel filtering structure provides an
area savings of approximately 50% compared to the traditional &parallel filtering structure. The
reduced-complexity &parallel FIR filter structure is shown in Figure 6.
100
......8.b......
Figure 6. Reduced-Complexity 6-Parallel FIR Filter
4: Quantization Process
In order t o physically implement a filter in hardware, the filter coefficients must first be quantized
using a power-of-two representation of a given word length K (excluding the sign bit). It is assumed
that all filter coefficients are constrained to [-1,1). Ttaditionally, the quantization process employs
a twos complement representation in conjunction with a truncation scheme or a rounding scheme.
However, [20, 7, 21 show that if the filter coefficients are first scaled before the quantization process
is performed, the resulting filter will have much better frequency-space characteristics. By using
the appropriate scale factor, the filter coefficients collectively settle into the optimal power-oftwo quantization space. The NUS [7] and INUS [ Z ] algorithms both employ a scalable quantization
process and produce excellent results in terms of the frequency space-characteristics of the quantized
filters. To begin the process, the ideal filter is normalized so that the llargest coefficient has an
absolute value of 1 and the quantized filter is initialized with zeros. The normalized ideal filter is
then multiplied by a variable scale factor (VSF). The VSF steps through the range of numbers from
.4375 t o 1.13 with a step size of 2 Y X - . Signed power-of-two (SPT) terms are then allocated t o the
101
quantized filter coefficient that represents the largest absolute difference between the scaled ideal
filter and the quantized filter (see Example 1). The allocation of SPT terms stops when the absolute
differencebetween all of the scaled ideal coefficients and the quantized coefficients is less than 2 T K .
Once the allocation of terms has stopped the normalized peak ripple (NPR) is calculated. The
process is then repeated for a new scale factor. The quantized filter leading to the minimum NPR is
chosen. The INUS algorithm slightly modifies the term allocation process by using some searching
techniques. The NPR of the filters is slightly improved, but at the expense of computation time.
When the quantized filter is implemented, a post-processing scale factor (PPSF) is used t o properly
rescale the magnitude of the resulting data stream. Essentially, the PPSF reverses the normalization
and scaling introduced in the quantization process. While the scaling process changes the magnitude
of the filter response, it should be noted that it does not change the functionality of the filter.
Example 1 S P T Term Allocation Example
The word length, K, equals 2.

0
Ideal Filter (IF) = [.26 .131], Initial Quantized Filter (QF) = [0 01.
Normalize the Ideal Filter, IF = [l .5038].
Scale with a variable scale factor of 1/2, IF = [.5 .2519].
Iteration One, QF = [.5 01.
Iteration Two, Q F = [.5 ,251.
Difference between IF and Q F is less than 1/4, stop SPT allocation.
The quantization process that we use is similar to the one described above. However, we modify
the quantized filter selection criteria and the scaling process. In our quantization process, we do not
use the NPR as the selection criteria for the best quantized filter. In order t o calculate the NPR, it
is assumed that the shape of the filter in the frequency-space domain is known. In other words, the
location of the passband and stopband is known prior to the quantization step so that the NPR in the
passband and the stopband can be calculated after the SPT terms are allocated. In our case, it is the
set of filters that result from the application of the FFAs, and not the original filter, which requires
quantization. For example, the reduced-complexity 6-parallel FIR filtering structure contains 18
filters which must each have their coefficients quantized. Since the set of filters which result from
the FFAs do not have well defined passbands and stopbands, it is not possible to know the location
of the passbands and stopbands of these filters prior to the quantization step. Therefore, the NPR
cannot be used as a selection criteria for choosing the best quantized filter. The selection criteria
absolute diflerenee (MAD) between the ideal and
used is based upon calculating the ma%+"
quantized filter in the frequency-space domain. Since there is no well defined stopband or passband,
the MAD is calculated over the entire range of frequencies. The quantized filter producing the
minimum MAD is chosen. A slight improvement in the frequency-space characteristics (1 or 2 dB in
the stopband) may be gained by employing searching techniques similar to those used in the INUS
algorithm. However, we do not believe that the improvement is enough to justify the increase in
the computation time of our quantization algorithm.
In addition to modifying the quantized filter selection criteria, we modify the scaling process. As
stated above, every quantized filter must have a PPSF to properly adjust the magnitude of the filter
output. The PPSF is calculated as follows:
PPSF = max(abs[Ideal Filter Coef facients])/VSl?
(19)
In cases where large levels of parallelism are used, the PPSFs can contribute to a significant amount
of hardware overhead (the reduced-complexity 6-parallel FIR filter requires 18 PPSFs). In an effort
to reduce this hardware overhead, we restrict the PPSFs to the following set of values:
New P P S F s = [.125, .25, -375, .5, .625, -75,.875,1].
102
(20)
The scale factors in this set, when implemented with shifts and additions alone, require very few
operations. For example, the scale factor .5 can be implemented as a single shift-to-theright by one
bit position. The original PPSF is replaced with the new PPSF that is nearest in value. Since the
scale factor of the quantized filter is shifted in value, the quantized filter coefficients must also be
properly shifted in value. This is accomplished using the following steps:
Step 1: Determine Effective Value of the Filtering C o e : e
Effective Caef ficients = [Quantized Filter Coefficients] * PPSF
(21)
Step 2: Determine Quantized Filter Coefficients With the New PPSF
Shifted Quantized Filter Coefficients = [Effective Coef ficient.s]/(NewPPSF).
(22)
Once the quantized filter coefficients are properly shifted in value, the quantized coefficients are
quantized to their final power-of-two representation.
5: Area Reduction Techniques

6.1: Multiplierless F i l t e r Implementation
In an effort to reduce the hardware costs below what is at ievi throueh 3 application of the
fast algorithms, several area reduction techniques are used. The first area reduction technique that
we use involves implementing the parallel filters in a multiplierless fashion. It is widely known that
multiplication by a constant multiple can be realized using only shifts and additions Ill, 9, 20, lo].
For example, Y multiplied by X = 0.1110 can be implemented as Y >> 1+Y >> 2+Y >> 3 , where
>> denotes a shift to the right. By using a dedicated shift and add implementation rather than
a general purpose multiplier for the constant multiple, the hardware cost is significantly reduced.
A general purpose multiplier assumes that all of the bits could be active during a multiplication
operation. In most cases, however, the constant multiplier does not have all of its bits active which
implies that some of the hardware in the general multiplier is not necessary. Since the binary
representation of the filter coefficients is known prior to implementation, we know exactly which
bits of the coefficient will be active during a multiplication operation. Therefore, we can implement
the filter coefficients using exactly the required amount of hardware (shifts and additions) for that
particular filter coefficient. Since all of the filter coefficients are implemented using shifts and
additions alone, the entire parallel filter structure can be implemented using only shifts, additions
and delays (registers). In addition to making the implementation significantly smaller, replacing
general purpose multipliers with dedicated shift and add multipliers allows the implemented parallel
filtering circuit to operate at higher clock rates.
5.2: Canonic Signed Digits
The second area reduction technique that is used attempts to reduce i,he number of 1s required
in a coefficients power-of-two representation. Using a canonic signed digit (CSD) representation,
coefficientscan be represented using the fewest number of non-zero bits [19]. The CSD representation
is a signed power-of-two representation where each of the bits is in the set { O , l , i } (
represents
i
the
value -1) and no two consecutive bit positions are non-zero. Using a C!;D representation for each
coefficient implies that the coefficient can be implemented in a shift and add fashion using the fewest
number of shift and add operations. In the previous example, X = 0.11.10. Converting this t o its
CSD representation, X = l.OOi0. Therefore, Y multiplied by X can be implemented as Y -Y >> 3.
The CSD representation requires 1 add and 1shift compared to the 2 adds and 3 shifts required by
the twos complement representation. On the average, a CSD representation contains approximately
33% fewer non-zero bits than its twos complement counterpart. This,in turn, implies a hardware
savings of about 33% per coefficient.
103
5.3: Substructure Sharing

The final area reduction technique that we use is sub-structure sharing. Substructure sharing is
a hardware reduction technique that is performed on each of the FIR filter sections that are found
in the parallel filtering structure. Essentially, sub-structure sharing is the process of examining the
hardware implementation of the filter coefficients and sharing the hardware units that are common
among the filter Coefficients. It should be noted that the FIR filter sections must be implemented
using a transposed direct-form structure in order to perform sub-structure sharing (see Figure
7). Substructure sharing can be realized using a wide variety of techniques [l, 3, 181. The two
techniques that we use are sub-expression sharing and adjacent coeficient sharing. Sub-expression
sharing, presented in [3], can be viewed as a two step process. In the first step, the three most
common CSD sub-expressions, l O l , i O l , and 1, are generated. The filter coefficients are then "built"
from these sub-expressions using the proper combination of shifts and additions. Figure 8 shows
a simple example that uses the sub-expression sharing technique. In this example, a 2-tap filter
with coefficients CO= 0.010~0000and Ct = 0.10100100 is implemented. In order to generate the
pieces 101 and TOl, an overhead of 2 adders is needed. In the example, 4 adders are required if subexpression sharing is used. If these two coefficients were implemented using a CSD representation
and no sharing, 4 adders are again required. In this particular example, there is no savings due t o
the 2 extra adders required t o generate the sub-expressions. However, consider another 2-tap filter
with the coefficients CO= 0.iOiOlOiO and Cl = 0.10101010. Using sub-expression sharing, these
two coefficients can be implemented using 5 adders compared to the 7 that are required if no sharing
is used.
X(d
a-:::::
5. x 5 . x c , x
).("I
Figure 7. General Transposed Dlrect-Form FIR Filtering Structure
Figure 8.2-Tap FIR Filter Using Sub-Expression Sharing

In addition t o sub-expression sharing, we also consider a second sub-structure sharing technique,
referred to as adjacent coefieient sharing, which groups the filter coefficients into adjacent pairs
and determines the hardware units that can be shared among the coefficients. Adjacent coefficient
sharing improves upon sub-expression sharing in some cases due to the 2 extra adders that are
required t o perform sub-expression sharing. The pair of filter coefficients is examined t o determine
the bit positions where the two coefficients have equivalent bit values. The bit positions that match
can then be factored out into a third "shared" coefficient. The filter input is multiplied by this
shared coefficient and the output is used by each coefficient in the pair to complete their respective
multiplication operations. The following 2-tap FIR filter example helps t o clarify the adjacent
coefficient sharing technique.
104
Example 2 Adjacent Coefficient Sharing Example
CSD representation of Coefficient Pair:
Filter Output = CO*Input

0
CO= l.OOOiOi, Cl = 0.0OOiOT.
+ t-'C1 * Input, Requiring 41 Adders,
Shared Coefficient ( S C ) = o.oooioi, CA = I.OOOOOO, C; = O.OOOOOO.
Filter Output = [Input SC * Input] z-'SC
* Input, Requiring 3 Adders.
Figure 9 shows the filter structure which results from adjacent coefficient sharing. It should be
noted that if the filter is of odd length, the N t h filter coefficient remains iin its CSD representation.
One of the advantages of adjacent substructure sharing is that it does not significantly increase the
routing required in the implementation. It is important to minimize the routing in any circuit as
the routing capacitance plays a role in the power consumed in the circuit.
Input
Ovtpvt
Figure 9.2-Tap FIR Filter Using Adjacent Coefficient Sharing

Once each of the FIR filter sections have had their coefficients quantized and converted to a binary
CSD representation, both sub-expression sharing and adjacent coefficient sharing are applied to the
filter to determine which substructure sharing method leads to the lowest implementational cost.
In addition, we apply a third substructure sharing technique to the filters. This third method is
a combination of the sub-expression and adjacent coefficient sharing methods. Adjacent coefficient
sharing is first applied to the filter to determine the shared coefficients that result from each pair of
filter coefficients. Sub-expression sharing is then applied to the shared coefficients to further reduce
the number of required operations. Figure 10 shows a 2-tap filter with coefficients CO= 1.0iOlOlOi
and C1 = l.OiOlOlO0 that is implemented using this third substructure sharing technique. Each
of these three substructure sharing methods is applied to each of the quantized FIR filters and the
method that leads to the least number of adders is chosen. In other words, each FIR filter section
will use one of the three substructure sharing techniques depending on which method is best suited
for that particular filter section.
hpYt
Figure 10.2-Tap FIR Filter Using Combined Sharing
105
6: Experimental Results
In this section, our procedure for generating parallel filtering structures is compared t o the
traditional method of generating parallel filtering structures. We consider 2-, 3-, 4-, 6,8- and
12-parallel filters. The traditional method consists of standard hardware duplication and a CSD
quantization of the filter coefficients using truncation. Our method consists of the procedures
outlined in Sections 3 , 4 and 5. The number of adders required for implementation and the frequency
response of both structures are compared. We present our results using both the ideal PPSFs and
the modified PPSFs in order to show the advantage of modifying the PPSFs. In both the traditional
method and our method, the filter coefficients are quantized using a word length of K = 9. The
first filter used for comparison is a 24-tap low-pass FIR filter designed using the Remez Exchange
algorithm. The ideal filter has normalized passband and stopband edge frequencies of .15 and .25,
respectively (normalized t o .5). The ideal filter is an equiripple filter with passband and stopband
ripples of .0049 (46.1938 dB). Quantizing this filter using a CSD coding requires 37 adders. The
quantized filter has a passband ripple (PBR) of 0.0151 and a stopband ripple (SBR) of 0.0098
(40.1436 dB). The number of adders required in the traditional parallel filtering structure is L * 37,
where L is the block size or the level of parallelism. Table 1 shows the results of our procedure
for various levels of parallelism using the ideal PPSFs. The first column defines the type of fast
algorithm used. For example, 2,3 denotes a (2-by-2) fast algorithm cascaded with a (3-by-3)
algorithm. The first row shows the results if the original filter is not decomposed using a fast
algorithm. Table 2 shows the results using the modified PPSFs. Figure 11 shows the frequency
responses of the original (ideal) filter, the 2-parallel filter after standard quantization and the 2parallel filter after our quantization process using the ideal PPSFs (Figure 12 shows the same plot
using the modified PPSFs). Table 3 shows a count of the total number of adders required by both
the traditional method and our method t o implement a L-parallel filter for various values of L. As
is clearly shown, our method is capable of producing parallel FIR filters with a significant reduction
in the hardware cost compared to traditional implementation styles. Additionally, it should be
noted that our method improves upon the frequency response produced by standard quantization
and truncation. In order to verify our approach, a second example filter is used. This filter is
a 72-tap low-pass FIR filter designed using the Remez Exchange algorithm. The ideal filter has
n o r d i e d passband and stopband edge frequencies of .20 and ,225, respectively. The ideal filter
is an equiripple filter with passband and stopband ripples of .0144 (36.8423 dB). Quantizing this
filter using CSD coding requires 111 adders. The quantized filter has a PBR of 0.0406 and a SBR
of 0.0274 (31.2553 dB). The number of adders required in the traditional parallel filtering structure
is L * 111. Tables 4, 5 and 6 show the results of our procedure for various levels of parallelism.
Figures 13and 14 show a comparison of the frequency responses for the 2-parallel and the 4-parallel
filtering structures, respectively, with modified PPSFs. The tables again show that our method is
able to significantly reduce the hardware costs while maintaining excellent filter performance. It
should be noted that in both examples, the use of the modified PPSFs reduces the hardware costs
with a minimal amount of degradation to the frequency-space performance of the filter. Although
both of the example filters are designed using a Remez Exchange algorithm, any type of filter design
algorithm can be used in conjunction with this procedure. The only information required by our
method is the set of ideal filter coefficients.
7: Conclusion
In this paper, we have developed a method for designing reduced complexity, parallel FIR filtering
structures. By applying FFAs,a novel quantization process and various area reduction techniques,
the parallel filtering structures that are produced require up to 45% less hardware than is required
by traditional parallel filtering methods for the given examples. Not only is this method suited
for custom and standard cell VLSI implementations, but it is also suited for implementations in
software and programmable DSP chips. In a software or programmable DSP implementation, the
106
Table 1. Summary of Experlmental Results: 24-Tap LPF, Ideal PPSFs
389
FA = Type of fast algorithm used, TCA = Total CSD adds after quantization,
SSS = Adds required after sub-structure sharing, SA = Adds required to implement the PPSFs,
PP = Pre/Post-processing adds, PBR = Passband ripple, SBR = Stopband ripple
Table 2. Summary of Experimental Results: 24-Tap LPF, Maidified PPSFs
2,2,2
2,2,3
11
19s 11
250
11
169
231
11 12 11 76 11 .0066 11
I[ 20 11 150 11 .0067 11
,0077 11 42.3029
,0068
11
43.3736
SSS = Adds required after sub-structure sharing, SA = Adds required to implement scale factors,
Table 3. Hardware Costs, 24-tap Low-Pass FIR Filter
L = Blocksize, SAC = Adders required using standard method and CSD coding,
OAI = Adders required using our method, ideal PPSFs,
OAM = Adders required using our method, modified PPSFs
107
Table 4. Summary of Experlmental Results: 72-Tap LPF, Ideal PPSFs
Table 5. Summary of Experimental Results: 72-Tap LPF, Modified PPSFs
Table 6. Hardware Costs, 72-tap Low-Pass FIR Filter
L = Blocksize, SAC = Adders required using standard method and CSD coding,
OAI = Adders required using our method, ideal PPSFs,
OAM = Adders required using our method, modified PPSFs
108
filtering algorithm produced by our method would take significantly less computation time due to
the reduction in the number of operations that must be performed.
Based upon our results, high levels of parallelism can be achieved w:ithout incurring an overwhelming hardware overhead. In situations where a large throughput is required, but the design
area is constrained, our method would be ideally suited. We believe that our method of generating
parallel filtering structures allows the consideration of parallel processing even in situations where
the design area is extremely limited.
References
Abhijit Chatterjee, Rabindra K. Roy, and Manuel A. d'Abreu. Greedy hardware optimization for
linear digital circuits using number splitting and refactorization. IEEE Zhnsactions on VLSI Systems,
1(4):423-431, December 1993.
Chao-Liang Chen, Kei-Yong Khoo, and Alan N. Willson Jr. An improved polynomial-time algorithm for
designing digital filters with power-of-two coefficients. In Proceedings of IEEE International Symposium
on Circuits and Systems, pages 84-87, Seattle, WA, May 1995.
Richard I. Hartley and Keshab K. Parhi. Digit-Serial Computation. Kluwer Academic Publishers,
Norwell, MA, 1995.
R A. Hawley, B. C. Wong, T. Lin, 3. Laskowski, and H. Samueli. Design techniques for silicon compiler
implementationsfor high-speed FIR digital filters. IEEE Journal of Solid-state Circuits, pages 656-667,
May 1996.
R. Jain, P. T. Yang, and T. Yoshino. FIRGEN A computer-aided design system for high performance
FIR a t e r integrated circuits. IEEE Dnnsactions on Signal Processing, 39(7):1655-1667, 1991.
H. K. Kwan and M. T. Tsim. High speed 1-D FIR digital filtering architectures using polynomial convolution. In Proceedings of IEEE International conference on Acoustics, Speech and Signal Processing,
pages 1863-1866, Dallas, TX, April 1987.
Dongning Li, Jianjian Song, and Yong Ching Lim. A polynomial-time algorithm for designing digital
filters with power-of-two coefficients. In Pmeedings of IEEE International Symposium on Circuits and
Systems, pages 84-87, May 1993.
Y.C. Lim. Design of discrete-coefficient-valuelinear phase FIR filters with optimum normalized peak
ripple magnitude. IEEE "hnsactions on Circuits and Systems, 37(12):1480-1486, December 1990.
Y. C. Lim and A. G. Constantinidies. Linear phase FIR digital filter without multipliers. In Proceedings
of IEEE International Symposium on Circuits and Systems, pages 185-188, Tokyo, Japan, July 1983.
Y. C. Lm and B. Liu. Design of cascade form FIR filters with discrete valued coefficients. IEEE
lhansactions on Acoustics, Speech, and Signal Processing, pages 1735-1730, November 1988.
Y. C. Lm and S. R. Parker. FIR filter design over a discrete powers-of-two coefficient space. IEEE
"hnsactions on Acoustics, Speech, and Signal Processing, pages 583-591, .June 1983.
Zhi-Jian Mou and Pierre Duhamel. Fast FIR filtering: Algorithms and imtplemeutations. Signal Processing, 13(4):377-384, December 1987.
ai-Jian Mou and Pierre Duhamel. A unified approach to the fast FIR fi1te:ringalgorithms. In Proceedings of IEEE Intemational Conference on Acoustics, Speech and Signal Processing, pages 1914-1917,
New York, NY,April 1988.
Zhi-Jian Mou and Pierre Duhamel. Short-length FIR filters and their use in fast nonrecursive filtering.
IEEE "hnsactions on Signal Processing, 39(6):1322-1332, June 1991.
Keshab K.Parhi. Algorithms and architectures for high-speed and low-power digital signal processing.
In Pmceedings of 4th Intematwnol Conference on Advances in Communiications and Control, pages
259-270, modes, Greece, June 1993.
Keshab K. Parhi. 'Ikading off concurrency for low-power in linear and non-linear computations. In
Pmceedings of the IEEE Workshop on Nonlineor Signal Processing, pages 895-898, Halkidiki, Greece,
June 1995.
D.N. Pearson and K. K. Parhi. Low-power FIR digital filter architectur'es. In Proceedings of IEEE
Intemationnl Symposium on Circuits and Systems, pages 231-234, Seattle,,WA, May 1995.
M. Potkonjak, M. B. Srivastava, and A. Chandrakasan. Efficient substitution of multiple constant
multiplications by shifts and additions using iterative pairwise matching. In DAG-94, Pmceedings of
the 31st ACM/IEEE Design Automation Conference, pages 189-194, 1994.
R W.Reitwiesner. Binary arithmetic. In Advances in Computers, volume li, pages 231-308. Academic,
1966.
H.Samueli. An improved search algorithm for the design of multiplierless FIR filters with power-of-two
coef8cients. IEEE Dnwactions on Circuits and Systems, 361044-1047, Jiily 1989.
P. P.Vaidyanathan. Multirate Systems and Filter Banks. Prentice Hall, Englewood Cliffs, NJ, 1993.
109
[22] S. Winograd. Arithmetic complexity of computations. In CBMS-NSF Regional Conference Series in

Applied Mathematics, number 33. SIAM Publications, 1980.
[23] A. Y. Wu, K. J. Liu, Z. Zhang, K. Nakajima, A. Raghupathy, and S. C. Liu. Algorithm-based low-power
DSP system design: Methodology and verification. In IEEE Signal Processing Society Workshop on
VLSI Signal Processing, pages 277-286, Sakai, Japan, October 1995.
[24] A. Zergainoh and P. Duhamel. Implementation and performance of composite fast FIR filtering algorithms. In IEEE Signal Processing Society Workshop o n VLSI Signal Processing, pages 267-276, Sakai,
Japan, October 1995.
OripM. Slmdanl Ouanlbaicn (dash) and Our Ouaniratim (dash-dd] Fine, Responses
-,aL
0
'
0.05
'
0.1
'
0.15
'
'
'
0.2 0.25
0.3
Normalzed Frequency
'
0.35
0.4
'
0.45
'
0.5
Flgure 11. Frequency Responses, 24-Tap LPF, Ideal PPSFs, L = 2
-4
0.bS
o:,
0.;5
0:2
0
.
h
0'3
0:2*
0'4
0
;
s
015
Normaim F ~ W W W
Flgure 12. Frequency Responses, 24-Tap LPF, Modified PPSFs, L = 2
110
Original, Standard Quantization (dash) and Our Quantization (dash-dot) Filter Responses
-1ooL
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Normalized Frequency
Figure 13. Frequency Responses, 72-Tap LPF, Modified PPSFs, L = 2
2o
Original. Standard Quantization (dash) and Our Quantization (dash-dot) IFilter Responses
1
.-C
Pf
x
40-
-60 -
3
-
-looo
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Normalized Frequency
Figure 14. Frequency Responses, 72-Tap LPF, Modified PPSFs, L = 4
111

Parker Parhi 1996

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Parker Parhi 1996

Загружено:

Авторское право:

Доступные форматы

Area-Efficient Parallel FIR Digital Filter Implementations *

David A. Parker and Keshab K. Parhi

1063-6862/96 $5.00 0 1996 IEEE

2: Parallel Processing for High-speed or Low-Power

El (1)and (2) the following quadratic equation is obtained

L(PK - V,)2 = P(K - & ) 2 .

3: Fast FIR Algorithms

This implies that

+ t-'(XoHi + X1Ho) + Z-'XiHl

which implies that

which implies that

Flgure 2. Traditional 2-Parallel FIR Fllter lmplemientatlon

Flgure 3. Reduced-Complexity 2-Parallel Fast FIR Fllter Implementation

3.3: (bby-3)Fast FIR Algorithm

Y =yo +z-'Yl +%-ay, = ( x o + z - ' X i + Z - ~ X ~ ) ( +HzO- ' H ~ + z-'Hz)

+ XZHO+ X1H1) + z - ~ ( X I H +Z XZH1) + Z - ~ X Z H Z

which implies that

FFA is applied 4 times in order to derive

The same technique that is used to produce the ( 2 - g - 2 )

Figure 4. Reduced-Complexity 3-Parallel Fast FIR Filter Implementation

Y = XhHA + z-'[(XA X:)(HA -k Hi)- XAHA - X:H:] + Z - ~ X ~ H ; .

XLHL = XOHO r-'[(Xo

Application 2, Filtering Operation X;HI

= (xi+ Z-2X3)(H1 + Z-2H3)

Application 2, Filtering Operation

(XA X i ) ( H A +Hi)= [(Xo+ X i ) .C2(Xz X3)][(Ho+ H i ) z - ~ ( H ~H4)]

Figure 5. Reduced-Complexlty 4-Parallel FIR Filter

Figure 6. Reduced-Complexity 6-Parallel FIR Filter

Example 1 S P T Term Allocation Example

The word length, K, equals 2.

Normalize the Ideal Filter, IF = [l .5038].

Scale with a variable scale factor of 1/2, IF = [.5 .2519].

Iteration One, QF = [.5 01.

Iteration Two, Q F = [.5 ,251.

Difference between IF and Q F is less than 1/4, stop SPT allocation.

PPSF = max(abs[Ideal Filter Coef facients])/VSl?

Effective Caef ficients = [Quantized Filter Coefficients] * PPSF

Step 2: Determine Quantized Filter Coefficients With the New PPSF

Shifted Quantized Filter Coefficients = [Effective Coef ficient.s]/(NewPPSF).

5: Area Reduction Techniques

5.3: Substructure Sharing

Figure 7. General Transposed Dlrect-Form FIR Filtering Structure

Figure 8.2-Tap FIR Filter Using Sub-Expression Sharing

Example 2 Adjacent Coefficient Sharing Example

CSD representation of Coefficient Pair:

Filter Output = CO*Input

CO= l.OOOiOi, Cl = 0.0OOiOT.

+ t-'C1 * Input, Requiring 41 Adders,

Shared Coefficient ( S C ) = o.oooioi, CA = I.OOOOOO, C; = O.OOOOOO.

Filter Output = [Input SC * Input] z-'SC

* Input, Requiring 3 Adders.

Figure 9.2-Tap FIR Filter Using Adjacent Coefficient Sharing

Figure 10.2-Tap FIR Filter Using Combined Sharing

Table 1. Summary of Experlmental Results: 24-Tap LPF, Ideal PPSFs

Table 2. Summary of Experimental Results: 24-Tap LPF, Maidified PPSFs

Table 3. Hardware Costs, 24-tap Low-Pass FIR Filter

Table 4. Summary of Experlmental Results: 72-Tap LPF, Ideal PPSFs

Table 5. Summary of Experimental Results: 72-Tap LPF, Modified PPSFs

Table 6. Hardware Costs, 72-tap Low-Pass FIR Filter

[22] S. Winograd. Arithmetic complexity of computations. In CBMS-NSF Regional Conference Series in

Flgure 11. Frequency Responses, 24-Tap LPF, Ideal PPSFs, L = 2

Flgure 12. Frequency Responses, 24-Tap LPF, Modified PPSFs, L = 2

Figure 13. Frequency Responses, 72-Tap LPF, Modified PPSFs, L = 2