Академический Документы
Профессиональный Документы
Культура Документы
E-Mail{dparlcer,parhi@ee.umn.edu}
Abstract
This paper presents a novel approach for implementing area-eficient pamllel (block) finite impulse
response (FIR) filters that require less hardware than tmditional block FIR filter implementations.
Pamllel processing is a powerfil technique because it can be used to increase the throughput of a
FIR filter or reduce the power consumption of a FIR filter. However, a traditional block filter
implementation causes a linear increase in the hardware cost (area) b y 61 factor of L, the block
size. In many design situations, this large hardware penalty cannot be tolerated. Therefore, it is
advantageous to produce parallel FIR filter implementations that require less area than traditional
block FIR filtering structures. I n this paper, we propose a method to design parallel FIR filter
structures that require a less-than-linear increase in the hardware cost. A novel adjacent coefficient
sharing based sub-structure sharing technique is introduced and used to reduce the hardware cost of
parallel FIR filters. A novel coeficient quantization technique, referred to as a maximum absolute
difference (MAD) quantization process, is introduced and used to produce qziantiied filters with good
spectrum chamcteristics. By using a combination of fast FIR filtering algorithms, a novel coeficient
quantization process and area reduction techniques, we show that pamllel FIR:filtering structures with
up to a 45% reduction in hardware is achieved for the given examples.
1: Introduction
The finite-impulse response (FIR) filter has been and continues to be one of the fundamental
processing elements in any digital signal processing (DSP) system. FIR filters are used in DSP
applications that range from video and image processing to wireless communications. In some a p
plications, such as video processing, the FIR filter circuit must be able to operate at high frequencies,
while in other applications, such as cellular telephony, the FIR filter circuit must be a low-power
circuit, capable of operating at moderate frequencies. Parallel, or block, processing can be applied
to digital FIR filters to either increase the effective throughput or reduce the power consumption
of the original filter. Traditionally, the application of parallel processing tso an FIR filter involves
the replication of the hardware units that exist in the original filter. If the area required by the
original circuit is A , then the L-parallel circuit requires an area of L x A . With the continuing trend
to reduce chip size and integrate multi-chip solutions into a single chip solution, it is important to
limit the silicon area required to implement a parallel FIR digital filter in it VLSI implementation.
In many design situations, the hardware overhead that is incurred by parallel processing cannot be
tolerated due to limitations in design area. Therefore, it is advantageous to realize parallel FIR
filtering structures that consume less area than traditional parallel FIR filtering structures.
While FIR filters have been given extensive consideration in the exis1,ing literature, most of
this work focuses on only a single aspect such as filter design (coefficient generation) [8]or filter
This research has been supported by the Army Research Ofice under contmct number DAIDAAHOI93-G-0318.
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
93
coeacient quantization [20] or generation of filter circuits through compilation [S,41. Furthermore,
very little work has been done that deals directly with reducing the hardware complexity of parallel
FIR filters [lq. In order to design area-efficient parallel filten, we must consider the entire design
process. Therefore, our goal is to take a comprehensive look at all of the aspects from filter design
to implementation to produce low-area parallel FIR filter structures. In this paper, we propose a
methodology by which low-area parallel FIR filter structures can be produced. This method can
be broken down into four basic steps: 1) designing a FIR filter from spectrum requirements, 2)
generating a parallel architecture from the original FIR filter, 3) quantizing the filter coefficients
and 4) applying substructure sharing and other area reduction techniques to reduce the hardware
mat. Step 1 consists of nothing more than choosing a FIR filter design algorithm from the many
in existence. This gives our method added flexibility because it is not optimized to work with a
specific filter design algorithm. Since Step 1 is well understood, it is given no further consideration.
In the remainder of this paper, Steps 2 - 4 of our design method are discussed in detail. In addition,
we apply this method to some specific FIR filter examples.
where k is a process dependent parameter and V, is the device threshold voltage. It should be noted
that the clock period, To,is typically set equal to the maximum propagation delay, Tpd,in a circuit.
The propagation delay of the Gparallel filter is given by
It should be noted that the supply voltage cannot be lowered indefinitely by increasing the level of
94
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
parallelism in a filter. There is a lower bound on the supply voltage which is dictated by the process
parameters.
3T
Figure 1. (a) Charging/Discharglngof entire capacitance in clock period T (b) ChargIng/Dlscharglng entire capacltance in clock perlod 3T using a 3-paraiiel filter
L-1
L-1
where
It is clear that the parallel filter requires L2 filtering operations of length-N/L. This is in agreement
with the fact that the complexity of a traditional block filter increases linearly with the block size or
the number of samples processed in parallel in a clock cycle. However, looking at (5), it is possible
to reduce the number of product terms on the RXS to something less than L 2 . Since the work of
Winograd [22], it is known that two polynomials of degree L - 1 can be multiplied using only (2L- 1)
product terms. This reduction in the number of multipliers comes at the expense of additional
95
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
adders. Replacing multipliers with add operations is advantageous because adders have a smaller
implementational cost than multiplien in terms of silicon area. For large values of L , however, the
number of adders becomes unmanageable. In this case, polynomial product algorithms using slightly
more than (2L - 1) product terms are employed. These suboptimal algorithms achieve a balance
between the number of multiplications and additions required to perform the polynomial product.
Since the product terms in the polynomial formulation of the parallel FIR filter are equivalent to
filtering operations, this implies that the parallel FIR filter can be realized using approximately
(2L - 1) FIR filters of length-N/L.
A relatively new class of algorithms, termed fast FIR algorithms (FFAs) [14, 13, 12, 6,24, 231,
rely upon this approach to produce reduced complexity parallel filtering structures. Using this
approach, the Lparallel filter can be implemented using approximately (2L - 1) filtering operations
of length-(N/L). The resulting parallel filtering structure would require (2N - N / L ) multipliers.
As an example, let N = 4 and let L = 2. The traditional 2-parallel approach would require 8
multiplications while 2-parallel fast filtering approach would require only G multipliers. For large
values of N, the advantage of the FFAs is clear.
In the general case, a (n-by-n) FFA produces a FIR filtering structure that is the functional
equivalent of a parallel FIR filter of block size n. The application of a (n-by-n) FFA produces a
set of filters each of which are length N/n, where N is the length of the original FIR filter. The set
of filters that are produced by a (n-by-n) FFA will consist of the n filters, H o ,H l , ..., H,-1, that
are produced by taking the polyphase decomposition of the original filter with decomposition factor
n, plus the filters that result from taking the additive combinations of these n filters. The proper
filter transfer function is realized with the addition of some pre- and post-processing steps that are
performed in conjunction with the filtering operations. This will become clear when specific fast FIR
filtering algorithms are presented in the subsequent sections. In this work, we focus our attention
on the (2-by-2) and (3-by-3) FFAs. Larger FFAs ((5--by-5)) do exist and are discussed in detail
in [14].
3.2: (2-by-2) Fast FIR Algorithm
Now that the necessary background material has been covered, we will focus our attention upon
specific fast FIR filtering algorithms. In this section, the (2-by-2) FFA is covered. The (2-by-2)
FFA results in a 2-parallel filtering structure. Let us begin by considering the standard 2-parallel
filtering structure. From (G), we have
Y = Yo
+ Z-'Yl
= (<YO Z-'X1)(Hu
Y = Yo + z - ' Y ~= XoHo
+ ZF'HI)
Figure 2 shows the resulting 2-parallel FIR filtering structure. This structure requires 2 N multipliers
and 2(N - 1) additions.
If (7) is written in a different, yet equivalent, form, the ( 2 - b v - 2 ) FFA is obtained:
Y = Yo + z-lY1 = (KO + z-lxl)(Ho
+ z-lH1)
Y = Yo + z-'Yi = XoHo
Y=l'b+z-'K
+ Z-'(XOHI
+ X1Ho) + F 2 X 1 H 1
=XaHo+r-'[(Xo+Xi)(Ho+H1) - X O H ~ - X ~ H ~ ] + Z - ' X ~ H ~
96
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
97
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
+
+
YO = H O X O Z - ~ ( H ~ X H, 2 X 1 )
YI = (HOXI H I X o ) C 3 H Z X 2
YZ = H O X Z + H ~ X I + H Z X O .
(9)
The traditional bparallel filtering structure requires 3 N multiplications and 3(N - 1) additions.
By manipulating (9) through a series of steps, the number of filtering operations can be reduced
which in turn reduces the total number of multipliers required to realize the 3-parallel filtering
structure. The (3-by-3) FFA is obtained by several applications of the same technique that was
used to produce the (2-by-2) FFA. From (6), we have
where
YO = HOXO- z - ~ H z X Z+ Z - ~ [ ( +
HH
~ 2 ) ( X 1+ X z ) - HIX1)]
Y1 = [(Ho
+ H i ) ( X o + XI) - H i x i ] - [HoXo- z - ~ H z X Z ]
YZ = [(He +Hi + Hz)(Xo + X i + X z ) ] - [(Ho+ Hi)(Xo + X i ) HIXI]
- [(Hi + H2)(X1 + X z ) - H i x i ) ] .
(10)
+
+
the (3-by-3)
Step 5, the technique is used 3 times on the terms ( X O+ V ) ( X o W ) and VW from Step 3 (the term
VW appears twice in the equation of Step 3). Figure 4 shows the filtering structure that results from
the (3-by-3) FFA. This structure requires 6 length-N/3 FIR filters and 10 pre/post-processing
additions to realiie the proper transfer function. The (3-by-3) FFA structure requires 6(N/3)
multipliers and 6(N/3 - 1)+10 adden. Comparing the implementational cost of the traditional and
reduced-complexity Sparallel structures, it should be clear that the reduced-complexity filtering
structure provides a savings of approximately 33% over the traditional structure.
98
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
FIR Algorithms
In the previous two sections, we have shown how the FFAs can be used to produce reducedcomplexity 2-parallel and 3-parallel filtering structures. In many design situations, however, it
may be desirable to use parallel FIR filters with greater levels of parallelism. The (2-by-2) and
(3-by-3) FFAs have the added flexibility that they can be cascaded together to achieve greater
levels of parallelism. The cascading of FFAs is a straight-forward extension of the original FFA
application. For example, a (m-by-m) FFA can be cascaded with a (n-by-n) FFA to produce
a (m x n)-parallel filtering structure. The set of FIR filters that result from the application of the
(m-by-m) FFA are further decomposed, one at a time, by the application of the (n-by-n) FFA.
The resulting set of filters will be of length N / ( m x n). When cascading the FFAs, it is important
to keep track of both the number of multipliers and the number of adders required for the filtering
structure. The number of required multipliers is calculated as follows:
where T is the number of FFAs used, L;is the block size of the FFA at step-i and Mi is the number
of filters that result from the application of the i-th FFA. The number tof required adders that is
calculated as follows:
where Ai is the number of pre/post-processing adders required by the i-ih FFA. Consider the case
of cascading 2 (2-by-2) FFAs. The resulting 4-parallel filtering structuIe would require a total of
9N/4 multipliers and 20 9(N/4 - 1)adders for implementation. The reduced-complexity 4-parallel
filtering structure represents a hardware (area) savings of nearly 44% when compared to the 4 N
multipliers required in the traditional 4-parallel FIR filtering structure.
In order to understand how the FFAs are cascaded, it is useful to consider (6) once again. Let
us begin with a parallel FIR filter with a block size of 4. From (6),we have
Y = yo + Z-'Yl+ .Z-'y2 + Z - 3 Y 3
+ t-'Xz + z - ~ X ~ ) ( H+Q%-'Hi + Z-'HZ
Y (Xu
Z- 3H3).
(13)
The reduced-complexity 4-parallel FIR filtering structure is obtained by first applying the (2-by-2)
FFA to (13)and then applying the FFA a second time to each of the filtering operations that result
from the first application of the FFA. From (13),we have
Y = ( X i + z-'x;)(H:,
+ Z-lH;)
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
Application 1
XhHh = ( X o + z-'X2)(Ho
+z-~Hz)
+ X Z ) ( H O+ H z ) - XOHO- X Z H Z ]+ z - ~ X Z H Z
(16)
X;H;= X i H i
3 H3
(XA+ X ; ) ( H A + H;)
(17)
required in the 4-parallel filtering structure. The resulting reduced-complexity 4-parallel filtering
structure is shown in Figure 5. The 4-parallel structure shown in Figure 5 can be thought of as
3 separate (2-by-2) FFAs each producing 2 outputs which are combined to produce the 4 filter
outputs. As stated earlier, this structure uses approximately 44% less hardware than the traditional
¶llel FIR, filter.
m ..
.......___
,(e)
.(-I
,1**q
d9Q
*#.I)
119.1)
1<-..1)
.(9r4r.(rta
dC*I)
*e.*
....._.....
,
U
I
I
I
I
100
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
......8.b......
4: Quantization Process
In order t o physically implement a filter in hardware, the filter coefficients must first be quantized
using a power-of-two representation of a given word length K (excluding the sign bit). It is assumed
that all filter coefficients are constrained to [-1,1). Ttaditionally, the quantization process employs
a twos complement representation in conjunction with a truncation scheme or a rounding scheme.
However, [20, 7, 21 show that if the filter coefficients are first scaled before the quantization process
is performed, the resulting filter will have much better frequency-space characteristics. By using
the appropriate scale factor, the filter coefficients collectively settle into the optimal power-oftwo quantization space. The NUS [7] and INUS [ Z ] algorithms both employ a scalable quantization
process and produce excellent results in terms of the frequency space-characteristics of the quantized
filters. To begin the process, the ideal filter is normalized so that the llargest coefficient has an
absolute value of 1 and the quantized filter is initialized with zeros. The normalized ideal filter is
then multiplied by a variable scale factor (VSF). The VSF steps through the range of numbers from
.4375 t o 1.13 with a step size of 2 Y X - . Signed power-of-two (SPT) terms are then allocated t o the
101
quantized filter coefficient that represents the largest absolute difference between the scaled ideal
filter and the quantized filter (see Example 1). The allocation of SPT terms stops when the absolute
differencebetween all of the scaled ideal coefficients and the quantized coefficients is less than 2 T K .
Once the allocation of terms has stopped the normalized peak ripple (NPR) is calculated. The
process is then repeated for a new scale factor. The quantized filter leading to the minimum NPR is
chosen. The INUS algorithm slightly modifies the term allocation process by using some searching
techniques. The NPR of the filters is slightly improved, but at the expense of computation time.
When the quantized filter is implemented, a post-processing scale factor (PPSF) is used t o properly
rescale the magnitude of the resulting data stream. Essentially, the PPSF reverses the normalization
and scaling introduced in the quantization process. While the scaling process changes the magnitude
of the filter response, it should be noted that it does not change the functionality of the filter.
Ideal Filter (IF) = [.26 .131], Initial Quantized Filter (QF) = [0 01.
The quantization process that we use is similar to the one described above. However, we modify
the quantized filter selection criteria and the scaling process. In our quantization process, we do not
use the NPR as the selection criteria for the best quantized filter. In order t o calculate the NPR, it
is assumed that the shape of the filter in the frequency-space domain is known. In other words, the
location of the passband and stopband is known prior to the quantization step so that the NPR in the
passband and the stopband can be calculated after the SPT terms are allocated. In our case, it is the
set of filters that result from the application of the FFAs, and not the original filter, which requires
quantization. For example, the reduced-complexity 6-parallel FIR filtering structure contains 18
filters which must each have their coefficients quantized. Since the set of filters which result from
the FFAs do not have well defined passbands and stopbands, it is not possible to know the location
of the passbands and stopbands of these filters prior to the quantization step. Therefore, the NPR
cannot be used as a selection criteria for choosing the best quantized filter. The selection criteria
absolute diflerenee (MAD) between the ideal and
used is based upon calculating the ma%+"
quantized filter in the frequency-space domain. Since there is no well defined stopband or passband,
the MAD is calculated over the entire range of frequencies. The quantized filter producing the
minimum MAD is chosen. A slight improvement in the frequency-space characteristics (1 or 2 dB in
the stopband) may be gained by employing searching techniques similar to those used in the INUS
algorithm. However, we do not believe that the improvement is enough to justify the increase in
the computation time of our quantization algorithm.
In addition to modifying the quantized filter selection criteria, we modify the scaling process. As
stated above, every quantized filter must have a PPSF to properly adjust the magnitude of the filter
output. The PPSF is calculated as follows:
(19)
In cases where large levels of parallelism are used, the PPSFs can contribute to a significant amount
of hardware overhead (the reduced-complexity 6-parallel FIR filter requires 18 PPSFs). In an effort
to reduce this hardware overhead, we restrict the PPSFs to the following set of values:
New P P S F s = [.125, .25, -375, .5, .625, -75,.875,1].
102
(20)
The scale factors in this set, when implemented with shifts and additions alone, require very few
operations. For example, the scale factor .5 can be implemented as a single shift-to-theright by one
bit position. The original PPSF is replaced with the new PPSF that is nearest in value. Since the
scale factor of the quantized filter is shifted in value, the quantized filter coefficients must also be
properly shifted in value. This is accomplished using the following steps:
Step 1: Determine Effective Value of the Filtering C o e : e
(21)
(22)
Once the quantized filter coefficients are properly shifted in value, the quantized coefficients are
quantized to their final power-of-two representation.
The second area reduction technique that is used attempts to reduce i,he number of 1s required
in a coefficients power-of-two representation. Using a canonic signed digit (CSD) representation,
coefficientscan be represented using the fewest number of non-zero bits [19]. The CSD representation
is a signed power-of-two representation where each of the bits is in the set { O , l , i } (
represents
i
the
value -1) and no two consecutive bit positions are non-zero. Using a C!;D representation for each
coefficient implies that the coefficient can be implemented in a shift and add fashion using the fewest
number of shift and add operations. In the previous example, X = 0.11.10. Converting this t o its
CSD representation, X = l.OOi0. Therefore, Y multiplied by X can be implemented as Y -Y >> 3.
The CSD representation requires 1 add and 1shift compared to the 2 adds and 3 shifts required by
the twos complement representation. On the average, a CSD representation contains approximately
33% fewer non-zero bits than its twos complement counterpart. This,in turn, implies a hardware
savings of about 33% per coefficient.
103
a-:::::
5. x 5 . x c , x
).("I
104
Figure 9 shows the filter structure which results from adjacent coefficient sharing. It should be
noted that if the filter is of odd length, the N t h filter coefficient remains iin its CSD representation.
One of the advantages of adjacent substructure sharing is that it does not significantly increase the
routing required in the implementation. It is important to minimize the routing in any circuit as
the routing capacitance plays a role in the power consumed in the circuit.
Input
Ovtpvt
105
6: Experimental Results
In this section, our procedure for generating parallel filtering structures is compared t o the
traditional method of generating parallel filtering structures. We consider 2-, 3-, 4-, 6,8- and
12-parallel filters. The traditional method consists of standard hardware duplication and a CSD
quantization of the filter coefficients using truncation. Our method consists of the procedures
outlined in Sections 3 , 4 and 5. The number of adders required for implementation and the frequency
response of both structures are compared. We present our results using both the ideal PPSFs and
the modified PPSFs in order to show the advantage of modifying the PPSFs. In both the traditional
method and our method, the filter coefficients are quantized using a word length of K = 9. The
first filter used for comparison is a 24-tap low-pass FIR filter designed using the Remez Exchange
algorithm. The ideal filter has normalized passband and stopband edge frequencies of .15 and .25,
respectively (normalized t o .5). The ideal filter is an equiripple filter with passband and stopband
ripples of .0049 (46.1938 dB). Quantizing this filter using a CSD coding requires 37 adders. The
quantized filter has a passband ripple (PBR) of 0.0151 and a stopband ripple (SBR) of 0.0098
(40.1436 dB). The number of adders required in the traditional parallel filtering structure is L * 37,
where L is the block size or the level of parallelism. Table 1 shows the results of our procedure
for various levels of parallelism using the ideal PPSFs. The first column defines the type of fast
algorithm used. For example, 2,3 denotes a (2-by-2) fast algorithm cascaded with a (3-by-3)
algorithm. The first row shows the results if the original filter is not decomposed using a fast
algorithm. Table 2 shows the results using the modified PPSFs. Figure 11 shows the frequency
responses of the original (ideal) filter, the 2-parallel filter after standard quantization and the 2parallel filter after our quantization process using the ideal PPSFs (Figure 12 shows the same plot
using the modified PPSFs). Table 3 shows a count of the total number of adders required by both
the traditional method and our method t o implement a L-parallel filter for various values of L. As
is clearly shown, our method is capable of producing parallel FIR filters with a significant reduction
in the hardware cost compared to traditional implementation styles. Additionally, it should be
noted that our method improves upon the frequency response produced by standard quantization
and truncation. In order to verify our approach, a second example filter is used. This filter is
a 72-tap low-pass FIR filter designed using the Remez Exchange algorithm. The ideal filter has
n o r d i e d passband and stopband edge frequencies of .20 and ,225, respectively. The ideal filter
is an equiripple filter with passband and stopband ripples of .0144 (36.8423 dB). Quantizing this
filter using CSD coding requires 111 adders. The quantized filter has a PBR of 0.0406 and a SBR
of 0.0274 (31.2553 dB). The number of adders required in the traditional parallel filtering structure
is L * 111. Tables 4, 5 and 6 show the results of our procedure for various levels of parallelism.
Figures 13and 14 show a comparison of the frequency responses for the 2-parallel and the 4-parallel
filtering structures, respectively, with modified PPSFs. The tables again show that our method is
able to significantly reduce the hardware costs while maintaining excellent filter performance. It
should be noted that in both examples, the use of the modified PPSFs reduces the hardware costs
with a minimal amount of degradation to the frequency-space performance of the filter. Although
both of the example filters are designed using a Remez Exchange algorithm, any type of filter design
algorithm can be used in conjunction with this procedure. The only information required by our
method is the set of ideal filter coefficients.
7: Conclusion
In this paper, we have developed a method for designing reduced complexity, parallel FIR filtering
structures. By applying FFAs,a novel quantization process and various area reduction techniques,
the parallel filtering structures that are produced require up to 45% less hardware than is required
by traditional parallel filtering methods for the given examples. Not only is this method suited
for custom and standard cell VLSI implementations, but it is also suited for implementations in
software and programmable DSP chips. In a software or programmable DSP implementation, the
106
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
389
FA = Type of fast algorithm used, TCA = Total CSD adds after quantization,
SSS = Adds required after sub-structure sharing, SA = Adds required to implement the PPSFs,
PP = Pre/Post-processing adds, PBR = Passband ripple, SBR = Stopband ripple
2,2,2
2,2,3
11
19s 11
250
11
169
231
11 12 11 76 11 .0066 11
I[ 20 11 150 11 .0067 11
,0077 11 42.3029
,0068
11
43.3736
FA = Type of fast algorithm used, TCA = Total CSD adds after quantization,
SSS = Adds required after sub-structure sharing, SA = Adds required to implement scale factors,
PP = Pre/Post-processing adds, PBR = Passband ripple, SBR = Stopband ripple
L = Blocksize, SAC = Adders required using standard method and CSD coding,
OAI = Adders required using our method, ideal PPSFs,
OAM = Adders required using our method, modified PPSFs
107
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
FA = Type of fast algorithm used, TCA = Total CSD adds after quantization,
SSS = Adds required after sub-structure sharing, SA = Adds required to implement scale factors,
PP = Pre/Post-processing adds, PBR = Passband ripple, SBR = Stopband ripple
FA = Type of fast algorithm used, TCA = Total CSD adds after quantization,
SSS = Adds required after sub-structure sharing, SA = Adds required to implement scale factors,
PP = Pre/Post-processing adds, PBR = Passband ripple, SBR = Stopband ripple
L = Blocksize, SAC = Adders required using standard method and CSD coding,
OAI = Adders required using our method, ideal PPSFs,
OAM = Adders required using our method, modified PPSFs
108
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
filtering algorithm produced by our method would take significantly less computation time due to
the reduction in the number of operations that must be performed.
Based upon our results, high levels of parallelism can be achieved w:ithout incurring an overwhelming hardware overhead. In situations where a large throughput is required, but the design
area is constrained, our method would be ideally suited. We believe that our method of generating
parallel filtering structures allows the consideration of parallel processing even in situations where
the design area is extremely limited.
References
Abhijit Chatterjee, Rabindra K. Roy, and Manuel A. d'Abreu. Greedy hardware optimization for
linear digital circuits using number splitting and refactorization. IEEE Zhnsactions on VLSI Systems,
1(4):423-431, December 1993.
Chao-Liang Chen, Kei-Yong Khoo, and Alan N. Willson Jr. An improved polynomial-time algorithm for
designing digital filters with power-of-two coefficients. In Proceedings of IEEE International Symposium
on Circuits and Systems, pages 84-87, Seattle, WA, May 1995.
Richard I. Hartley and Keshab K. Parhi. Digit-Serial Computation. Kluwer Academic Publishers,
Norwell, MA, 1995.
R A. Hawley, B. C. Wong, T. Lin, 3. Laskowski, and H. Samueli. Design techniques for silicon compiler
implementationsfor high-speed FIR digital filters. IEEE Journal of Solid-state Circuits, pages 656-667,
May 1996.
R. Jain, P. T. Yang, and T. Yoshino. FIRGEN A computer-aided design system for high performance
FIR a t e r integrated circuits. IEEE Dnnsactions on Signal Processing, 39(7):1655-1667, 1991.
H. K. Kwan and M. T. Tsim. High speed 1-D FIR digital filtering architectures using polynomial convolution. In Proceedings of IEEE International conference on Acoustics, Speech and Signal Processing,
pages 1863-1866, Dallas, TX, April 1987.
Dongning Li, Jianjian Song, and Yong Ching Lim. A polynomial-time algorithm for designing digital
filters with power-of-two coefficients. In Pmeedings of IEEE International Symposium on Circuits and
Systems, pages 84-87, May 1993.
Y.C. Lim. Design of discrete-coefficient-valuelinear phase FIR filters with optimum normalized peak
ripple magnitude. IEEE "hnsactions on Circuits and Systems, 37(12):1480-1486, December 1990.
Y. C. Lim and A. G. Constantinidies. Linear phase FIR digital filter without multipliers. In Proceedings
of IEEE International Symposium on Circuits and Systems, pages 185-188, Tokyo, Japan, July 1983.
Y. C. Lm and B. Liu. Design of cascade form FIR filters with discrete valued coefficients. IEEE
lhansactions on Acoustics, Speech, and Signal Processing, pages 1735-1730, November 1988.
Y. C. Lm and S. R. Parker. FIR filter design over a discrete powers-of-two coefficient space. IEEE
"hnsactions on Acoustics, Speech, and Signal Processing, pages 583-591, .June 1983.
Zhi-Jian Mou and Pierre Duhamel. Fast FIR filtering: Algorithms and imtplemeutations. Signal Processing, 13(4):377-384, December 1987.
ai-Jian Mou and Pierre Duhamel. A unified approach to the fast FIR fi1te:ringalgorithms. In Proceedings of IEEE Intemational Conference on Acoustics, Speech and Signal Processing, pages 1914-1917,
New York, NY,April 1988.
Zhi-Jian Mou and Pierre Duhamel. Short-length FIR filters and their use in fast nonrecursive filtering.
IEEE "hnsactions on Signal Processing, 39(6):1322-1332, June 1991.
Keshab K.Parhi. Algorithms and architectures for high-speed and low-power digital signal processing.
In Pmceedings of 4th Intematwnol Conference on Advances in Communiications and Control, pages
259-270, modes, Greece, June 1993.
Keshab K. Parhi. 'Ikading off concurrency for low-power in linear and non-linear computations. In
Pmceedings of the IEEE Workshop on Nonlineor Signal Processing, pages 895-898, Halkidiki, Greece,
June 1995.
D.N. Pearson and K. K. Parhi. Low-power FIR digital filter architectur'es. In Proceedings of IEEE
Intemationnl Symposium on Circuits and Systems, pages 231-234, Seattle,,WA, May 1995.
M. Potkonjak, M. B. Srivastava, and A. Chandrakasan. Efficient substitution of multiple constant
multiplications by shifts and additions using iterative pairwise matching. In DAG-94, Pmceedings of
the 31st ACM/IEEE Design Automation Conference, pages 189-194, 1994.
R W.Reitwiesner. Binary arithmetic. In Advances in Computers, volume li, pages 231-308. Academic,
1966.
H.Samueli. An improved search algorithm for the design of multiplierless FIR filters with power-of-two
coef8cients. IEEE Dnwactions on Circuits and Systems, 361044-1047, Jiily 1989.
P. P.Vaidyanathan. Multirate Systems and Filter Banks. Prentice Hall, Englewood Cliffs, NJ, 1993.
109
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
-,aL
0
'
0.05
'
0.1
'
0.15
'
'
'
0.2 0.25
0.3
Normalzed Frequency
'
0.35
0.4
'
0.45
'
0.5
-4
0.bS
o:,
0.;5
0:2
0
.
h
0'3
0:2*
0'4
0
;
s
015
Normaim F ~ W W W
110
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.
Original, Standard Quantization (dash) and Our Quantization (dash-dot) Filter Responses
-1ooL
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Normalized Frequency
2o
Original. Standard Quantization (dash) and Our Quantization (dash-dot) IFilter Responses
1
.-C
Pf
x
40-
-60 -
3
-
-looo
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Normalized Frequency
111
Authorized licensed use limited to: IEEE Xplore. Downloaded on January 26, 2009 at 11:23 from IEEE Xplore. Restrictions apply.