Вы находитесь на странице: 1из 5

600

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 58, NO. 9, SEPTEMBER 2011

Two High-Performance Adaptive Filter Implementation Schemes Using Distributed Arithmetic


Rui Guo and Linda S. DeBrunner
AbstractDistributed arithmetic (DA) is performed to design bit-level architectures for vectorvector multiplication with a direct application for the implementation of convolution, which is necessary for digital lters. In this brief, two novel DA-based implementation schemes are proposed for adaptive nite-impulse response lters. Different from conventional DA techniques, our proposed schemes use coefcients as addresses to access a series of lookup tables (LUTs) storing sums of delayed and scaled input samples. Two smart LUT updating methods are developed, and least-mean-square adaptation is performed to update the weights and minimize the mean square error between the estimated and desired output. Results show that our two high-performance designs achieve high speed, low computation complexities, and low area cost. Index TermsAdaptive lter, distributed arithmetic (DA), nite-impluse response (FIR), least mean square (LMS), lookup table (LUT), multiply accumulate (MAC), offset-binary coding (OBC).

I. I NTRODUCTION OST PORTABLE electronic devices such as cellular phones, personal digital assistants, and hearing aids require digital signal processing (DSP) for high performance. Due to the increased demand of the implementation of sophisticated DSP algorithms, low-cost designs, i.e., low area and power cost, are needed to make these handheld devices small with good performance. Various types of DSP operations are employed in practice. Filtering is one of the most widely used signal processing operations [1]. For FIR lters, output y(n) is a linear convolution of weights wn and inputs. For an N th-order FIR lter, the generation of each output sample y(n) takes N + 1 multiply-accumulate (MAC) operations. Since general-purpose multipliers require signicant chip area, alternate methods of implementing multiplication are often used, particularly when the coefcients values are known prior to implementation. Distributed arithmetic (DA) is one way to implement convolution multiplierlessly, where the MAC operations are replaced by a series of LUT access and summations. Techniques, such as ROM decomposition [2] and offset-binary coding (OBC)

[7] can reduce the LUT size, which would otherwise increase exponentially with the lter length N + 1 for conventional DA. However, in many applications such as echo cancelation and system identication, coefcient adaptation is needed. This adaptation makes it challenging to implement DA-based adaptive lters with low cost due to the necessity of updating LUTs. Several approaches have been developed for DA-based adaptive lters, i.e., from the point of view of reducing logic complexity [3][6], [8]. Recently, a DA-based FIR adaptive lter implementation scheme has been presented in [5], [6], and [8], which uses extra auxiliary LUTs to help in the updating; however, memory usage is doubled. In this brief, two novel LMS adaptation-based DA implementation schemes are proposed for FIR adaptive lter implementation. The rst proposed algorithm updates the LUTs in a similar way as described in [5], [6], and [8] but without the need for auxiliary LUTs. The second proposed algorithm incorporates an OBC-based LUT updating scheme that reduces memory usage. It is shown that our two proposed schemes both outperform that described in [5], [6], and [8], with the second proposed algorithm requiring less memory usage but more computation cost than our rst proposed algorithm. This brief is organized as follows. Section II describes the background of DA and OBC. Then, we present our proposed schemes for the DA-based FIR adaptive lter in Section III. A performance comparison of different DA-based implementations is made in Section IV. Our conclusions are given in Section V. II. BACKGROUND A. DA DA was rst studied by Croisier et al. [9] in 1973 and popularized by Peled and Liu [10]. Quantization effects in the DA system were analyzed in [11] and [12]. Useful tutorials on DA were provided in [7] and [13]. DA is used to design bitlevel architecture for vector multiplication [2]. Traditionally, for lters implemented using DA, the input samples are used as addresses to access a series of LUTs whose entries are sums of coefcients. Consider a discrete N th-order FIR lter with constant coefcients, and input samples coded as B-bit twos complement numbers with only the sign bit to the left of the binary point as follows:
B1

Manuscript received July 20, 2010; revised December 9, 2010, February 10, 2011, and April 20, 2011; accepted June 6, 2011. Date of publication August 22, 2011; date of current version September 14, 2011. This paper was recommended by Associate Editor P. K. Meher. The authors are with the Department of Electrical and Computer Engineering, Florida State University, Tallahassee, FL 32310 USA (e-mail: rg07g@fsu.edu; linda.debrunner@fsu.edu). Digital Object Identier 10.1109/TCSII.2011.2161168

x(n k) = xk0 +
j=1

xkj 2j .

(1)

1549-7747/$26.00 2011 IEEE

GUO AND DEBRUNNER: TWO HIGH-PERFORMANCE ADAPTIVE FILTER IMPLEMENTATION SCHEMES USING DA

601

Using (1) to compute the FIR output gives


N B1 N

TABLE I LUT CONTENTS FOR FOUR-TAP FIR WITH OBC CODING [2]

y(n) =
k=0

wk xk0 +
j=1 k=0

wk xkj 2j . and

(2) C0 =

With
N k=0

Cj = N wk xkj , j [1, B 1] k=0 wk xk0 , (2) can be rewritten as


B1

y(n) =
j=0

Cj 2j .

(3)

The Cj values can be precomputed and stored in a LUT with the input used as the address. This technique allows the FIR lter with known coefcients to be implemented without general-purpose multipliers. This implementation requires a LUT with a size that increases exponentially with the number of taps N + 1, which results in a large time cost for accessing the LUT for a high-order lter. Therefore, reducing the LUT size improves system performance as well as area cost. One possible way to reduce LUT size, which is called ROM decomposition, replaces a longer address by shorter addresses, and the data read from smaller LUTs is accumulated to generate the output. For a 64-tap FIR lter, by breaking the LUT with 264 entries into smaller LUTs with 4-bit addresses, only (64/4) 24 = 28 entries are required. B. OBC OBC can be used to reduce the LUT size by a factor of 2 to 2N 1 [7]. By rewritting the input from (1), OBC is derived as follows: 1 (4) x(n k) = {x(n k) [x(n k)]} 2
B1

Fig. 1. DA-based bit-serial architecture for implementing K-tap FIR lter with OBC.

x(n k) = xk0 +
j=1

xkj 2j + 2(B1) .

(5)

Substituting (1) and (5) into (4) B1 1 x(nk) = (xk0 xk0 )+ (xkj xkj )2j 2(B1) . 2 j=1 (6) By dening Dkj as xkj xkj , the output from FIR lter can be written as N B1 wk y(n) = Dk0 + Dkj 2j 2(B1) 2 j=1
k=0 N

The OBC scheme is described in (4)(8). The LUT contents for a four-tap FIR lter are given in Table I. It can be observed that the rst half and the second half of this LUT are mirrored vertically. Therefore, its size can be halved by using x0j to control the sign of each entry at the cost of a slightly increased hardware complexity. The hardware circuit for implementing a K-tap lter is shown in Fig. 1, where j starts from j = B 1 and decreases by 1 each cycle until j = 0. S1 is 0 when j = 0 and 1 if otherwise, and S2 is 1 when j = B 1 and 0 if otherwise. III. P ROPOSED S CHEMES For an FIR lter with LMS adaptation, which involves the automatic update of lter weights in accordance with the estimation error, conventional DA performance suffers from the intensive computation required to rebuild LUTs. Work has been done in [3][6], [8], [14], and [15] to reduce the computation workload for LUT updating by only recomputing a few LUT entries. In [5], [6], and [8], the authors proposed an efcient LUT updating method that uses auxiliary LUTs, where only half of the entries are needed to be recomputed. The techniques proposed in this brief eliminate the need for auxiliary LUTs. Our rst scheme uses a similar LUT updating method to [5], [6], and [8] but without the need for auxiliary LUTs since our proposed scheme stores the sums of delayed inputs in LUTs. In our second scheme, OBC is incorporated, and a new updating

=
k=0 N

wk Dk0 + 2

B1

j=1

k=0

wk Dkj j 2 2 (7)

k=0

wk (B1) 2 . 2

N By dening Ej as k=0 (wk Dkj /2) and Eextra as N k=0 (wk /2), (7) can be rewritten as B1

y(n) = E0 +
j=1

Ej 2j + Eextra 2(B1) .

(8)

602

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 58, NO. 9, SEPTEMBER 2011

Since all the combinations of inputs x(n), x(n 1), . . . , x(n N ) are included in the old LUT at t = n, these new entries with LSAB being 0 can be obtained directly by copying the corresponding entries from the old LUT, as indicated by the arrows in Fig. 2. A closer observation discloses that each of the rest of the new entries with LSAB being 1 can be generated by adding x(n + 1) to the prior entry in the new LUT. Mathematically, the new entries Ti (n + 1) can be obtained from the old entries Ti (n) by (12) and (13) with the entry index i [0, 2N +1 1] as follows: Ti (n + 1) = T i (n) i {i|i mod 2 = 0}
2

(12)

Fig. 2. LUT update from t = n to t = n + 1, i.e., modied from [6].

Ti (n + 1) = Ti1 (n + 1) + x(n + 1) i {i|i mod 2 = 1}. (13) In this brief, LMS adaptation is chosen to update the weights wk , as shown in wk (n + 1) = wk (n) + e(n)x(n k) e(n) = d(n) w(n)x(n) (14) (15)

method is introduced to reduce further the memory usage. Unlike [15], the algorithms proposed in this brief implement the coefcient updating and lter operations concurrently. A. First Proposed Scheme Conventional DA stores the sums of weights (coefcients) in LUTs and uses the inputs as addresses. This approach works very well for nonadaptive lters with constant coefcients. However, for adaptive lters, adaptation is necessary for the weights, and the input registers must be updated. By exploiting the commutative property of convolution, the same ltering operation can be obtained by storing sums of delayed input samples in LUTs and by using the binary coefcients as addresses. With the inputs and coefcients having a common bit width, as is often the case, this change does not increase latency; however, computation and memory cost are reduced for a bitserial DA design. If the inputs and coefcients have different lengths, there is some difference in latency, which we have considered in Section IV. Representing the weights wk in twos complement form using B bits gives
B1

wk = wk0 +
j=1

wkj 2j .

(9)

Using (9) to calculate the output yields


N B1 N

y(n) =
k=0

x(n k)wk0 +
j=1 k=0

x(n k)wkj 2j . (10)

where d(n) is the desired output, w(n) = [w0 (n) w1 (n), . . . , wN (n)], x(n) = [x(n) x(n 1), . . . , x(n N )]T , and e(n) is the error between the desired and estimated output. The top-level circuit diagram of this proposed scheme for an example four-tap FIR lter is shown in Fig. 3. The method for updating the DA_LUT block in our rst scheme is similar to that used for updating the auxiliary LUT in [5], [6], and [8], which we refer to as the DA0 scheme. Fig. 4 shows more details of the implementation. The Addr Gen block has to generate the addresses in the order shown in Fig. 4. As shown in the weight update block in Fig. 3, the multiplication from (14) can be simplied as shifting by assuming the step size and quantizing the product of the error and the step size to be a power of 2. In contrast to DA0 , our scheme uses no auxiliary LUTs but only main LUTs. The two multipliers controlled by a0 cooperate to read the proper entry according to Fig. 2 for updating the new entry each cycle. The updating of LUTs and weights used as addresses can be performed concurrently, which reduces latency. In DA0 , two types of LUTs are necessary: the auxiliary LUTs need to be updated rst, and then, the updates of the main LUTs are executed. B. Second Proposed Scheme In Section II-B, OBC is shown to reduce the number of LUT entries without increasing the number of LUTs required. In this section, we propose a new scheme that combines OBC with our rst proposed scheme. Because of the commutative property, as in our rst proposed approach, the sums of delayed input samples are stored in LUTs coded using OBC with binary coefcients as the address, as derived in (16)(19). Equation (17) indicates that, with wkj as the address, LUTs can still be used to store Fj , as shown on the left of Fig. 5 for a four-tap FIR lter. Although the size of LUTs is reduced by applying OBC, the LUT updating still suffers from high computation cost since the oldest sample is included in every entry, e.g., x(n 3) in Fig. 5. The second entry with address 001 at time t = n needs to be updated by adding (x(n 3) 2x(n 2) + x(n + 1))/2.

Similar to (2), the term in square brackets has only 2N +1 possible values; therefore, a LUT can be used. The left table shown in Fig. 2 gives the LUT values for a four-tap FIR lter. When the time index t = n + 1, (10) becomes
N

y(n + 1) =
k=0

x(n k + 1)wk0
B1 N

+
j=1 k=0

x(n k + 1)wkj 2j .

(11)

Fig. 2 shows graphically how the LUTs can be updated. Specically, it can be observed from the term in square brackets that the new input sample x(n + 1) is not used for the new entries, whose least signicant address bit (LSAB) w0j is 0.

GUO AND DEBRUNNER: TWO HIGH-PERFORMANCE ADAPTIVE FILTER IMPLEMENTATION SCHEMES USING DA

603

Fig. 5. LUT update for the four-tap adaptive FIR lter.

Fig. 3.

Top-level circuit diagram for the four-tap adaptive FIR lter. Fig. 6. Area comparison. (a) Chip area and (b) 32-tap FIR lter synthesis results.

dated by (20) and (21) with the entry index i [0, 2N 1] as follows: Ti (n + 1) = Q + T2i+1 (n) i {i|i < 2N 1 } Ti (n + 1) = Q T2(2N 1i) (n) i {i|i 2N 1 }. (20) (21)

Fig. 4.

Detailed DA_LUT block for the four-tap adaptive FIR lter.

The rest of the entries need to be updated with approximately the same computation cost as follows:
B1

wk = wk0 +
j=1 N

wkj 2j

To update Fextra , we could add together all of the rst entries from the sub-LUTs used by ROM decomposition, which requires num 1 addition, where num is the number of subLUTs. Another method is by subtracting half of the oldest input sample from Fextra and adding half of the newest sample to Fextra , which requires two operations every cycle, since division by two can be realized by right shifting. The latter method is chosen to update Fextra for this proposed scheme. Performances for the different schemes are compared in Section IV. IV. P ERFORMANCE C OMPARISON

(16) To compare our schemes with DA0 , synthesis results from a 0.18-m standard cell library for implementing FIR lters with 8-bit inputs and weights are presented in Fig. 6(a). It is shown that, since extra auxiliary LUTs are necessary for the DA0 scheme, signicant area savings can be achieved by our proposed schemes. Similar results are obtained in Fig. 6(b) by implementing a 32-tap FIR lter using FPGA Stratix II EP2S15F672I4. In addition to the area advantage shown in Fig. 6, our proposed algorithm also has an advantage of reduced latency. Formulations for certain critical measurements are derived to compare the performances of our proposed two schemes with DA0 . Since these three implementation schemes have a

Fj =
k=0 N

x(n k)(wkj wkj ) 2 x(n k) 2


B1

(17) (18) (19)

Fextra =
k=0

y(n) = F0 +
j=1

Fj 2j Fextra 2(B1) .

To reduce the computation workload, we propose a smart updating algorithm. Fig. 5 shows how the update works for a four-tap FIR lter, with precomputed Q = (x(n + 1) + x(n 3))/2. Mathematically, the new entries Ti (n + 1) can be up-

604

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 58, NO. 9, SEPTEMBER 2011

similar clock rate, the memory usage and the computation cost for each scheme are derived and compared. The memory usage is measured by the numbers of LUT entries. If step size is assumed and estimation error is scaled to be in the power of 2 to simplify multiplication as a shift, the only critical computation used for these three schemes is addition. The computation cost is estimated as the number of necessary addition per lter cycle, including LUT updating and data ltering. The memory usages for DA0 and our two proposed schemes, i.e., MDA0 , Mproposed1 , and Mproposed2 , respectively, are estimated as follows: MDA0 = (2m 1) Mproposed1 Mproposed2 K m K = (2m 1) m K = (2m1 ) m 2 (22) (23) (24)
Fig. 7. (a) Savings by the proposed rst scheme. (b) Savings by the proposed second scheme.

where K is the lter length and m is the number of bits required for the LUT address when ROM decomposition is used. K and m are assumed to be on the power of 2. Since our proposed rst scheme does not use any auxiliary LUTs, its memory usage is exactly half of that of DA0 . In our second proposed scheme, the memory usage is reduced further by using OBC for the LUTs. With OBC-coded LUTs, our second proposed scheme requires the least memory usage, which is less than 30% of that by DA0 , since (Mproposed2 /MDA0 ) = 23 /((24 1) 2) = 26.6% for the case of m = 4. If W and B are the bit width of inputs and coefcients, respectively, then the numbers of necessary addition every lter cycle for these three schemes are estimated as follows: K K K +2m + W 1 (25) m m m K K Aproposed1 = 2m1 +K + B 1 (26) m m K K Aproposed2 = (2m1 +1) +K + B +1. (27) m m ADA0 = 2m1 The three addends in (25)(27) count the additions for updating the LUTs, coefcients, and summing up all the entries read from LUTs, respectively. To examine the effects of changing the ratio of the input width W and the coefcient width B, we plot the savings of additions for different values in Fig. 7. It is shown that our two proposed schemes require less addition cost than DA0 , whereas the proposed second scheme needs slightly more addition operations than our rst proposed approach, which is due to the precomputation of Q and updating of Fextra every cycle. V. C ONCLUSION In this brief, two different DA-based schemes were presented for FIR adaptive lter implementation. In contrast to conventional DA-based schemes, our schemes store the sums of delayed input samples in LUTs and use the binary coefcients as addresses. It was shown that, since no auxiliary LUTs are required, our rst proposed scheme needs exactly half the memory usage required by the previous work, whereas the second

proposed scheme only needs less than 30% of that required by the previous work. In addition, our two proposed schemes both have low computation cost, with the second scheme requiring slightly more addition operations than the rst one. Unlike the previous work, in our schemes, the updating of LUTs and coefcients can be executed concurrently, which enables low latency. R EFERENCES
[1] S. K. Mitra, Digital Signal Processing: A Computer-Based Approach, 2nd ed. New York: McGraw-Hill, 2001. [2] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. Hoboken, NJ: Wiley, 1999. [3] C. H. Wei and J. J. Lou, Multimemory block structure for implementing a digital adaptive lter using distributed arithmetic, Proc. Inst. Elect. Eng., vol. 133, no. 1, pt. G, pp. 1926, Feb. 1986. [4] C. F. N. Cowan and J. Mavor, New digital adaptive-lter implementation using distributed-arithmetic techniques, Proc. Inst. Elect. Eng., vol. 128, no. 4, pt. F, pp. 225230, Feb. 1981. [5] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson, LMS adaptive lters using distributed arithmetic for high throughput, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 7, pp. 13271337, Jul. 2005. [6] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson, A novel high performance distributed arithmetic adaptive lter implementation on an FPGA, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2004, vol. 5, pp. V-161V-164. [7] S. A. White, Applications of distributed arithmetic to digital signal processing: A tutorial review, IEEE ASSP Mag., vol. 6, no. 3, pp. 419, Jul. 1989. [8] D. J. Allred, H. Yoo, V. Krishnan, W. Huang, and D. V. Anderson, An FPGA implementation for a high throughput adaptive lter using distributed arithmetic, in Proc. 12th Annu. IEEE Symp. Field-Programmable Custom Comput. Mach., 2004, pp. 324325. [9] A. Croisier, D. Esteban, M. Levilion, and V. Rizo, Digital lter for PCM encoded signals, U.S. Patent 3 777 130, Dec. 4, 1973. [10] A. Peled and B. Liu, A new hardware realization of digital lters, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-22, no. 6, pp. 456 462, Dec. 1974. [11] K. Kammeyer, Quantization error analysis of the distributed arithmetic, IEEE Trans. Circuits Syst., vol. CAS-24, no. 12, pp. 681689, Dec. 1977. [12] F. Taylor, An analysis of the distributed-arithmetic digital lters, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-34, no. 5, pp. 1165 1170, Oct. 1986. [13] K. Kammeyer, Digital lter realization in distributed arithmetic, in Proc. Eur. Conf. Circuit Theory Des., Genoa, Italy, 1976. [14] C. F. N. Cowan, S. G. Smith, and J. H. Elliott, A digital adaptive lter using a memory-accumulator architecture: Theory and realization, IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-31, no. 3, pp. 541 549, Jun. 1983. [15] W. Huang and D. V. Anderson, Modied sliding-block distributed arithmetic with offset binary coding for adaptive lters, J. Signal Process. Syst., vol. 63, no. 1, pp. 153163, Apr. 13, 2010.

Вам также может понравиться