Академический Документы
Профессиональный Документы
Культура Документы
1 the above said applications require dedicated systems build around a core
2 Application Specific Integrated Circuit (ASIC) and, hence, the performance
3 of any DSP system depends on the efficient design of this ASIC deployed
4 therein [1]. Whereas, on one hand, certain applications require high speed
5 of operation and less power consumption, on the other, some applications
6 demand less silicon area on the IC. Hence, the ASIC is always required to
7 be designed as per specific demand of the application.
8 Digital filters are almost always part of a DSP system [2] & [3] and
9 therefore, efficient filter design is an important activity of a DSP ASIC.
10 Mostly, in all DSP applications the digital filters are of higher-order [4] &
11 [5]. These higher-order filters increase the complexity of the DSP system.
12 Some textbooks such as [1] showcase many techniques to improve the
13 performance by decreasing the complexity of the Digital filters. These
14 techniques are; pipelining, retiming, parallel processing, folding and
15 un-folding. Pipelining and retiming reduce the critical path of the filter
16 thereby increasing the sampling speed of the filter. Parallel processing
17 and un-folding techniques increase the throughput of the system at the
18 cost of increasing the hardware. Folding technique decreases the area
19 complexity of the filter but at the cost of reduced speed of operation of the
20 filter. As per application requirement we select, from the above, one of the
21 transformation techniques.
22 Systolic architectures are attractive design for computation intensive
23 DSP algorithms. This architecture is suitable for VLSI based design
24 because of its modular, regular and simple structure with high throughput
25 [6]. Systolic architecture is an array of Processing Elements (PE’s). Many
26 algorithms and architectures using systolic array in FIR filter and Discrete
27 sinusoidal transforms are suggested in [7] to [16]. The PE’s of filters consists
28 of multipliers and adders. Multipliers in PE’s limit the area of the filter
29 as they occupy maximum area. Nowadays memory based computation
30 is becoming popular [17], [18]. The conventional multiplier is replaced by
31 memory based multiplier and using these multipliers in PE’S of systolic
32 array results in memory based systolic architectures [19]. In FIR filter one
33 of the inputs to the multiplier i.e. the coefficient of the filter is fixed and
34 the other, which is the input sample, is variable. This nature of the inputs
35 to the multiplier gives rise to the concept of memory based computation
36 in filters [20]. There are two types of memory based computations, one is
37 Direct Memory based and the other is Distributed Arithmetic (DA) based
38 multiplier [21]. In the case of Direct Memory based multiplier the product
39 terms are stored and in the case of DA based multiplier the inner product
40
SYSTOLIC ARCHITECTURE FOR FIR FILTERS 3
1 terms are stored in memory. In [22] to [25] authors have suggested both
2 Direct Memory based and DA based implementation of FIR filters.
3 In the case of transposed FIR filter structure direct memory based
4 structure involves less hardware complexity compared to DA structure
5 [22]. The latency of transposed FIR filter increases with the order of the
6 filter in the case of Direct Memory based computation but it increases
7 with the size of the input sample in the case of DA based computation. In
8 [19] author has proposed Direct Memory based systolic architecture for
9 FIR filter and used the conventional memory based multiplier.
10 In this article we present a new approach to memory-based systolic
11 architecture for FIR filter by employing a new memory-based multiplication
12 strategy. This new approach, thus, reduces hardware complexity of the FIR
13 filter by simplifying the used multiplier configurations. Hence, this work
14 is an enhanced and detailed application version of our earlier work [25].
15 Further, to reduce the latency and the area of the filter, the FIR structure
16 is decomposed and 2-D Memory based systolic architecture for FIR filter
17 is realized. It is then implemented in Xillinx Virtex-7 XC7vx330tffg1157
18 FPGA using VHDL for verification and performance evaluation. The rest
19 of the paper is organized as follows. In section 2, the basic Memory based
20 systolic architecture for FIR filter is derived from the data flow graph and
21 the memory cell of the structure is explained in detail. In section 3, the
22 2-D systolic structure of the filter is derived from the concurrent recursive
23 equation of the filter output response which results in reduction in the
24 latency of the filter. Area and latency of both 1-D and 2-D structures are
25 also compared. In section 4, the work is concluded stating the advantages
26 of the structure.
27
28 2. FIR filter using Memory based Systolic Structure
29
30 In this section memory based systolic structure for N-tap FIR filter is
31 derived. The filter structure is mapped into the PE’s of the systolic array.
32 Then the details of the memory cells used in PE’s are discussed. The
33 conventional memory based PE’s and our proposed memory based PE’s
34 are also dealt in detail.
35
36 2.1 Derivation of memory based systolic architecture
37 The input output relation [2] of FIR filter in time domain is given in
38 Eq. (1) as follows
39
40
4 C.S.VINITHA AND R. K. SHARMA
1 N −1
4 where y(n) is the output, x(n) is the input and h(n) is the impulse response
5 of the filter. Eq. (2) represents the Z-transformed relation of the above.
6 Y (Z) = X (Z).H (Z) (2)
7
8 where Y(Z), X(Z) and H(Z) are the Z-transform of output, input and
9 impulse response of the filter. The output response Y (Z) expressed in Eq.
10 (3) in recursive form as
11
Y
= ( Z) X(Z)[h(0) + z −1 [h(1) + z −1 [h(2) + ...... + z −1
12
[h( N − 2) + z −1 [h( N − 1)]]....]] (3)
13
14 because the transfer function H(Z) is given as
15 N −1
16 H (Z) = ∑z h(n) (4)
−n
17 n=0
18
The transposed form data flow graph (DFG) of the recursive equation
19
of the filter is given in Fig.1. The memory based systolic structure is derived
20
as follows. Each multiplier, delay element and an adder of the DFG is
21
mapped onto the PE’s of the systolic structure. The conventional multipliers
22
are replaced by memory based multipliers. In memory based multipliers,
23
the product terms are stored in memory for the different combination of
24
the input samples. As the input is common to all the multipliers in the
25
transposed form structure, we can replace all the multipliers by a memory
26
module. This memory module will have multiple memory cells one each
27
for the multipliers in the DFG. The systolic array structure for the DFG is
28
29
30
31
32
33
34
35
36
37
38
Figure 1
39
40 Transposed form data flow graph of N-tap FIR filter
SYSTOLIC ARCHITECTURE FOR FIR FILTERS 5
1
2
3
4
5
6 (a)
7
8
9
10
11
12
(b)
13
Figure 2
14
15 (a) Systolic architecture of FIR filter. (b) Structure of the memory module.
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31 Figure 3
32 Structure of Memory cell
33 (a) Conventional Multiplier. (b) Proposed Multiplier.
34
given in Fig.2. Apart from the memory module, we have a delay cell (D)
35
for the left most delay in the DFG of the filter. The cell A is replacing the
36
adder followed by the delay element in the DFG of the filter. Thus the DFG
37
is completely mapped onto the memory based systolic architectures.
38
39
40
6 C.S.VINITHA AND R. K. SHARMA
21
Conventional
Conventional
Conventional
22
Proposed
Proposed
Proposed
23
24
25
26
S 48 100 287 290 726 597
27
28 16 L (ns) 15.5 15.5 15.5 15.5 15.5 15.5
29 Fmax(MHz) 460.61 357 278.3 260.4 187.02 212.6
30 S 263 238 680 617 982 1041
31 32 L (ns) 31.5 31.5 31.5 31.5 31.5 31.5
32
Fmax(MHz) 255.49 301.932 210.7 253.4 188.9 158.9
33
34 S 999 957 1484 1538 2345 2360
35 64 L (ns) 63.5 63.5 63.5 63.5 63.5 63.5
36 Fmax(MHz) 199.16 179.66 168.23 169.8 138.14 141.1
37 S 3096 3545 4398 3765 6064 5755
38
128 L (ns) 127.5 127.5 127.5 127.5 127.5 127.5
39
40 Fmax(MHz) 121.44 108.47 135.5 140.5 100.3 111.11
SYSTOLIC ARCHITECTURE FOR FIR FILTERS 7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 Figure 4
19 2-D DFG of an N-order FIR filters.
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Figure 5
39
2-D Systolic structure of a 16-order filter for q=p=4.
40
SYSTOLIC ARCHITECTURE FOR FIR FILTERS 9
8
9 where ymof Eq.(5) given as follows
10 p −1
ym ∑ h(mp + n)z X( z) (6)
−n
11 =
n =0
12
13 and the recursive equation of each is given in Eq.(7)
14
y m ( z) = X( z)[h(mp) + z −1 (h(mp + 1)
15
16 + z −1 (h(mp + 2) + .. + z −1 (h(mp + p − 1)).....))] (7 )
17 for m, p = 0.1.2.......q-1. The 2-D DFG for the Eq. (5) is given in Fig. 4. Each
18 row in the DFG represents the Eq. (7). This row is nothing but the 1-D
19 structure of the filter. Thus all the multiplier nodes in a row are replaced
20 by the memory-module and the adder and delay cell is replaced by cell A
21 of the systolic structure shown in Fig.2. Thus 2-D systolic structure will
22 have rows of 1-D systolic structure and the output of these rows is added
23 in a Pipelined Adder Tree (PAT) which gives the output of the filter.
24
25 3.2 Decomposed structure of a 16-order filter for q=4 and q=8
26
27 The decomposed structure for a 16-order filter for q=4 is given in
28 Fig.5. Since it’s a 16-order filter both q and p will be equal to 4. As the no
29 of adders in a row is reduced as compared to the 1-D structure, the width
30 of the adder used in the cell A of the 2-D structure is reduced. The width
31 of the adder increases as we precede from the first cell A to the last cell in a
32 row. For the example considered, the width of the last adder cell required
33 in the case of 1-D structure is 23-bit for an input sample size of 8-bit. But in
34 the case of 2-D structure for the same example the size of the adder used
35 in the last cell will be 11-bit. We have a series of adders in the PAT block of
36 the structure. For q=4 structure there will be two levels of adders in PAT
37 block. So the width of the final adder used in PAT block will be equal to
38 13-bit. The reduction in the bit length of the adder reduces the width of
39 the register used in the cell A of the structure. Thus the overall area of the
40 decomposed structure is reduced.
10 C.S.VINITHA AND R. K. SHARMA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20 Figure 6
21 2-D Systolic structure of a 16-order filter for q=8 and p=2.
22
23 Similarly for the 2-D structure with q=8 and p=2, for a 16-order filter
24 the width of the last adder used in the PAT block is 12-bit. This structure
25 further reduces the area of the filter than for the decomposition level q=4.
26 The 2-D structure for q=8 and p=2 is given in Fig.6. The area complexity
27 for q=4 and q=8 for different order of the filter and for different input
28 size is given in Table.2. From the table we can see how much the area of
29 the filter is reduced for the decomposed filter as compared to the area
30 occupied by the 1-D filter. For example, for a 128-order filter for an input
31 length of 32-bit, we get 89.8% area reduction for a decomposed filter (q=8)
32 as compared to 1-D (q=1) filter. Also if we compare the area (in terms of
33 number of occupied Slices) of our proposed Systolic architecture based
34 filter with the basic systolic architecture based filter, we are able to get 5 to
35 7 percent reduction in area.
36 The maximum frequency of operation of the filter for different order
37 and different input length is also compared. The details are given in
38 Table.3. From the table we can see that for q=8, Fmax is higher than for q=4
39 for all cases. Thus we can conclude that the decomposition level q=8 is
40
SYSTOLIC ARCHITECTURE FOR FIR FILTERS 11
1 Table 2
2 Comparison of Area in terms of number of slices of the filter for different
3 input sample size and different order of the filter.
4
5 Input sample size
6 8-bit 16-bit 32-bit
7 Description
Order
of Parameters
Conventional
Conventional
Conventional
8 of the
Proposed
Proposed
Proposed
at different
9 filter
decomposition levels
10
11
12
q=1 263 238 680 617 982 1041
13
14 32 S q=4 54 53 138 129 195 180
15 q=8 02 02 47 29 74 65
16
q=1 999 957 1484 1538 2345 2360
17
18 64 S q=4 119 75 201 250 572 490
19 q=8 54 61 123 132 273 246
20 q=1 3096 3545 4398 3765 6064 5755
21
128 S q=4 293 253 588 603 905 840
22
23 q=8 133 114 344 316 623 587
24
25 considered as the best level of decomposition as we are able to achieve
26 minimum area and maximum frequency of operation.
27
28 3.3 Structure of Pipelined adder tree for various decomposition levels
29
The detailed structure of PAT for both q=4&8 are given in Fig. 7(a)
30
& (b) respectively. For the decomposition level q=4, we have 2- levels
31
of adders. So including the first level of adders of the filter block there
32
are 3-levels of adders. These adder levels change the latency of the filter.
33
Similarly for q=8, we have 4-level of adder which includes three of PAT
34
block and one of filter block. The latency of the filter for different order and
35
different decomposition level is given in Table.4. Latency of the proposed
36
filter is also compared with the latency of the conventional memory based
37
systolic FIR filter given in [19]. Latency of our proposed filter is less than
38
the filter structure given in [19] for all decomposition levels. Also in the
39
proposed filter, the PAT block remains same for all filter order. Hence the
40
12 C.S.VINITHA AND R. K. SHARMA
1
2
3
4
5
6
7
8
9
10
11
12 (a) q=4
13
14
15
16
17
18
19
20
21
22
(b) q=8
23
Figure 7
24
Structure of PAT for different decomposition levels.
25
26 latency for the decomposed filter remains same for all levels. This is also
27 an advantage of the proposed structure over the filter structure proposed
28 in [19]. The proposed decomposed structure can be used for higher order
29 FIR filter with less area complexity and reduced latency and maximum
30 frequency of operation.
31
32
4. Conclusion
33
34 A Memory based 1-D and 2-D systolic architecture for FIR filter is
35 derived from the data flow graph of the filter. The conventional Memory
36 based multiplier used in the filter is replaced by our proposed Memory
37 based multiplier. The proposed filter is implemented in Xillinx Virtex 7
38 FPGA device using VHDL. The key performance metrics like number of
39 slices, latency and the maximum frequency of operation is compared with
40 the conventional Memory based systolic FIR filter architecture given in
SYSTOLIC ARCHITECTURE FOR FIR FILTERS 13
1 Table 3
2 Comparison of Fmax of the filter for different input length and different order
3 of the filter.
4 Input size
5 8-bit 16-bit 32-bit
6 Description of
Conventional
Conventional
Conventional
7 Order of Parameters With
Proposed
Proposed
Proposed
8 the filter decomposition
levels
9
10
11
q=1 255.4 301.9 210.7 253 188.9 158.9
12 Fmax
13 32 q=4 291.8 256.5 139.4 106.4 101.7 105.3
(MHz)
14 q=8 709.7 709.7 395.2 322.1 286.8 288
15 q=1 199.1 179.6 168.2 169.8 138.1 142.4
Fmax
16 64 q=4 121.9 134 101 100.4 71.2 74.3
(MHz)
17 q=8 182.9 130 122 114.6 100 101.1
18 q=1 121.4 108.4 138.6 140.5 100.3 111.1
19 Fmax
128 q=4 92.8 99.9 65.5 64.4 50.6 49
20 (MHz)
q=8 124.5 128.3 100.2 101.3 77.6 80.4
21
22
23 Table 4
24 Comparison of Latency of the filter for different order and different
25 decomposition level.
26
Order of the filter
27
Decomposition level
28 16 32 64 128
29
Conventional
Conventional
Conventional
Conventional
30
Proposed
Proposed
Proposed
Proposed
31
[19]
[19]
[19]
[19]
32
33
34
q=1 16 15.5 32 31.5 64 63.5 128 127.5
35
36 q=4 6 2.5 10 2.5 18 2.5 34 2.5
37
38 q=8 5 3.5 7 3.5 11 3.5 19 3.5
39
40
14 C.S.VINITHA AND R. K. SHARMA
1 [19]. Because of the PAT structure proposed in 2-D structure of this work,
2 we are able to achieve same latency for all filter order. Latency achieved
3 by the proposed 2-D structure is very less at every decomposition level as
4 compared to the structure proposed in [19]. The decomposed structure
5 proposed for q=8 is a hardware efficient structure with minimum area of
6 occupation with minimum latency and maximum frequency of operation.
7
8 References
9
10 [1] K. K. Parhi. VLSI Digital Signal Processing Systems: Design and
11 Implementation. New York: John Wiley & Sons, Inc (1999).
12 [2] S. K. Mitra. Digital Signal Processing: a Computer Based Approach.
13 Boston: McGraw-Hill (2006).
14
[3]
J. G. Proakis and D. G. Manolakis. Digital Signal Processing:
15
Principles, Algorithms and Applications. Upper Saddle River, NJ:
16
Prentice-Hall (1996).
17
18 [4] G. Mirchandani, R. L. Zinser, Jr., and J. B. Evans, “A new adaptive
19 noise cancellation scheme in the presence of crosstalk [speech
20 signals],” IEEE Trans. Circuits and Systems II: Analog and Digital
21 Signal Processing, 39(10), 681–694 (1995).
22 [5] D. Xu and J. Chiu, “Design of a high-order FIR digital filtering and
23 variable gain ranging seismic data acquisition system,” In IEEE
24 Proceedings Southeastcon’93, p. 6(1993).
25 [6] H. T. Kung,“Why systolic architectures?,” IEEE Computer, 15(1)37–
26 45 (1982).
27
[7] R. Wyrzykowski and S. Ovramenko, “Flexible systolic architecture
28
for VLSI FIR filters,” IEEE Proceedings-Computers and Digital
29
Techniques, 139(2). 170–172 (1992).
30
31 [8] B. K. Mohanty and P. K. Meher, “Cost-effective novel flexible cell level
32 systolic architecture for high throughput implementation of 2-D FIR
33 filters,” IEEE Proceedings-Computers and Digital Techniques,143(5),
34 436–439 (1996).
35 [9]
B. K. Mohanty and P. K. Meher “Novel flexible systolic mesh
36 architecture for parallel VLSI implementation of finite digital
37 convolution,” IETE Journal of Research, 44(6),261–266 (1988).
38 [10]
P. K. Meher, “High throughput and low-latency implementation
39 of bit-level systolic architecture for 1D and 2D digital filters,” IEEE
40
SYSTOLIC ARCHITECTURE FOR FIR FILTERS 15
1 [21]
H.R. Lee, Jen and C.M. Liu. “On the design automation of the
2 memory-based VLSI architectures for FIR filters”, IEEE Trans.
3 Consumer Electronics, 39(3), 619–629 (1993).
4 [22] P.K. Meher. “New approach to look-up-table design and memory-
5 based realization of FIR digital filter”, IEEE Transactions on Circuits
6 and Systems I: Regular Papers, 57(3), 592-603 (2010).
7
[23]
P.K Meher. “LUT optimization for memory-based computation”,
8
IEEE Transactions on Circuits and Systems II: Express Briefs. 57(4),
9
285-9 (2010).
10
11 [24] P. K. Meher, S. Chandrasekaran, and A. Amira. “FPGA realization
12 of FIR filters by efficient and flexible systolization using distributed
13 arithmetic”, IEEE Trans. Signal Process. 56(7), 3009–3017 (2008).
14 [25] C.S.Vinitha and R.K.Sharma. “A Novel Technique to optimize the LUT
15 used in Memory based filter”, In Proceedings of IEEE International
16 Conference in Electrical, Electronics, Computers, Communication,
17 Mechanical and Computing (2018)
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40