Вы находитесь на странице: 1из 5

476

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 62, NO. 5, MAY 2015

An Improved Two-Step Binary Logarithmic


Converter for FPGAs
Mandeep Chaudhary and Peter Lee, Member, IEEE

AbstractThis brief describes an improved binary linear-to-log


(Lin2Log) conversion algorithm that has been optimized for implementation on a field-programmable gate array. The algorithm is
based on a piecewise linear (PWL) approximation of the transform
curve combined with a PWL approximation of a scaled version of
a normalized segment error. The architecture presented achieves
23 bits of fractional precision while using just one 18K-bit block
RAM (BRAM), and synthesis results indicate operating frequencies of 93 and 110 MHz when implemented on Xilinx Spartan3
and Spartan6 devices, respectively. Memory requirements are
reduced by exploiting the symmetrical properties of the normalized error curve, allowing it to be more efficiently implemented
using the combinatorial logic available in the reconfigurable fabric
instead of using a second BRAM inefficiently. The same principles can be also adapted to applications where higher accuracy
is needed.
Index TermsComputer arithmetic, function evaluation, piecewise linear (PWL) approximation, table-based methods, uniform
segmentation.

I. I NTRODUCTION

OGARITHMIC signal processing (LSP) has been proposed as a viable alternative to fixed- and floating-point
binary signal processing for many years. The properties of
logarithmic arithmetic make it particularly suitable for applications where a high dynamic range and relatively low
absolute accuracy can be used or when logarithms simplify
the arithmetic processing needed in computationally intensive algorithms [1], [2]. LSP can be also used with lowpower very-large-scale integration implementation [3], which
can be useful in low-power applications when the cost of
conversion to and from the logarithmic domain can be justified. Recently, there has been increased interest in the
use of LSP techniques on field-programmable gate arrays
(FPGAs). Recently, a number of new conversion algorithms
have been developed, and existing conversion algorithms have
been reevaluated for efficient implementation using FPGA resources [4], [5]. This brief presents an implementation of an
algorithm previously presented in [6] that has been adapted

Manuscript received July 21, 2014; revised October 14, 2014; accepted
December 16, 2014. Date of publication December 25, 2014; date of current version April 23, 2015. This brief was recommended by Associate
Editor S. Hu.
The authors are with the School of Engineering and Digital Arts, University
of Kent, Canterbury CT2 7NT, U.K. (e-mail: MC539@kent.ac.uk; P.Lee@kent.
ac.uk).
Digital Object Identifier 10.1109/TCSII.2014.2386252

to optimize its utilization of the available resources on FPGAs.


The results of the optimization are presented for Xilinx Spartan3
and Spartan6 families, and they are compared with the original
solution presented and the equivalent results published by other
researchers on similar hardware platforms [7], [8].
This brief begins with a brief review of previous work in
Section II before presenting the improved algorithm in
Section IV. The results of the improved architecture are presented in Section V. This brief concludes with some suggestions for further work.
II. P REVIOUS W ORK
The binary logarithm of a number x can be defined using a
four-tuple of the form
x = 2log2 x = (1 Z)(1)S 2I 20.F

(1)

where S is the sign bit, I and F are the integer and fractional (or mantissa) parts, respectively, of the logarithm base 2,
and Z is used to represent the special case of x = 0. The
derivation of Z, S, and I is straightforward and not discussed
further in this brief. However, the conversion to and from
the log domain requires the approximation of nonlinear terms
log2 (1.F ) for linear-to-log (Lin2Log) conversion and 20.F for
log-to-linear (Log2Lin) conversion. This brief concentrates on
the conversion of a normalized number 1 1.F < 2, where
F represents the fractional component, although the approach
described in this brief can be also applied to an approximation
of 20.F .
Early algorithms for approximating log2 (1.F ) were based on
improvements to the simple linear interpolation first proposed
by Mitchell in 1962 [9]. Subsequent papers have proposed
improvements to the basic Mitchell architecture, which have
been achieved through the use of more complex curve fitting
and/or error correction techniques that only require simple
arithmetic components.
For higher accuracy, lookup tables (LUTs) have been used.
However, for resolutions of 16 bits and beyond, the direct
use of LUTs becomes prohibitively large, and many methods
for reducing the size of LUTs while maintaining conversion
accuracy have been proposed and published. Most are based
on piecewise linear (PWL) or piecewise polynomial (PWP)
approximation techniques that reduce the size of the LUTs by
using more complex arithmetic components (e.g., multipliers
and adders) [10], [11]. Chaudhary and Lee [6] revisited a twostep algorithm first proposed in a patent attributed to Larson

1549-7747 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

CHAUDHARY AND LEE: IMPROVED TWO-STEP BINARY LOGARITHMIC CONVERTER FOR FPGAs

477

Fig. 1. Larson first-stage error curves for 128 segments.

[12] for a binary Lin2Log converter and described its efficient


implementation on an FPGA. The architecture was compared
with similar architectures based on uniform or nonuniform
PWL or PWP methods that have been proposed recently. This
brief presents a further improvement to the algorithm presented
in [6]. Section III outlines the characteristics of the Larson
algorithm, and Section IV describes the proposed improvements to the Larson architecture. Section V provides empirical
data about the FPGA implementation of these improvements
and compares resource utilization with some alternative solutions that have been published recently.

III. R EVIEW OF L ARSON A LGORITHM

(2)

where F represents the fraction bits of the normalized binary


input as
F = Fk1 21 +Fk2 22 + + F0 2k =

k


Fki 2i .

i=1

(3)
Larsons algorithm is based on an extension of the work
initially presented by Kmetz [13] and Maenner [14]. Kmetz
[13] used a LUT to approximate the difference (or error), i.e.,
, between a log2 (1.F ) and log2 (1.F ), i.e.,
= log2 (1.F ) a log2 (1.F )

segment contains 2m elements, where, for each of the p = 2m


segments, a unique pair of PWL coefficients Ap and Bp are
stored in the LUT. Thus, the output approximation becomes


k

i
Fki 2 + .
(5)
a log2 (1.F ) = Ap + Bp
i=km1

The Larson algorithm [12] uses a combination of two PWL


approximations to convert a normalized binary input 1
1.F < 2 into a binary logarithm 0 log2 (1.F ) < 1 such that
a log2 (1.F ) log2 (1.F )

Fig. 2. Proposed Larson algorithm architecture [12].

(4)

where the n least significant bits of 1.F are used as the address
of the LUT containing 2n values of . F is partitioned into
p = 2m segments using the m most significant bits of F . Each

Maenner extended this work further [14] by using a LUT of the


error curve, as proposed by Kmetz, with a PWL approximation
of the normalized log function. The error curve is now used in
each PWL segment to reduce the residual error caused by the
linear interpolation used between the vertices of the segment.
However, Arnold et al. [15] pointed out that, although the
error curve in each segment is similar, it is not the same, and
this limited the achievable accuracy of the Maenner approach.
Larson improved the Maenner algorithm. It uses a simple
PWL approximation of a normalized residual error curve that
has been derived from all the error curves produced in each
segment of the PWL approximation of the normalized log2
function. Fig. 1 shows the curves produced in a first-stage
PWL approximation. All the curves have a similar shape but
different magnitudes. The Larson error curve is an average of
all the scaled error curves. Hence, the Larson algorithm stores a
PWL approximation of the normalized error curve and a scaling
factor. The scaling factor is determined by the segment of the
PWL approximation of the log2 function in which the input data
are located. Hence, the approximation function now becomes

a log2 (1.F ) = Ap + Bp

k

i=km1


Fki 2i + Sp 

(6)

478

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 62, NO. 5, MAY 2015

TABLE I
LUT C OEFFICIENT B ITS [6]

Fig. 3. Normalized error curve for each segment stored in LUT2.

where Sp is a unique factor in each segment used to scale a


PWL approximation  of the error, i.e., , in each of the p
segments, i.e.,


k

 = Aj + Bj
Fki 2i , j = 1 : n. (7)
i=km1

The algorithm is implemented using two stages. Stage 1 contains a PWL approximation of the log function. The coefficient
LUT contains values for Ap and Bp , as well as scaling factor
Sp . Stage 2 is a PWL approximation of the error curve and
contains an additional multiplier to scale the curve.
It should be noted that the error curve generated for LUT2
is using a second-stage PWL, as shown in Fig. 2. The scaled
version of the normalized error curve is then added to the
first-stage PWL approximation of the log function to reduce
the overall conversion error. LUT1 is used to store the zeroth
(A) and first-order (B) coefficients of the PWL approximation
together with a scaling factor (S), which is used to multiply
the normalized error curve approximated using the PWL coefficients (A and B  ) and stored in LUT2.
Although Larson [12] did not describe how this normalized
error curve was derived, Chaudhary and Lee [6] have analyzed
and verified some simple methods for deriving a curve that
produces the minimum (root mean square) error approximation
This was assessed for a normalized input with 23 bits of
fractional precision. Each configuration produced a conversion
error of less than 1 unit of last place (ULP), where 1 ULP =
223 = 1.19 107 . The results presented in [6] for different
LUT sizes for a 23-bit conversion are reproduced here in Table I
for reference.

Fig. 4. Residual error produced after approximating the normalized curve (as
shown in Fig 3) using a symmetrical approximation.

IV. I MPROVED A RCHITECTURE


Although the results in Table I indicate that the 7:16::7:9
configuration is the most efficient in terms of total memory
used, it does not make the most efficient use of resources
when implemented on an FPGA. In [6], the LUTs have been
implemented using dedicated block RAM (BRAM) elements
that are embedded in the FPGA fabric. These BRAM cells have
a granularity of 18K bits for Xilinx Spartan3 devices and 9K
bits for the more modern Spartan6 devices. The aforementioned
optimal implementation uses just 43% of the available capacity
of a BRAM element to implement LUT1 and less than 19% for
LUT2. Although significant numbers of BRAM cells exist on
modern FPGA devices, they still represent a limited resource.
The solution proposed here is to exploit, where possible, the
distributed RAM elements available within the configurable
logic block (CLB) of the FPGA fabric to reduce the number of BRAM elements. These memory blocks have a much
finer granularity (16 1 bits for Spartan3 or 64 1 bits for
Spartan6) and are embedded in the configurable logic fabric.
In Table I, it is observed that using an 8:15 partition for LUT1
would increase the utilization of a BRAM element to 85% while
reducing the size of LUT2 to just 1728 bits. This would result in
just 9.4% utilization of a second BRAM element for LUT2 on a
Spartan3 device. The inefficient utilization of BRAM for LUT2
indicates that alternative implementations could be more effective. The normalized error curve stored in LUT2 (as shown in
Fig. 3) has a significant component of symmetry about its apex.
In addition, this symmetry could be used to reduce the size
of LUT2 by a factor of 2. Similar techniques have been used

CHAUDHARY AND LEE: IMPROVED TWO-STEP BINARY LOGARITHMIC CONVERTER FOR FPGAs

479

Fig. 5. Residual error in the normalized error curve approximation (as shown
in Fig 3).

Fig. 6. Improved LUT2 architecture.

to reduce the size of LUTs for approximating sine and cosine


functions in direct digital synthesis. The analysis of the curve
in Fig. 3 shows that it is not completely symmetrical. However,
a symmetrical approximation results in a small residual error,
as shown in Fig. 4, between the normalized error curve and
the symmetrical approximation to it. The residual error can be
reduced using a method similar to that in [13]. Fig. 5 shows
the residual error produced when this symmetrical component
is removed.
However, when a simple symmetrical approximation of the
normalized error curve is used, the accuracy of the conversion
algorithm degrades to less than 23 bits of fractional precision.
Moreover, the residual error, as shown in Fig. 4, has symmetrical properties; hence, an approximation to the residual error
can be also stored in the same LUT (as shown in Fig. 6). When
the normalized error curves symmetrical property is combined
with the symmetrical error (see Fig. 4), the remaining error in
the normalized error curve (as shown in Fig. 7) is again reduced
to less than 1.19 107 (which is 1 ULP for a 23-bit fractional
input). Fig. 7(b) shows an even distribution of the approximated
overall error achieved by using the proposed algorithm.
V. R ESULTS
The algorithm with reduced memory has been implemented
using the Xilinx ISE Design Suite V13.2 on Xilinx XC3S200
Spartan3 and XC6SLX16 Spartan6 FPGA devices. The al-

Fig. 7. (a) Overall error obtained in the approximation. (b) Histogram of the
approximated error distribution.

gorithm is implemented with no pipelining and maximum


pipelining stages to show variations in the achievable operating
frequency. The results are summarized in Tables II and III.
Using these techniques, the size of LUT2 is reduced to just
1184 bits from 1728 bits or reduced by 32% of the LUT2 size
at the cost of an additional adder and XOR gates at the LUT
inputs and outputs. The additional logic reduces the number
of CLB slices needed to implement the LUT by a further 5%
when compared with just using the CLBs or slices for the LUT.
Table IV compares the results in Table III with recent work
[7], [8] and the original algorithm for equivalent conversions
with 23 bits of fractional precision. This comparison shows
that the 50% reduction in the number of BRAM elements used
compared with the work in [6] is offset by a modest increase of
4.5% and 0.8% in the total number of slices on Spartan3 and
Spartan6 FPGA devices, respectively.
VI. C ONCLUSION AND F UTURE W ORK
This brief has demonstrated an improvement to an efficient
two-stage PWL implementation of a binary Lin2Log converter.
The architecture presented is based on an algorithm first described by Larson [12]. The new architecture reduces the size
of the LUT in the second stage, making it suitable for efficient

480

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS, VOL. 62, NO. 5, MAY 2015

TABLE II
LUT C OEFFICIENT B ITS

TABLE III
D EVICE U TILIZATION S TATISTICS

TABLE IV
U TILIZATION C OMPARISON

implementation using distributed memory elements instead of


the BRAM on Xilinx FPGAs. This method is even more effective on newer Xilinx devices that have increased distributed
memory capacities over earlier generations. Future work will
investigate the use of these methods at higher precision and
investigate if they can be applied when high-order PWP approximations are used. It will be also applied to other functions
such as 2x .
R EFERENCES
[1] F. Sheikh et al., A 2.05gvertices/s 151mw lighting accelerator for 3D
graphics vertex and pixel shading in 32nm CMOS, in Proc. IEEE ISSCC
Tech. Dig. Papers, Feb. 2012, pp. 184186.
[2] M. Arnold, Reduced power consumption for MPEG decoding with INS,
in Proc. IEEE Int. Conf. Appl.-Spec. Syst., Architect. Process., 2002,
pp. 6575.

[3] G. Yemiscioglu and P. Lee, 16-bit Clocked Adiabatic Logic (CAL) logarithmic signal processor, in Proc. IEEE 55th Int. MWSCAS, Aug. 2012,
pp. 113116.
[4] S.-F. Hsiao, H.-J. Ko, and C.-S. Wen, Two-level hardware function evaluation based on correction of normalized piecewise difference functions,
IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 59, no. 5, pp. 292296,
May 2012.
[5] T. Sasao, S. Nagayama, and J. Butler, Numerical function generators
using LUT cascades, IEEE Trans. Comput., vol. 56, no. 6, pp. 826838,
Jun. 2007.
[6] M. Chaudhary and P. Lee, Two-stage logarithmic converter with reduced
memory requirements, IET Comput. Digit. Tech., vol. 8, no. 1, pp. 2329,
Jan. 2014.
[7] M. Haselman et al., A comparison of floating point and logarithmic
number systems for FPGAs, in Proc. 13th Annu. IEEE Symp. FCCM,
Apr. 2005, pp. 181190.
[8] H. Fu, O. Mencer, and W. Luk, Optimizing logarithmic arithmetic
on FPGAS, in Proc. 15th Annu. IEEE Symp. FCCM, Apr. 2007,
pp. 163172.
[9] J. N. Mitchell, Computer multiplication and division using binary logarithms, IRE Trans. Electron. Comput., vol. EC-11, no. 4, pp. 512517,
Aug. 1962.
[10] F.-S. Lai, A 10 ns hybrid number system data execution unit for digital
signal processing systems, IEEE J. Solid-State Circuits, vol. 26, no. 4,
pp. 590599, Apr. 1991.
[11] M. Arnold and M. Winkel, A single-multiplier quadratic interpolator for
INS arithmetic, in Proc. ICCD, 2001, pp. 178183.
[12] K. Larson, Floating point to logarithm converter, U.S. Patent 5 365 465.
Nov. 15, 1994, [Online]. Available: https://www.google.com/patents/
US5365465
[13] G. Kmetz, In a digital computation system, U.S. Patent 4 583 180.
Apr. 15, 1986, [Online]. Available: http://www.google.com/patents/
US4583180
[14] R. Maenner, A fast integer binary logarithm of large arguments, IEEE
Micro, vol. 7, no. 6, pp. 4145, Dec. 1987.
[15] M. Arnold, T. Bailey, and J. Cowles, Error analysis of the Kmetz/
Maenner algorithm, J. VLSI Signal Process. Syst. Signal, Image Video
Technol., vol. 33, no. 1/2, pp. 3753, 2003, [Online]. Available: http://
link.springer.com/article/10.1023%2FA%3A1021189701352

Вам также может понравиться