Multiplying Floating Point Numbers

Multiplying Floating Point Numbers
Introduction
We'll do addition using the one byte floating point representation discussed in the
other class notes. IEEE 754 single precision has so many bits to work with, that it's
simply easier to explain how floating point addition works using a small float
representation.
Multiplication is simple. Suppose you want to multiply two floating point
numbers, X and Y.
Here's how to multiply floating point numbers.
1. First, convert the two representations to scientific notation. Thus, we explicitly
represent the hidden 1.
2. Let x be the exponent of X. Let y be the exponent of Y. The resulting exponent
(call it z) is the sum of the two exponents. z may need to be adjusted after the
next step.
3. Multiply the mantissa of X to the mantissa of Y. Call this result m.
4. If m is does not have a single 1 left of the radix point, then adjust the radix
point so it does, and adjust the exponent z to compensate.
5. Add the sign bits, mod 2, to get the sign of the resulting multiplication.
6. Convert back to the one byte floating point representation, truncating bits if
needed.
Example
Let's multiply the following two numbers:
Variable
sign
exponent
fraction
1001
010
0111
110
Here are the steps again:
1. First, convert the two representations to scientific notation. Thus, we

explicitly represent the hidden 1.
In this case, X is 1.01 X 22 and Y is 1.11 X 20.
2. Let x be the exponent of X. Let y be the exponent of Y. The resulting exponent
(call it z) is the sum of the two exponents. z may need to be adjusted after the
next step.
For now, the resulting exponent is 2 + 0 = 2
3. Multiply the mantissa of X to the mantissa of Y. Call this result m.
Multiplying 1.01 by 1.11 results in 10.0011
4. If m is does not have a single 1 left of the radix point, then adjust the radix
point so it does, and adjust the exponent z to compensate.
Now, we have to renormalize 10.0011 to 1.00011 and increase the exponent by
1 to 3
5. Add the sign bits, mod 2, to get the sign of the resulting multiplication.
The sign bit is 0 + 0 = 0.
6. Convert back to the one byte floating point representation, truncating bits
if needed.
We need to truncate 1.00011 x 23 to 1.000 x 23 and convert.
Product
sign
exponent
fraction
X*Y
1010
000
Negative Values
Unlike floating point addition, negative values are simple to take care of in floating
point multiplication. Treat the sign bit as 1 bit UB, and add modulo 2. This is the same
as XORing the sign bit.
Bias
Does the bias representation help us in floating point multiplication? In multiplication,

it's more of a pain to use bias representation because we need to sum the exponents.
Adding exponents represented in bias notation requires special hardware for biased
representation. You're better off converting to two's complement, and adding.
Thus, the bias makes it more challenging to add exponents.
Summary
Multiplying two floating point values isn't so difficult, at least, if you're mostly
interested in understanding how it works, rather than developing hardware to do the
multiplication.
Multiplying floating point requires you to add the exponents of two values, then
multiply the mantissas, then renormalize that result, adjusting the exponent if
necessary. Finally, you deal with the sign bit by XORing the sign bit.
Multiplying in real hardware involves rounding, dealing with overflow and underflow.
Our goal is to be able to do the multiplication on paper, just to get an idea of what's
going on.
Adding Floating Point Numbers

Introduction
We'll do addition using the one byte floating point representation discussed in the
other class notes. IEEE 754 single precision has so many bits to work with, that it's
simply easier to explain how floating point addition works using a small float
representation.
Addition is simple. Suppose you want to add two floating point numbers, X and Y.
For sake of argument, assume the exponent in Y is less than or equal to the exponent
in X. Let the exponent of Y be y and let the exponent of X be x.
Here's how to add floating point numbers.
1. First, convert the two representations to scientific notation. Thus, we explicitly
represent the hidden 1.
2. In order to add, we need the exponents of the two numbers to be the same. We
do this by rewriting Y. This will result in Y being not normalized, but value is
equivalent to the normalized Y.
Add x - y to Y's exponent. Shift the radix point of the mantissa
(signficand) Y left by x - y to compensate for the change in exponent.
3. Add the two mantissas of X and the adjusted Y together.
4. If the sum in the previous step does not have a single bit of value 1, left of the
radix point, then adjust the radix point and exponent until it does.
5. Convert back to the one byte floating point representation.
Example 1
Let's add the following two numbers:
Variable
sign
exponent
fraction
1001
110
0111
000

In normalized scientific notation, X is 1.110 x 22, and Y is 1.000 x 20.
2. In order to add, we need the exponents of the two numbers to be the same.
We do this by rewriting Y. This will result in Y being not normalized, but
value is equivalent to the normalized Y.
The difference of the exponent is 2. So, add 2 to Y's exponent, and shift the
radix point left by 2. This results in 0.0100 x 22. This is still equivalent to the
old value of Y. Call this readjusted value, Y'
3. Add the two mantissas of X and the adjusted Y' together.
We add 1.110two to 0.01two. The sum is: 10.0two. The exponent is still the
exponent of X, which is 2.
4. If the sum in the previous step does not have a single bit of value 1, left of
the radix point, then adjust the radix point and exponent until it does.
In this case, the sum, 10.0two, has two bits left of the radix point. We need to
move the radix point left by 1, and increase the exponent by 1 to compensate.
This results in: 1.000 x 23.
Sum
sign
exponent
fraction
X+Y
1010
000
Example 2
Let's add the following two numbers:
Variable
sign
exponent
fraction
1001
110
0110
110

In normalized scientific notation, X is 1.110 x 22, and Y is 1.110 x 2-1.
2. In order to add, we need the exponents of the two numbers to be the same.
We do this by rewriting Y. This will result in Y being not normalized, but
value is equivalent to the normalized Y.
The difference of the exponent is 3. So, add 3 to Y's exponent, and shift the
radix point of Y left by 3. This results in 0.00111 x 22. This is still equivalent to
the old value of Y. Call this readjusted value,Y'
3. Add the two mantissas of X and the adjusted Y' together.
We add 1.110two to 0.00111two. The sum is: 1.11111two. The exponent is still the
exponent of X, which is 2.
4. If the sum in the previous step does not have a single bit of value 1, left of
the radix point, then adjust the radix point and exponent until it does.
In this case, the sum, 1.11111two, has a single 1 left of the radix point. So, the
sum is normalized. We do not need to adjust anything yet.
So the result is the same as before: 1.11111 x 23.
We only have 3 bits to represent the fraction. However, there were 5 bits in our
answer. Obviously, it looks like we should round, and real floating point
hardware would do rounding.
However, for simplicity, we're going to truncate the additional two bits. After
truncating, we get 1.111 x 22. We convert this back to floating point.
Sum
sign
exponent
fraction
X+Y
1010
111
This example illustrates what happens if the exponents are separated by too much. In
fact, if the exponent differs by 4 or more, then effectively, you are adding 0 to the
larger of the two numbers.
Negative Values
So far, we've only considered adding two non-negative numbers. What happens with
negative values?
If you're doing it on paper, then you proceed with the sum as usual. Just do normal
addition or subtraction.
If it's in hardware, you would probably convert the mantissas to two's complement,
and perform the addition, while keeping track of the radix point (read about fixed
point representation.
Bias
Does the bias representation help us in floating point addition? The main difficulty
lies in computing the differences in the exponent. Still, that's not so bad because we
can just do unsigned subtraction. For the most part, the bias doesn't pose too many
problems.
Overflow/Underflow
It's possible for a result to overflow (a result that's too large to be represented) or
underflow (smaller in magnitude than the smallest denormal, but not zero). Real
hardware has rules to handle this. We won't worry about it much, except to
acknowledge that it can happen.
Summary
Adding two floating point values isn't so difficult. It basically consists of adjusting the
number with the smaller exponent (call this Y) to that of the larger (call it X), and
shifting the radix point of the mantissa of the Y left to compensate.
Once the addition is done, we may have to renormalize and to truncate bits if there are
too many bits to be represented.
If the differences in the exponent is too great, then the adding X + Y effectively
results in X.
Real floating point hardware uses more sophisticated means to round the summed
result. We take the simplification of truncating bits if there are more bits than can be
represented.

Multiplying Floating Point Numbers

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Multiplying Floating Point Numbers

Загружено:

Авторское право:

Доступные форматы

Multiplying Floating Point Numbers

Here are the steps again:

1. First, convert the two representations to scientific notation. Thus, we

Does the bias representation help us in floating point multiplication? In multiplication,

Adding Floating Point Numbers

Here are the steps again:

Here are the steps again:

Вам также может понравиться