EN - LO - Introduction To Digital System - Ang - ZDralek PDF

FACULTY OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
VŠB – TECHNICAL UNIVERSITY OF OSTRAVA
Introduction to Digital Systems

for Joint Teaching Programme of BUT and VSB-TUO
Supervisor:
Jaroslav Zdralek
Authors
Jaroslav Zdralek
Zdenka Chmelikova
Ostrava * 2014
This textbook was supported by the project No. CZ.1.07/2.2.00/28.0062 funded by

the European Social Fund and Ministry of Education, Czech Republic.
Author doc. Ing. Jaroslav Zdrálek, Ph.D.
Ing. Zdeňka Chmelíková, Ph.D.
Title Introduction to Digital Systems for Joint Teaching Programme of
BUT and VSB-TUO
Editor VŠB - Technical University of Ostrava
Faculty of Electrical Engineering and Computer Science
Department of Telecommunications
17. listopadu 15, 708 33 OSTRAVA
Edition first
Year 2014
Issue on-line
ISBN 978-80-248-3646-1 CD
This publication has not been linguistically or editorially modified
ii VŠB-TU Ostrava
Contents
1 Introduction...................................................................................................................... 1
1.1 Basic terms ............................................................................................................... 1
1.2 Endianness................................................................................................................ 3
1.3 Binary prefixes– Standard IEC .................................................................................. 3
1.4 Reference ................................................................................................................. 4
2 Numeral systems .............................................................................................................. 5
2.1 Polynomial of numeral system ................................................................................. 6
2.2 Numeral systems used in digital systems ................................................................. 7
2.3 Conversion between numeral systems .................................................................... 8
2.4 Reference ............................................................................................................... 10
3 Boolean algebra.............................................................................................................. 13
3.1 Propositional calculus............................................................................................. 13
3.2 Definition of Boolean algebra ................................................................................ 16
3.3 Boolean function .................................................................................................... 19
3.4 Boolean expression ................................................................................................ 21
3.5 Reference ............................................................................................................... 24
4 Design of Boolean function ............................................................................................ 25
4.1 Logic gate................................................................................................................ 26
4.2 Synthesis ................................................................................................................. 30
4.3 Minimization by Karnaugh map ............................................................................. 34
4.4 Realization by NAND and NOR logic gates ............................................................. 37
4.5 Algorithm of synthesis ............................................................................................ 38
4.6 Reference ............................................................................................................... 39
5 Real numbers.................................................................................................................. 41
5.1 Some famous bugs ................................................................................................. 42
5.2 Serious problems .................................................................................................... 43
 Chaotic bank ........................................................................................................... 43
 Rump’s problem ..................................................................................................... 44
VŠB-TU Ostrava iii

 A simple example ................................................................................................... 45
5.3 The representation of the number is not easy....................................................... 45
5.4 References .............................................................................................................. 45
6 Integer numbers ............................................................................................................. 47
6.1 Unsigned integer .................................................................................................... 48
6.2 Signed integer......................................................................................................... 48
6.3 Sign and magnitude ................................................................................................ 49
6.4 1’s complement ...................................................................................................... 49
6.5 2’s complement ...................................................................................................... 50
6.6 Conversion to two’s complement .......................................................................... 51
6.7 Conversion from two’s complement ...................................................................... 52
6.8 Offset binary ........................................................................................................... 52
6.9 Conversion to and from offset binary .................................................................... 54
6.10 BCD numbers .......................................................................................................... 54
6.11 Ten’s complement .................................................................................................. 56
6.12 References .............................................................................................................. 57
7 Arithmetic operations on integer numbers ................................................................... 59
7.1 Flag bits of operations ............................................................................................ 59
7.2 Sign extension ........................................................................................................ 60
7.3 Addition, unsigned and two’s complement ........................................................... 61
7.4 Subtraction, unsigned and two’s complement ...................................................... 63
7.5 Addition and subtraction in the sign and magnitude ............................................. 64
7.6 Addition and subtraction in offset binary .............................................................. 65
7.7 Addition and subtraction in BCD code ................................................................... 67
7.8 Multiplication ......................................................................................................... 69
7.9 Division ................................................................................................................... 70
7.10 References .............................................................................................................. 72
8 Fixed point arithmetic .................................................................................................... 75
8.1 Binary scaling .......................................................................................................... 76
8.2 Format m.n ............................................................................................................ 78
8.3 ℚ number format ................................................................................................... 78
8.4 Range of representation of fixed point .................................................................. 79
8.5 Conversion to and from fixed point ....................................................................... 80
8.6 Arithmetic operations ............................................................................................ 84
8.7 Addition and subtraction........................................................................................ 85
iv VŠB-TU Ostrava
8.8 Multiplication ......................................................................................................... 86
8.9 Division ................................................................................................................... 88
8.10 References .............................................................................................................. 89
9 Floating point numbers .................................................................................................. 91
9.1 Significand .............................................................................................................. 94
9.2 Precision ................................................................................................................. 95
9.3 Floating point values .............................................................................................. 95
9.4 Sets of floating-point data ...................................................................................... 96
9.5 Formats defined by IEEE 754-2008 ........................................................................ 97
9.6 Binary interchange format encodings .................................................................. 100
9.7 Decimal interchange floating point format .......................................................... 104
9.8 Declet and densely-packed decimal ..................................................................... 109
9.9 Rounding .............................................................................................................. 110
9.10 Not a Number ....................................................................................................... 111
9.11 Infinity .................................................................................................................. 112
9.12 Default exceptions................................................................................................ 113
9.13 Implementation .................................................................................................... 113
9.14 References ............................................................................................................ 114
9.15 Annex 09A ............................................................................................................ 117
9.16 Annex 09B............................................................................................................. 119
9.17 Annex 09C............................................................................................................. 122
10 Floating point arithmetic .......................................................................................... 127
10.1 Rounding .............................................................................................................. 129
10.2 Exception .............................................................................................................. 132
10.3 Operation on result .............................................................................................. 132
10.4 Minifloat floating point format ............................................................................ 133
10.5 Addition and subtraction...................................................................................... 135
10.6 Multiplication ....................................................................................................... 136
10.7 Division ................................................................................................................. 138
10.8 References ............................................................................................................ 138
11 Characters and Unicode ........................................................................................... 141
11.1 Terminology.......................................................................................................... 141
11.2 Fonts ..................................................................................................................... 145
11.3 Bitmap font........................................................................................................... 145
11.4 Outline fonts ......................................................................................................... 146
VŠB-TU Ostrava v
11.5 Stroke fonts .......................................................................................................... 146
11.6 ASCII...................................................................................................................... 147
11.7 Code pages ........................................................................................................... 150
11.8 C0 and C1 control codes ....................................................................................... 152
11.9 Unicode ................................................................................................................ 153
11.10 Using Unicode....................................................................................................... 155
11.11 UTF-32 .................................................................................................................. 155
11.12 UTF-16 .................................................................................................................. 155
11.13 UTF-8 .................................................................................................................... 157
11.14 Byte order mark.................................................................................................... 159
11.15 Whitespace character .......................................................................................... 160
11.16 Newline................................................................................................................. 161
11.17 Possible notations of Unicode .............................................................................. 162
11.18 References ............................................................................................................ 163
12 Finite state machine ................................................................................................. 167
12.1 Discrete time ........................................................................................................ 169
12.2 Definitions of finite state machine ....................................................................... 169
12.3 Synchronous and asynchronous machine ............................................................ 171
12.4 Block diagram of synchronous FSM ..................................................................... 172
12.5 Description of FSM behavior ................................................................................ 173
12.6 Examples of finite state machine ......................................................................... 175
12.7 Table notation of FSM .......................................................................................... 177
12.8 Synchronization of input ...................................................................................... 178
12.9 Notation in programming languages.................................................................... 179
12.10 References ............................................................................................................ 181
13 Synchronous digital system ...................................................................................... 183
13.1 Decimal adder ...................................................................................................... 184
13.2 Data unit for decimal adder ................................................................................. 185
13.3 Control unit .......................................................................................................... 187
13.4 Simulation and realization.................................................................................... 189
13.5 Reference ............................................................................................................. 190
13.6 Annex 13A ............................................................................................................ 191
vi VŠB-TU Ostrava
Digital systems for joint teaching programme of BUT and VSB-TUO
1 Introduction
Digital systems are systems which are characterized by discontinuous representation of

information or works. It is the opposite of analog systems which are characterized by con-
tinuous representation of information, [wiki_0101]. All surrounding world behaves in con-
tinuous manner, and for digital processing it is necessary to convert continuous representa-
tion to discontinuous one. Digital systems use numbers to represent reality. From the reali-
zation point of view, digital systems use binary numbers where digits 0 and 1 are the fun-
dament.
Digital system and information technology are very closely related terms. Definition of in-
formation is comprehensive; nowadays, information is typically connected with digital data,
digital system. Then, digital system is possible to understand as a system which processes
and produces information. Typical representatives of digital system are mobile phones,
digital television and radio, digital photos and movies, and so on. Computers cannot be
omitted because they were at the beginning of modern era of digital systems. Different
modifications of computers are used in systems listed above. Also, the term of computer
changed into personal computer, notebook, and tablet.
Digital system is also understood as a combination of hardware and software. Hardware is a

physical and tangible solution of digital system. Hardware processes data or information by
algorithms, which are implemented directly in hardware or described by programs. These
programs are called as software.
Design and description of hardware is performed by using Boolean function and finite state
machine. After that, synthesis tools are applied and these tools transfer the description to a
suitable form for realization. For this process, it is important to know the format of data
and the algorithm of processing. In this process of design, it is also required to know the
terminology.
1.1 Basic terms

The following terms are joined with digital systems and we have to understand them cor-
rectly not to be confused. These terms concern the names of bit groups, indexing of bits
within the group, and so on.
Bit is a fundamental and the lowest unit of information in computing and tele-
Bit.□
communications. A bit has two values that can be understood as logical or binary values,
depending on the usage. Typical values of a bit are 0 or 1; True or False; plus or minus; Low
or High; etc. A bit was created in 1943 by J. W. Tukey as an abbreviation of the words binary
digit. A lowercase letter b is used as an abbreviation for a bit as a unit. More information is
in literature [wiki_0102].
VŠB-TU Ostrava 1
1 Introduction
Byte is a unit of data. The term of byte term was coined by Dr. Werner Buchholz in July
□
1956, during the early design phase of the IBM Stretch computer. A capital letter B is used Byte.
as an abbreviation for a byte as a unit. Byte is a group of 8 bits, where each bit has its bit
position and order. It also means that a byte can contain numbers from 0 to 255 in decimal,
from 0 to FF in hexadecimal and 0000 0000 to 1111 1111, in binary numeral system,
[wiki_0103]. Octet is a term which is often used in telecommunications. Nowadays, the
term of octet is not frequently used and it is often replaced by byte, [wiki_0104].
Nibble is a group of 4 bits, which corresponds to a half of byte. Nibble is used to store one
□
hexadecimal digit. One byte has two nibbles or two hexadecimal digits. One nibble may Nibble.
have decimal values from 0 to 15, binary values from 0000 to 1111 and hexadecimal values
from 0 to F. More information is in literature [wiki_0105].
Word is a group of bytes. In history, one word had a different number of bits. Today, a
number of bits in a word is given by computer architecture (size of registers, memory, etc.).
Word size can be 16, 32 or 64 bits depending on the architecture of the processor. More
information is in [wiki_0106].
value = an-1Bn-1 … + a1B1 + a0B0 + a-1B-1… + a-mB-m

an-1 a-m
MSB LSB Where
 B is equal to 2, it is radix of binary numeral
system
Fig. 01-01 Position of LSB and MSB

□
LSB (Least Significant Bit) is a coefficient with the lowest weight, [wiki_0107]. MSB (Most LSB and MSB.
Significant Bit) is a coefficient with the highest weight, [wiki_0108]. The terms of LSB and
MSB are often related to byte and even to the coefficient of other numeral systems.
Bit numbering or position of bits or indexing of bits is a number to indicate the position of a
bit or a coefficient, Fig. 01-02, [wiki_0109]. If a group of n bits is given, then preferred bit
numbering is such that the leftmost position has index n-1 and the rightmost position has
index 0. If the group is an integer number, then indexes correspond to orders, Fig. 01-02.
This principle of indexing is not obligatory and it is necessary to realize the value of index
and the value of order. It does not have to be same, Fig. 01-02.
Preferred
Change of
7 0 0 7
position
MSB LSB MSB LSB

Change of
weight
7 0 0 7
LSB MSB LSB MSB
Fig. 01-02 Possible positions and weights in byte
2 VŠB-TU Ostrava
1.2 Endianness
Endianness is a way of storing numbers, codes in the computer memory, organization bytes
for serial transmission, etc. Information of n-bit width is split into smaller groups, called
atomic elements. Atomic element is a byte, typically, but it can be a word or other m-tuple.
Endianness determines which atomic element is stored in the memory at a lower address,
which atomic element is transmitted first, etc. Endianness has two principles which are
called little and big endian, Fig. 01-03 and Fig. 01-04. More information is in [wiki_0110].
MSB LSB Atomic element Atomic element

0x 08 09 0A 0B 0C 0D 0E 0F is a byte is a word
a 0F a 0E0F
a+1 0E 0C0D
a+2 0D 0A0B Little endian, LSB atomic
0C a+3 0809 element is stored at a lower
0B
0A address.□
09
a+7 08 a+7
Fig. 01-03 Little endian
Atomic element Atomic element MSB LSB
is a word is a byte 0x 08 09 0A 0B 0C 0D 0E 0F
a 0809 a 08
0A0B a+1 09
0C0D a+2 0A Big endian, MSB atomic
a+3 0E0F 0B element is stored at a lower
0C
0D
address.□
0E
a+7 a+7 0F
Fig. 01-04 Big endian
Little endian is a principle where LSB atomic element is stored at a lower address and it has
a lower index during the transmission, Fig. 01-03. Big endian is the opposite, it is the princi-
ple where MSB atomic element is stored at a lower address and it has a lower index during
the transmission, Fig. 01-04.
1.3 Binary prefixes– Standard IEC

Normally, physicists understand that the prefix kilo means 1000 = 103. In contrast, comput-
er scientists understand that prefix kilo is 1024 = 210. The difference of these values is small Binary prefixes.□
for prefix kilo, but for higher prefixes, this difference increases. The result is that the value
of bytes does not correspond to the value with prefix. A typical example is the disk capacity,
the capacity which is stated on the label of disk, is 500 GB and the computer shows the
capacity 466.37 GB. But, the bit rate of Ethernet is 10 Gbitps and it is equal to 10 billion bits
per second. This is caused by a different definition of value for the prefix Giga. More com-
VŠB-TU Ostrava 3
1 Introduction
plicated situation relates to the 3 1/2 inch floppy disk with the capacity 1.44 MB. The capac-
ity is 1 474 560 bytes (1.44 * 103 * 210). Standard IEC 60027-2 solves these irregularities and
it has been valid in the Czech Republic since 1 April 2004. The standard introduces new
prefixes, which are derived from base 2 and they are called binary prefixes, [wiki_0111].
The binary prefixes are in Fig. 01-05.
Decimal SI prefix Binary SI prefix

Base Base Base Base
Symbol Name Symbol Name
1000 10 1024 2
k kilo 1000 103 ki kibi 1024 210
M mega 10002 106 Mi mebi 10242 220
G giga 10003 109 Gi gibi 10243 230
T tera 10004 1012 Ti tebi 10244 240
P peta 10005 1015 Pi pebi 10245 250
E exa 10006 1018 Ei exbi 10246 260
Z zetta 10007 1021 Zi zebi 10247 270
Y yota 10008 1024 Yi yobi 10248 280
Fig. 01-05x Decimal and binary SI prefixes
1.4 Reference
[wiki_0101] Digital data; http://en.wikipedia.org/wiki/Digital_data; on line 2014-10-21
[wiki_0102] Bit; http://en.wikipedia.org/wiki/Bit; on line 2014-1021
[wiki_0103] Byte; http://en.wikipedia.org/wiki/Byte; on line 2014-1-21
[wiki_0104] Octet (computing); http://en.wikipedia.org/wiki/Octet_(computing); on line

2014-1021
[wiki_0105] Nibble; http://en.wikipedia.org/wiki/Nibble; on line 2014-10-21
[wiki_0106] Word (computer architecture);

http://en.wikipedia.org/wiki/Word_(computer_architecture); on line 2014-
10-21
[wiki_0107] Least significant bit; http://en.wikipedia.org/wiki/Least_significant_bit; on

line 2014-10-21
[wiki_0108] Most significant bit; http://en.wikipedia.org/wiki/Most_significant_bit; on

line 2014-10-21
[wiki_0109] Bit numbering, http://en.wikipedia.org/wiki/Bit_numbering; on line 2014-

10-21
[wiki_0110] Endianness; http://en.wikipedia.org/wiki/Endianness; on line 2014-10-21
[wiki_0111] Binary prefix; http://en.wikipedia.org/wiki/Binary_prefix; on line 2014-10-

21
[IEC 60027-2] IEC 60027-2, International Standard, Letter symbols to be used in electrical
technology – Part 2: Telecommunications and electronics
4 VŠB-TU Ostrava
2 Numeral systems
A prehistoric man depicted numbers using tools available - fingers, stones, notches etc.
Some tribes in Africa used the quinary system, using the fingers of one hand. Quinary sys-
tem is a system with radix 5. Because a man has twenty fingers, we often use the vigesimal
system, a system with radix 20. Mayan Indians used this system up to 6th century AD. Su-
merians used a positional system with the basis 60. The counting time (24 hours a
day, each hour has 60 minutes, each minute has 60 seconds) has survived four thou- A number is an ar-
sand years. Indians are regarded as the discoverers of a positional system we use ranged group of sym-
today. The oldest numbering system originated in India in 3rd century BC and then it bols called digits.□
was gradually taken over by Arabs and further spread to Greece and Europe [Inter-
net_0201], [wiki_0201].
Roman numeral system is famous as a non-positional notation. An example of a Roman

number is MDCCLXIV and this number is equal to 176410. Letters corresponding to decimal Non-positional
numbers are: M = 1000; D = 500; C = 100; L = 50; X = 10; V = 5; I = 1. Rules for writing Ro- notation.□
man numbers are mentioned in literature [Interent_0202], [wiki_0202].
Decimal numeral Binary numeral Octal numeral Hexadecimal nu-

system system system meral system
0 0 0 0
1 1 1 1
2 10 2 2
3 11 3 3
4 100 4 4
5 101 5 5
6 110 6 6
7 111 7 7
8 1000 10 8
9 1001 11 9
10 1010 12 A
11 1011 13 B
12 1100 14 C
13 1101 15 D
14 1110 16 E
15 1111 17 F
16 10000 20 10
Fig. 02-01 Digits and orders in numeral systems
A positional notation of a number is expressed by a series of symbols where each symbol

has its position in the series, specified by its weight, [wiki_0203]. The value of a number is Positional nota-
□
determined by the sum of weighed digits. For example, the decimal number 2568 can be tion.
expressed like this: 2568 = 2x103 + 5x102 + 6x101 + 8x100. Each weight is determined by the
VŠB-TU Ostrava 5
2 Numeral systems
power of number 10 and defines the specific position of a digit in the series. Decimal point
enables us to use a negative number as the exponent of power. For example, decimal num-
ber 0.39 can be expressed like this: 0.39 = 3x10-1 + 9x10-2.
Positional notation means that all digits from a relevant numeral system are used on each
order. After using all the digits in given order, a higher order is added. In the decimal nu-
meral system it is usual that after using all digits 0, 1 …9 with weight 100, a higher order 101
is added, ..8, 9, 10, 11…. This principle of adding weight is valid in all positional numeral
systems. Fig. 02-01 also shows the principle of adding a higher order for binary, octal and
hexadecimal numeral systems. The added orders are marked by a yellow color of the back-
ground.
Notation in programming languages

Literature Note
Language C other languages
10112 1011B 2#1011 B”1011” Binary
138 13O 013 8#13 Octal
1110 11D 11 11 D”11” Decimal
1A1B16 1A1BH 0x1A1B X”1A1F” 16#1A1B Hexa
(1011)2 0b1011 B1100 2#1010 Binary
(12)10 0d12 12 D12 10#12 Decimal
(B)16 0hB 0xB HB 16#B Hexa
Fig. 02-02 Notation of radix in a number
In the following text, different numeral systems are used. For correct interpretation of a
number, the radix of numeral system is stated with the number. Possible notations of radix
are shown in Fig. 02-02.
2.1 Polynomial of numeral system

In general, each number of positional systems is a string of digits or coefficients in the form
an-1an-2…a1a0.a-1a-2…a-m. Therefore, every real number can be expressed by a polynomial of
a numeral system, formula (0201), [wiki_0201]. It is sometimes called Horner polynomial.
Polynomial of a
𝑁𝑅 = (𝑎𝑛−1 𝑅𝑛−1 + ⋯ + 𝑎2 𝑅2 + 𝑎1 𝑅1 + 𝑎0 𝑅0 + 𝑎−1 𝑅−1 + 𝑎−2 𝑅−2 + ⋯ + 𝑎−𝑚 𝑅−𝑚 ) =
numeral system.□
= ∑𝑛−1
−𝑚 𝑎𝑖 ∙ 𝑅
𝑖
(0201)
Where
 NR is a value which is expressed in radix R.

 ai is a digit or a coefficient in the range from 0 to R-1.
 R is a base or a radix of the numeral system. Radix is an integer and R > 1. Ri is weight and i
□
 i is an exponent which expresses the position of the coefficient (i= 0,1,…n). This po- is order.
sition is also called the order of a digit.

Ri is a weight.
 n is a number of orders in the integer part, where index n-1 is the leftmost digit and
it is called the highest order or the most significant digit.
6 VŠB-TU Ostrava
 m is a number of orders in the fractional part where the index (-m) is the rightmost
digit and it is called the lowest-order or the least significant digit.
Integer part Fractional part
1234.5678
The most significant digit The least significant digit
Fig. 02-03 Terms of real number

The polynomial of numeral system introduces terms that are connected with a real num-
ber, Fig. 02-03. Real number has two basic parts, integer and fractional part, that are sepa-
rated by radix point. The concept of radix point is preferred to the concept of decimal point
because radix point is a general term and it does not depend on the base of the numeral
system, [wiki_0204]. The integer part and the fractional part are given by formulas (0202)
and (0203).
NI = (an-1an-2…a2a1a0)R = an-1Rn-1 + an-2Rn-2 + …a2R2 + a1R1 + a0R0 (0202)

Integer and frac-
-1 -2 -3 -m
NF = (0 . a-1a-2a-3 …a-m)R = a-1R + a-2R + a-3R … + a-mR (0203) tional parts of a
number.□
Where
 NI is an integer part of a real number.

 NF is a fractional part of a real number.
 ai is a digit of a real number.
 R is a radix of a numeral system.
 n is a number of digits in the integer part.
 m is a number of digits in the fractional part.
 . red dot, is radix point.
The digit in the leftmost position is also called the most significant digit. And the digit in the
rightmost position is also called the least significant digit. In the situation when the number The most and least
does not have a fractional part, the least significant digit is the digit with weight R0. significant digits.□
2.2 Numeral systems used in digital systems

The natural numeral system for digital systems is the binary numeral system. Octal and
hexadecimal numeral systems are also associated with digital systems, but these systems
are only used to write binary numbers. Calculations in the decimal numeral system can be
implemented in binary systems by using codes, where the BCD code is the most famous
one.
The decimal numeral system is the most common. The radix of this system is number 10
(R = 10) and decimal numbers are often marked with capital letter D. This decimal numeral Decimal numeral
□
system uses the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and certain weights have special names, 101 system.
is ten, 102 is hundred, 103 is thousand, 106 is million, 109 is billion, 1012 is trillion and so on.
VŠB-TU Ostrava 7
2 Numeral systems
Each number can be expressed by equation (0201). For example, number 3725 can be ex-
pressed as the polynomial:
 (3725)D = 3 · 103 + 7 · 102 + 2 · 101 + 5 · 100
Binary numeral system is fundamental in digital systems. The radix of this system is number
2 (R = 2) and binary numbers are often marked with capital letter B. This binary numeral Binary numeral
□
system only uses the digits 0, 1. Each number can be expressed by polynomial (0201). For system.
example, number 1101.101 as the polynomial is:
 (1101.101)B = 1 · 23 + 1 · 22 + 0 · 21 + 1 · 20 + 1 · 2-1 + 0 · 2-2 + 1 · 2-3
Hexadecimal numeral system is used for shortening the notation of long binary numbers.
The radix of this system is number 16 (R = 16) and capital letter H is used for indicating hex- Hexadecimal
adecimal numbers. The hexadecimal numeral system uses the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 numeral system.□
and letters A, B, C, D, E, F for the remaining values, where A = 10, B = 11, C = 12, D = 13, E =
14, F = 15. Each hexadecimal number can be expressed by the polynomial:
 (35A1)H = 3 · 163 + 5 · 162 + 10 · 161 + 1 · 160
Octal numeral system is used very little nowadays and it was also used for shortening the
notation of binary numbers. The radix of this system is number of 8 (R = 8) and capital letter Octal numeral
□
O is used for indicating octal numbers. The octal numeral system uses the digits 0, 1, 2, 3, 4, system.
5, 6, 7. Each octal number can be expressed by the polynomial:
 (572)O = 5 · 82 + 7 · 81 + 2 · 80
2.3 Conversion between numeral systems

The conversion between any numeral systems is possible and the following algorithms are
valid. However, the knowledge of arithmetic in a numeral system different from the deci-
mal one is needed. Therefore in examples, the conversion to and from the decimal numeral
system is preferred. The conversion to the decimal numeral system from any numeral sys-
tem is given by the polynomial of numeral system (0201). A practical usage of polynomial is:
 20123 = (2 · 33 + 0 · 32 + 1 · 31 + 2 · 30 = 2 · 27 + 0 + 3 + 2 ) = 5910
 110110.01B = (1 · 25 + 1 · 24 + 0 · 23 + 1 · 22 + 1 · 21 + 0 · 20 + 0  2-1 + 1  2-2 = 32 + 16 Conversion to
+ 4 + 2 + 0.25) = 54.25D the decimal nu-
 456.811 = (4  112 + 5  111 + 6  110 + 8  11-1 = 484 + 55 + 6 + 0.728) = 545.72810 meral system.□
 D4C.B16 = (13 · 162 + 4 161 + 12 · 160 + 11  16-1 = 256 + 54 + 12 + 0.6875) =
322.687510
 0x2AF8 = (2 · 163 +10 · 162 + 15 · 161 + 8 · 160 = 8192 + 2560 + 240 + 8) = 0d11000
The conversion from decimal to any system is carried out separately for integer and fraction
Conversion from
parts and these parts are defined by formulas (0202) and (0203). The conversion of the
the decimal nu-
integer part is performed by division and the conversion of the fractional part is performed
meral system.□
by multiplication. The example of conversion of integer number is in Fig. 02-04 and the
following algorithm was used:
8 VŠB-TU Ostrava
 The integer number is divided by radix R, the result is quotient Q0 and remainder a0,
Q0 = N/R = an-1Rn-2 + an-2Rn-3 + … a2R1 + a1R0 and remainder a0.
 The next step is a division of quotient Q0 by radix R, the result is quotient Q1 and
remainder a1, Q1 = Q0/R = an-1Rn-3 + an-2Rn-4 + … a2R0 and remainder a1. Algorithm of the
 Division is applied until the quotient is zero and remainder an-1, Qn-1 = Qn-2/R = 0 and integer number
remainder an-1. conversion.□
 Remainders are concatenated into the string an-1, an-2…a1a0 and it is the integer
number in a new numeral system.
Convert (527)10 to numeral system with radix R = 7.
The solution is:

 527/7 = 75 and remainder a0 is 2, the least significant digit.
 75/7 = 10 and remainder a1 is 5.
 10/7 = 1 and remainder a2 is 3.
 1/7 = 0 and remainder a3 is 1, the most significant digit.
Answer: 52710= 13527
Fig. 02-04 Conversion of integer decimal number to number with radix 7

The conversion of the fractional part of a number is performed by multiplication, where the
number is multiplied by a new radix. After each multiplication, the integer part of the prod-
uct is a new digit of the number in a new numeral system. The example of conversion of
fractional number is in Fig. 02-05 and the following algorithm was used:
 The fractional number NF is multiplied by new radix R, the result of multiplication is

P-1 = NF * R = a-1.a-2a-3…a-m. The obtained product P-1 is split into integer part a-1 and Algorithm of the
new fractional part NF-1 = 0.a-2a-3…a-m. fractional num-
 The split fractional part NF-1 is multiplied by new radix R, the product is P-2 = NF-1 * R ber conversion.□
= a-2.a-3…a-m. The obtained product P-2 is split into integer part a-2 and new fraction-
al part NF-2 = 0.a-3…a-m.
 The multiplication is performed until the product is equal to zero or given precision
is achieved.
Convert (0.367)10 to numeral system with radix R = 16. The given precision is 16 bits.
The solution is:
 0.367 * 16 = 5.872; the integer part of product is the first digit of searched
number, a-1 = 5.
 0.872 * 16 = 13.952; the integer part of product is digit a-2 = 1310 = 0xD.
 0.952 * 16 = 15.232; the integer part of product is digit a-3 = 1510 = 0xF.
 0.232 * 16 = 3.712; the integer part of product is digit a-4 = 3.
 The given precision is achieved.
Answer: (0.367)D = (0.5DF3)H
Fig. 02-05 Conversion of fractional decimal number to number with radix 16
VŠB-TU Ostrava 9
2 Numeral systems
The procedure for converting binary numbers to octal and hexadecimal numbers is similar.
A binary number is split into groups from the radix point. These groups are 3 bits for an
octal number and 4 bits for a hexadecimal number. Then, each group is expressed by an
octal or a hexadecimal digit. In the case when the transferred binary number cannot be
divided into groups of three or four bits, a necessary number of zeros is added on the left-
most or rightmost side of the number. Then, the number 1100 1011 1001.1112 is equal to
CB9.E16. The basic idea of conversion is in Fig. 02-06.
Conversion be-
32475.55206 Octal number
To octal number tween a binary
number and octal
011 010 100 111 101.101 101 010 000 110 Groups of 3 bits
or hexadecimal
11010100111101.10110101000011 Binary number numbers.□
0011 0101 0011 1101.1011 0101 0000 1100 Groups of 4 bits

To hexadecimal number
353D.B50C Hexadecimal number
Fig. 02-06 Conversion between binary, octal and hexadecimal numeral systems
All numbers from 0 to 15 can be written as the sum of numbers 8, 4, 2, 1. In this way, we
can quickly convert numbers from the decimal to the binary system and backwards. Num-
bers 8, 4, 2, and 1 are the weights of the binary numeral system and the corresponding Principle 8, 4, 2, 1.□
exponents of the powers of 2 are 3, 2, 1 and 0. These exponents are the orders of a binary
number. For example, decimal number 11 is a sum of numbers 8, 2 and 1. It means that the
binary number has ones in orders 3, 1 and 0. Then 1110 is equal to 10112. Fig. 02-07 shows
other examples.
Decimal Weights Corresponding

3
number 8 (=2 ) 4 (=22) 2 (=21) 0
1 (=2 ) binary number
0 0 0 0 0 0000
3 0 0 1 1 0011
6 0 1 1 0 0110
10 1 0 1 0 1010
13 1 1 0 1 1101
15 1 1 1 1 1111
Fig. 02-07 Examples of principle 8, 4, 2 and 1
2.4 Reference
[Internet_0201]ČÍSELNÉ SOUSTAVY; http://www.prevod.cz/popis.php?str=564&parent=y;
on line 2014-10-21]
[Internet_0202]Římská čísla; http://www.converter.cz/prevody/rimska-cisla.htm; on line

2014-10-21
[wiki_0201] Numeral system; http://en.wikipedia.org/wiki/Numeral_system; on line

2014-10-21
10 VŠB-TU Ostrava
[wiki_0202] Roman numerals; http://en.wikipedia.org/wiki/Roman_numerals; on line

2014-10-21
[wiki_0203] Positional notation; http://en.wikipedia.org/wiki/Positional_notation; on

line 2014-10-21
[wiki_0204] Decimal mark; http://en.wikipedia.org/wiki/Decimal_mark; on line 2014-

10-21
VŠB-TU Ostrava 11
3 Boolean algebra
Boolean algebra is a fundamental mathematical tool for analyzing and synthesizing logical
circuits of all types. Before we get into the theory of Boolean algebra, we will concentrate
on the question: “What is the logic?”
Logic is a science dealing with reason, truthfulness, demonstrability, refutability. Logic is a

form of communication, not a psychological interpretation.
3.1 Propositional calculus

The proposition is an argument or an expression, which makes sense and about which it is
possible to decide whether it is true or false. Mathematical logic is defined by a set of prim-
itive symbols and by a set of propositional connectives, [Mendelson_1997]. Basic connec-
tives are shown in Table 03-01.
Name Symbol Reading Writing

Negation ¬ isn’t true, that ¬p
Conjunction ˄ AND p˄q
Disjunction ˅ OR p˅q
Implication ⇒ If …, then … p⇒q
Equivalency ⇔ … if and only if … p⇔q
Table 03-01 Overview of possible types of propositional connectives
By using propositional symbols and propositional connectives, we can compose a complex

proposition which we call the formula. For every formula of propositional logic, we can
build a truth table where a true proposition can be labeled as True (Yes, 1, On, etc.) and a
false proposition can be labeled as False (No, 0, Off, etc.). Letters p and q represent propo-
sitional variables which correspond to individual formulas in the proposition.
p q ¬p p˄q p˅q p⇒q p⇔q

0 0 1 0 0 1 1
0 1 1 0 1 1 0
1 0 0 0 1 0 0
1 1 0 1 1 1 1
Table 03-02 Truth table of propositional connectives
Example 1. We have a bucket with water. And we have to decide when water flows. There
are two cases.
VŠB-TU Ostrava 13
3 Boolean algebra
a) In the first case, if at least one of the taps Tap 1 Tap 2 is open, water flows. This sit-
uation is in Fig. 03-01, including the corresponding truth table.
b) In the second case, water flows if both tapsTap 1 and Tap 2 are open. This situation
is in Fig. 03-02, including the corresponding truth table.
Tap 1 Tap1 Tap2 Flow

No No No
No Yes Yes
Yes No Yes
Yes Yes Yes
Tap 2
Fig. 03-01 Example of disjunction Tap 1 ˅ Tap 2
Tap1 Tap2 Flow

No No No
Tap 1 Tap 2
No Yes No
Yes No No
Yes Yes Yes
Fig. 03-02 Example of conjunction Tap1 ˄ Tap 2
Example 2. We have an electrical circuit, Fig. 03-03. The light is turned on/off by the switch.
This means that if the switch is on, the bulb is on. Values of voltage and/or current are not
important. There are two cases again.
Switch
Fig. 03-03 Example of electrical circuit
a) Serial connection. In this case, the bulb is on, if both switches A and B are on. Oth-
erwise, if at least one switch is off, the bulb is off, see Fig. 03-04. A logical expres-
sion of this example is: the bulb is on if the switch A AND the switch B is on.
A B Bulb lights
Off Off No
A B
Off On No
On Off No
On On Yes
Fig. 03-04 Serial connection of switches
14 VŠB-TU Ostrava
b) Parallel connection. In this case, the bulb is on if at least one of the switches A and
B is on. If both switches A and B are off, bulb is off, see Fig. 03-05. A logical expres-
sion of this example is: the bulb is on if the switch A OR the switch B is on.
A A B Bulb lights
Off Off No
Off On Yes
B On Off Yes
On On Yes
Fig. 03-05 Parallel connection of switches

Example 3: Convert the following expression into propositional calculus, :
a) If it's raining and I'll take a raincoat, I will not be soaked.
Solution: First, mark the propositions:
A = it's raining; B = I'll take a raincoat; C = I will not be soaked;
The propositions correspond to the notation: (A ˄ B) ⇒ C
Reading via propositional word connectives: If it's raining AND I'll take a raincoat,
then I will not be soaked.
b) It is not true that John plays the guitar and the piano. John cannot play the guitar.
Solution: First, mark the propositions:
A = John plays the guitar; B = John plays the piano;
The propositions correspond to the notation: [¬ (A ˄ B) ˄ (¬A)]
Reading via propositional word connectives: It isn’t true, that John plays the guitar
AND John plays the piano AND it isn’t true that John can play the guitar.
George Boole (1815 - 1864) was an English mathematician and a

founder of the algebraic tradition in logic. He worked as a school-
master in England and from 1849 until his death as a professor of
mathematics at Queen's University, Cork, Ireland. His most fa-
mous works are: Mathematical Analysis of Logic (1847), An Inves-
tigation into the Laws of Thought, on Which are Founded the
Mathematical Theories of Logic and Probabilities (1854). Both
works are in the mathematical theories of logic. Boole wanted to
prove, that his understanding of mathematics offers possibilities
of solving logical problems. Boole was the first to introduce logical
relations in algebraic equations.□
VŠB-TU Ostrava 15
3 Boolean algebra
3.2 Definition of Boolean algebra

Boolean algebra is a basic mathematics needed for studying the logical design of digital
systems. Boolean algebra has many other applications, including the theory of sets and Boolean
mathematical logic. However, we will restrict ourselves to its application in combinational algebra.□
and sequential circuits. Boolean algebra is only defined for two values, which can be
True/False, Yes/No, H_level/L_level, 1/0, Open/Close, On/Off, etc. Boolean algebra defines
three basic operations:
Binary opera-
 Addition (logical disjunction, logical sum). Basic notation is “+”. tion.□
 Multiplication (logical conjunction, logical product). Basic notation is ”∙”. This sym-
bol is frequently omitted in the expression, and the notation is AB instead of A ∙ B.
 Negation is a unary operation. Basic notation is “ ’ ”, or in literature “¯”. Unary opera-
tion.□
Boolean algebra is a set of rules (axioms and theorems) for writing and evaluating logical
relations. This algebra uses Boolean variables which can only have two values. In digital
system, values 0 and 1 are preferred. Boolean algebra fulfils the condition of duality. It
means that axioms and theorems are expressed in pairs, where values and binary operators
are being swapped. Each value of 0 changes to 1 (0  1), and each multiplication changes
to addition (·  +) and vice-versa. Example:
Boolean algebra fulfils the princi-
 a · b + a · b’ = a  (a + b) · (a + b’) = a
ple of duality. □
 a+1=1  a·0=0
In literature, Boolean algebra is defined in different ways, [wiki_0301]. The following defini-
tion is based on axioms and derived theorems, [Wakerly_2006], [Roth_2004]. Boolean the-
orems have their names. The definition of Boolean algebra uses the term of element. The
element can be a value, a variable, an expression.
Boolean algebra is a six-tuple consisting of a set A, binary operation  (and), binary opera- Definition
tion + (or), a unary operation ‘ (not, complement) and two elements 0 and 1. In such a six- of Boolean
tuple, the following axioms are valid for all elements a, b … of set A and elements 0 and 1: algebra. □
(A1) a= 0 if a ≠ 1 (A1D) a = 1 if a ≠ 0
(A2) if a= 0 then a’ = 1 (A2D) if a = 1 then a’ = 0

Axioms of
(A3) 0·0=0 (A3D) 1 + 1 = 1 Boolean
algebra. □
(A4) 1·1=1 (A4D) 0 + 0 = 0
(A5) 0·1=1·0=0 (A5D) 1 + 0 = 0 + 1 = 1
Note to axioms, theorems and elements
 Axiom is a premise or starting point of reasoning which is not doubted.

 Theorem is a statement that has been proven on the basis of previously estab-
lished statements or axioms.
 Elements of set A are any variables or expressions.□
16 VŠB-TU Ostrava
Name of theorem Duality Theorems

of Boolean
Identifies  (T1) a+0=a  (T1D) a ∙ 1 = a
algebra. □
Null element  (T2) a+1=1  (T2D) a · 0 = 0
Idempotency  (T3) a+a=a  (T3D) a ∙ a = a
Complements  (T4) a + a’ = 1  (T4D) a ∙ a’ = 0
Involution  (T5) (a’)’ = a
Commutativity  (T6) a+b=b+a  (T6D) a ∙ b = b ∙ a

Associativity  (T7) (a + b) + c = a + (b + c)  (T7D) (a ∙ b) ∙ c = a ∙ (b ∙ c)
Distributivity  (T8) a ∙ b + a ∙ c = a ∙ (b + c)  (T8D) (a + b)  (a + c) = a + b  c
Covering  (T9) a+a∙b=a  (T9D) a ∙ (a + b) = a
Combining  (T10) a ∙ b + a · b’ = a  (T10D) (a + b) ∙ (a + b’) = a
Consensus  (T11)  (T11D)
a ∙ b + a’ ∙ c + b ∙ c = a ∙ b + a’ ∙ c (a + b) ∙ (a’ + c) ∙ (b + c) = (a + b) ∙ (a’ + c)
Generalized  (T12)  (T12D)
idempotency a + a + a + …. + a = a a ∙ a ∙ a ∙ …. ∙ a = a
De Morgan  (T13) (a ∙ b)’ = a’ + b’  (T13D) (a + b)’ = a’ ∙ b’

 (T14)  (T14D)
(a ∙ b ∙ c ∙ d ∙ …)’ = a’ + b’ + c’ + d’ + … (a + b + c + d + …)’ = a’ ∙ b’ ∙ c’ ∙ d’ ∙ …
The theorems T13 and T14 are called De Morgan rules. These rules are used to swap a logi-
cal sum and a logical product. De Morgan rules can also be expressed by sentences:
De Morgan
• The negation of a logical multiplication is the logical addition of the negations. rules. □
• The negation of a logical addition is the logical multiplication of the negations.
Augustus De Morgan (1806 – 1871) He was a Scottish mathematician

and logician. He was born in India, he studied at Cambridge University
and worked mainly in London. He is the author of numerous works on
algebra, arithmetic, mathematical analysis and probability theory. Above
all, he is one of the founders of formal algebra. In 1847 he published a
work of Formal Logic, or the Calculus of Inference, Necessary and Proba-
ble, wherein certain way ahead G. Boole.□
Fig. 03-06, Fig. 03-07 and Fig. 03-08 show the application of Boolean theorems and axioms.
Fig. 03-06 and Fig. 03-07 show the application of theorems T8 and T8D in disjunctive and
conjunctive forms. Fig. 03-09 shows the application of De Morgan theorems for simplifica-
tion of an expression.
VŠB-TU Ostrava 17
3 Boolean algebra
ab + abc´d + abde´ + a´bcé + a´bćé =
Theorem T8 Theorem T8
= ab(1 + c´d + de´) + aćé (b + b´) =

Theorem T2 Theorem T4
= ab(1) + aćé (1) =
Theorem T1D
= ab + aćé
Fig. 03-06 Application of Boolean axioms and theorems - I
(a + b + c) (a + b + d) (a + d’) (a + b’) (a + b’ + c’) =

Theorems T8D T1
= (a + b + c) (a + b + d) (a + b + d’) (a + b’+ d’) (a + b’ + 0) (a + b’ + c’)
Theorems T8D T8
=
= (a + b + c) (a + b + dd’) (a + b’+ d’.0) (a + b’ + c’) =
Theorems T4D T2D
= (a + b + c) (a + b + 0) (a + b’+ 0) (a + b’ + c’) =
Theorems T1 T1
= (a + b + c) (a + b) (a + b’) (a + b’ + c’) =(a + b) (a + b’) = a
Theorems T8D, T2D, T1 T8D, T2D, T1 T8D, T4D, T1
Fig. 03-07 Application of Boolean axioms and theorems - II
(a + b + c) (a + b + d) (a + d’) (a + b’) (a + b’ + c’) =

Theorems T5
= (((a + b + c) (a + b + d) (a + d’) (a + b’) (a + b’ + c’))’)’ = T14D De Morgan rule
= ((a + b + c)’+ (a + b + d)’+ (a + d’)’+ (a + b’)’+ (a + b’ + c’)’)’ =
T13D De Morgan rule

= (a’b’c’+ a’b’d’+ a’d + a’b + a’bc)’ =
Theorems T10 T1D
= (a’b’c’+ a’b’d’+ a’b’d + a’bd + a’b.1 + a’bc)’ =
Theorems T8 T8
= (a’b’c’+ a’b’(d’+ d) + a’b(d + 1) + a’bc)’ =
Theorems T4, T1D T2, T1D
= (a’b’c’+ a’b’ + a’b + a’bc)’ =(a’b’ + a’b)’ = (a’)’ = a
Theorems T8, T2, T1D T8, T2, T1D T8, T4, T1D T5
Fig. 03-08 Application of Boolean axioms and theorems - III
18 VŠB-TU Ostrava
Note to expression, equation and statement
 Expression is any combination of variables, logical values and operations, for in-
stance a ∙ b + c, and the simplest expression as 0, 1, a, a’ …
 Equation has two sides equal to each other. The value of the left side equals to
the value of the right side, for instance
z = a, Decode = (a + b’) · c + a ∙ d
0 = 3x - 1
 Statement is an element of programming language, typically assigning the value
of expression on the right side to the variable on the left side. Many program-
ming languages use for the assigning operation the sign equal (=). In C language,
the statement “z = a” is read as z assigns a. example:
C language: VHDL language:

X = 1; X := ‘1’;
z = a; Z <= a;
Y = a && b || c; Y := a AND b OR c;
READY = a | !b; READY:= a OR not b; □
3.3 Boolean function

Boolean function is a representation of the domain of definition into the domain of map-
ping. Boolean function describes how to determine Boolean value output based on some
logical calculation from Boolean inputs. This function is important for designing circuits and
chips for digital computers, [wiki_0302]. A complete Boolean function is the representation:
Complete
n
f: {0, 1} → {0, 1} (0301) Boolean
function. □
Where
 {0, 1}n is the domain of definition, where n is a number of inputs variables.

 {0, 1} is the domain of mapping.
Domain of
The domain of definition is a set of n-tuples. The n-tuple has n variables and each variable definition.□
has one value, 0 or 1. The number of n-tuples is 2n. It means that the domain of definition
contains all combinations of 0 and 1 of length n. The domain of mapping has length 1;
Domain of
therefore the output of Boolean function is equal to 0 or 1.
mapping.□
a b c f
0 0 0 1
0 0 1 1
0 1 0 0
0 1 1 1 Truth table describes Boolean function
1 0 0 0 clearly and uniquely.□
1 0 1 0
1 1 0 1
1 1 1 1
Fig. 03-09 Boolean function for 3 variables
VŠB-TU Ostrava 19
3 Boolean algebra
The basic way of expressing the mapping is a table. For Boolean function, the table is called
truth table, [wiki_0304]. The domain of definition contains 2n combinations of logical values
□
for n variables, and therefore the truth table has 2n rows. The output value of Boolean func- Truth table.
tion is assigned for each combination. Only the truth table describes Boolean function
clearly and uniquely. Fig. 03-09 shows Boolean function with 3 variables, where the map-
ping is f: {0, 1}3 → {0, 1}.
An incomplete Boolean function can have 3 values on the output: 0, 1 and X. The value X is
called “don’t care”. An incomplete Boolean function is the mapping: Incomplete
Boolean
f: {0, 1}n → {0, 1, X} (0302)
function.□
Where
 {0, 1}n is the domain of definition.

 {0, 1, X} is the domain of mapping.
 X is a value “don’t care”.
Example of incomplete function is in Fig. 03-10. We can often see the value “don’t care” as
an input value. One column is added to the truth table and it contains the number that
corresponds to the combinations of variables. This number enables us a better orientation
in truth table.
No. x2 x1 x0 f
No. x1 x0 f 0 and 1 0 0 X 0
0 0 0 0 2 and 3 0 1 X 0
1 0 1 X 4 1 0 0 1
2 1 0 0 5 1 0 1 1
3 1 1 X 6 and 7 1 1 X 1
Fig. 03-10 Incomplete Boolean function and its modification
x y f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15

0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
1 0 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
Not(x implies y)
Not(y implies x)
x = y (x xnor y)
x ≠ y (x xor y)
Not (x and y)
Not (x or y)
x implies y
y implies x
x and y
x or y
not y
Name of
not x
zero
One
y
x
function
Fig. 03-11 All functions of 2 variables

𝑛
A total number of functions = 22 (0303)
Where
 n is a number of input variables.
20 VŠB-TU Ostrava
For two input variables, there are four possible combinations of values 0 and 1. It is possible
to define 16 different Boolean functions, see Fig. 03-11. Only the marked functions are usu-
ally used in practice. For n variables, it is possible to define a total number of Boolean func-
tions according to formula (0303).
3.4 Boolean expression

The information contained in the truth table can be expressed algebraically as well. Boolean
expression is a base for designing the gate network which is a logical combinational circuit.
The terms that relate to Boolean expression, are:
Literal – is a variable or the complement of a variable, for instance, a1, a1’, x1, y1, z1, Tap1 …
Product term – is a single literal or logical product of two or more literals, for instance,
x·y·z', Y'·Z·X ', z …
Sum term – is a single literal or a logical sum of two or more literals, for instance a’, x + y´+
z, X '+ Y …
Normal term – is a product or sum term in which each variable must appear only once.
Minterm – an n-variable minterm is a normal product term with n literals. The result of
minterm is equal to 1 only for one combination of values for n variables. And for remaining Minterm, mi.□
combinations the result of minterm is equal to 0. There are 2n such minterms.
The condition of assembling the minterm is such that the product must be equal to 1 only
for one combination of variables (axiom A4), and the product must be equal to 0 for re- Assembling the
maining combinations, Fig. 03-12. It means that if a variable x has value 0, then the com- minterm.□
plement of x is used in the minterm, not x. And, if a variable x has value 1, then the variable
is used in a direct form in the minterm, x.
Minterm is denoted by mi, (small letter m), with the index. The index is a decimal number
Indexing the
that corresponds to n-tuple. Each n-tuple of variables can be read as a binary number.
minterm.□
4-tuple, where variables are in order a, b, c, and d
Product is equal to 1 if all vari-

For 4-tuple (0 1 0 1), the minterm m5 is a’bc’d ables have value 1, axiom A4.
Sum is equal to 0 if all varia-

For 4-tuple (0 1 0 1), the maxterm M5 is a + b’ + c + d’
bles have value 0, axiom A4D.
Fig. 03-12 Minterm and maxterm for 4-tuple
Maxterm – an n-variable maxterm is a normal sum term with n literals. The result of max-
term is equal to 0 only for one combination of values for n variables. And the result of max- Maxterm, Mi.□
term is equal to 1 for remaining combinations. There are 2n such maxterms.
VŠB-TU Ostrava 21
3 Boolean algebra
The condition of assembling the maxterm is such that the sum must be equal to 0 only for
one combination of variables (axiom A4D), and the sum must be equal to 1 for remaining Assembling
combinations, Fig. 03-12. It means that if a variable x has value 0, then the variable is used the maxterm.□
in a direct form in the maxterm, x. And, if a variable x has value 1, then the complement of
x is used in the maxterm, not x.
Maxterm is denoted by Mi, (capital letter M), with the index. The index is a decimal number Indexing the
that corresponds to n-tuple. Each n-tuple of variables can be read as a binary number. maxterm.□
Each Boolean function can be unfolded as the addition or multiplication of the simplest
functions fi, where each simplest function fi defines the output for one row of the truth
table. The sum of the simplest function fi is in Fig. 03-13 and axiom A5D and theorem T1
were used. The function f can be expressed by formula (0304). The theorem T1 (a + 0 = a)
simplifies formula (0304) into (0305).
a b f a b f0 a b f1 a b f2 a b f3
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 1 = 0 1 0 + 0 1 1 + 0 1 0 + 0 1 0
1 0 1 1 0 0 1 0 0 1 0 1 1 0 0
1 1 0 1 1 0 1 1 0 1 1 0 1 1 0
Fig. 03-13 Sum-decomposition of function
f = f0 + f1 + f2 + f3 (0304)
f = f1 + f2 (0305)
The simplest functions f1 and f2 fulfil a logical multiplication, only one combination of input
has value 1 and remaining combinations have value 0. Therefore, the simplest functions f1 The exclusive-
and f2 can be expressed by the minterms. The minterm for function f1 is a’b and the or function.□
minterm for function f2 is ab’. After applying minterms to formula (0305), there is formula
(0306). Formula (0306) also defines the exclusive-or function.
f(a,b) = a’b + ab’ (0306) Canonical disjunctive normal form.□
Similarly, if the axiom A5 and theorem T1D are applied, then Fig. 03-14 shows the product-
decomposition of function f to the simplest function fi. The function fi can be expressed by
formula (0307). The theorem T1D (a  1 = a) simplifies formula (0307) into (0308).
a b f a b f0 a b f1 a b f2 a b f3
0 0 0 0 0 0 0 0 1 0 0 1 0 0 1
0 1 1 = 0 1 1 * 0 1 1 * 0 1 1 * 0 1 1
1 0 1 1 0 1 1 0 1 1 0 1 1 0 1
1 1 0 1 1 1 1 1 1 1 1 1 1 1 0
Fig. 03-14 Product-decomposition of function
f = f0 * f1 * f2 * f3 (0307)
f = f0 * f3 (0308)
22 VŠB-TU Ostrava
The simplest functions f0 and f3 fulfil logical addition, only one combination of input has
value 0 and remaining combinations have value 1. Therefore, the simplest functions f0 and
f3 can be expressed by the maxterms. The maxterm for function f0 is a + b and the maxterm
for function f2 is a’ + b’. After applying maxterms to formula (0308), there is formula (0309).
f(a, b) = (a + b) (a’+ b’) (0309) Canonical conjunctive normal form.□
The forms of formulas (0306) and (0309) have their names. The name of formula (0306) is
the Canonical Disjunctive Normal Form (CDNF), the minterm canonical form or the sum of
products. It means that each product contains all variables and corresponds to one row of Canonical nor-
the truth table. The name of formula (0309) is the Canonical Conjunctive Normal Form mal form.□
(CCNF), the maxterm canonical form or the product of sums. It means that each sum con-
tains all variables and corresponds to one row of the truth table, [wiki_0303]. A general
notation of canonical normal forms is given by formula (0310) for the sum of products and
by formula (0311) for the product of sums.
𝑓(𝑥𝑛−1 … 𝑥1 , 𝑥0 ) = ∑ 𝑚𝑖 (0310)
𝑓(𝑥𝑛−1 … 𝑥1 , 𝑥0 ) = ∏ 𝑀𝑖 (0311)
Where
 f(xn-1, …x1, x0) is a Boolean function with the definition of the variable orders.
 mi are minterms.
 Mi are maxterms.
The real practice uses the mintrem and the maxterm canonical forms for defining Boolean
function. The notation must take into account the incomplete Boolean function with the
value “don’t care”. The notations below, formulas (0310) and (0311), have two parts; the
first part describes the value 1 or 0 and the second one describes the value “don’t care”.
Formula (0310) uses minterms for description and formula (0311) uses maxtems.
𝑓(𝑥𝑛−1 … 𝑥1 , 𝑥0 ) = ∑ 𝑚(𝑖, 𝑗, … ) + ∑ 𝑑(𝑘, 𝑙, … ) (0310)
𝑓(𝑥𝑛−1 … 𝑥1 , 𝑥0 ) = ∏ 𝑀(𝑜, 𝑝, … ) ∗ ∏ 𝐷(𝑟, 𝑠, … ) (0311)
Where
 f(xn-1 … x1, x0) is a Boolean function with the definition of the variable orders.
 m(i, j, …) is a list of indexes that correspond to the minterms.
 d(k, l, …) is a list of indexes that correspond to the minterms for value “don’t care”.
 M(o, p, …) is a list of indexes that correspond to the maxterms.
 D(r, s, …) is a list of indexes that correspond to the maxterms for value “don’t care”.
Fig. 03-15 shows the definition of an incomplete Boolean function with needed derived
minterms and maxterms. The application of the minterm and the maxterm canonical forms
for the definition of Boolean function is given by formulas (0312) and (0313). The substitu-
tion of indexes by minterms or maxterms is problematic, because at this moment the
VŠB-TU Ostrava 23
3 Boolean algebra
“don’t care“ value gets a specific value, 1 or 0. After this substitution, the incomplete Bool-
ean function is not uniquely defined.
𝑓(𝑎, 𝑏, 𝑐) = ∑ 𝑚(1, 2, 5) + ∑ 𝑑(3, 6) (0312)
𝑓(𝑎, 𝑏, 𝑐) = ∏ 𝑀(0, 4, 7) ∗ ∏ 𝐷(3, 6) (0313)
Index a b c f Minterm Maxterm

0 0 0 0 0 M0 = a + b + c
1 0 0 1 1 m1 = a´· b’· c
2 0 1 0 1 m2 = a’· b · c’
3 0 1 1 X d3 = a’· b · c D3 = a + b´+ c´
4 1 0 0 0 M4 = a´+ b + c
5 1 0 1 1 m5 = a · b’· c
6 1 1 0 X d6 = a · b · c’ D6 = a´+ b´+ c
7 1 1 1 0 M7 = a´+ b´+ c´
Fig. 03-15 Incomplete Boolean function with minterms and maxterms
3.5 Reference
[Roth_2004] Charles H. Roth, Jr.:Fundamental of Logic Design; Thomson 2004, ISBN 0-

534-37804-8
[Mendelson_1997] Elliott Mendelson: Introduction to Mathematical Logic, Fifth Edition

(Discrete Mathematics and Its Applications); Chapman and Hall/CRC 2009,
ISBN-13: 978-1584888765
[Warkley_2006] Jon F. Warkley: Digital Design, Principles and Practices, Fourth Edi-
tion; Prenice Hall 2006, ISBN 0-13-186389-4
[wiki_0301] Boolean algebra; http://en.wikipedia.org/wiki/Boolean_algebra; on line

2014-10-21
[wiki_0302] Boolean function; http://en.wikipedia.org/wiki/Boolean_function; on line

2014-10-21
[wiki_0303] Canonical normal form;

http://en.wikipedia.org/wiki/Canonical_normal_form; on line 214-10-21
[wiki_0304] Truth table; http://en.wikipedia.org/wiki/Truth_table; on line 2014-10-21
24 VŠB-TU Ostrava
4 Design of Boolean function
The basic steps of the digital system design are synthesis and realization. Synthesis is a pro-
cess that transforms a description of the digital system into a suitable form for realization.
If the small level of integration is applied to a realization, then the circuit diagram is a suita-
ble result of synthesis. Circuit diagram shows the connection of gates, flip-flops, multiplex-
es, etc. With the increasing integration, the result of synthesis changes into a suitable file
that is independent on realization. In case of the programmable logic devices, the result of
synthesis is used to generate a file that describes the state of programmable fuses. In case
of the application-specific integrated circuit - ASIC, this file is used as an input for a design
of masks for producing integrated circuits.
Synthesis is made by special programs, called synthesis tools, which are a part of electronic
Synthesis.□
design automation - EDA tools, [wiki_0401]. Files in hardware description languages – HDL
are the input of synthesis. These files can describe Boolean functions, finite state machines,
system in a higher level of description as register transfer level - RTL description, and/or
system with component instantiation. This chapter is devoted to the synthesis of simple
Boolean function and the expected result is a combinational logic, [wiki_0403]. In this case, Combinational
the combinational logic is a collection of connected logic gates. The behavior of combina- logic.□
tional logic and Boolean function is the same, in both of them the output only depends on
the current input.
One result of synthesis is a circuit diagram of combinational logic. Circuit diagram is a

graphical representation of Boolean expression, [wiki_0402]. Circuit diagram of combina- Circuit diagram.□
tional logic consists of logic gates and links, Fig. 04-01. Logic gates represent Boolean func-
tions or operations. The connection of gates is given by the order of the evolution of an
expression. The multiplication cb’ has a higher priority than non-equivalence; the operator
of non-equivalence is CIRCLED PLUS - ⊕. Fig. 04-01 shows a 3-level combinational logic.
a y
b’
c y = a((b’c)cb’)
b’
c
Fig. 04-01 Circuit diagram and corresponding expression
VŠB-TU Ostrava 25
The preferred result of the Boolean function synthesis is a two-level combinational logic.
The importance of 2-level logic lies in:
 The two-level logic corresponds to the canonical form of Boolean expression.

 The realization of two-level logic has a minimum propagation delay. Two-level logic.□
 The realization of a two level logic leads to the minimum number of logic gates.
a y
b
c’ y = a + b c’
Fig. 04-02 Circuit diagram with 2-level logic
This chapter is devoted to synthesis of small two-level combinational logics, Fig. 04-02, and Minimization
to synthesis that results in the minimum number of basic logic gates. Logic gates AND, OR, criteria.□
NOT, NAND and NOR are considered as basic logic gates.
Synthesis with other criteria is outside of this textbook. These are, for example, synthesis of
multilevel logic, synthesis by using special logic gates as XOR gates, synthesis of large Bool-
ean functions with minimum propagation delay. These principles of synthesis can be found
in literature [Ergovac_Lang_2004], [Koren_2008], [Katz_Borriello_2005], [Roth_2004],
[Warkley_2006], [Fristacky_1986] and others.
Note to some interesting synthesis
 Synthesis of 64-bit adder or multiplayer with the minimum propagation delay cri-
terion.
 Synthesis of comparison circuit for two 128-bit numbers.
 Synthesis of circuits that are based on linear algebra, i.e. coding, encryption, etc.□
4.1 Logic gate

Logic gates are basic elements of combinational logic that are described by principles of
Boolean algebra. Each logic gate has a corresponding Boolean function and an assigned
operator that is used in Boolean expressions. Logic gate also has a graphical symbol that is
used for drawing the circuit diagram.
The realization of logic gate can be based on pneumatic, hydraulic, electric and other prin-
ciples. However, famous logic gates are based on electronic principles and all modern digi-
tal systems are realized by these electronic logic gates.
Each logic gate can be described in several ways and there are more names of operations
for one gate. The basic description of logic gate is truth table and corresponding Karnaugh
map. Next, the gate is described by sentences and possible program statements. Some ex-
pressions are derived by using DeMorgan rules. The description also contains a graphical
symbol that is used in circuit diagram.
26 VŠB-TU Ostrava
NOT gate. NOT gate is usually called an inverter, [wiki_0404]. It produces the output value
that is the opposite of its input value. An alternate name for complementation is inverter,
one’s complement, negation, complement, etc. The NOT gate can be described by:
 NOT gate produces 1 if the input is equal to 0 and vice versa.
Possible expressions and statements in programming languages are:
 not a; !a; ~a; a´; ā; ¬a

 Y = !a; Y=~a;
 If (a==1) then z=0; else z=1;
 If (a==0) z=0; else z=1; NOT gate□
Truth table Karnaugh map Graphic symbol

a
a y a y
0 1 1 0
1 0
Fig. 04-03 NOT gate
AND gate. AND gate corresponds to Boolean multiplication and alternate names are con-
junction, operation AND, logical multiplication, logical product, [wiki_0405]. AND gate can
be described by one of the sentences:
 AND gate produces 1 if and only if all of its inputs are equal to 1.
 The product is equal to 1 if inputs “a” and simultaneously “b” are equal to 1; else is
equal to 0.
 The product is equal to 0 if input “a” or “b” equals 0; else is equal to 1.
 a · b; ab; a AND b; a  b; a & b; a && b

 y := a && b; y := a & b; y = !(!a || !b); y = !(!a | !b);
 If (a==1 & b==1) z=1; else z=0; AND gate□
 If (a==0 | b==0) z=0; else z=1;
Truth table
Karnaugh map Graphic symbol
a b y a
0 0 0
0 1 0 0 0 a y
1 0 0 b
b 0 1
1 1 1
Fig. 04-04 AND gate
OR gate. OR gate corresponds to Boolean addition and alternate names are disjunction,
operation OR, logical addition, inclusive OR, logical sum, [wiki_0406]. OR gate can be de-
scribed by one of the sentences:
 OR gate produces 1 if and only if one or more of its inputs are equal to 1.
VŠB-TU Ostrava 27
 The sum is equal to 1 if inputs “a” or “b” are equal to 1; else sum is equal to 0.
 The sum is equal to 0 if inputs “a” and simultaneously “b” are equal to 0; else is
equal to 1.
 a + b; a OR b; a # b; a  b; a | b; a || b
 z = a || b; zz = a | b; z = !(!a && !b); zz = !(!a && !b);
 If (a == 1 | b == 1) then z = 1; else z = 0;
OR gate□
 If (a == 0 & b == 0) z = 0; else z = 1;
Truth table
a b y a
0 0 1
0 1 1 0 1 a y
b
1 0 1 b 1 1
1 1 0
Fig. 04-05 OR gate
NAND gate. NAND gate is the negation of AND, [wiki_0407]. Alternate names are non-
conjunction, operation NAND, negation AND, negation of multiplication, non-product,
complement logical multiplication, not logical product, Sheffer stroke (). NAND gate can
be described by one of the sentences:
 NAND gate produces 1 if and only if one of its inputs is equal to 0.

 The negation of product is equal to 1 if variable “a” or “b” is equal to 0; else is equal
to 0.
 The negation of product is equal to 0 if variables “a” and simultaneously “b” are
equal to 1; else is equal to 0.
 !(a . b); (a . b)’; a NAND b; ¬(a  b); ~(a & b); !(a && b);
NAND gate□
 z := !(a && b); zz := !(a & b); z = !a || !b; zz = !a || !b; z := !(a & b & c);
 If (a == 1 & b == 1) z = 0 else z = 1; NAND gate does not
 If (a == 0 | b == 0) z = 1 else z = 0; fulfil associative law.□
Truth table
a b y
a
0 0 1
0 1 1 a y
1 1
b
1 0 1 b 1 0
1 1 0
Fig. 04-06 NAND gate
28 VŠB-TU Ostrava
NOR gate. NOR gate is the negation of OR, [wiki_0408]. Alternate names are non-
disjunction, operation NOR, negation OR, negation of addition, Peirce's arrow (). NOR gate
can be described by one of the sentences:
 NOR gate produces 1 if and only if all of its inputs are equal to 0.
 The negation of addition is equal to 1 if variables “a” and simultaneously “b”
are equal to 0; else is equal to 0.
 The negation of addition is equal to 0 if variable “a” or “b” is equal to 0; else is
equal to 1.
 !(a + b); (a + b)’; a NOR b; ¬(a  b); ~(a | b); !(a || b);
NOR gate□
 z := !(a || b); zz := !(a | b); z = !a && !b; zz = !a & !b;
 If (a == 1 | b == 1) z = 0; else z = 1;
NOR gate does not fulfil
 If (a == 0 & b == 0) z = 1; else z = 0;
associative law.□
Truth table
a
a b y
0 0 1 a y
1 0
b
0 1 0 b 0 0
1 0 0
1 1 0
Fig. 04-07 NOR gate
XOR gate. XOR operation corresponds to mathematical sum of modulo 2. XOR gate has
alternate names as non-equivalence, exclusive OR, [wiki_0409]. XOR gate can be described
by one of the sentences:
 The output of 2-input XOR gate is equal to 1 if inputs are not equal.
 The output of 2-input XOR gate is equal to 0 if inputs are equal.
 XOR operation can be expressed as a XOR b = a ⊕ b=ab´+ a´b.
 a ⊕ b; a XOR b; a ^ b;
 z := a ^ b; z = (!a & b)||(a & !b);
 If ((a & b) | (!a & !b)) z = 0; else z = 1; XOR gate□
 If ((a | b) & (!a | !b)) z = 1; else z = 0;
Truth table Karnaugh map Graphic symbol

a
a b y a y
0 1 b
0 0 0
0 1 1 b 1 0
1 0 1
1 1 0 Fig. 04-08 XOR gate
VŠB-TU Ostrava 29
Note to XOR operation
 XOR operation corresponds to mathematical sum of modulo 2.

 E.g. 1 ⊕ 0 ⊕ 0 ⊕ 1 ⊕ 1 = 1
 For two variables, it is non-equivalence; values of variables are not equal. □
XNOR gate. XNOR operation is non XOR. Alternate names are equivalence, exclusive NOR,
[wiki_0410]. XNOR gate can be described by one of the sentences:
 The output of 2-input XNOR gate is equal to 1 if inputs are equal.

 The output of 2-input XNOR gate is equal to 0 if inputs are not equal.
 XNOR operation can be expressed as a XNOR b = (a ⊕ b)’=ab+ a´b’.
 !(a ⊕ b); NOT(a XOR b); ~(a ^ b); a≡b; a⇔b

 z := !(a ^ b); z = (!a & !b)||(a & b); XNOR gate□
 If ((a & b) | (!a & !b)) z = 1 else z = 0;
 If ((a | b) & (!a | !b)) z = 0 else z = 1;
Truth table
a b y
a
0 0 1
0 1 0 1 0 a y
1 0 0 b
b 0 1
1 1 1
Fig. 04-09 XNOR gate
4.2 Synthesis
This subchapter is devoted to the synthesis of Boolean function into 2-level combinational
logic. The goal of this is to find the canonical form of Boolean expression with minimum
number of literals. The literal is a variable in direct or negation form, e.g. variable a or a’.
Therefore the result of synthesis will be either minimum Sum of Products (minimum dis-
junctive form) or minimum Product of Sums (minimum conjunctive).
For manual synthesis, two ways can be applied, minimization by using Boolean axioms and
theorems, and Karnaugh maps. Examples of minimization by Boolean theorems are in pre-
vious chapter and in Fig. 04-10. This principle is only suitable for a complete Boolean func-
tion. An incomplete Boolean function contains values 0, 1 and X – don’t care. Then, the
Note to truth table of Boolean function
Boolean expression cannot work with the value X – “don’t care”. This is a reason why truth
table uniquely defines Boolean function.□
30 VŠB-TU Ostrava
value “don’t care” is problematic, therefore it is impossible to derive in advance which X

value terms are suitable for minimization.
y = cba M(0, 4, 6) = M0 + M4 + M6
y = (c + b + a) (c’ + b + a) (c’ + b’+ a) =

T8D
= (c.c’ + b + a) (c’ + a + b.b’) Theorems T4D and T1
y = (b + a) (c’ + a)
Fig. 04-10 Minimization by Boolean theorems
Another method of minimization is Karnaugh map. Karnaugh map is a graphical method of

simplification and minimization of Boolean function for its realization. This method is only Karnaugh map.□
suitable for 1 to 4 variables; experts can manage 5 or 6 variables. For more variables, algo-
rithms for simplification and minimization are preferred. Karnaugh map is an arranged for-
mation of cells. The numbers of cells is equal to 2n, where n is number of Boolean function Numbers of cell
variables. Karnaugh maps for 1, 2, 3 and 4 variables are in Fig. 04-11. More information of is the power of
Karnaugh map is in literature [wiki_0413], [Fristacky_1986], [Warkley_2006], [Roth_2004] 2.□
and [Katz_Borriello_2005].
1 variable 2 variables 3 variables 4 variables

a a a b a b
0 1 0 1 0 1 3 2 0 1 3 2
2 3 4 5 7 6 4 5 7 6
b c c
12 13 15 14
d 8 9 11 10
Fig. 04-11 Karnaugh maps for 1, 2, 3 and 4 variables
Columns and rows are labeled by the line with the name of a variable. The cells covered by
the line have value 1 for given variable and the outside cells have value 0 for given variable. Each cell has
On this basis, it is possible to derive minterm and maxterm for each cell. Therefore, the minterm and
maxterm.□
Note to labeling Karnaugh maps
a b a b
0 1 3 2 0 1 3 2 Variable d has
Variable a has
4 5 7 6 c 4 5 7 6 value 1
c value 1
12 13 15 14 12 13 15 14 Variable d has
Variable a has
d 8 9 11 10
d 8 9 11 10 value 0
value 0
VŠB-TU Ostrava 31
minterm or maxterm can be expressed as an index and then each cell has its index (green
color). Minterm for cell with index 13, binary 1101, is dcb’a. Maxterm for cell with index 3,
binary 0011, is d + c + b’ + a’.
Truth table has rows and each row corresponds to different minterm or maxterm. Karnaugh
map has cells and each cell has different minterm or maxterm. Therefore, it is possible to
transfer truth table into Karnaugh map, Fig. 04-12. Each minterm or maxterm has its index
that is the same in truth table, in Karnaugh map and in the definition of Boolean function,
formulas (0401) and (0402).
𝑓(𝑥𝑛−1 … 𝑥1 , 𝑥0 ) = ∑ 𝑚(𝑖, 𝑗, … ) + ∑ 𝑑(𝑘, 𝑙, … ) (0401)
𝑓(𝑥𝑛−1 … 𝑥1 , 𝑥0 ) = ∏ 𝑀(𝑜, 𝑝, … ) ∗ ∏ 𝐷(𝑟, 𝑠, … ) (0402)
Where
 f(xn-1 … x1, x0) is a Boolean function with the definition of the variable orders.
 m(i, j, …) is a list of indexes that correspond to the minterms.
 d(k, l, …) is a list of indexes that correspond to the minterms for value “don’t care”.
 M(o, p, …) is a list of indexes that correspond to the maxterms.
 D(r, s, …) is a list of indexes that correspond to the maxterms for value “don’t care”.
𝑓(𝑐, 𝑏, 𝑎) = ∑𝑚(2, 4, 6) + ∑𝑑(1, 3)
𝑓(𝑐, 𝑏, 𝑎) = ∏𝑀(0, 5, 7) + ∏𝐷(1, 3)
No. c b a f
b
0 0 0 0 0 a
1 0 0 1 X 0 1 3 2
0 X 1 X
2 0 1 0 1
4 5 7 6
3 0 1 1 X c 1 0 0 1
4 1 0 0 1
5 1 0 1 0
6 1 1 0 1
7 1 1 1 0
Fig. 04-12 Enrolment of Boolean function into Karnaugh map
The arranged cells in Karnaugh map ensure that the adjacent cells only differ in the value of
one variable. Adjacent cells are arranged horizontally or vertically (not diagonally). The cells Adjacent cells.□
of outer rows and columns fulfil this condition, [Katz_Borriello_2005]. It is possible to apply
theorems T10 or T10D on adjacent cells or a group of adjacent cells. The number of adja-
cent cells is the power of 2.
Cell with index 2 has minterm d’c’ba’ and cell with index 3 has minterm d’c’ba, Fig. 04-13. If
sum of products is created, then corresponding Boolean expression is d’c’ba’ + d’c’ba. The
theorem T10 can be applied and the resulted expression is the product term d’c’b. It is a Product term.
□
blue circle in Karnaugh map, Fig. 04-13. Similarly, it is possible to derive the product term
for cells 6 and 7. These two circles are also adjacent, they only differ in one variable and the
32 VŠB-TU Ostrava
theorem T10 can be applied. The result is the product term d’b, it is a red circle in Kar-
naugh map, Fig. 04-13.
Similarly, it is possible to apply the above steps for maxterms, Fig. 04-13. Cells with indexes
Sum term.□
4 and 6 are adjacent and the theorems T10D can be applied on maxterms, (d + c’+ b + a) (d
+ c’+ b’ + a) = (d + c’+ a). The result is the sum term. Cells with indexes 12 and 14 are adja-
cent to cells 4 and 6 and theorem T10D can be applied on sum terms, (d + c’+ a) (d’ + c’+ a)
= (c’+ a).
a b a b
0 1 3 2
d’c’b 0 1 3 2
1 1 d + c’ + a
4 5 7 6 d’cb 4 5 7 6
c 1 X c 0 X
12 13 15 14 12 13 15 14 c’ + a
d’b 0 0
d 8 9 11 10 d 8 9 11 10 d’ + c’ + a
Fig. 04-13 Adjacent cells
In case of simplification and minimization of Boolean function, the circles are drawn in Kar-
naugh map. The number of cells in the circle is 1, 2, 4, 8 …, the power of 2, and must be
placed in a square or a rectangle. If the loop is plotted on Karnaugh map, then it is possible
to derive the corresponding product term or sum term directly from Karnaugh map.
Note to adjacent cells on outer edges of Karnaugh map
 Cells on the left/right outer edges or on the top/bottom outer edges of the map
are adjacent.
 Consequently, the cells in the corners of the map are adjacent. □
For product term it is valid:

Rules for assem-
 The circle must cover 1-cells, with suitable X-cells, if needed.
bling product
 If the circle covers only areas of the map where the variable is 0, then the variable
term.□
is complemented in the product term.
is in direct form in the product term.
 If the circle covers areas of the map where the variable is 0 as well as areas where it
is 1, then the variable does not appear in the product term.
For sum term it is valid:
 The circle must covers 0-cells with suitable X-cells, if needed.

Rules for assem-
bling sum term.□
is in direct form in the sum term.
is complemented in the sum term.
VŠB-TU Ostrava 33
 If the circle covers areas of the map where the variable is 0 as well as areas where it
is 1, then the variable does not appear in the sum term.
Note to 1-cell, 0-cell and X-cell
 The notation 1-cell means that the cell has value 1.

 Consequently, 0-cell has value 0 and X-cell has value X. □
Note to the number of literals in a product term or a sum term
 If Karnaugh map is for n-variables and the circle contains 2i cells, then each term
contains n - i + 1 literals.
 For example, Karnaugh map is for 4 variables.
 If the circle contains 1 cell, then the term contains 4 literals.
 If the circle contains 2 cells, then the term contains 3 literals.
 If the circle contains 4 cells, then the term contains 2 literals.
 Etc.□
4.3 Minimization by Karnaugh map

The result of simplification and minimization by Karnaugh map is the expression in the form
of Sum of Products or Product of Sums. These forms are suitable for the realization of
2-level combinational logic. Sum of Products is derived for value 1 by the following algo-
rithm:
 Create all maximal circles that cover all 1-cells. The circle can cover suitable X-cells
that ensure maximal cells in the circle. The possible number of cells in the circle is
1, 2, 4, 8…, it is the power of 2.
 Minimize the number of circles so that all 1-cells stay covered. The remaining cir-
cles must have minimum number of literals.
 Express product terms for the remaining circles. Minimization for Sum of Products.□
 Create Sum of Products.
f(d, c, b, a) = ∑m(2, 3, 6, 10) + ∑d(1, 7, 9, 11, 14) (0403)
All practical minimizations will be performed on incomplete Boolean function that is given
by formula (0403). Minimization for Sum of Products is shows in Fig. 04-14 and it leads to 4
possible results. All results are equivalent and correspond to given incomplete Boolean
function. However, they correspond to 4 different complete Boolean functions. The result-
ing functions are:
 Function f1 contains yellow and green product terms and minimum Sum of Prod-
ucts is formula (0404).
 Function f2 contains yellow and blue product terms and minimum Sum of Products
is formula (0405).
34 VŠB-TU Ostrava
a b
f(d, c, b, a) = ∑m(2, 3, 6, 10) + ∑d(1, 7, 9, 11, 14)
0 X 1 1
c 0 0 X 1 Product terms
0 0 0 X
b d a’b
a 0 X X 1
a b
0 0 1X 31 21
Minimization 0 X 1 1 bd’
c 40 50 7X 61 c 0 0 X 1
4 possibilities 0 0 0 X
120 130 150 14X d
0 X X 1 ac’
d 8 9 11 X 101
0 X b
a
b 0 X 1 1 bc’
a
c 0 0 X 1
0 X 1 1
0 0 0 X
c 0 0 X 1 d
0 X X 1
0 0 0 X
d Minimization for Sum of Products.□
0 X X 1
Fig. 04-14 Minimization for Sum of Products
 Function f3 contains yellow and red product terms and minimum Sum of Products
is formula (0406).
 Function f4 contains red and green product terms and minimum Sum of Products is
formula (0407).
f1 = a’b + bd’ (0404)
f2 = a’b + ac’ (0405)
f3 = a’b + bc’ (0406)
f4 = bd’ + bc’ (0407)
One of possible logical networks of combinational logic for given Boolean function is in Fig.
04-15. For realization, logic gates AND and OR are chosen. The application of these logic
gates leads to 2-level combinational logic, and this realization is also called AND-OR. This
name is created by the order of logic gates in diagram. The network AND-OR is natural for
Sum of Products.
AND-OR combinational logic.□
a
b
f1 f1 = a’b + bd’
b
d’
Fig. 04-15 AND-OR circuit diagram for 2-level combinational logic
The next possibility is to create minimum Product of Sums. Products of Sums form is de-
rived for value 0 by the following algorithm:
VŠB-TU Ostrava 35
 Create all maximal circles that cover all 0-cells. The circle can cover suitable X-cells
that ensure maximal cells in the circle. The possible number of cells in the circle is
1, 2, 4, 8…, it is the power of 2.
 Minimize the number of circles so that all 0-cells stay covered. The remaining cir-
cles must have minimum number of literals.
Minimization for Product of Sums.□
 Express sum term for the remaining circles.
 Create Product of Sums.
f(d, c, b, a) = ∑m(2, 3, 6, 10) + ∑d(1, 7, 9, 11, 14)

b
a Sum terms
0 X 1 1
c b
b 0 0 X 1
a 0 0 0 X
d
00 1 0 X X 1
X 31 21
Minimization b a’+ c’
c 40 50 7X 61 a
3 possibilities
120 130 150 14X 0 X 1 1
c 0 0 X 1
d 8 c’ + d’
0 9 X 11 X 101 d
0 0 0 X
0 X X 1
b
a
0 X 1 1
c a’ + d’
0 0 X 1
0 0 0 X
d
0 X X 1
Minimization for Product of Sums.□
Fig. 04-16 Minimization for Product of Sums
Minimization for Product of Sums is shows in Fig. 04-16 and it leads to 3 possible results.
The function used is the same as in the previous example, formula (0403). All results are
equivalent and correspond to given incomplete Boolean function. However, they corre-
spond to 3 different complete Boolean functions. The resulting functions are:
 Function f5 contains yellow and green sum terms and minimum Product of Sums is
formula (0408).
 Function f6 contains yellow and red sum terms and minimum Product of Sums is
formula (0409).
 Function f7 contains yellow and blue sum terms and minimum Product of Sums is
formula (0410).
f5 = b (a’+ c) (0408)
f6 = b (c’ + d’) (0409)
f7 = b (a’+ d’) (0410)
One of possible logical networks for Product of Sums is the application OR and AND logic
gates, in this order. Circuit diagram of combinational logic for given Boolean function is in
36 VŠB-TU Ostrava
Fig. 04-17. The application of these logic gates leads to 2-level combinational logic, and this
realization is also called OR-AND. This name is created by the order of logic gates in dia-
gram. The network OR-AND is natural for Product of Sums.
b f6 f6 = b (c’+ d’)
c’
d’
OR-AND combinational logic.□
Fig. 04-17 OR-AND circuit diagram for 2-levels combinational logic
One incomplete Boolean function was minimized. Let’s study “don’t care” cell, e.g. cell with
index 7. The original value of cell is “don’t care”. If the circles are created, the value 0 or 1 is
substituted into this cell. Boolean function contains more cells with “don’t care” value. The
result is 7 expressions for one incomplete Boolean function. Four expressions are minimum
Sum of Products and 3 expressions are minimum Product of Sums. All 7 expressions corre-
spond to one incomplete Boolean function. However, more complete Boolean functions
can be derived. This problem is caused by “don’t care” value.
If the complete Boolean function is minimized, then it is possible to obtain more minimum
expressions as Sum of Products or Product of Sums. Then all expressions correspond to one
complete function.
4.4 Realization by NAND and NOR logic gates

In the previous subchapter, the Boolean function is realized by AND and OR logic gates,
needed NOT gate is not drawn. The set of logic gates AND, OR and NOT is a complete set
and all Boolean functions can be realized by AND, OR and NOT logic gates. Another com-
plete set is a set with NAND logic gate and a set with NOR logic gate. It means that all Bool-
ean functions can be realized only by NAND logic gates or only by NOR logic gates. And also,
all Boolean expressions can be modified only to NAND or NOR operation, which can lead to
more-level combinational logic. If only 2-level combinational logic is expected, then Sum of
Products can be modified to NAND operations and Product of Sums can be modified to NOR
operations. The modification is the application of DeMorgan rules.
f1 = a’b + b’d
. a’
(ab) (bd) 1
b
3 f1
1st NAND 2nd NAND b
d’ 2
rd
3 NAND
NAND combinational logic.□
Fig. 04-18 NAND combinational logic
NAND two-level combinational logic is derived from Sum of Products by the application of
theorem T5 and DeMorgan rule. For an expression, e.g. f1, formula (0404), it is possible to
write:
VŠB-TU Ostrava 37
̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
𝑓1 = 𝑎̅𝑏 + 𝑏𝑑̅ = ̿̿̿̿̿̿̿̿̿̿̿
𝑎̅𝑏 + 𝑏𝑑̅ = (𝑎
̅̅𝑏 ̅̅̅̅̅̅ )
̅̅̅) . (𝑏𝑑
The result is the expression that contains only NAND operations. The circuit diagram is in
Fig. 04-18 and it is 2-level combinational logic.
NOR two-level combinational logic is derived from Product of Sums by the application of
theorem T5 and DeMorgan rule. For an expression, e.g. f6, formula (0409), it is possible to
write:
̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
𝑓6 = 𝑏 (𝑐̅ + 𝑑̅ ) = ̿̿̿̿̿̿̿̿̿̿̿̿
𝑏 (𝑐̅ + 𝑑̅) = 𝑏̅ + (𝑐̅̅̅̅̅̅̅̅̅
+ 𝑑̅)
The result is the expression that contains only NOR operations. The circuit diagram is in Fig.
04-19 and it is 2-level combinational logic.
+ (c + d) b’
b f6
c’ 2
1st NOR d’ 1
f6 = b (c’+ d’)
2nd NOR
NOR combinational logic.□
Fig. 04-19 NOR combinational logic
4.5 Algorithm of synthesis

The latest description of Boolean function is by HDL - Hardware Description Languages. This
description is an input of synthesis that is made by algorithm. The example of the architec-
ture in VHDL notation of incomplete Boolean function, formula (0403), is in Fig. 04-20. Two
algorithms of synthesis are used in EDA tools. There is Quine-McCluskey algorithm and Es-
presso algorithm.
architecture Data_Flow of Incomplete_Boolean_function is

begin
f <= ‘0’ when d=’0’and c=’0’ and b=’0’ and a=’0’ else -- index 0
‘1’ when d=’0’and c=’0’ and b=’1’ and a=’0’ else -- index 2
‘X’ when others; -- remaining indexes
end architecture Data_Flow;
Fig.04-2 VHDL notation of incomplete Boolean function
38 VŠB-TU Ostrava
Quine-McCluskey algorithm, developed in the mid-1950s, finds the minimized rep-

resentation of any Boolean expression. It provides a systematic procedure for gen-
Quine-McCluskey
erating all prime implicants and then extracting a minimum set of primes covering
algorithm.□
the on-set. The method finds the prime implicants by repeatedly applying the Unit-
ing theorem. The contribution of Quine-McCluskey is to provide a tabular method
that ensures that all prime implicants are found. More information is in literature
[Warkley_2006], [Katz_Borriello_2005], [Roth_2004], [wiki_0411] and [Fristacky_1986].
Espresso algorithm is a program for two-level Boolean function minimization, de-

veloped at the University of California at Berkeley, and now a common subroutine Espresso algo-
□
for many logic minimization tools. It combines many of the best heuristic tech- rithm
niques developed in earlier programs, such as mini and presto. Although a detailed
explanation of the operation of espresso is beyond the scope of this book, the basic
ideas employed by the program are not difficult to understand, [Katz_Borriello_2005]
and [wiki_0412].
4.6 Reference
[Ergovac_Lang_2004] Milos D. Ercegovac, Tomas Lang: Digital Arithmetic; Morgan Kauf-
mann Publishers, 2004, ISBN 1-55860-798-6
[Fristacky_1986] Frištacký N., Kolesár M., Kolenička J., Hlavatý J.: Logické systémy;
Alfa a SNTL 1986
[Katz_Borriello_2005] Randy H. Katz, Gaetano Borriello: Contemporary Logic Design, Sec-

ond Edition; Prentice Hall 2005, ISBN 0-201-30857-6
[Koren_2008] Israel Koren: Computer Arithmetic Algorithms; A. K. Peters 2008;

ISBN 1-56881-160-8

534-37804-8
[wiki_xx01] Logic synthesis; http://en.wikipedia.org/wiki/Logic_synthesis; on line 2014-

10-21
[wiki_0402] Circuit diagram; http://en.wikipedia.org/wiki/Circuit_diagram; on line 2014-

11-05
[wiki_0403] Combinational logic; http://en.wikipedia.org/wiki/Combinational_logic; on

line 2014-11-05
[wiki_0404] Inverter (logic gate); http://en.wikipedia.org/wiki/Inverter_(logic_gate); on

line 2014-11-05
VŠB-TU Ostrava 39
[wiki_0405] AND gate; http://en.wikipedia.org/wiki/AND_gate; on line 2014-11-05
[wiki_0406] OR gate; http://en.wikipedia.org/wiki/OR_gate; on line 20104-11-05
[wiki_0407] NAND gate; http://en.wikipedia.org/wiki/NAND_gate; on line 2014-11-05
[wiki_0408] NOR gate; http://en.wikipedia.org/wiki/NOR_gate; on line 2014-11-05
[wiki_0409] XOR gate; http://en.wikipedia.org/wiki/XOR_gate; on line 2014-11-05
[wiki_0410] XNOR gate; http://en.wikipedia.org/wiki/XNOR_gate; on line 2014-11-05
[wiki_0411] Quine–McCluskey algorithm;

http://en.wikipedia.org/wiki/Quine%E2%80%93McCluskey_algorithm; on
line 2014-11-11
[wiki_0412] Espresso heuristic logic minimizer;

http://en.wikipedia.org/wiki/Espresso_heuristic_logic_minimizer; on line
2014-11-11
[wiki_0413] Karnaugh map; http://en.wikipedia.org/wiki/Karnaugh_map; on line 2014-

11-11
40 VŠB-TU Ostrava
5 Real numbers
All of us use numbers and this chapter deals with their distribution to groups and the possi-
ℝ, real numbers
bility to represent them in digital world, mainly in the computer. In mathematics, a real
are e.g.: +1; -1;
number is any number on the real number line from minus infinity to plus infinity,
+1.41; -5467.01;
[wiki_0501]. The symbol boldface R or ℝ (double-struck capital R, Unicode+211D) is used
…□
for the denotation of the set of real numbers. The set of real numbers is divided into two
groups, rational numbers and irrational numbers.
Rational number is any number that can be expressed as a quotient or a fraction p/q of two
ℚ, rational
integer numbers with the denominator q not equal to zero, [wiki_0502]. It means that inte-
numbers are
ger numbers are part of rational numbers with denominator equal to one, e.g. 5/1 equals 5.
e.g.: +1; -1/1;
Also numbers 25/100, b101/23 are rational numbers and it is possible to write them down
0.25; 2/3; …□
as 0.25, b0.101; in this case, the number of digits is finite. On the contrary, number 1/3
belongs to the second group of rational numbers, where the fraction is the only precise
denotation of a number. The notation with the radix point, e.g. 0.333…, is not precise
Irrational num-
enough. The set of rational numbers is denoted by boldface Q or ℚ (double-struck capital
bers are e.g.: √2;
Q, Unicode+211A).
π=3.14…;
Irrational numbers are the remaining numbers, opposite to rational numbers, [wiki_0503]. e=2.71…; …□
E.g. number equal to the square root of two (√2) cannot be expressed precisely as a num-
ber with radix point and finite number of digits (1.41…) or as a fraction. Another example
may be π number or e Euler value. We only use their approximate values, 3.14 or 2.71.
Integer numbers are a part of rational numbers that can be expressed by the fraction with
ℤ, integer num-
denominator equal to 1, [wiki_0504]. This means that integer number does not use the
bers are e.g.:
fractional part of a number or radix point. Integer numbers are in the range from minus
…-2; -1;
infinity to plus infinity. The set of integer numbers is denoted by boldface symbol Z or ℤ
0;
(double-struck capital Z, U+2124). As for the division of the integer set to subsets, opinion is
+1; +2…□
divided, [wiki_0504], [wiki_0505] and [wiki_0506]. One way, the set of integers consists of
the subsets of natural numbers {+1, +2, +3 …}, zero {0} and the opposites of the natural
numbers {−1, −2, −3 …} that are negative. The second way, the subset of natural numbers is ℕ, natural num-
{0, +1, +2, +3 …} and the subset of negative non-zero numbers is {-1, -2, -3 …}. The set of bers are e.g.:
natural numbers is usually denoted by boldface symbol N or ℕ (double-struck capital N, 0; +1; +2…
U+2115). or
+1; +2…□
The above mentioned summary represents the mathematical point of view on numbers but
in computer science a different terminology exists. There are also limitations that are given
by the finite quantity of bits for representing numbers. Therefore, the minimum and maxi-
mum numbers are defined and there is a space between two neighboring numbers. In
mathematics, where no limitations exist, it is possible to use numbers from the range of
VŠB-TU Ostrava 41
5 Real numbers
minus infinity to plus infinity and there is no space between two neighboring numbers. In
computer science, mainly in the definition of data types in programming language, different
terms are used:
 Floating point numbers are numbers with radix point and exponent and correspond
to real numbers.
 Fixed point numbers are numbers, where the position of radix point is defined be-
fore and correspond to rational numbers.
 Integer numbers correspond to mathematical integer numbers and two data types
are defined, signed and unsigned integers.
o Signed integers are the set of integer numbers, ℤ.
o Unsigned integers correspond to natural numbers, set ℕ = {0, +1, +2, …}.
The problem of representing numbers and their precision is a separate science. In computer
science we can find more examples of wrong computation, some of them were produced
by people and others were produced by the theory of representing numbers. For better
understanding, some examples known from the literature are shown below.
5.1 Some famous bugs

Pentium bug. In 1994, the flaw in the division was introduced in the Pentium processor,
literature [Muller_2010], [Janeba_1995], [wiki_0507], [Intel_0501]. Equation (0501) shows
correct division, real quotient produced by Pentium processor is shown in equation (0502).
The difference starts on the weight 10-4.
4195835
3145727
= 1.333820449136241002 (0501)
4195835
3145727
= 1.333739068902037589 (0502)
This flaw was on the first Pentium model with frequency 60, 90 and 100 MHz. The Intel
Corporation recognized this flaw in the algorithm, repaired it very quickly and continued in
the production of new Pentium processor models. Moreover, the Intel Corporation offered
each customer to replace the original processor by a new one without the flaw.
Excel bug. The flaw in Excel 2007 was in the calculation with number 65535 and/or ap-
proaching number 65536, [Microsoft_0501], [Muller_2010]. Displayed results of formula
(0503) and multiplication (0504) were wrong.
65536 – 2-37 was displayed as 100001 instead of 65536 (0503)
77.1 * 850 was displayed as 100000 instead of 65535 (0504)
The flaw was only in displayed results but in other calculations correct numbers were used.
Microsoft explains this flaw in article [Microsoft_0501] http://blogs.office.com/b/microsoft-
excel/archive/2007/09/25/calculation-issue-update.aspx and a patch is available from
http://blogs.msdn.com/excel/archive/2007/10/09/calculation-issue-update-fix-
available.aspx.
42 VŠB-TU Ostrava
5.2 Serious problems

Some computations with different results or incorrect results are shown below.
 Chaotic bank
The beginning of the chaotic bank flaw starts as a story, literature [Muller_2010] and [In-
ternet_0501]. “Recently, Mr. Gullible went to the Chaotic Bank Society, to learn more about
the new kind of account they offer to their best customers. He was told:
You first deposit $e − 1 on your account, where e = 2.7182818 · · · is the base of the
natural logarithms. The first year, we take $1 from your account as banking charges.
The second year is better for you: We multiply your capital by 2, and we take $1 of
banking charges. The third year is even better: We multiply your capital by 3, and
we take $1 of banking charges. And so on: The n-th year, your capital is multiplied
by n and we just take $1 of charges. Interesting, isn’t it?”
The question is how much money will be on the account after 25 years.
The bank officer started thinking and tried to simulate. The program in C language is in Fig.
05-01 and was compiled with mingw32-gcc version 4.7.1; its result is in Fig. 05-02.
Interesting, isn’t? Officer tries to check by calculation in Excel 2010 with the result equal to
$-2242373259. This indicates a problem, which result is correct? It is simple; the correct
value is $0 on the account.
int main()
{ float single_account = 1.71828182845904523536028747135;
double double_account = 1.71828182845904523536028747135;
long double long___account = 1.71828182845904523536028747135;
int i;
for (i = 1; i <= 25; i++)
{
single_account = i*single_account - 1;
double_account = i*double_account - 1;
long___account = i*long___account - 1;
}
printf("You will have $%+1.17e on your account. (Single precision)\n",
single_account);
printf("You will have $%+1.17e on your account. (Double_precision)\n",
double_account);
printf("You will have $%+1.17e on your account. (Long precision) \n",
long___account);
}
Fig. 05-01 Chaotic Bank
VŠB-TU Ostrava 43
5 Real numbers
You will have $+5.68654735142289410e+017 on your account.(Single precision)
You will have $+1.20180724741044860e+009 on your account.(Double_precision)
You will have $-3.97983188024993510e-235 on your account.(Long precision)
Fig. 05-02 Result of Chaotic Bank simulation in C language
 Rump’s problem
The following formula (0505) was designed by Siegfried Rump in 1988, [Rump_1988] and
processed on computer IBM 370. The C program is in Fig. 05-03. The problem is also de-
scribed by literature [Muller_2010] and [Inernet_0501].
𝑓(𝑎, 𝑏) = 333.75𝑏6 + 𝑎2 (11𝑎2 𝑏2 − 𝑏 6 − 121𝑏4 − 2) + 5.5𝑏 8 + 𝑎⁄2𝑏 (0505)
int main()
{
double a = 77617.0;
double b = 33096.0;
double b2,b4,b6,b8,a2,firstexpr,f;
b2 = b*b;
b4 = b2*b2;
b6 = b4*b2;
b8 = b4*b4;
a2 = a*a;
firstexpr = 11*a2*b2-b6-121*b4-2;
f = 333.75*b6 + a2 * firstexpr + 5.5*b8 + (a/(2.0*b));
// The same notation for single and long precission
// printf("Single precision result: $ %+1.17e \n",ff);
printf("Double precision result: $ %+1.17e \n",f);
// printf("Long precision result: $ %+1.17le \n",lf);
Fig. 05-03 Rump’s example
Single precision result: $ +2.03172224360162030e+029
Double precision result: $ +5.96060417881739890e+020
Long precision result: $ -9.38724727098368430e-323
Fig. 05-04 The result of Rump’s example
The program was compiled with mingw32, version 4.7.1, and was run with operands
a = 77617 and b = 33069. The results are in Fig. 05-04. The checking calculation was pro-
duced in Excel 2010 with result -1.180591621E+21. What is the correct result? The correct
result is -0.8273960599 …, [Muller_ 2010]. Interesting, isn’t?
44 VŠB-TU Ostrava
int main()
{ float a;
int i;
a = 0.2; a += 0.1; a -= 0.3;

for(i = 0; a < 1.0; i++) a += a;
printf("i = %d a = %f\n", i, a);
return 0;
}
Fig. 05-05 Simple example
 A simple example
A simple example is the program in Fig. 05-05, [wiki_0508]. The questions are:
 Is the program meaningful?

 How does this program finish? Is it an infinite cycle?
The program will print the text “i = 27 a = 1.600000”.
5.3 The representation of the number is not easy

The above mentioned problems occur in digital systems. Next category includes the results
of specific functions or expressions that are out of limits. For instance, arcsin(10), arc-
cos(10) cannot be calculated, the argument of arcsin or arccos must be lower than or equal
to 1; log (0) does not exist, its limit equals minus infinity; division by zero; square root of
minus 2 in real number is not defined, and so on. In all these situations, the same results
must be obtained in each digital system because the calculation must be transferable and
must lead to the same result. The scientists study these problems, they study algorithms of
complicated function calculations and they found a satisfying solution. The conclusions of
these studies for floating point number are expressed in IEEE Standard for Floating-Point
Arithmetic, IEEE 754.
5.4 References
[Inernet_0501] http://perso.ens-lyon.fr/jean-michel.muller/chapitre1.pdf; on line 2014-06-
29
[Intel_0501] http://www.intel.com/support/processors/pentium/sb/CS-012748.htm; on
line 2013-06-10
[Janeba_1995] M. Janeba: The Pentium Problem;

http://www.willamette.edu/~mjaneba/pentprob.html, on line 2013-06-10
[Microsoft_0501] http://blogs.office.com/b/microsoft-
excel/archive/2007/09/25/calculation-issue-update.aspx, on line 2013-06-
12
[Microsoft_0502]http://blogs.msdn.com/excel/archive/2007/10/09/calculation-issue-
update-fix-available.aspx, on line 2013-06-12
VŠB-TU Ostrava 45
5 Real numbers
[Muller_2010] J.-M. Muller, N. Brisebarre, F. de Dinechin, C.-P. Jeannerod, V. Lefevre, G.

Melquiond, N. Revol, D. Stehle, S. Torres: Handbook of Floating-Point
Arithmetic; Birkhauser Boston, a part of Springer Science+Business Media,
LLC 2010; ISBN 978-0-8176-4704-9; e-ISBN 978-0-8176-4705-6
[Rump_1988] S. M. Rump. Algorithms for verified inclusion. In R. Moore, editor, Reliability

in Computing, Perspectives in Computing, pages 109–126. Academic Press,
New York, 1988.
[wiki_0501] Real number; http://en.wikipedia.org/wiki/Real_number; online 2013-06-

07
[wiki_0502] Rational number; http://en.wikipedia.org/wiki/Rational_number; on line

2013-06-07
[wiki_0503] Irrational number; http://en.wikipedia.org/wiki/Irrational_number; on line

2013-06-07
[wiki_0504] Integer http://en.wikipedia.org/wiki/Integer; on line 2013-06-07
[wiki_0505] Number; http://en.wikipedia.org/wiki/Number; on line 2014-06-23
[wiki_0506] Natural line; http://en.wikipedia.org/wiki/Natural_number; on line 2014-

069-23
[wiki_0507] Pentuim bug; http://en.wikipedia.org/wiki/Pentium_Bug; on line 2013-06-

10
[wiki_0508] Guard digit; http://en.wikipedia.org/wiki/Guard_digit; on line 2014-08-21
[wiki_0509] Number line; http://en.wikipedia.org/wiki/Number_line; on line 2014-08-

21
[wiki_0510] Real line; http://en.wikipedia.org/wiki/Real_line; on line 2014-08-21
46 VŠB-TU Ostrava
6 Integer numbers
Integer numbers with designation ℤ are numbers without a fraction part and their range is
from minus infinity to plus infinity. There are two opinions on the division of integer num- Integer numbers
bers into subsets, but in computer science the preferred opinion is that the natural num- contain natural
bers contain zero, {0, +1, +2, +3 …}, [wiktionary_0601], [wiki_0608] and [Internet_0601]. numbers and
Reference [proofwiki_0601] even wrote “However, using ℕ = {0, 1, 2, 3 …} is a more mod- negative num-
ern approach, particularly in the field of computer science, where starting the count at zero bers. □
is usual.” The application of integer numbers in computer science or informatics has some
limitations that are given by the binary numeral system and the limited word size. Firstly,
the finite word size causes a limited range of integer number representation. Secondly, the
The set of natural
integer numbers use a sign (plus or minus); the sign is a special glyph and it does not belong
numbers is {0,
among the digits of numeral system. However, in binary numeral system only two digits
+1, +2, +3 …} □
exist and special techniques must be used to represent negative numbers.
Computer science uses these terms related to integer numbers:
 Integer or signed integer; it is a positive or a negative number. In computers, the nega-

tive numbers are mainly represented by two’s complement and other principles.
 Unsigned integer; numbers without a sign, they are natural numbers or non-
negative numbers, set {0, 1, 2 …}.
 Binary numeral system; it is a basic way of representing integer numbers.
 Decimal numeral system; it is denoted as BCD numbers – Binary Coded Decimal
numbers. There are special techniques to express decimal numbers and to use the
decimal numeral system in binary computing.
Integer numbers in the binary numeral system are the most frequently used in computers
and programming languages. Integer decimal numbers in BCD code are presented in a sep-
arate chapter below. Next description will concentrate on integer numbers in the binary
numeral system and the techniques for representing a sign.
The term of integer, or int, also relates to programming languages, where the integer be-
longs to a data type. The range of representation is defined by the number of bits used in
the word and it depends on the implementation of the programming language. It does not
have to be defined by the computer architecture because it is possible to use a 64 bit inte-
ger on 16 bit architecture. Programming languages use the terms as short integer, long,
long long, double long integer and the corresponding number of bits depends on the type
of language and its implementation.
Modern programming languages begin to use new names in the declaration; these names
contain the number of bits in the word, for example, int8_t, uint8_t and the same is valid
VŠB-TU Ostrava 47
6 Integer numbers
for 16, 32 and 64 bits. Notation int is meant for signed and uint for unsigned integer decla-
ration. These new declarations have been stated by standard ISO/IEC 9899:1999. This
standard is known as C99, C language version 1999, literature [wiki_0601]. These new dec-
larations can be used in language C++2011, [cppref_0601] and [Microsoft_0601].
Endianness or endian is also a common term and it defines the way of placing the number
in the memory. It specifies the order of atomic elements for n-bit object. For example,
there is 32 bit number and byte is atomic element, then the endian defines the order of
bytes in the memory, whether the MSB byte of 32 bit number will be placed on a lower or a
higher memory address.
6.1 Unsigned integer

It is a number without a sign and only a non-negative number, the set is {0, 1, 2 …}. In the
computer science, an unsigned integer in binary numeral system has a limited range of
representation that is given by the number of bits in the word, formula (0601). The mini-
mum number is always zero. A typical placement in the computer word is in Fig. 06-01.
0 to 2n - 1 (0601) Range of unsigned integer is

0 to 2n - 1.□
Where
 n is number of bits in the word.
MSB LSB
For 8-bit the range is from 0 to 255
n-1 0
0 1 0 1 0 0 1 1 Number 129D is coded as 1000 0001B
Fig. 06-01 Unsigned integer
In the declaration of programming languages it is possible to see the notations for unsigned
integer data types:
 unsigned integer or unsigned int, where the number of bits in the word depends
on the language and its implementation.
 uintx_t, where x is 8, 16, 32 or 64, e.g. uint32_t is the declaration for 32 bit un-
signed integer; it is valid for C99/C++2011 version and above.
 unsigned short int, unsigned long int or unsigned long long int.
6.2 Signed integer

A signed integer is the most important number in computer science and it uses special
techniques or principles for sign representation. The range of representation then depends Signed integer.□
on these techniques or principles. The main principle for expressing the sign is 2’s comple-
ment that all computers and programming languages use.
The possibility of representing minus numbers is useful for subtraction, where minus is
replaced by adding a negative number. Another problem of representation is how many
48 VŠB-TU Ostrava
zeros there will be. Some techniques or principles have two zeros, plus and minus, and the
rest only one zero, a plus zero.
In programming languages, the integer number is a basic data type and in the declaration it
is possible to use the following notation, similar to that of an unsigned integer. This declara-
tion automatically supposes the application of two’s complement:
 Signed integer, integer or int, where the number of bits in the word depends on the
language and its implementation.
 intx_t, where x is 8, 16, 32 or 64, e.g. int32_t for 32 bit signed integer; it is valid for
C99/C++2011 version and above.
 Signed short int, signed long int or signed long long int.
6.3 Sign and magnitude

This is a basic principle for the representation of the sign, where a special bit is used for the
sign, typically it is an MSB bit. Then, the word is divided into two parts, the first part is a
sign and the second part is a magnitude, where magnitude is the absolute value of a signed
integer. This principle has two zeros, plus and minus. The range of representation is given
by formula (0602) and a typical organization of the word is in Fig. 06-02, literature
[wiki_0602]. Signed magnitude is the most common way of representing the significand in
floating point.
Range of sign and mag-
n-1 n-1
-(2 -1) to +(2 -1) (0602) nitude is
-(2n-1 - 1) to +(2n-1 - 1)□
Where
MSB LSB For 8-bit the range is from -127 to + 127

n-1 0 Number +11D is coded as 0000 1011B
0 1 0 1 0 0 1 1
Number -11D is coded as 1000 1011B
Plus zero (+0) is 0000 0000B
Sign Magnitude
Minus zero (-0) is 1000 0000B
Fig. 06-02 Sign and magnitude
6.4 1’s complement

Ones’ complement is the code to express a negative number, [wiki_0602]. It can be
defined by two corresponding ways, mathematically (0603) and logically (0604). The
formulas are defined for n-bit word and in this word the MSB bit is always a sign bit, Ones’ complement
1
Fig. 06-03. The sign bit is equal to 0 for a positive number and the sign bit is equal to A= 2n- 1 - A
1
1 for a negative number. The range of representation is symmetric (0605) and ones’ A = ~A□
complement has two zeros, plus and minus. Today, ones’ complement is used spo-
radically because in the operations of addition or subtraction it is necessary to cor-
rect the result by adding one; more details are in literature [wiki_0603].
VŠB-TU Ostrava 49
6 Integer numbers
1
A = 2n - 1 - A (0603)
1 Range of ones’ com-
A=~A (0604)
plement
-(2n-1-1) to +(2n-1-1) (0605) -(2 - 1) to +(2n-1 - 1)□
n-1
Where
 1
A is the denotation of ones’ complement; it is ones’ complement to number A
(positive).
 A is a positive value, for which ones’ complement is calculated.
 ~ is bitwise negation, operator of C language.
For 8-bit the range is from -127 to + 127

MSB LSB
n-1 0 Number +11D is coded as 0000 1011B
0 1 0 1 0 0 1 1 Number -11D is coded as 1111 0100B
Plus zero (+0) is 0000 0000B
Sign Minus zero (-0) is 1111 1111B
Fig. 06-03 Ones’ complement
6.5 2’s complement

Two’s complement is also a way to represent a sign number in the binary numeral system.
It is defined by the mathematical formula (0606) and the logical formula (0607) for n-bit
word, literature [wiki_0602] and [wiki_0604]. The MSB bit is a sign bit, 1 for a negative
number and 0 for a positive number, Fig. 06-04. Two’s complement has only one zero, a
plus zero. Two’s complement is widely used in computer and programming languages.
2
A = 2n - A (0606) Two’s complement
2
A= 2n- A
2
A = 1A + 1 = ~A + 1 (0607) 2
A = 1A + 1 = ~A + 1□
-(2n-1) to +(2n-1 - 1) (0608)
Where
Range of two’s com-
 2
A is the denotation of two’s complement; two’s complement of number A plement
(positive). -(2n-1) to +(2n-1 - 1)□
 1
A is the denotation of ones’ complement.
 A is a positive number for which two’s complement is calculated.
Only one zero, plus 0□
 ~ is bitwise negation, operator of C language.
50 VŠB-TU Ostrava
MSB LSB For 8-bit the range is from -128 to +127

n-1 0
Number +11D is coded as 0000 1011B
0 1 0 1 0 0 1 1
Number -11D is coded as 1111 0101B
Sign
0 minus Only one zero 0000 0000B (plus zero)
1 plus
Fig. 06-04 Two’s complement
Note about the terminology of complements.

 The complement is a code, where the complement represents a negative number.
 Therefore, all formulas mentioned above define ones’ or two’s complement for a
positive number.
 For example, 0101B represents number +5 and 1011B represents number -5. □
6.6 Conversion to two’s complement

Two’s complement is defined by two formulas (0606), (0607), which can only be used if the
number is in the range of representation. It means that one bit is used for the sign and the
rest for the value. The computer has the word size of 8, 16 … bits. On the logical level or in
computer, where the operation of subtraction is missing, it is necessary to use the formula
(0607) that is known as “not B + 1”. In this formula, “not B” is ones’ complement and also
bitwise not that is performed by the gate invertor. “plus 1” is the addition of 1-bit infor-
mation, which is mostly performed through the input carry on the adder. The conversion
from a decimal number to two’s complement is in Fig. 06-05. Firstly, the conversion from
the decimal to the binary numeral system is well known, and then two’s complement is
calculated by means of formula “not B plus 1”.
Decimal number -50 to two’s complement, 8 bit

 50D = 0011 0010B
 Not B, ones’ complement is 1100 1101B Not B plus 1□
 Plus 1 is 1100 1110B
 -50D = 21100 1110B = 0xCE □
Fig. 06-05 Conversion to two’s complement by “not B plus 1”
Decimal number -50 to two’s complement, 8 bit

 28 - 50D = 206D
 206D = 1100 1110B = 0xCE
 -50D = 21110 1011B = 0xCE □

Fig. 06-06 Conversion to two’s complement by 2n – A
VŠB-TU Ostrava 51
6 Integer numbers
Secondly, formula (0606) as a mathematical definition of two’s complement can be used on

a higher level. It is preferred in decimal arithmetic as well. The two’s complement is calcu- 2n - A□
lated in the decimal system and then it is converted to the binary numeral system.
6.7 Conversion from two’s complement

Again, there are two ways. The first way is to change the negative number (two’s comple-
ment) to the binary positive number (not B plus 1) and then to use the polynomial of nu-
meral system for the conversion from the binary to the decimal numeral system, Fig. 06-07.
For positive numbers, the change from the negative to the positive number is missing. On
the binary level of two’s complement, formula (0606) is used for the change from the posi-
tive to the negative number and back. Formula (0606) has the meaning of “not B plus 1”,
Fig. 06-07.
Two’s complement 0x8C to decimal, 8 bit

 0x8C = 1000 1100B
 Not B, it is 0111 0011B; 0x73
 Plus 1 is 0111 0100B, 0x74
 0111 0100B = 74H = 7*16 + 4 = 116D Not B plus 1 □
 Two’s complement 274H = -116D □
Fig. 06-07 Conversion of two’s complement to decimal number by “not B plus 1”
The second way, a direct conversion from two’s complement to the decimal numeral sys-
tem is given by formula (0609). It looks like a classical polynomial of the numeral system
but only the first element of the polynomial has a minus sign for MSB bit, Fig. 06-08.
𝑛−2
−𝑎𝑛−1 2𝑛−1 + 𝑎𝑛−2 2𝑛−2 + ⋯ 𝑎0 20 = −𝑎𝑛−1 2𝑛−1 + ∑𝑖=0 𝑎𝑖 2𝑖 (0609)
Where
 ai is a binary digit.
Two’s complement 0x8C to decimal, 8 bit

 0x8C = 1000 1100B
Polynomial
 7 6 5 4 3 2 1
-1*2 + 0*2 + 0*2 + 0*2 + 1*2 + 1*2 + 0*2 + 0*2 0 𝑛−2
−𝑎𝑛−1 2𝑛−1 + ∑𝑖=0 𝑎𝑖 2𝑖 □
= -128 + 12 = - 116
 2
8CH = - 116D □
Fig. 06-08 Conversion of two’s complement to decimal number by polynomial
6.8 Offset binary

Offset binary is also called excess-K. It is a special scheme for the representation of signed
Offset binary or
or unsigned integer, where the minimal number corresponds to zero of the offset binary
excess-K or bi-
number and the maximal number corresponds to the maximal offset binary number, litera-
ased number.□
ture [wiki_0602] and [wiki_0605]. A basic principle is in Fig. 06-09. The offset binary is used
52 VŠB-TU Ostrava
in the digital to analog or analog to digital converter and its application in floating point for
representing the exponent is commonly used.
Minimal number Maximal number

Number line
Plus infinity
Offset, b
Offset binary
0 2n-1
For n-bit representation
Fig. 06-09 Basic principle of offset binary
Mathematical definition is given by formula (0610) and it is defined by number b, which is Biased exponent
called offset. Offset b may be any number and it moves the range of representation on the of floating point
number line. In computer science, two definitions of offset are used, 2n-1 and 2n-1-1 for n-bit is offset binary. □
word. The second definition is used in floating point according to IEEE 754, [IEEE 754-2008].
The range of representation is given by formula (0611).
B
A=A+b (0610)
B
A = A + b. □
-(b) to +(-b + 2n-1) (0611)
Where
 B
A is an offset binary number and must be a natural number ℕ, (BA ≥ 0).
 A is the integer number for which the offset binary is calculated, A is positive or Biased B
number
negative. A is unsigned
integer. □
 b is offset or bias, in standard IEEE 754 for floating point, the bias is 2n-1 - 1, for n-bit
exponent.
The definition of offset as 2n-1 is very useful and on the binary level the MSB bit is the sign in
the opposite definition to the usual one, 0 is minus and 1 is plus. Then, it is possible to
change the offset binary to two’s complement by inverting the sign bit, MSB bit, see Fig. 06-
10. The conversion to two’s complement is suitable for arithmetic operations.
Decimal Offset bina- Two’s com-

number ry for 8 bit plement
+ 127 1111 1111 0111 1111 Bias b is 128
+ 126 1111 1110 0111 1110 Offset is 2n-1. □
: : : The range of representation is
+1 1000 0001 0000 0001 - (2n-1) to + (2n-1-1)
0 1000 0000 0000 0000
-1 0111 1111 1111 1111 For 8 bit word, the range is
: : : -128 to +127
-127 0000 0001 1000 0001
-128 0000 0000 1000 0000
Fig. 06-10 Relation between offset binary and two’s complement
VŠB-TU Ostrava 53
6 Integer numbers
Decimal Offset bina- Decimal value of

number ry for 8 bit offset binary
+ 128 1111 1111 255 Offset b is 127
+ 127 1111 1110 254
: : : The range of representation is Offset is 2n-1 - 1. □
+1 1000 0000 128 - (2n-1-1) to + (2n-1)
0 0111 1111 127
-1 0111 1110 126 For 8 bit word the range is
: : : - 128 to +127
Bias as 2n-1-1 is
-126 0000 0001 1
-127 0000 0000 0 used by biased
exponent in
Fig. 06-11 Values for offset binary as biased exponent in 32-bit binary floating point word floating point.□
One of three fields of the floating point word is called the biased exponent and this field is
coded by the binary offset. The standard IEEE prefers the term of biased exponent and it
uses the term of bias instead of offset. The range of representation is given by formula
(0612) for n-bit word and the example of biased exponent coding for 32-bit binary floating
point word is in Fig. 06-11.
-(2n-1-1) to +(2n-1) (0612)
Where
 n is a number of bits in the biased exponent.
6.9 Conversion to and from offset binary

The conversion to and from offset binary is given only by formula (0610), BA = A + b. The
conversion may be calculated in any numeral system and the example is in Fig. 06-12.
Calculation Used formula

 B B B
If b = 127 A = +38 A = 165 A = 127 + 38 = 165 A=A+b
 B B
A = -38 A = 89 A = 127 + (-38) = 89
B 
If b = 127 A = 240 A = 113 A = -127 + 240 = 113 A = BA - b
B 
A = 14 A = -113 A = -127 + 14 = - 113
Fig. 06-12 Example of conversion to and from offset binary
6.10 BCD numbers

The decimal numeral system is common for a human and it is used in everyday life. But the
binary numeral system and following computation is used by the computer and the calcula-
tion is limited by the final word width for representing numbers. The introductory chapter
concerning real numbers shows errors, bugs that are generated by computers. The error in
the simple example (see the chapter above) can be removed by means of decimal arithme-
tic and decimal numbers in binary computers. First of all, it is necessary to ensure the rep-
resentation of decimal numbers in the BCD code, Binary Coded Decimal. The BCD code is
54 VŠB-TU Ostrava
the basic code; in literature, it is possible to find other modifications, [wiki_0606],

[DEC_PDP] and [DEC_VAX].
Decimal BCD
digit code
0 0000
1 0001 Sign Binary Hex
2 0010 + 1010 A
3 0011 - 1011 B
4 0100 + 1100 C Preferred
5 0101 - 1101 D Preferred
6 0110 + 1110 E
7 0111 unsigned 1111 F
8 1000
9 1001
Fig. 06-13 Definition of BCD code
BCD code is defined by the table, Fig. 06-13, where each decimal digit in the range 0 to 9 is Nibble is a 4-bit
expressed by a 4-bit binary number. Here, ten combinations are used, the remaining com- group and BCD is
binations are not used for coding the decimal digits but they are used for coding a possible placed to the
sign. This group of four bits is called the nibble; a byte has 2 nibbles and so on. When a dec- nibbles.□
imal number has more orders, then each decimal order is coded separately into one nibble;
all nibbles are arranged side by side and form a string, Fig. 06-14. Decoding means that the
string is divided into nibbles and each nibble is converted into a decimal digit. BCD umber has a
string format.□
Note about the principle of 8-4-2-1.

Each number from the range 0 to 15 can be expressed as the addition of 8, 4, 2 and 1. For
example, 13 equals 8 + 4 + 1 and the binary code is 1101. □
Coding Decoding
691 0111100000110100
0111 1000 0011 0100

0110 1001 0001
7 8 3 4
011010010001 7834
Fig. 06-14 BCD coding and decoding scheme
In computer science, BCD numbers use the BCD code and nibbles are placed into the byte
by two ways, the packed or the unpacked format, with the sign or without. The BCD num-
bers are understood as a string, therefore the number of weights is changeable, [DEC_VAX].
The terminologies related to the BCD numbers in computer science are:
 Packed BCD means that each nibble of the byte or the word is used, Fig. 06-15. A
byte has 2 decimal digits. Both nibbles are used. In the word, each nibble is used for
the BCD code of a decimal digit or a sign.
VŠB-TU Ostrava 55
6 Integer numbers
 Unpacked BCD means that only one digit is placed into one byte to a lower nibble.
A higher nibble equals zero. In the word, each byte is used for one BCD code of a
decimal digit or a sign.
 Sign BCD numbers. Sign BCD numbers use the principle of the sign and magnitude.
One nibble is a sign and the sign is mostly placed as the least significant nibble. For
the sign, the combination higher than 9 is used. The preferred combination for a
plus sign is hex C, and for a minus sign it is hex D, Fig. 06-13. This convention was
derived from the accounting terms (Credit and Debit).
 Unsigned BCD numbers. The string usually includes the combination hex F in the
least significant nibble as the expression of the unsigned format, [IBM_370],
[DEC_PDP] and [DEC_VAX].
MSB LSB Preferred sign

15 0
1 0 0 1 0 0 1 1 0 1 0 1 1 0 1 1 code
Packed format.□
Hex C - plus
Decimal number is -935 Sign Hex D - minus
Hex F - unsigned
Fig. 06-15 Packed format with a sign in 16-bit word
MSB LSB Preferred sign code

15 0
0 0 0 0 0 0 1 1 0 0 0 0 1 0 1 0 Hex C - plus
Unpacked for-
Hex D - minus
mat.□
Decimal number is +3 Sign Hex F - unsigned
Fig. 06-16 Unpacked format with a sign in 16-bit word
The range of representation is given by the numbers of used nibbles. In the unpacked for-
mat, it is the number of bytes minus the sign byte. In the unsigned format, it is necessary to
pay attention whether the value hex F is used as the sign or not. Formulas (0613) and
(0614) define the range of representation.
For unsigned 0 to (10n-1 - 1) (0613)
For signed -(10n-1 - 1) to +(10n-1 - 1) (0614)
Where
 n is a number of used nibbles including or excluding the sign.
6.11 Ten’s complement

Ten’s complement is suitable for representing negative numbers or for subtraction in the
10
decimal numeral system. The ten’s complement is defined logically by formula (0615) and A = 9A + 1□
mathematically by formula (0617), literature [wiki_0607].
56 VŠB-TU Ostrava
10
A = 9A + 1 (0615)
9
A: 0→9, 1→8, 2→7, 3→6, 4→5, 5→4, 6→3, 7→2, 8→1, 9→0 (0616)
10
A = 10n - A (0617) 10
A = 10n - A□
Where
 9
A is nine’s complement of A, it is the representation defined by formula (0616),
where digit 0 is replaced by 9, digit 1 is replaced by 8, digit 2 is replaced by 7, …
 10
A is ten’s complement of A.
 n is the order.
Ten’s complement may be used in arithmetic or for representing a sign decimal number.
When it is used in arithmetic, then the algorithm takes this fact into account. When the
ten’s complement is used for representing a sign number, then it is necessary to define the
sign nibble; its values define a positive or a negative number. If the value is 0, 1, 2, 3 or 4,
then the sign is plus, values 5, 6, 7, 8 or 9 define a minus sign.
6.12 References
[DEC_VAX] VAX780 Architecture handbook; Digital Equipment Corporation, 1977;
(http://bitsavers.trailing-
edge.com/pdf/dec/vax/VAX_archHbkVol1_1977.pdf; on line 2013-09-24)
[DEC_PDP11] PDP11 processor handbook, PDP11/04/34a/44/60/70; Digital Equipment

Corporation, 1979; (http://bitsavers.informatik.uni-
stuttgart.de/pdf/dec/pdp11/handbooks/PDP11_Handbook1979.pdf; on line
2013-09-24)
[cppref_0601] Fixed width integer types (since C++11);

http://en.cppreference.com/w/cpp/types/integer; on line 2014-07-07
[IBM_370] IBM System/370 Principles of Operation, IBM, March 1980
[IEEE 754-2008] IEEE Std 754™-2008, IEEE Standard for Floating-Point Arithmetic, 29 August
2008, revision of IEEE 754 – 1985
[Internet_0601] Natural numbers;

http://www.abstractmath.org/MM/MMNaturalNumbers.htm; on line
2014-07-07
[Microsoft_0601] XMINT4.XMINT4(int32_t, int32_t, int32_t, int32_t) constructor;

http://msdn.microsoft.com/en-
us/library/windows/desktop/hh404668(v=vs.85).aspx; on line 2014-07-07
[proofwiki_0601] Definition: Natural Numbers;

https://proofwiki.org/wiki/Definition:Natural_Numbers; on line 2014-07-07
[wiktionary_0601] Numeral numbers; http://en.wiktionary.org/wiki/natural_numbers;
on line 2014-07-07
VŠB-TU Ostrava 57
6 Integer numbers
[wiki_0601] C data types; http://en.wikipedia.org/wiki/Inttypes.h#inttypes.h; on line

2013-09-18
[wiki_0602] Signed number representations;

http://en.wikipedia.org/wiki/Signed_number_representations; on line
2013-09-19
[wiki_0603] Ones' complement; http://en.wikipedia.org/wiki/Ones%27_complement;

on line 2013-09-19
[wiki_0604] Two's complement; http://en.wikipedia.org/wiki/Two%27s_complement;

on line 2013-09-19
[wiki_0605] Offset binary; http://en.wikipedia.org/wiki/Offset_binary; on line

2013-09-22
[wiki_0606] Binary-coded decimal; http://en.wikipedia.org/wiki/Binary-coded_decimal;

on line 2013-09-22
[wiki_0607] Method of complements;

http://en.wikipedia.org/wiki/Method_of_complements; on line 2013-09-22
[wiki_0608] Natural numbers; http://en.wikipedia.org/wiki/Natural_numbers; on line

2014-07-07
58 VŠB-TU Ostrava
7 Arithmetic operations on integer numbers
The basic arithmetic operations are addition, subtraction, multiplication and division. In
computer, these operations are performed by ALU, Arithmetic Logic Unit as a part of the
processor. The algorithm for performing these operations depends on the principle of rep-
resentation of negative numbers and the finite word size of ALU. Therefore, some results
of the operations may be out of the range of representation and this situation is called the
overflow. Each processor has a status register or a similar register that contains the flags
characterizing the properties of results.
In computer, the basic arithmetic operations are realized by logical circuits as combination-
al circuits or as synchronous digital systems based on FSM – Finite State Machine. Number
of bits in ALU may be from one to n-bits. For example, one-bit arithmetic unit is used when
the operands are in a serial stream and then the arithmetic logic unit must be designed as a
synchronous digital system. The addition, subtraction and multiplication may be realized as
combinational circuits and/or synchronous digital systems, but the division is always a syn-
chronous digital system with FSM.
Operands may be unsigned or signed, in different codes. But every processor has a binary
adder and it is necessary to define the algorithm of arithmetic operations with operands in
different codes on the binary adder. For example, the operands in offset binary are added
by a binary adder, BCD numbers are also added by a binary adder and so on.
The choice of realization depends on the definition of the processor architecture and also
on the time needed for the calculation of the operations. This time is called the propagation
delay of the operation. Therefore, there are a lot of ways of possible realization.
7.1 Flag bits of operations

These flags can be found in each computer. They characterize the results of arithmetic op-
erations, [wiki_0701]. Flags are placed in the status register of the processor. The accurate
terminology depends on the processor architecture and the producer. The basic flags are
negative, zero, overflow and carry flag. The setting of these flags is defined in detail by the
processor architecture and the instruction set. Following description only concentrates on
the addition because there are differences in other arithmetic and logical operations.
 N flag or S flag, N flag is a negative flag and S flag is a sign flag, [wiki_0702]. This flag
is always the MSB bit of the result. When the result is understood as two’s com-
Note to the range of representation
The ranges of representation are from 0 to 2n-1 for unsigned integer and from -(2n-1-1) to
+(2n-1-1) for two’s complement.□
VŠB-TU Ostrava 59
plement, then N or S flag is a sign bit.

 Z flag is a zero flag, [wiki_0703]. When it is set, the result equals zero. It means that
all bits of the result equal 0.
 V flag is an overflow flag, [wiki_0704], [DEC_PDP11], [Internet_0701]. It refers to
the overflow in the operations of addition or subtraction and two’s complement. It Two’s comple-
means that the result is out of the representation range. ment overflow.□
 In case of the addition A + B and two’s complement numbers, a situation
may occur when a negative result is produced by the addition of positive
operands. This is also valid vice versa: the overflow occurs if (+A) + (+B) = −C
or (−A) + (−B) = +C.
 In case of the subtraction A - B and two’s complement numbers, the over-
flow occurs in the situation when the operands have opposite signs and the
sign of operand B is the same as the sign of the result. The overflow occurs
when (+A) − (−B) = −C or (−A) − (+B) = +C.
 C flag is a carry flag, [wiki_0705]. The carry flag is used and generated by many in-
structions, typically a shift instruction with carry and so on. The following explana-
tion only concentrates on the addition. In case of subtraction, the carry flag has two
definitions according to the processor architecture. For the addition, following rules
are applied:
 In the addition, the carry flag is the carry out of the MSB bit. This carry out
can be used for the addition of higher weights or higher words where the
Unsigned integer
carry flag is the carry in for this addition.
overflow.□
 In the addition, the carry flag is the unsigned overflow; it means that the
result is out of the range of representation for unsigned integer.
Note to the calculation of V flag, overflow
V flag is the exclusive or operation of the carry into the MSB bit and the carry out of the
MSB bit.□
7.2 Sign extension

Sign extension is the operation only with a sign and it is used in the situation when n-bit
number is placed to m-bit word and m is higher than n, [wiki_0706]. For example, 8-bit sign
number is placed to 16-bit word. The value must stay the same, it means that the sign bit of
a number must be copied to new higher bits, Fig. 07-01. When the number is negative, then
adding 1 to higher bits does not change the value.
1 1 0 1 1 0 1 1
MSB LSB
15 8 7 0
1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1
Sign extension
Fig. 07-01 Sign extension
60 VŠB-TU Ostrava
MSB LSB
15 0
1 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0
arithmetic shift right by one
1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 0
Fig. 07-02 Sign extension by arithmetic shift right
One possibility for calculating the sign extension is to use the arithmetic shift right. This
shift copies the sign to the new sign bit; it means the value of the sign remains the same. In
the situation where 8-bit sign number is placed to 16-bit word, first the 8-bit sign number is
placed to the higher byte of word and then the arithmetic shift right is performed on the
word 8 times, Fig. 07-02. The result is that the higher byte contains the sign extension and
the lower byte contains the original 8-bit sign number.
7.3 Addition, unsigned and two’s complement

The addition in the binary numeral system has the same principles as in the decimal nu- 1 + 1 =10B
□
meral system. Just remember, that 1 + 1 equals 10 in the binary numeral system and 10
binary is 2 decimal. Each bit position adds three values, two values of the operands plus the
carry from the previous order or bit. Also, each bit position generates two outputs, the sum Carry, the value
and the carry to the next position. This scheme is the same for each bit position and this is to next bit.
□
the basic idea of hardware realization, see the separate chapter.
S = A + B mod 2n (0701)
Where
 S is a result of addition in n-bit word.

 A, B are operands, sign extended to n-bit word.
 n is the number of bits in the word.
In computer, the adder size is defined as n-bit and only these n-bits may be used to place
the numbers. Then the addition in computer is defined by formula (0701). The formula
generates the n-bit result and each addition generates the sum and flags that describe this
result. This principle does not depend on the number of bits in the word, therefore 4-bit
arithmetic and 4-bit numbers are used in following examples. Below, there are examples of
the addition with flag setting, mainly with comments for V and C flag.
Note to the binary addition

In decimal numeral system, it is valid In binary numeral system, it is valid
1+1=2 1 + 1 = 10
2+1=3 10 + 1 = 11
3+1=4 11 + 1 = 100
: :
Number of items in the result of addition must be the same and does not depend on the
numeral system. □
VŠB-TU Ostrava 61
 In Fig. 07-03, when the operands are unsigned, then the result has the overflow,
the carry flag is set and number 16 is out of the range. When the operands are in
two’s complement, the result is correct; the overflow as V flag is not set.
Check in decimal
Binary Unsigned 2’s com.
Operand A 0111 7 7
Operand B + 1001 + 9 + -7
Sum S 0000 N=0, Z=1, V=0, C=1 ?0 0
Carry out 1
Fig. 07-03 Unsigned integer overflow
 In Fig. 07-04, when the operands are unsigned, the result is correct and C flag is not
set. When the operands are in two’s complement, the result is wrong, the overflow
N flag is MSB bit.
occurs, V flag is set. The correct result of the addition is +9, but the calculated result
Z flag is set, when
is out of the range of two’s complement representation. It is possible to explain this
the result is ze-
situation where the addition of two positive numbers generates a negative result.
ro.□
And vice versa, the addition of two negative numbers generates a positive result.
E.g., let is given 4-bit binary addition 0011 + 0110 = 1001. The result is correct for
unsigned number, 3 + 6 = 9. However, the result has overflow if the numbers are
understood as two’s complement, (+3) + (+6) = -7 is wrong.
Check in decimal
Binary Unsigned 2’s com.
Operand A 0111 7 7
Operand B + 0010 + 2 + 2
Sum S 1001 N=1, Z=0, V=1, C=0 9 ?-7
Carry out 0
Fig. 07-04 Overflow in case of two’s complement
 Fig. 07-05 shows the situation of the addition of 8-bit numbers on 4-bit adder. The
carry from the addition of lower nibbles must be added to the addition of higher
nibbles. The least significant addition must start with the zero carry in. The realiza-
tion of this principle of the addition is a synchronous digital system with FSM.
8 bit operand A 0100 1010

8 bit operand B 0010 1001
Addition of Addition of high-
lower nibbles er nibbles
Binary Binary
Carry in 0 1
Operand A 1010 0100
Operand B + 1001 + 0010
Sum S 0011 N=0, Z=0, 0111 N=0, Z=0,
Carry out 1 V=1,C=1 0 V=0,C=0
Fig. 07-05 The 8-bit addition as sequence of two 4-bit additions
62 VŠB-TU Ostrava
7.4 Subtraction, unsigned and two’s complement

The binary subtraction as a mathematical operation is given by the same algorithm as in the
decimal numeral system. The subtraction, where the sign is expressed separately, is de-
scribed in the subchapter dealing with the addition on the sign and magnitude. Only the
addition uses the term of carry as the input value to the next order, the subtraction uses
the term of borrow. The subtraction can also be defined as the addition of a negative num-
ber, formula (0702) and the negative number is then expressed as a complement. This
method of the complement is preferred in computer. In case of the binary numeral system,
the two’s complement is used, negative number (-B) is then converted into two’s comple-
ment.
A - B = A + (-B) = A + 2B = A + not B + 1 (0702)
Where
 A, B are binary operands.

 2
B is two’s complement of B.
 not B is bitwise negation.
Check in decimal
Binary Binary Unsigned 2’s com.
Operand A 0100 0100 4 +4
Operand B - 0010 >2’s comp> + 1110 - 2 - +2
Sum S 0010 N=0, Z=0, 2 2
Carry out 1 V=0, C=?
Fig. 07-06 Subtraction
Check in decimal
Operand A 0100 0100 4 +4
Operand B - 0110 >2’s comp> + 1010 - 6 - +6
Sum S 1110 N=0, Z=0, ?? -2
Fig. 07-07 Subtraction, result as a negative number
Check in decimal
Operand A 0111 0111 7 +7
Operand B - 1010 >2’s comp> + 0110 - 10 - -6
Sum S 1101 N=0, Z=0, ?? ??
Fig. 07-08 Subtraction and overflow
Figures 07-06, 07-07 and 07-08 show the subtraction, where operand B is converted by
means of the two’s complement algorithm and then added to operand A. If the operands
are in two’s complement and the overflow does not occur, the result is correct and it is in
two’s complement. When the overflow occurs, then the result is incorrect. For unsigned
VŠB-TU Ostrava 63
operands, the carry flag defines that the result is out of the range of representation. How-
ever, the carry flag has two definitions for the subtraction and its setting is defined by the
processor architecture.
7.5 Addition and subtraction in the sign and magnitude

The sign and magnitude numbers have one bit for the sign and the remaining bits for the
magnitude. The performance of the addition and/or subtraction depends on the available
adder. Firstly, it is possible to use a binary adder and two’s complement; secondly, the spe-
cial adder with the algorithm for the addition and/or subtraction in the sign and magnitude
code is used. The difference between the sign and magnitude code and the two’s comple-
ment for 4-bit definition is shown in Fig. 07-09 where you can see there is no problem to
convert the sign and magnitude code to the two’s complement by means of the algorithm
not B plus 1. Before the conversion, it is only necessary to clear the sign bit, Fig. 07-10.
Binary sign and

Signed decimal Two’s complement
magnitude
:
+2 0010 0010
+1 0001 0001
0 0000 0000
-1 1001 1111
-2 1010 1110
:
Fig. 07-09 Comparison of the sign magnitude code and two’s complement
Where
2 S
If AS = 0 then A= A  SA is the sign and magnitude
number A
If AS = 1 then AS=0, 2A = not (SA) + 1  AS is the sign of the sign and
magnitude number
 2A is two’s complement
Fig. 07-10 Algorithm of the conversion
A direct performance of the addition and/or subtraction in the sign and magnitude code is
described as a graph, Fig. 07-11, literature [Kaps_2013]. The practical realization of this
algorithm may be the combinational circuit or synchronous digital system, where the graph
describes the behavior of control unit as FSM. This graph describes the addition and the
subtraction both for the manual and the digital system. The basic steps are:
 Use the correct input, in case of subtraction, the sign is the swap.
 Find the correct path after the branches.
 First path, it is the addition of two operands with the same sign.
 Second path, it is the subtraction of two operands, where the magnitude A is higher
than B.
64 VŠB-TU Ostrava
 Third path, the magnitude A is equal to B, the result is the positive zero.
 Forth path, it is the subtraction of two operands, where the magnitude A is less
than B.
Subtraction
A-B
Where
Addition  A, B are operands in the sign and magnitude.
A+B BS = not BS
 S is result of the addition or subtraction.
 AS, BS, SS are signs of operands.
F  AM, BM, SM are magnitudes of operands.
AS = BS
F
T AM > BM
T F
AM = BM
T
SM = AM + BM SM = AM - BM SM = 0 SM = BM - AM
SS = AS SS = AS SS = 0 SS = BS
Done
Fig. 07-11 Algorithm of the addition and subtraction in the sign and magnitude
7.6 Addition and subtraction in offset binary

Offset binary, or excess-n, is the code for representing a sign number, where the same off-
set, or bias, is added to all numbers, BA = A + b. The biased number BA is a natural number. Offset binary and
These numbers are without a sign, they are only positive numbers. In computer terminolo- excess-n are syn-
gy, a biased number BA is an unsigned integer. The addition and/or subtraction in excess-n onymous.
□
is mathematically defined by formulas (0703) and (0704). The word mathematically means
B
that no overflow occurs and the only result must be a natural number. The example of the A = A + b and
B
addition is in Fig. 07-12. A≥0□
Decimal Excess-n in Excess-n in

Operation Note
number decimal binary
28 → Add bias → 41 → 10 1001 Bias is 13 decimal
+ 14 → Add bias → 27 → + 1 1011
42 68 100 0100
- 13 - 1101
42 ← Sub bias ← 55 ← 11 0111
Fig. 07-12 Mathematical addition in excess-n, bias is 13 decimal

B
Sum = (A + B) + b = BA + BB - b (0703)
B
Sub = (A - B) + b = BA - BB + b (0704)
VŠB-TU Ostrava 65
Where
 B
Sum, BSub are results in offset binary and must be natural numbers, (BS ≥ 0).
 A, B are operands.
 B
A, BB are operands in offset binary.
 b is bias or offset.
In computer, where numbers are placed to n-bit word, the overflow may occur on sub-
Biased number is
results or final result. Then the overflow causes that formulas (0703) and (0704) are not
always unsigned
valid. The overflow may occur, e.g., in the addition BA + BB, where the result is higher than
integer.□
2n – 1, or in the subtraction BA – BB where the results is less than zero. Therefore, it is nec-
essary to modify the original formulas (003) and (0704) by taking into account the n-bit
word. Then the formulas (0705) and (0708) are valid for any bias and have no limitation in
case of the overflow of sub-results. Floating point numbers use the floating point bias 2n-1 –
1; then formulas (0706), (0707) and (0709) are valid and have no limitation in case of the
overflow of sub-results. The overflow may only occur on the final addition and may be de-
tected by C flag as the carry out.
Subtraction is
For the addition replaced by addi-
tion of negative
B
Sum = (BA + BB) + not(b) + 1, valid for any bias (0705) number in two’s
B
Sum = (BA + BB) + b + 2, valid only for bias 2n-1-1 (0706) complement.□
B
Sum = (BA + BB) + 2n-1 + 1, valid only for bias 2n-1-1 (0707)
For the subtraction

B
Sub = (BA + not(BB)) + 1 + b (0708)
B
Sub = (BA + not(BB)) + 2n-1, valid only for bias 2n-1- 1 (0709)
Where
 B
Sum is the sum in n-bit offset binary.
 B
Sub is the difference in n-bit offset binary.
 B
A, BB are operands in n-bit offset binary.
 b is bias or offset.
 not(b) is bitwise not of bias b.
 n is n-bit word for representation.
 2n-1 - 1 is bias for floating point according to IEEE 754.
Mathematical proof for BS = BA + BB + b + 2, where b = 2n-1-1
The proof is valid for n-bit word

0 = 2n = 2*2n-1 – 2 + 2 = 2*(2n-1-1) + 2 = 2b+2
B
S = (A + B) + b = A + B + b + 0 = A + B + b + 2b +2
B
S = (A + b) + (B + b) = BA + BB + b + 2□
66 VŠB-TU Ostrava
Decimal Excess-n Excess-n in

Operation Note
number in decimal binary
4 → Add bias → 11 → 1011
+ 3 → Add bias → 10 → + 1010
7 0101 Overflow occurs
+ 1001 Not(b) plus 1
7 ← Sub bias ← 14 ← 1110
Fig. 07-13 Example for 4-bit word where bias is 2n-1- 1 = 7
Fig. 07-13 presents the addition, where the first addition generates an overflow and the
result corresponds to formula S = (a + b) mod 16 = 21 mod 16 = 5. However, the addition of
the correction “not b plus 1” generates the correct result without an overflow.
7.7 Addition and subtraction in BCD code

Performing these arithmetic operations in the decimal numeral system is no problem, but
the computer has the binary adder. Therefore, the following text explains the algorithm for
performing decimal operations on the binary adder. Fig. 07-14 shows the basic principle
and three possible situations. In the first situation, the result in the nibble is correct. In the
second one, the result is higher than 9 and must be corrected by adding 6. In the third situ-
ation, the addition in the nibble generates the carry to the next nibble and the result must
also be corrected by adding 6.
0th nibble 0th nibble 0th nibble

0100 1000 1000
+ 0101 Sum is higher + 0011 Carry to + 1001
than 9 10110 next nibble
1001 1011 1 0001
+ 0110 + 0110
1 0001 1 0010
st st
1 nibble 1 nibble
Fig. 07-14 BCD addition on binary adder
It is worth mentioning, that the result of the addition has one order more than the maximal
order of the operands, formula (0710). This is a typical property of each addition. In com-
puter, integer numbers in the BCD code are represented by the string. In the case, when
the string length has no limitation, the overflow cannot occur. In case of the restricted
string length, the BCD overflow can occur. It means that the destination string does not
have a sufficient length, [DEC_VAX].
nr = max(n1,n2) + 1 (0710)
Where
 nr is the maximal order of the sum.

 n1, n2 are orders of operands in the addition.
VŠB-TU Ostrava 67
When the BCD addition is performed on, e.g., 3 BCD digits, the correction must be done on
each addition, Fig. 07-15. The first correction will be performed when the sum in nibbles is
higher than 9 or when the carry to the next nibble is generated. Performing the first correc-
tion by the addition may also generate numbers higher than 9, therefore it is necessary to
continue by adding 6 in the second correction. And again, this correction may generate the
nibble higher than 9 and so the correction continues. The correction is performed until each
nibble is higher than 9.
th nd st th
4 2 1 0 Check in
carry carry carry Note
nibble nibble nibble nibble decimal
0000 1000 0101 0110 856
+0000 +1001 +0100 +1000 +948
1 0 0 Carry to next nibble
0001 0001 1001 1110 First binary addition
+0110 +0000 +0110 First correction
0001 0111 1010 0100 Second binary addition
0000 0000 0110 0000 Second correction
0001 1000 0000 0100 Final result 1804
Fig. 07-15 BCD addition for more nibbles
The subtraction can be performed by adding a negative number, Fig. 07-16. The negative
number is expressed by ten’s complement. For signaling, the best practice is to use the
most significant digit, where 0 is plus and 9 is negative. It means to add one nibble for the
sign. After performing the subtraction, the most significant digit determines the sign of the
result.
Subtraction is
Definition of
Carry replaced by addi-
subtraction in Operation BCD number Note
in
decimal tion of negative
47 0000 0100 0111 number in ten’s
-95 - 0000 1001 0101 complement.□
0000 0100 0111 First operand
10 9
+ 1001 0000 0100 1 A = A +1, second operand
1001 0100 1100 First addition
+ 0000 0000 0110 Correction of addition
1001 0101 0010 Result is negative
9
0000 0100 0111 1 A = A +1
-48 - 0100 1000 Final result in BCD and sign
Fig. 07-16 Performing subtraction by addition
Note to ten’s complement

The mathematical definition is 10A = 10n - |A|, where n is number of digits. In the logical def-
inition, ten’s complement is nine’s complement plus 1, where nine’s complement is the
representation 0→9, 1→8, 2→7, 3→6 …
10
A = 9A + 1
9
A: 0→9, 1→8, 2→7, 3→6, 4→5, 5→4, 6→3, 7→2, 8→1, 9→0 □
68 VŠB-TU Ostrava
7.8 Multiplication
The multiplication of unsigned integers is the basic arithmetic operation and the algorithm
for the binary unsigned integers is the same as for the decimal unsigned integers, Fig. 07- Multiplication of
17. The algorithm is valid both from the mathematical and the computer point of view and unsigned inte-
the application of this algorithm must take into account the following: ger.□
 The algorithm is only valid for unsigned integers.

 The product size in bits is the sum of operand sizes, formula (0711).
nr = n1 + n2 (0711)
Where
 n1, n2 are sizes of operands in bits.

 nr is the size of product in bits.
Size in bits Check in

Multiplication
decimal
1 0 1 1 0 5 bits 22
* 1 0 1 3 bits * 5
1 0 1 1 0
0 0 0 0 0
1 0 1 1 0
1 1 0 1 1 1 0 8 bits 110
Fig. 07-17 Multiplication of unsigned integer
Two’s com- Absolute Check in

Signs
plement values decimal
0101 0 plus 0101 (+5)

* 1001 1 minus * 0111 * (-7)
0101
Step 1 Step 2 0 101 Product in two’s
xor operation 01 01 complement
000 0
1 minus 0010 0011 1101 1101 OK - 35
Step 3
Fig. 07-18 Signed multiplication
In case of a sign number, the multiplication must be performed by a special algorithm. The
basic one is the application of the algorithm for unsigned number with modification,
Fig. 07-18:
 Separate the sign of both operands and calculate the sign of the product. The sign
of the result is the xor operation with both signs, step 1.
VŠB-TU Ostrava 69
 Separate the significant bits of operands as the absolute value of operands and per-
form the multiplication for unsigned numbers, step 2.
 According to the sign of the product, convert the product to the defined form of
the representation, step 3.
It is possible to directly perform the multiplication of the sign numbers in two’s comple-
ment by a special algorithm, Booth's multiplication algorithm in [wiki_0707] and more algo-
rithms in [Ercegovac_2004], [Koren_2002] and [Stine_2012].
7.9 Division
Division is a very arduous arithmetic operation because the result can be in more different
forms. The division is also a time consuming operation. The result of the division is a ration- Rational numbers
al number. When the result is a number with radix point, it is the floating point division. ℚ are 1.1, 1/8 … □
This subchapter describes only the division of integer numbers, the integer division. In this
case, the result can have two forms, either a fraction or a quotation and a remainder. Frac-
Natural numbers
tion is a very accurate result of the division. A lot of programming languages use the frac-
ℕ are the set {0,
tion as the result, and they have the rational number as the data type, [Matlab_0701] and
1, 2, 3 …} □
[wiki_0708].
Formula (0712) defines the quotient and/or remainder as the result of division. This formu-
la is uniquely defined for the positive numbers only; these numbers are called the nomina-
tor and the denominator. One of the first division algorithms is the Euclidian division for
positive integer numbers, which was extended for positive and negative numbers. The basic
Euclidian divi-
idea of this algorithm is that the remainder is always positive, formula (0713), [wiki_0709].
sion, remainder
Another algorithm of the integer division admits a positive and/or negative remainder,
is always posi-
[Koren_2002] and [Ercegovac_2004]. This definition is used by a lot of mathematical sys-
tive. □
tems and programming languages.
n=d*q+r (0712)
0≤r<d (0713)
Quotient is equal
Where to nominator
/denominator. □
 n is the nominator or dividend as an integer number.
 d is the denominator or divisor as an integer and not equal to zero (d ≠ 0).
 q is the quotient as an integer.
 r is the remainder.
These different accesses to the integer division are shown in Fig. 07-19. The most of report-
ed systems have the integer division with truncation and the remainder can be either posi-
tive or negative. Only Euclidian algorithm and the Python programming language give dif-
ferent results. Only the MS Excel 2010 spreadsheet has no defined remainder and the
modulo operation is defined by a different way. The Octave system has a different operator
for the remainder and the modulo operation, resulting in different results. But the present
literature states that the remainder is the modulo calculation, [ISO/IEC_0701], [wiki_0710]
and [wiki_0711].
70 VŠB-TU Ostrava
System Operator or function 7 and 3 7 and -3 -7 and 3 -7 and -3

Euclidian algo- Quotient 2 -2 -3 3
rithm Remainder 1 1 2 2
MS Excel 2010 QUOTIENT 2 -2 -2 2
No remainder
MOD 1 -2 2 -1
Octave v3.2.4 Idivide 2 -2 -2 2
Remainder 1 1 -1 -1
Modulo 1 -2 2 2
GNU bash, / (integer division) 2 -2 -2 2
v4.2.45 % (reminder) 1 1 -1 -1
C language / (integer division) 2 -2 -2 2
(gcc compiler) % (as modulo, remainder) 1 1 -1 -1
C plus / (integer division) 2 -2 -2 2
(gcc compiler) % (as modulo, remainder) 1 1 -1 -1
Python 2.7.6 // (integer division) 2 -3 -3 2
% (as remainder, modulo) 1 -2 2 -1
Fig. 07-19 The results of integer division in different systems
The latest opinion on the integer division is formulated by standard ISO/IEC 10967-1:2012,
Information technology - Language independent arithmetic - Part 1: Integer and floating
point arithmetic, famous as LIA. This standard states the definition of quotient and remain-
der. The LIA standard states the integer division as:
“quotI (-3; 2) = -2 round toward minus infinity, specified in LIA-2” Floor division. □
q = floor (n/d) (0714)
“divtI (-3; 2) = -1 round toward zero, no longer specified by any part of LIA”
q = truncation (n/d) (0715) Floor, ceiling and

truncation are
It means that preferred algorithm is quotI (0714) with rounding toward minus infinity. This
mathematical
definition is also known as the floor division, q = floor (n/d). The truncation division (0715)
function about
is considered as historical and it is often used. The standard ISO/IEC 10967-1:2012 states
rounding func-
the remainder which is calculated by modulo function modI, (specified in LIA-2). The stand-
tion. □
ard writes: “It is coupled to division by the following identities:
x = quotI (x; y) * y + modI (x; y) if y ≠ 0, and no overflow occurs

LIA-2 states that
y < modI (x; y) ≤ 0 if y < 0
remainder is
0 ≤ modI (x; y) < y if y > 0” (0717) calculated by
modulo opera-
Where tion. □
 x is the nominator or dividend.
 y is the denominator or divisor.
 modI is the modulo function and calculates the remainder of the integer division.
VŠB-TU Ostrava 71
The Python programming language from version 2.2, including Python 3.x version, uses the
floor division, [Python_0701], [Python_0702] and it defines a new // operator (double
slash) as the integer floor division. The modulo operator remains the same, % (percent),
and produces the remainder. The original / operator (slash) gives the quotient according to
the floor division.
The programming language C is defined by the standard ISO/IEC 9899:2011 - Programming

languages — C and this standard states for the integer division by the / operator (slash) that
quotient is calculated by the truncation division. Let’s quote:
When integers are divided, the result of the / operator is the algebraic quotient with any
fractional part discarded. If the quotient a/b is representable, the expression (a/b)*b + a%b
shall equal a. This is often called ‘‘truncation toward zero’’.
The programming language C++ is defined by ISO/IEC 14882:2011 - Programming Language

C++ and for integer division by the / operator states the similar sentences. It means that C
and C++ use the truncation integer division by the / operator. Then the value of the re-
mainder corresponds to this definition. When it is desirable to use the floor division, then
the floor function must be used for the quotient, floor (a/b), and the remainder is calculat-
ed separately.
The realization of division is performed by many algorithms which are described in litera-
ture [Ercegovac_2004], [Internet_0701], [Koren_2002], [wiki_0712] or [Muller_2010]. The
realization of the algorithms of division is only by FSM – Finite State Machine.
7.10 References
[DEC_PDP11] PDP11 processor handbook, PDP11/04/34a/44/60/70, instruction set and
instruction SUB; Digital Equipment Corporation, 1979;
(http://bitsavers.informatik.uni-
stuttgart.de/pdf/dec/pdp11/handbooks/PDP11_Handbook1979.pdf; on line
2013-09-24)
[Ercegovac_2004] M. D. Ercegovac, M. Lang; Digital Arithmetic; Morgan Kaufmann

Publishers 2004; ISBN 1-55860-798-6
[Internet_0701]Arithmetic Operations on Binary Numbers;

http://www.doc.ic.ac.uk/~eedwards/compsys/arithmetic/; on line 2013-09-
26
[Internet_0702]Multiplication in FPGAs; http://www.andraka.com/multipli.htm; on line

2013-10-31
[ISO/IEC_0701] ISO/IEC 10967-1:2012 - Information technology - Language independent

arithmetic - Part 1: Integer and floating point arithmetic
[ISO/IEC_0702] ISO/IEC 9899:2011 - Programming languages — C
[ISO/IEC_0703] ISO/IEC 14882:2011 - Programming Language C++
72 VŠB-TU Ostrava
[Kaps_2013] Jens-Peter Kaps; Digital System Design, Signed Magnitude Addition – Sub-
traction Algorithm; George Mason University;
http://ece.gmu.edu/~jkaps/courses/ece331-
s07/resources/signedinteger.pdf
[Koren_2002] I. Koren; Computer Arithmetic Algorithm; A. K. Peters Ltd. 2002; ISBN 1-

56881-160-8
[Matlab_0701] Dom::Rational Field of rational number;

http://www.mathworks.com/help/symbolic/mupad_ref/dom-rational.html;
on line 2014-07-12
[Muller_2010] Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre

Jeannerod, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Damien
Stehlé, Serge Torres: Handbook of Floating-Point Arithmetic; Birkhauser
Boston, a part of Springer Science+Business Media, LLC 2010; ISBN 978-0-
8176-4704-9; e-ISBN 978-0-8176-4705-6
[Python_0701] Why Python's Integer Division Floors; http://python-

history.blogspot.cz/2010/08/why-pythons-integer-division-floors.html; on
line 2014-07-11
[Python_0702] PEP 238 -- Changing the Division Operator;

http://legacy.python.org/dev/peps/pep-0238/; on line 2014-07-11
[Stine_2012] J. E. Stine; Digital Computer Arithmetic Datapath Design Using Verilog HDL;
Springer 2012; ISBN-13 978-1461347255
[wiki_0701] Status register; http://en.wikipedia.org/wiki/Status_register; on line 2013-

09-26
[wiki_0702] Negative flag; http://en.wikipedia.org/wiki/Sign_flag; on line 2013-09-26
[wiki_0703] Zero flag; http://en.wikipedia.org/wiki/Zero_flag; on line 2013-09-26
[wiki_0704] Overflow flag; http://en.wikipedia.org/wiki/Overflow_flag; on line 2013-09-

26
[wiki_0705] Carry flag; http://en.wikipedia.org/wiki/Carry_flag; on line 2013-09-26
[wiki_0706] Sign extension; http://en.wikipedia.org/wiki/Sign_extension; on line 2013-

09-26
[wiki_0707] Booth's multiplication algorithm;

http://en.wikipedia.org/wiki/Booth%27s_multiplication_algorithm; on line
2013-10-21
[wiki_0708] Rational data type; http://en.wikipedia.org/wiki/Rational_data_type; on

line 2014-07-12
VŠB-TU Ostrava 73
[wiki_0709] Euclidean division; http://en.wikipedia.org/wiki/Euclidean_division; on line

2014-07-10
[wiki_0710] Remainder; http://en.wikipedia.org/wiki/Remainder; on line 2014-07-10
[wiki_0711] Modulo operation; http://en.wikipedia.org/wiki/Modulo_operation; on line

2014-07-10
[wiki_0712] Division algorithm; http://en.wikipedia.org/wiki/Division_algorithm; on line

2014-08-18
74 VŠB-TU Ostrava
8 Fixed point arithmetic
The term of fixed point (or fixed point numbers) is used mainly in the computer
Fixed point numbers are a
science. Fixed point numbers can be understood as the numbers with radix point
subset of real numbers,
at the place defined beforehand. From mathematical point of view, rational num-
not vice versa. □
bers ℚ (U+211A) are the numbers that are expressed as a quotient or fraction
n/d. We begin to think about the fraction and the possibility for its writing and reading. It is
possible to write the real number 1.23 as e.g. 0.00123 * 1000 (0.00123/10 -3) … 1.23 * 1 and
also as 1.23*1/1 … 1230* 1/1000 etc. The multiplier is called a scaling factor, [wiki_0803]. Scaling factor.□
The multiplier is either the integer or the fraction in the form 1/denominator. Then, it is
possible to read the fraction as: 1.23 is 123 with a scaling factor of 1/100, 123 is 1.23 with a 1.23 is 123 with a
scaling factor of 100 etc., or 123 scaled by 1/100 is 1.23 and so on, [wiki_0803]. The same scaling factor of
principles are valid for negative numbers and any scaling factor, Fig. 08-01. 1/100.□
1.23 = 12.3/10 = 123/100 = 1 230/1 000 -1.1 = -22/20 = -33/30 123 scaled by
1/100 is 1.23.□
1.23 is 123 with a scaling factor of 1/100 -1.1 is -33 with a scaling factor of 1/30
Fraction is
123 scaled by 1/100 is 1.23 -22 scaled by 1/20 is -1.1 nominator
Fig. 08-01 Scaling /denominator. □
The useful scaling factors are the powers of 2 or 10 and they are chosen so that the
nominator is an integer number. Then, the fixed point numbers are represented by
Fixed point is the theo-
the division of the integer numbers and the numerator of fraction is used in the
ry of the transfor-
computation. It means that the integer arithmetic is used instead of floating point or
mation of real numbers
special arithmetic unit. The basic arithmetic operations of the fixed point numbers
to fraction of the inte-
are based on the fraction arithmetic, more details later. The scaling factor is used to
ger numbers. □
express the fixed point numbers, e.g. the scaling factor of 1/3600 is used for the
calculation of hours from seconds or for transforming angles, where the angle 2π
radians corresponds to number 64536.
The meaning of the fixed point is in the fact that the integer arithmetic is used instead of
Fixed point is
the floating point arithmetic. The integer arithmetic is faster than the floating point arith-
faster than float-
metic. Moreover, not every processor has a hardware floating point unit, e.g. DSP - digital
ing point. □
signal processor. In this situation, floating point operations are simulated by the software
library and this is a slower computation than the integer arithmetic. The second reason is
the precision, in the suitable definition of the fixed point it is possible to reach a higher
precision than by using the floating point with the same word size. More details later.
VŠB-TU Ostrava 75
The significance of the fixed point numbers lies in:
 Transferring a number with radix point to an integer.

 In the suitable definition, the fixed point number can be more precise than the
floating point number, under the condition of the same word size. Example later.
 The integer arithmetic operations are faster than the floating point operations.
 The range of fixed point representation is smaller than the floating point range. It is
a negative property.
The fixed point numbers and binary scaling theory are mainly used in digital signal pro-
cessing and other areas, literature [wiki_0804]:
 Digital signal processing – DSP. A lot of DSP processors are only of integer type,
floating point operations are only software simulated. DSP covers the applications
of the digital filter, digital image processing, speech to text and text to speech con-
version, and so on.
 Binary angle, where 2π angle corresponds to e.g. 65536 = 216.
 In the 1970’s and 1980’s, the fixed point was used in the intensive real time com-
puting, such as the flight simulator.
 Some programs, where DCT – Discrete Cosine Transformation is used to compress
JPEG images.
 Computer graphics.
The support for a rational number is possible to find in programming languages and alge-
braic computation systems, such as Mathematica and Maple, [wiki_0809]. In the program-
ming languages, the support is based on the software libraries. The better known languages
are Common Lisp, Perls, Ruby, C/C++, VHDL and others are mentioned in literature
[wiki_0809], [vhdl_0801]. For the languages C/C++, it is the project of GNU Multiple Preci-
sion Arithmetic Library, [wiki_0809]. The Python programming language has the module of
fractions, which provides the support for the rational number arithmetic, [Python_0801].
The libfxmath library is a platform-independent fixed point maths in the format ℚ16.16
under the license MIT, [Google_0801] and [wiki_0808].
8.1 Binary scaling Fixed point number

Binary scaling is the classical scaling with the preferred scaling factor of 1/2n. In the may be understood as
following text, the scaling factor will be written in the decimal or the hexadecimal the integer number
numeral system. The scaling factor cannot be zero but the fixed point number can be with the scale factor.□
zero. Fig. 08-02 shows the terminology on 8-bit number. Fixed point number may be
with or without a sign. In case of signed fixed point number, the leftmost bit or MSB
Fixed point has a sign,
is the sign. In case of negative fixed point number, the two’s complement is used.
an integer and a frac-
If the fixed point number is understood as an integer, Fig. 08-02, it is a classical tion part. □
signed or unsigned integer. If the number is understood as a fixed point number
with radix point, Fig. 08-02, it is divided into parts: a sign, an integer and a fraction. Fixed point is a subset
The number may be unsigned or signed. The following parts are the integer and the of real numbers. □
76 VŠB-TU Ostrava
fraction part and they are the absolute value of a fixed point number. Fractional part is a
part after the radix point and the scaling factor defines the size in bits of this fractional part.
In the example in Fig. 08-02, the scaling factor is 1/23, it means that the fractional part has 3
bits, counted from the radix point to the right. The weight of LSB bit is 2-n for scaling factor
of 1/2n. The fixed point value in Fig. 08-02 is IN = 53H as an integer and FX = 53H * 1/23 =
A.6H as a number with radix point.
7 0 7 0
0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 Size of fraction part in
3
* 1/2 = bits is derived from
Nibble 1 Nibble 0 Integer Fraction scaling factor. □
part Radix part
Sign Sign point
Radix point is defined
Integer number - IN Scaling Fixed point number – FX by scaling factor. □
factor - SF or real number - Re
Fig. 08-02 Designation of parts in fixed point
Fig. 08-03 shows the weight for the representation of the integer or fixed point in the byte.
In case of integer, the LSB has the weight of 20 and MSB bit is a sign or not. In case of fixed
point, the LSB bit has the weight of 2-n and MSB bit is a sign or not. The 20 weight of fixed
point is determined by the radix point. The sign bit has minus weight and it is used in two’s
complement. It is suitable to use this minus weight for the conversion to the decimal nu-
meral system.
MSB LSB MSB LSB

7 0 7 0
0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1
7 6 5 4 3 2 1 0 4 3 2 1 0 -1 -2 -3
Weights for unsigned b b b b b b b b a a a a a a a a
7 6 5 4 3 2 1 0
Weights for signed - b +b +b +b +b +b +b +b - a4 +a3 +a2 +a1 +a0 +a-1 +a-2 +a-3
Integer number - IN Fixed point - Fx
Fig. 08-03 Weights of bits Real number
Format ℚ3.3 MSB LSB MSB LSB

7 0 7 0
+ 2.4 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1
- 2.4 1 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1
Sign extension
Integer number - IN Fixed point - FX
Fig. 08-04 Fixed point placement with sign extension
VŠB-TU Ostrava 77
It is necessary to note that the required fixed point size in bits must be less than or equal to
the computer word size. When the fixed point size in bits is less than the computer word
size, then the fixed point number is placed to the word from LSB bit and the sign extension
is performed to the highest remaining significant bits, Fig. 08-04.
The position of radix point or binary scaling factor is defined by the user or software library.
In literature, we can find a lot of enrolments of the fixed point definition:
 ℚm.f format or ℚm.n format or ℚ number format, more details below.

 ℚf format, where ℚ is prefix and designation for rational numbers and f is number
of bits in fractional part.
 Bn format, it means that integer part has n-bits plus sign and fractional part has n -
1 bit, literature [wiki_0804].
 Signed or unsigned m.n.
 fxm.b, where fx is the abbreviation for fixed point, m is the number of bits of inte-
ger part and b is the number of bits of the whole word. For example, fx3.16 is nota-
tion for 3 bits of integer part in 16-bit word, fractional part has 13 bits.
 s:m:f format, where every item means the number of bits, s-bit is the sign, m-bit is
the size of integer part and f-bit is the fractional part size.
8.2 Format m.n

Format m.n is the definition of fixed point, where m is the number of integer bits including
the sign bit and n is the number of bits in fractional part. This definition may be signed or
unsigned and it is generally listed. This definition can be found in real practice, literature
[Malapeti_082010], [Oberstar_082007ti] and [Yates_082009]. Very famous definitions are
formats without the integer part, 0.16 or 0.32 for unsigned and 1.15 or 1.31 formats for
sign version. These definitions without the integer part have the advantage that the frac-
tional part has the maximum number of bits. This is the situation when fixed point is more
accurate than floating point. Fractional part has more bits than the significand of floating
point in the same word size.
8.3 ℚ number format

ℚ number format is the format for fixed point and also the letter ℚ is the symbol for
Rational numbers are
rational numbers in mathematical theory. In computer science, it is a denotation for
typically numbers in the
fixed point format in variances ℚm.f, ℚf, ℚm.n or ℚn. This format will be used in
form of fraction.□
following texts. The ℚ format is always defined with the sign, where two’s comple-
ment is used for a negative number and the MSB bit is the sign. The letter n or f is
the number of bits in the fractional part and m is the number of bits in the integer Two’s complement is
part without the sign, [wiki_0805] and [TI_082003]. Then the minimum required used for negative num-
word size is the addition of the number of bits of integer and fractional part and bers.□
sign, formula (0801). For example, ℚ2.13 has a sign bit, 2 bits in the integer part and
13 bits in the fractional part. It is also possible to find the literature where m is num-
ber of bits in integer part including the sign, [Oberstar_082007], [wiki_0808] and This ℚ format will be
[Google_0801]. Then m + n is the required minimum number of bits in the word. used in the following
texts.□
78 VŠB-TU Ostrava
Word size is m + n + 1 (0801)
Where
 +1 is the sign bit.

 m is the number of bits in the integer part.
 n or f is the number of bits in the fractional part.
8.4 Range of representation of fixed point

The range of representation is given by radix point and whether the definition is signed or
unsigned. The range can be defined by two formats, first as an integer and scaling factor
and then as a real number. Formulas (0802) and (0803) are for the unsigned format m.n
and formulas (0804) and (0805) are for the signed format ℚm.f.
Range for unsigned fixed point Range of the fixed point

representation, unsigned
0 to (2m+f - 1)/2f (0802)
0 to 2m – 1/2f (0803) 0 to (2m+f-1)/2f.□
Range for signed fixed point, m is without a sign bit
From - 2m+f/2f to + (2m+f - 1)/2f (0804) Range of the fixed point

representation, signed,
From -2m to + (2m – 1/2f) (0805)
sign bit outside of m
The gap, resolution
-2m+f/2f to (2m+f-1)/2f.□
f
ε = 1/2 (0806)
Where
 m is the number of bits in the integer part.

 f is the number of bits in the fractional part.
 ε is the resolution.
Note about precision floating point versus fixed point
Convert number 1.4 to signed fixed point in format ℚ1.30
 1.4D*230 = 1 503 238 553.6 ≈ 1 503 238 554D = 5999 999AH.

 5999 999AH scaled by 1/230 is 1.400 000 000 372 529 029 846 191 406 25D,
 Error of representation is + 0.000 000 026 609 216 417 585 100 446 43%.
 0x27B3 3333 as 32 bit floating point number is 1.399 999 976 158 142 089 843 75
 Error of representation is -0.000 001 702 989 850 725 446 428 571 429%
Booth definitions use 32-bit word and fixed point representation is more precision about
two weights than floating point. Floating point in 32-bit definition is using only 23 bit but
signed ℚ1.30 format of fixed point is using 30 bit for fraction part. □
VŠB-TU Ostrava 79
Each unsigned and signed range is defined by two equivalent formulas. Between two
neighboring numbers a small gap exists and it is defined by the number of fractional bits
1/2f, formula (0806). This gap is also called resolution ε (epsilon), [Oberstar_082007].
The range of representation also influences the computer word size of 16, 32 or 64 bits. In
some cases, when defined ℚ format does not use all bits of the word size, then the word
can increase the range of representation. Also, each arithmetic operation changes the for-
mat of the result, typically the number of bits in the integer and fractional part increases.
These new bits increase the accuracy and are used to rounding. They are called round and
sticky bits and have the same role as in floating point.
Limited number of bits for computation can cause the overflow. The overflow means the
number is out of the representation. The overflow is signalized by the flags C – carry and V Overflow.□
– overflow, depending if it is the sign or unsigned computation. The overflow of fixed point
must be derived from these flags.
8.5 Conversion to and from fixed point

Conversion is based on the basic definition of the binary numeral system for unsigned and
signed numbers and binary scaling, formulas (0807), (0808) and (0809). Formula (0807) is
valid for unsigned numbers and formula (0808) is valid for signed numbers. The difference
is only in the sign that corresponds to the coefficient am-1; for unsigned numbers the coeffi-
cient is plus. For signed numbers, the coefficient matches the sign, 0 for positive and 1 for
negative numbers. Formula (0809) defines the value of fixed point that is calculated by the
integer number and scaling factor. The red vertical line only indicates the radix point posi-
tion.
UN = + am-1 * Bm-1 + … a1 * B1 + a0 *B0 + a-1 * B-1 + a-2 * B-2 … a-f * B-f (0807) For a signed
number, the co-
SN = - am-1 * Bm-1 + … a1 * B1 + a0 *B0 + a-1 * B-1 + a-2 * B-2 … a-f * B-f (0808) efficient am-1 is
FX = IN * 1/2f (0809) the sign bit.□
Where
 UN is any unsigned number.

 SN is any sign number. For the conversion it is
 B is the radix of the numeral system, B is a natural number higher than +1. necessary to know:
 ai is the coefficient from the set {0, 1, … B-1}, i is in the range from m-1 to –f.  Polynomial of the
 Bi is the weight, i is in the range from m-1 to –f. numeral system.
 m is the number of coefficients of the integer part including the sign.  Algorithm of two’s
 f is the number of coefficients of the fractional part. complement.
 FX is a fixed point number and also may be understood as a real number.  Scaling.
 IN is an integer number, unsigned or signed.  Order of the applica-
 1/2f is the scaling factor. tion rules is random.□
 radix point is between the coefficients a0 and a-1 and it is indicated by a red
vertical line.
80 VŠB-TU Ostrava
Sign numbers use two’s complement that is shortly defined by words – bitwise not plus 1,
formula (0810) or mathematical definition, formula (0811) that can be used in any numeral
system.
2
A = (~ A) + 1 (0810)
2
A = 2n – A (0811)
Where
 2
A is the two’s complement of number A and corresponds to number –A.
 A is a natural number.
 n is the number of bits used for the representation.
The number of bits in the integer part of fixed point number is given by definition ℚm.f and
the number of bits in the integer part of the computer word can be different. Then the
fixed point number as integer also has a different value. This difference can be seen in case
of two’s complement, Fig. 08-03. However, formulas (0807), (0808) and (0809) give the
same value. Therefore in the examples of the conversion, the fixed point value and the
value derived from the computer word will also be calculated.
The conversion to fixed point means to calculate the integer number (IN) or the fixed point
number with the radix point and the corresponding value in the computer word. The con-
Conversion to
version to integer number and fixed point value can be performed in two ways; first by
fixed point.□
using a scaling factor and then by using the classical conversion to the binary or hexadeci-
mal numeral system. After that, the placement to the computer word will be made and sign
extension will be used. The format ℚm.f and the word size must be known and the conver-
sion to the fixed point begins from the known decimal real value. The performance of the
conversion is based on scaling factor, formula (0809), and subsequently the placement to
computer word can be made by algorithm, Fig. 08-05.
 Check, if the number is in the range of representation, otherwise it is the overflow.

 Multiply real number by 2f.
 Round this product to integer.
 Convert the decimal integer number to the binary or hexadecimal numeral system.
 Define and set the sign bit ,
o to 0 for positive numbers; the integer stays without any changes.
o to 1 for negative numbers; two’s complement is calculated, formula (0805)
or (0806).
 The number is placed to the computer word with a signed extension.
As you see in Fig. 08-05, the conversion is not accurate and the difference is caused by the
limited number of bits in the fractional part and by rounding. Only multiples of LSB weight
are represented accurately. Fig. 08-05 also states the reverse conversion as the check made
on the base of formula (0808).
Another way of the conversion is the possibility to convert the integer and fractional part
separately to the hexadecimal or binary numeral system. The number of bits in the frac-
VŠB-TU Ostrava 81
tional part is f, which is stated in the definition ℚm.f, and therefore the calculation of the
fractional part must be made to f+2 bit and rounding must be performed. This algorithm is
used for checking in Fig. 08-06.
Positive real number 1.23 to Q2.11 in Negative real number -2.36 to Q2.10 in the 16-bit word
the 16-bit word
 - 2.36 * 210 = - 2 416.64
 11
1.23 * 2 = 2 519.04  - 2 416.64 ≈ - 2 417
 2 519.04 ≈ 2 519  - 2 417D = - 971H
 2 519D = 09D7H,  (213)D - 971H = 2000H – 971H = 2168FH,
 IN = 09D7H  IN = -971H, FX = - 10.0101 1100 012 = -2.5C4H
 Fx = 0 0001.0011 1010 111B =  In two’s complement
1.3AEH  IN = 168FH, Fx = 101.1010 0000 01B = 5.A01H
 Placement to word is 0x09D7  Placement to word is 0xF68F, two’s complement
 Value in the word  IN = F68FH
 IN = 09D7H  F68FH * 1/210 (scaled by) = 211 1101.1010 0011 11B = 23D.A3CH
 Fx = 1.3AEH □  Fx = 3D.A3CH, 2’s complement □
Check Check
 1.3AEH =  3D.A3CH = -1*25 + 1*24 + D*160 + A*16-1 + 3*16-2 +

= 1*160+3*16-1+A*16 -2+E*16-3 C*16-3 = -2.360 351 562 5
= 1.229 980 468 75D  Real number is - 2.360 351 562 5
 Real number is 1.229 980 468 75 □ □
Fig. 08-05 Conversion to fixed point
Conversion from fixed point is also given by the above mention formulas, (0807) to (0811)
and the known format ℚm.f. The order of application of formulas is random, but it is neces- Conversion from
sary to calculate in the binary or hexadecimal system in the same way as in the decimal fixed point.□
numeral system. In the following text, two basic algorithms are described. The first possibil-
ity calculates the decimal integer number with a sign and then the scaling factor 1/2f is ap-
plied, Fig. 08-06. The algorithm is:
 Convert the given number to the sign decimal integer. Only note that the sign is
MSB bit and the theory of 2’s complement is used, formula (0805) or (0806).
 Multiply the obtained integer number by scaling factor 1/2f. The result is a real
number.
 A variation is to calculate two’s complement in the decimal numeral system.
The second way uses the first scaling factor 1/2f and then the polynomial of the numeral
system, formulas (0808). All calculations are made on the word, which contains the signed
extended fixed point number, Fig. 08-07.
 The given integer number is multiplied by the scaling factor of 1/2f, in the hexadec-
imal numeral system.
 Apply formula (0803) and the result is a real number.
82 VŠB-TU Ostrava
The 16 bit number 0x062B with format ℚ3.8 The 16 bit number 0xDCBA with format ℚ4.11
 0x062B is plus number, MSB = 0  0xDCBA is negative number, MSB = 1

 In the word IN = +062BH, Fx = +06.2BH  In the word, IN = DCBAH, FX = 1B.974H
 062BH = 1 579D  2
DCBAH = - ((216)D – DCBAH) = - 2346H
 1 579D scaled by 1/28 = 6.167 968 75D  - 2346H = - 9 030D
 Real number is 6.167 968 75 □  -9 030D scaled by 1/211 is -4.409 175 687 5D
 Real number is - 4.409 179 687 5□
Check Check
 6D = 110B  4D = 100B
 0.167 968 75D * 16 = …  0.409 179 687 5D * 2 = …
 0.167 968 75D = 0.2BH  0.409 179 687 5D =0.68CH
 6.2BH * 28 = 62BH  - 4.409 179 687 5D = - 4.68CH
 Sign extension to word it is 0x062B  - 4.68CH * 211 = - 100.0110 1000 110B * 211
 Word contains 0x062B □ = -2346H
 Two’s complement with sign extension
(~2346H) + 1= 2DCBAH = 0xDCBA
 Word contains 0xDCBA □
Fig. 09-06 Conversion from fixed point
The 16 bit number 0x2ED8 with format ℚ7.7 The 16 bit number 0xE9AB with format ℚ5.9
 Note sign extension is used  Note sign extension is used

 FX = 2ED8H*1/27 = 0 0101 1101.1011 000B  FX = E9ABH * 29 = 111 0100.1101 0101 1B
 FX = 05D.B0H  FX = 74.D58H
 SN = -0*28 + 5*161 + D + B*16-1 + 0*16-2  SN = -1 * 26 + 1 * 25 + 1 * 24 + 4 * 160 +
 SN = 93.687 5D D * 16-1 + 5*16-2 +8*16-3
 Real number is 93.687 5D□  SN = -11,166 015 625D
 Real number is -11,166 015 625D□
Check Check
 93D = 5DH  -11,166 015 625D * 29 = -565D

 0.687 5D * 16 = …  (216)D - 565D = 264 971D (two’s comple-
 0.687 5D =0.BH ment in decimal, 16 bits)
 93.687 5D = 5D.BH  64 971D = FDCBH
 Fx = 5D.BH = 0 0101 1101.1011 000B  Integer number in the word is 0xFDCB □
 Integer number in the word is 0x2ED8 □
Fig. 08-07 Conversion to fixed point by means of the polynomial of numeral system
VŠB-TU Ostrava 83
8.6 Arithmetic operations

The significance of fixed point lies in the usage of the integer binary arithmetic. The fixed
point theory changes a real number to an integer number with a scaling factor. Therefore,
the arithmetic operations are the arithmetic operations with fractions. Formulas (0812), Arithmetic oper-
(0813) and (0814) define the basic principle of arithmetic operations. As you can see, the ation – the same
classical definition of fixed point is used, where the numerator is an integer number and scaling factor.□
the denominator is a scaling factor. The arithmetic operations can be performed with the
same or different scaling factor, however, the same scaling factor is preferred in real prac-
tice. Each fixed point number is defined by format ℚm.f and the performance of the basic
arithmetic operations may change the format of the result. Therefore, it is necessary to
define the next operation that changes the scaling factor and moves the radix point to the
required position. This is performed by the multiplication or division by the power of 2. This
operation is called the adjustment. It is also necessary to realize that the operations are
performed on the defined size of the arithmetic unit, i.e. 16, 32 or 64 bits. Fixed point num-
bers are defined by the format and can be placed into every word size.
𝑎 𝑏 𝑎+𝑏
+ 2𝑓 = (0812)
2𝑓 2𝑓
𝑎 𝑏 𝑎∗𝑏
∗ 2𝑓2 = 2𝑓1+𝑓2 (0813)
2𝑓1
𝑎
2𝑓1 𝑎 2𝑓2
𝑏 = (0814)
𝑏 2𝑓1
2𝑓2
Where
 a,b are integer numbers.

 2f is the scaling factor.
The adjustment of the number means to change the scaling factor and subsequently the
Adjustment of
position of the radix point. The change of the position of the radix point is performed by
the number.□
the multiplication or division by the power of 2. The multiplication by the power of 2 can
be performed by arithmetic shift left, formula (0815). This shift changes the position of the
radix point and can change the sign. This situation is the overflow.
A * 2i = A << i (0815)
A//2i = A >> i (0816)
Where
 A is the signed integer number.

 i is the exponent of the power.
Division by the power of 2 can be performed by arithmetic shift right, formula (0816). The
quotient corresponds to floor division, where rounding is towards minus infinity,
[ISO/IEC_0801]. This shift changes the position of radix point and the sign bit stays without
changes.
84 VŠB-TU Ostrava
Before ℚ3.3 After ℚ3.4

MSB LSB MSB LSB
7 0 7 0
0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0
6/23 * 2/2 = 12/24

MSB LSB MSB LSB
7 0 7 0
1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 0
-6/23 * 2/2 = -12/24
Fig. 08-08 Adjustment of the number by multiplication (arithmetic shift left)
Before ℚ3.3 After ℚ3.2

MSB LSB MSB LSB
7 0 7 0
0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1
(6/23) / (2/2) = 3/22

MSB LSB MSB LSB
7 0 7 0
1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1
3 2
(-6/2 ) / (2/2) = -3/2
Fig. 08-09 Adjustment of the number by division (arithmetic shift right)
Note about the multiplication and the division by the power of 2
 Arithmetic shift left is the multiplication by 2.

 Arithmetic shift right is the division by 2.
 Division for two’s complement is the floor division (round towards minus infinity),
according to standard ISO/IEC 10967-1:2012 Language Independent Architecture.
 Example in Python: -7//2 = -4, -8//2 = -4. □
8.7 Addition and subtraction

The addition and the subtraction are performed on the integer number in two’s comple-
f f
ment, formula (0812). Both operands have the same size of fractional part. It is important a/2 + b/2 =
f□
to realize that the sum and the difference can have a new format, which depends on the (a+b)/2
value of operands, formula (0817). The new format has the same fractional part but the
integer part has one bit more than the maximum of both integer parts.
The difference is
ℚm1.f ± ℚm2.f = ℚ(max(m1, m2) + 1).f (0817)
the result of sub-
Where traction. □
VŠB-TU Ostrava 85
 m1, m2 are sizes of the integer part of operands. The sum is the
 max(m1, m2) is a function that chooses the maximum value from m1 and m2. result of addi-
 f is the size of the fractional part of operands. tion. □
Binary representa- Integer Fraction Decimal ra-

tion in byte arithmetic tional number
MSB LSB
7 0
ℚ0.5 0 0 0 1 1 0 0 0 24 0x18/25 0.75
ℚ0.5 + 0 0 0 1 0 1 0 0 20 0x14/25 0.625
ℚ1.5 0 0 1 0 1 1 0 0 44 0x2C/25 1.375
Check: 0xc2/25 = 44/25 = 1.375
Fig. 08-10 Example, when bit is generated to integer part
Binary representa- Integer Fraction Decimal ra-

tion in byte arithmetic tional number
MSB LSB
7 0
ℚ0.3 1 1 1 1 1 0 1 1 -5 0xFB/23 -0.625
ℚ0.3 + 0 0 0 0 0 1 0 0 4 0x04/23 0.5
ℚ1.3 1 1 1 1 1 1 1 1 -1 0xFF/23 -0.125
Check: 0xFF/23 = 2FFH/23 = -1/23 = -0.125
Fig.08-11 Example of the addition of negative and positive number
An example, when the addition changes the format of the result in in Fig 08-09. It is the
situation (0.1b + 0.1b = 1.0b), where one bit is generated to the integer part and increases
the size of the integer part. The size of the fractional part stays without any change. An
example of the addition of the negative and positive number is in Fig. 08-10. The negative
number is represented by two’s complement.
8.8 Multiplication
Fixed point multiplication is defined by multiplication fractions (0813) and the same scaling
f1 f2
factor is not so important. Operands can have different formats of ℚm.f, therefore the a/2 * b/2 =
f1 + f2 □
format of the result is calculated by formula (0818) for unsigned and formula (0819) for (a * b)/2
86 VŠB-TU Ostrava
signed format, literature [Yates_082009] and [Oberstar_082007. The result can be changed
to the desired format by first rounding and then adjustment.
ℚm1.f1 * ℚm2.f2 = ℚ(m1 + m2).(f1 + f2) (0818)
ℚm1.f1 * ℚm2.f2 = ℚ(m1 + m2 + 1).(f1 + f2) (0819)
Where
 m1 and m2 are numbers of integer bits, for each operand.

 f1 and f2 are numbers of fractional bits, for each operand.
A basic algorithm for integer multiplication is only defined for positive operands. If the op-
erand is negative, it is necessary to change the operand to a positive one and then multiply.
The sign of the result is calculated separately; when the result is negative, then 2’s com-
plement is calculated. Special algorithms exist for the multiplication operands in 2’s com-
plement, literature [wiki_0806] and [wiki_0807].
Binary representa- Fraction

MSB
tion in byte LSB
7 0
ℚ2.5 0 0 1 0 1 0 1 1 2BH/25
Sign extension * 0 1 0 0 1 1 1 0 * 4EH/25

ℚ2.5
ℚ5.10 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 0 0D1AH/210
Decimal rational
43/25 * 78/25 = 3354/210 = 3.275 390 625
number
Fig. 08-12 Example of multiplication
An example of the multiplication is in Fig. 08-12 and the product has a new format. When
the same format is expected, the adjustment with rounding is made, Fig. 08-13. The round-
ing is performed by the principle of the rounding to nearest, ties to even, [IEEE 754-2008]
and [wiki_0810].
ℚ5.10 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 0
R bit = 1
S bit = 1
Then add 1/2ulp
ℚ5.5 0 0 0 0 1 1 0 1 0 0 1
Fig. 08-13 Adjustment to a new format with rounding
VŠB-TU Ostrava 87
8.9 Division
Division is given by the formula (0814), and it changes the scaling factor. The calculation of
a new format of quotient is complicated and more information can be found in literature (a/2f1) / (b/2f2) =
[Yates_082013]. In the situation when the nominator and the denominator have the same (a/b) / (2f2/2f1) =
scaling factor, the quotient has no scaling factor. If the result is expected in the same scal- (a/b) * 1/2f1- f2 □
ing factor, then formula (0814) is modified to formula (0820).
(A/2f) ÷ (B/2f) = (A/B) *(2f/2f) = (A * 2f)/B * 1/2f (0820)
Where
 A, B are operands of division.

 1/2f is the scaling factor.
Fig. 08-14 shows the example of the integer division of positive operands. The same scaling
factor for operands and result is used. In case of one negative operand, the quotient de-
pends on the algorithm used. The detailed description of the division algorithms is in litera-
ture [Ercegovac_2004], [Koren_2002] and [wiki_0811].
1st step, original definition

MSB LSB MSB LSB
7 0 7 0
0 1 1 1 1 0 1 0 0 0 0 0 0 0 1 1
2nd step, numerator is multiplied by 25
MSB LSB MSB LSB MSB LSB

7 0 4 0 7 0
0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1
MSB LSB MSB LSB

3rd step, quotient 7 0 7 0
0 0 0 0 0 1 0 1 0 0 0 1 0 1 0 1
Calculation in hexadecimal:
(7A/25) ÷ (3/25) = (7A *25)/3) * (1/25) = (F40/3) * (1/25) = 515 * 1/25
Check: (0x7A/25) / (0x03/25) = (122D/25) / (3D/25) = 3.812 5/0.093 755 = 40.666 666 6…
Computed result is 0x018E/210 = 1301D/25 = 40.656 25
Fig. 08-14 Example of fixed point division with the same scaling factor
88 VŠB-TU Ostrava
8.10 References
[Ercegovac_2004] M. D. Ercegovac, M. Lang; Digital Arithmetic; Morgan Kaufmann
Publishers 2004; ISBN 1-55860-798-6
[Google_0801] libfixmath, wiki; https://code.google.com/p/libfixmath/w/list; on line 2014-

07-15
[IEEE 754-2008] IEEE Std 754™-2008, IEEE Standard for Floating-Point Arithmetic, 29
August 2008, revision of IEEE 754 – 1985

[Koren_2002] I. Koren; Computer Arithmetic Algorithm; A. K. Peters Ltd. 2002; ISBN 1-

56881-160-8
[Malapeti_082010] H. Malapeti: Digital media processing, Appendix B: Mathematical

computations on fixed-point processors; Elsevier 2010; ISBN 978-1-85617-
678-1
[Oberstar_082007] E. L. Oberstar: Fixed-point Representation & Fractional Math, Revi-

sion 1.2; published by Oberstar Consulting 2007;
[Python_0801] 9.5. fractions — Rational numbers;

https://docs.python.org/2/library/fractions.html#module-fractions; on line
2014-07-17
[TI_082003] TMS320C64x DSP Library Programmer’s Reference; Texas Instrument, Oc-

tober 2003, SPRU565B
{vhdl_0801] D. Bishop: Fixed point package user’s guide;

http://www.vhdl.org/fphdl/Fixed_ug.pdf; on line 2014-07-18
[wiki_0801] Division (mathematics); http://en.wikipedia.org/wiki/Division (mathemat-

ics); on line 2013-08-03
[wiki_0802] Fraction (mathematics); http://en.wikipedia.org/wiki/Fraction (mathemat-

ics); on line 2013-08-03
[wiki_0803] Fixed-point arithmetic; http://en.wikipedia.org/wiki/Fixed-point arithmetic;

on line 2013-08-03
[wiki_0804] Binary scaling; http://en.wikipedia.org/wiki/Binary_scaling; on line 2014-

07-15
[wiki_0805] Q (number format); http://en.wikipedia.org/wiki/Q_(number_format); on

line 2014-07-15
[wiki_0806] Binary multiplier; http://en.wikipedia.org/wiki/Multiplication_ALU; on line

2013-09-13
VŠB-TU Ostrava 89
[wiki_0807] Booth's multiplication algorithm;

http://en.wikipedia.org/wiki/Booth%27s_multiplication_algorithm; on line
2013-09-13
[wiki_0808] libfixmath; http://en.wikipedia.org/wiki/Libfixmath; on line 2014-07-15
[wiki_0809] Rational data type; http://en.wikipedia.org/wiki/Rational_data_type; on

line 2014-07-17
[wiki_0810] Rounding; http://en.wikipedia.org/wiki/Rounding; on line 2014-07-18

2014-07-18
[Yates_082013] R. Yates: Fixed-Point Arithmetic: An Introduction; Digital Signal Labs - signal

processing systems 2013; http://www.digitalsignallabs.com/fp.pdf, on line
90 VŠB-TU Ostrava
9 Floating point numbers
The term of floating point or floating point numbers or floating point data is mainly used in
computing for writing and displaying real numbers, [wiki_0901]. The illustration of floating
numbers is in Fig. 09-01 and a floating point number consists of significant digits with a sign,
scaled by the base raised to the power of n. The base of scale is a number, which defines
the base of the numeral system, e.g. 10 or 2. Significant digits can be a signed integer num-
ber or a signed number with a radix point. In this context, the term of radix point is more
suitable because it does not depend on the base of the numeral system like, for example,
the decimal point or the binary point. In general, the floating point format is given by the
formula (0901).
significant digits x baseexp (0901) Significant digits x bexp

□
Where
 significant digits with a sign are the digits from the range of 0 to b-1.
 base is a base of the numeral system.
 exp is an exponent, which is an integer number.
 baseexp is a scaling factor.
Significant digits are defined In the decimal numeral system

by the base of the numeral
123.0 x 100 = 12.3 x 101 = 1.23 x 102 = 0.123 x 103
system
123 x 10-3 = 12.3 x 10-2 = 1.23 x 10-1 = 0.123 x 100
In the binary numeral system

In the binary numeral system, it
1101.0 x 20 = 110.1 x 21 = 11.01 x 22 = 1.101 x 23 is useful and usual to express
the exponent as a decimal num-
11 x 2-3 = 1.1 x 2-2 = 0.11 x 2-1 = 0.011 x 20
ber
Theoretic notation
Significant digits x baseexp + is the concatenation operator
Fig. 09-01 Possibility of floating point notation
VŠB-TU Ostrava 91
Note to writing the floating point numbers and radix point
Significant digits and a base are expressed in the defined numeral system but the expo-
nent is always in the decimal numeral system. The decimal exponent is more suitable
for moving the radix point.
The terms of a decimal or a binary point depends on the numeral system but the radix
point is the general term for all numeral systems.□
The reason for defining the floating point number is the possibility to represent numbers in
a large range. For instance, the distance between galaxies in space is given in light-years;
however, the length of light wave is a small number, in nanometers. Both values may be
used in one computation with the maximal precision.
In the formula (0901), significant digits are often called as the significand or the mantissa.
Significand is the newest term, mantissa is a historical term. One problem of the floating
point numbers is how to write down these numbers in all situations. In literature, the clas- Scientific notation
sical typographic or mathematical conversion is used but the computer science uses anoth- 87.6 x 103,
er format. Following terms are connected with the floating point number enrolment. They 10.01 x 27
□
are the scientific notation, the normalized representation and the engineering notation.
 Scientific notation, this format uses the sign, significant digits, base and exponent,
[wiki_0902]. The significand is any real number. Examples of scientific notation are: Engineering no-
2.0 x 102, 0.2 x 103, 123 x 1045, 12.3 x 10-67, 0.123 x 103, 1.23 x 102, 11.01 x 24, tation
1.101 x 25 …. 87.6 x 103,
3.01 x 10-12
 Engineering notation is a variant of scientific notation with the base 10, where the □
exponent is the multiple of 3, [wiki_0903]. It corresponds to the SI prefixes, which

are preferred by engineers. SI is the French abbreviation for The International Sys-
tem of Units, [wiki_0913]. For instance, 11.3 x 10-3 meter is 11.3 mm; 34.5 x 103 m
Normalized sci-
is 34.5 km; 10 Gbps is 10 x 109 bps (bit per seconds); 1.5 TiB is 1.5 x 212 B.
entific notation
 Normalized scientific notation. It is a special format, where the integer part only has 8.76 x 104,
one digit and is not equal to zero [wiki_0904]. It means that the integer part has the 1.001 x 28
weight b0. For instance, 7.8 x 102, 1.101 x 27 …. □
 E notation, Fig. 09-02, is the notation of floating point number to the line, where all
parts are written to one row, without the superscript of exponent, 106. E notation
uses the letter e or E (small e or capital E) to express and separate the exponent, for E notation
instance, 12.3e-3, 12.3E-3, 11.1e2 …. This format is used by a lot of calculators, 8.76E4,
spreadsheets and other programs. The programming languages as the Ada, C++, 1.001E8
□
MATLAB, Scilab, Perl, Java and Python use E notation, [wiki_0902].
Fig. 09-02 Example of E notation
92 VŠB-TU Ostrava
Note to radix point and exponent

3 2
4.56 x 10 = 45.6 x 10 7.89 x 101 = 0.789 x 102
 When the number is shifted to the left, the exponent is decremented by one for
each position.
 When the number is shifted to the right, the exponent is incremented by one
for each position. □
All these enrolments are used in real practice and are automatically understood as the
floating point numbers in the decimal numeral system. An enrolment in the numeral sys-
tems different from the decimal one is solved by special notations and differs case by case.
In the history of computer science, a lot of different definitions and representations existed
about how to represent a floating point number in bytes, words. The pioneers in this field
were Leonardo Torres y Quevedo and Konrad Zuse. In 1914, Torres y Quevedo designed an
electro-mechanical version of the Analytical Engine of Charles Babbage which included a
floating-point arithmetic. In 1938, Konrad Zuse, from Berlin, completed the Z1, the first
mechanical binary programmable computer; this was, however, unreliable in operation. It
worked with 24-bit binary floating-point numbers having a 7-bit signed exponent, a 16-bit
significand (including one implicit bit), and a sign bit. More information is in literature
[wiki_0901], [Randeli_1982] and [Rojas_1997].
After this period, new computer architectures were developed and each of them had its IEEE 754
own format and properties of the floating point. This led to the fact that these different
definitions were causing problems with the data exchange between users and different ISO/IEC/IEEE
computer architectures. All historical experience led to the IEEE Standard for Floating-Point 60559:2011
Arithmetic (IEEE 754). The first version of this standard was published in 1985 and covered □
only binary floating point arithmetic. Subsequently in 1987, the standard IEEE 854-1987 was
published for the radix-independent floating point arithmetic, [wiki_0906]. The second ver-
sion of IEEE 754 was published in 2008 and it includes the original version of IEEE 754-1985
and IEEE 854-1987. The standard IEEE 754-2008 is also the international standard
ISO/IEC/IEEE 60559:2011.
Following explanation will be based on this standard where the base 2 and base 10 of the
numeral system for the floating point data are defined. Literature [wiki_0906] states the
significance of the standard IEEE-754 as follows:
“The standard defines
 arithmetic formats: sets of binary and decimal floating-point data, which consist of
finite numbers (including signed zeros and subnormal numbers), infinities, and spe-
cial "not a number" values (NaNs)
 interchange formats: encodings (bit strings) that may be used to exchange floating-
point data in an efficient and compact form
 rounding rules: properties to be satisfied when rounding numbers during arithmetic
and conversions
VŠB-TU Ostrava 93
 operations: arithmetic and other operations on arithmetic formats

 exception handling: indications of exceptional conditions (such as division by zero,
overflow, etc.)
The standard also includes extensive recommendations for advanced exception handling,
additional operations (such as trigonometric functions), expression evaluation and for
achieving reproducible results.”
Arithmetic format can be understood as a value of floating point operands and a result of
the operation. Arithmetic format is used for calculation with floating point. Interchange
format is defined by the fields in a word and by the encoding scheme for the purpose of
placing the floating number into the word. Interchange format is useful for exchanging the
floating point data between different computer architectures.
Note to significand, coefficient and mantissa
Mantissa was the first historical term for designating significant digits in the floating
point notation. Konrad Zuse used the term of mantissa in the period of 1939–1941 in the
computer Z-3, [Zuse_2008], Burks in 1946 [Burks_1946] and [RFC 0382] in 1972.
Subsequently, the term of significand or fraction is used by the standard IEEE 754-1985
but standard IEEE 754-2008 only uses the term of significand that is denoted either by
the small letters c or m. Both small letters c and m are referred to as numbers in a specif-
ic form. In the other sources, the small letter c is referred to as a coefficient.
The term of mantissa is discouraged by the IEEE floating-point standard committee, be-
cause it conflicts with the pre-existing use of mantissa for the fractional part of a loga-
rithm, [wiki_0905].□
9.1 Significand
The significand is the newest and official term for significant digits, and it
is defined by standard IEEE 754-2008. Significand can be thought of as an
Significant digits x bexp
integer or a fraction. Used significand digits are defined by the base and □
they are from the range of 0 to b-1. The term of coefficient is also used
by this standard but mantissa is not. Mantissa was officially used in the past and more
people in computer branch and literature use this term until now. More information is in
literature [wiki_0905]. Number 12.34 in the decimal numeral system can be written down
in several ways:
 1.234 * 10+1, the significand is in the form with radix point. This is a normalized
form of the notation.
 0.1234 * 10+2, the significand is in the form with radix point. This notation is al-
lowed by LIA - Language Independent Arithmetic, [ISO/IEC_0901], and several pro-
gramming language standards, [wiki_0905].
 1234 * 10-2, the significand is an integer.
 (1234/104) * 10+2, the significand is a fraction. Fixed point notation is used.
94 VŠB-TU Ostrava
9.2 Precision
The precision is the maximum number of digits in the significand and it is a
basic parameter of the interchange floating point format. The precision is Precision as letter p is the
denoted by small letter p, IEEE 754. Fig. 09-03 shows the definition of the pre- maximum number of digits
cision in the different notation of significand. in significand.□
1 234 567
In decimal 1.234567 𝑥 104 = 106
𝑥 104
111 1101
In binary 1.111101 𝑥 24 = 26
𝑥 24
Generally, it is also possible to write:
𝑑0 𝑑1 𝑑2 … 𝑑𝑝−1
𝑑0 . 𝑑1 𝑑2 … 𝑑(𝑝−1) 𝑥 𝑏 𝑒𝑥𝑝 = 𝑥 𝑏 𝑒𝑥𝑝
𝑏 𝑝−1
Fig. 09-03 Precision
9.3 Floating point values

All values of the floating point are defined by the triple (sign, exponent, significand) in the
radix b. The possible values are a finite value, e.g. 3.14 or 0.005, infinity and Not a Number.
Finite values defined by this triple are expressed by formula (0902). Number zero also be- Finite value.□
longs to finite values and floating point data have a plus or a minus zero. The sign is ex-
pressed by the power (-1)sign, where the sign can have the value 0 or 1. When the sign has
the value 0, then the sign is plus, the power (-1)0 = 1. When the sign has the value 1, then
the sign is minus, the power (-1)1 = -1.
v = (-1)sign x significand x bexp (0902)
Where
 v is a finite value of floating point number .

 sign is the sign and can have value 0 or 1.
 significand is a number in the numeral system with radix b.
 b is the radix of the numeral system.
 exp is an exponent.
Infinity is the value of the floating point data. It is a situation when the result is out of the
representation range. The floating point data have plus or minus infinity. It means that fi- Infinity.□
nite values have minimum and maximum values, which are defined by the interchange
floating point formats. The infinity value can also be the input value of an operation.
Not a Number, abbreviation NaN, is a special value for the situation when the result of an
NaN, Not a
operation is not defined. For instance, for operations like: a square root of a negative num-
Number.□
ber with the result as a real number, e.g. √−2, or inverse sinus of a number higher than 1,
VŠB-TU Ostrava 95
e.g. arcsin(5). The standard IEEE 754 in this situation defines the result as Not a Number -
NaN. The standard defines two values of NaN, signaling NaN and quiet NaN. More infor-
mation about infinity and NaN is in standard IEEE 754.
9.4 Sets of floating-point data

The set of finite floating-point numbers representable within a particular format is deter-
mined by the following integer parameters, [IEEE 754-2008]:
 b, is the radix, 2 or 10.

 p is the number of digits in the significand (precision).
 emax is the maximum exponent.
 emin is the minimum exponent, emin shall be 1 − emax for all formats.
The formula (0902) contains a sign, a significand and an exponent. The significand can be
expressed into two ways, either as a number with radix point or as an integer number. The Significand in
former one shows the significand in the scientific form (0903); the latter one shows the the scientific
significand as a coefficient (0904). Within each format, the following value of the floating- form.□
point data can be represented:
 Signed zero and non-zero floating-point numbers in the form
v = (-1)S x be x m, (0903) (-1)S x m x be

Where
 v is a finite value of floating point number.

 s is 0 or 1.
 b is a radix of the numeral system. d0 is MSB and
 e is any integer in the range of emin ≤ e ≤ emax. dp-1 is LSB □
 m is a number represented by a digit string in the form of d0 • d1 d2…dp −1
where di is an integer digit 0 ≤ di < b, therefore 0 ≤ m < b.
Red dot is a
In the text above, number m is a number with radix point and significand is expressed in radix point □
the scientific form. Significand has the radix point between the digits with the orders of
zero and minus one, d0 and d-1. It means that the integer part of this form is in the range of
0 to b-1. In the second form (0904), significand is expressed as a coefficient, small letter c. Significand as a
The coefficient is any unsigned integer, a positive number. coefficient.□
 Signed zero and non-zero floating-point numbers in the form
v = (-1)S x bq x c, (0904) (-1)S x c x bq

Where
 v is a finite value of floating point number.

 s is a sign 0 or 1.
 b is a radix of the numeral system.
96 VŠB-TU Ostrava
 q is a quantum, any integer in the range of emin ≤ q + p − 1 ≤ emax, where

p is precision. Quantum is an exponent used in case that the significand is
an unsigned integer number.
 c is a coefficient as an integer number which is represented by a digit string
d0 is MSB and
in the form of d0 d1 d2…dp −1 where di is an integer digit 0 ≤ di < b. The coeffi-
dp-1 is LSB □
cient c is therefore an integer with 0 ≤ c < bp, where p is precision.
Both formulas (0903) and (0904) are equivalent and describe the exactly same finite values
of the floating point number. These values are zero and non-zero numbers. The radix point
is between digits d0 and d-1 in the formula (0903) and after digit dp-1 in the formula (0904).
Formula (0905) defines the relation between the exponent e and the quantum q and for-
mula (0906) for significands.
Significand
e = q + p -1 (0905) with scaling.□
m = c/bp-1 (0906)
(-1)S x c/bp-1 x be
c
𝑣 = (−1)𝑠 𝑏𝑒 (0907)
𝑏 𝑝−1
Where
 e is an exponent, formula (0903).

 q is a quantum, formula (0904).
 p is a precision.
 m is a significand in the scientific form, formula (0903).
 c is a significand as a coefficient, formula (0904).
 v is a finite value of a floating point number.
 c/bp-1 is a significand with a scaling factor (0907).
 b is a radix of the numeral system .
Formula (0907) expresses the finite value of floating point data which is based on a scaling
factor of 1/bp-1. Subsequently, all formulas, (0903) - scientific form, (0904) - coefficient, and
(0907) - scaling, are equivalent and they correspond to the same finite floating point value.
9.5 Formats defined by IEEE 754-2008

The new standard IEEE 754-2008 defines several formats for floating point data with radix 2
and radix 10. The name for all formats is the combination of two expressions, the radix
name and the word size in bits:
 Basic formats are binary32, binary64, binary128, binary{k}, decimal64, decimal128

Basic formats.□
and decimal{k}. Basic formats are recommended for the performance of arithmetic
operations, [IEEE 754-2008].
 Interchange formats consist of basic formats plus binary16 and decimal32. The en-
coding of the interchange formats is fully specified as bit strings, and this allows da- Interchange
ta interchange between different platforms, [Muller_2010]. The standard does not formats.□
specify the endianness problems. Binary16 and decimal32 formats can only be used
VŠB-TU Ostrava 97
for storage purposes where high precision is not needed and they cannot be used
for arithmetic operations, [wiki_0910] and [wiki_0911].
 Extended and extendable precision formats whose encodings are not specified, but
may match those of interchange formats, [Muller_2010].
Basic and interchange formats are only defined by the radix, precision and maximum expo-
nent. The remaining needed parameters are in Tables 09-01 and 09-02. The standard allows
to define new formats for the word size higher than 128 bits. The new word size k is the
multiple of 32 and it must be higher than or equal to 128. In generally, new names are bina-
ry{k} and decimal{k}. It means that it is possible to define a new binary or decimal format,
e.g. for 320-bit word and others.
Parameters binary16 binary32 binary64 binary128 binary{k} (k≥128)

k, storage width in bits 16 32 54 128 multiple of 32
p, precision k - round
11 24 54 113
(4xlog2(k)) + 13
emax, (k-p-1)
15 127 1 023 16 383 2 -1
maximum exponent
Encoding parameters
bias, E-e 15 127 1 023 16 383 emax
Sign bit 1 1 1 1 1
w, exponent field round (4xlog2(k))
5 8 11 15
width in bits - 13
t, trailing significand
10 23 52 112 k–w-1
field width in bits
k, storage width in bits 16 32 64 128 1+w+t
Explanation: Source IEEE 754-2008
Yellow field contains initiation parameters and blue field calculated parameters.
Table 09-01 Binary interchange format parameters
The standard IEEE 754-2008 specifies the encoding floating point data into the sequence of
bits and it does not specify the placement in a memory. The placement of floating point
data is given by endiannes that specifies the rules of how to place a long word into the
smaller atomic elements, e.g. 128-bit word into the byte oriented memory. Basic endiannes
are a big endian and a little endian.
Note to formats
Standard defines an interchange and also an extended and an extendable precision for-
mat. Definitions of these formats according to IEEE 754-2008 are:
 2.1.33 interchange format: A format that has a specific fixed-width encoding de-
fined in this standard.
 2.1.20 extendable precision format: A format with precision and range that are
defined under user control.
 2.1.21 extended precision format: A format that extends a supported basic format
by providing wider precision and range.□
98 VŠB-TU Ostrava
Parameters decimal32 decimal64 decimal128 decimal{k} (k≥32)

k, storage width in bits 32 54 128 multiple of 32
p, precision 7 16 34 9 x k/32 – 2
emax, (k/16+3)
96 384 6 144 3x2
maximum exponent
Encoding parameters
bias, E-q 101 398 6 176 emax + p - 2
Sign bit 1 1 1 1
w + 5, combination
11 13 17 k/16 + 9
field width in bits
t, trailing significand
20 50 110 15 x k/16 - 10
field width in bits
k, storage width in bits 32 64 128 1+5+w+t
Source IEEE 754-2008
Explanation:
 Yellow field contains initiation parameters and blue field calculated parameters.
 In decimal format, quantum q as exponent and significand without radix point,
d0d1d2 , are only used.
Table 09-02 Decimal interchange format parameters
“For instance, the double-precision number −7.0868766365730135 * 10−268 is encoded to

the string of bytes 11 22 33 44 55 66 77 88. When the string is ordered in a memory (from
the lowest address to the highest one) on x86 and Linux/IA-64 platforms (they are said to
be little-endian) and by 88 77 66 55 44 33 22 11 on most PowerPC platforms (they are said
to be big-endian). Some architecture, such as IA-64, ARM, and PowerPC are bi-endian, i.e.,
they may be either little-endian or big-endian depending on their configuration.” It is the
citation of literature [Muller_2010].
In the literature, it is possible to find different name of formats for floating point numbers,
some of them are: historical, used in real practice or newly defined. The decimal formats
were defined by standard IEEE 854-1987 for the base-independent numeral system;
[wiki_0906]. However, new standard IEEE 754-2008 defines the floating point data for the
base 2 and base 10. The possible names are:
 Binary16, other names are half, half precision. This format is defined by IEEE-754-
2008 only for the storage purposes.
 Binary32, other names are single, single precision. Programming languages use the
declaration float or real. This format was defined by IEEE-754-1985.
 Binary64, other names are double, double precision. Programming languages use
the declaration double. This format was defined by IEEE-754-1985.
 Binary128, other names are quad, quad precision, double-double precision. This
format is defined by IEEE-754-2008.
 Decimal32, format is defined by IEEE 754-2008 only for storage purposes.
 Decimal64, format is defined by IEEE 754-2008.
 Decimal128, format is defined by IEEE 754-2008.
VŠB-TU Ostrava 99
 8-bit binary format, called minifloat, is the format that is non-defined by standard.
It is mainly used for educational purposes and some special purposes, mostly in
computer graphics, [wiki_0912].
 80-bit format, called extended precision, was the binary format of floating point da-
ta that was used in some processor architectures and this format is not widespread.
This format was defined by IEEE-754-1985 and it was rejected in 2008.
9.6 Binary interchange format encodings

For better utilization of all combinations in the word, the standard defines two formats for
the non-zero values, the normal form and the subnormal form, Fig. 09-04. Both forms use Binary inter-
the hidden bit, which corresponds to the digit d0 of significand in the scientific form. It is the change format
form with the radix point, m = d0 . d1 d2 … dp-1. The hidden bit of the normal form is equal to use the normal
1 and the hidden bit of the subnormal form is equal to 0, Fig. 09-04. In the situation, when and the sub-
it is impossible to express the value in the subnormal or the normal form, either an over- normal form.□
flow or underflow occurs.
 Significand of a binary normal number has the value of MSB digit d0 equal to 1. Sig-
nificand in the scientific form m is in the range of 1 ≤ m < 2 and the exponent is in
the range of emin ≤ e ≤ emax.
 Significand of a binary subnormal number has the value of MSB digit equal to 0.
Significand m is in the range of 0 < m < 1 and the exponent is emin, emin = 1 - emax.
emin = 1 – emax, for binary32, it is 1 – 127 = -126
Hidden bit
-1101 x 24 = -110.1 x 25 = - 11.01 x 26 = -1.101 x 27 Normal
+0.0000 1 x 2-124 = + 0.0001 x 2-125 = +0.01 x 2-126 Subnormal
Hidden bit
Fig. 09-04 Normal form
Standard IEEE 754-2008 states the representations of floating-point data in the binary in-
terchange formats, where each floating-point number has just one encoding in the binary Unique encod-
interchange format. This property is the unique encoding. ing □
In literature, it is also possible to find the term of denormal number, or denormalized num-
ber, that is the equivalent to the subnormal number. Subnormal numbers fill the underflow
gap around zero and increase the range of representation, [wiki_0909]. Subnormal num-
bers are in the range from minimum subnormal number to less than minimum normal
number.
100 VŠB-TU Ostrava

“Representations of floating-point data in the binary interchange formats are encoded in k

bits in the following three fields ordered as shown in Fig. 09-05:
 1-bit sign S, the sign is 0 for positive and 1 for negative floating point data
 w-bit biased exponent, E = e + bias
Offset binary
 (t = p – 1)-bit trailing significand field digit string T = d1 d2 … dp −1; the leading bit of
representation
the significand, d0, is implicitly encoded in the biased exponent E. MSB bit of signifi-
is used for the
cand d0 is often referred as hidden bit.”
exponent e. □
Above, it is the citation of IEEE 754-2008.
1 bit MSB w bits LSB MSB t = p -1 bits LSB
S E T
Sign Biased exponent Trailing significand field
E0 …………..… Ew-1 d1 d2………………………………………..………..… dp-1

Format for binary32
31 30 8 bits 23 22 23 bits 0
S E T
E0 …………..… E7 d1 d2………………………………………..………..… d23
Fig. 09-05 Definition of fields for binary interchange floating point format
The trailing significand field contains significand without the digit d0, which is called the
hidden bit. The value of this bit is implicitly defined either by the normal or the subnormal
form. In the former one, hidden bit is 1, in the latter one, hidden bit is 0. Capital letter E is
biased exponent.
In text above, the exponent e (small letter e) was used in formulas. The format of the word
uses the biased exponent E (capital letter E). The exponent e is either a negative or positive Small letter e is
number, but the biased exponent E is only a positive number. The biased exponent E is the exponent. □
offset binary representation of an integer number.
The format of the word that is shown in Fig. 09-05, is able to represent all floating point
values as: NaN, infinity, finite value and zero. The first rule of encoding is the value of the
biased exponent E; and sometimes, the second rule is used, i.e. the value of the trailing
significand field T, Fig. 09-06. The detailed description of the encoding binary values is in
the Annex 09A of this chapter.
Note to the offset binary or excess-k representation
The standard 754 uses the term of biased exponent or biased number that corresponds
to the offset binary or excess-K representation of signed numbers. The bias is the offset
that is added to the number and it moves the signed numbers to unsigned integers.
Standard 754 defines the bias equal to 2w-1 - 1, where w is the size of the exponent field.
Example:
 For w = 4, the bias is equal to 7.
 For number -5, the biased number is -5 + 7 = 2.□
VŠB-TU Ostrava 101

S E T
E0 …………..… Ew-1 d1 d2………………………………………..………..… dp-1
11… sNaN
11….11 Not a Number
01… qNaN
11….11 00…00 Infinity plus or minus
11…10 Finite non zero values in normal form

to
00…01
v = (-1)S * (1+T/2p-1) * 2E - bias
00…00 T≠0 Finite non zero values in subnormal form

v = (-1)S * (0+T/2p-1) * 2emin
00…00 T=0 Zero, plus or minus, (-1)S * 0 * 2emin

Significand is zero.
Fig. 09-06 Encoding the values in the binary interchange format
The Not a Number is the unique encoding with the biased exponent E having only ones and
the trailing significand field T is not zero. The NaN value has two sub-values qNaN and Encoding
sNaN, which are encoded by the value of trailing significand field T. When the trailing field T of NaN□
begins with 1 (d1 = 1), it is a quiet NaN; and when the field T begins with 0 (d1 = 0), it is a
signaling NaN. The real non zero value of trailing field T has the diagnostic purpose. The sign
bit has no influence on the value NaN.
The infinity is encoded by the combination where the biased exponent E is all 1s and the
Encoding
trailing field T is zero. The infinity is plus or minus, according to the sign bit.
of infinity□
Note to the biased exponent E
The exponent e for binary32 format has the parameters:

emax is 127, emin is -126, bias is 127, the size of the biased exponent w is 8 and biased
exponent is in the range of 0 ≤ E ≤ 2w-1, it is 0 ≤ E ≤ 255.
 When the value of the biased exponent E is all ones, it is number 2w - 1 = 255; then
Not a Number or infinity is encoded according to the value of the trailing field T.
 When the value of the biased exponent E is all zeros, E = 0, then either subnormal
finite values or zero is encoded according to the trailing field T.
 When the biased exponent E is in the range of 1 < E < 254, then normal finite value
is encoded.
 Exponent emax is the biased exponent E with all ones minus one, it is 2w-2 = 254.
 Exponent emin is the biased exponent E with all zeros plus 1; it is 1.□
102 VŠB-TU Ostrava

The normal value is encoded when the biased exponent E is in the range from 1 to 2w – 2. It
is not all zeros and not all ones in the biased exponent E. Then the exponent e (small letter Encoding
e) is equal to E - bias. The trailing significand field T contains the bit string d-1d-2…dp-1 and it of normal
is a part of significand. The hidden bit is 1, d0 = 1. The whole significand in the scientific numbers□
form is the concatenation of the hidden bit d0 with the radix point and the bit string from
the trailing field T. The finite value can be calculated by formula (0903) where significand m
is in the scientific form. Another possibility is to modify formula (0907) because we know
that the hidden bit d0 is 1. So we get a new formula (0908) for the calculation of the normal
value from the binary interchange format. The fraction in formula (0908) is the fixed point
number with scaling factor.
𝑇
𝑣 = (−1)𝑆 ∗ (1 + 2𝑝−1 ) ∗ 2𝐸−𝑏𝑖𝑎𝑠 (0908)
𝑇
𝑣 = (−1)𝑆 ∗ (0 + ) ∗ 2𝑒𝑚𝑖𝑛 (0909)
2𝑝−1
Where
 v is a finite value of the floating point number.

 S is a sign.
 T is a value from the trailing significand field as an integer value.
 2p-1 is a scaling factor.
 E is an unsigned integer value of the biased exponent.
 bias is an offset of the offset binary representation, it is 2w-1-1, where w is the size
of the biased exponent field.
 emin is a minimal value of the exponent and it is emax – 1.
The subnormal value is encoded when the biased exponent E is all zeros and the trailing
Encoding of
significand field T is not zero. It means that the exponent e is equal to emin. The trailing
subnormal
significand field T contains a part of significand without leading hidden bit d0 that is equal to
numbers□
0. The whole significand in the scientific form is the concatenation of the hidden bit 0 with
the radix point and the value of T field. The value of subnormal form can be calculated by
formula (0903). Other possibility is to use formula (0909) that is derived from formula
(0907) knowing that the hidden bit is equal to 0.
Note to the border between the normal and subnormal numbers
The explanation is given for the real format binary32, where emax is 127, emin is -126
and bias is 127.
 The smallest normal number with the exponent e = -126 is 1.0… x 2-126, the bi-
ased exponent is E = 1.
 The following smaller number is 1.11… x 2-127 = 0.111… x 2-126. The second value
is the subnormal number with the exponent e = -126 and significand is non zero
with MSB bit equal to zero. For this situation, the biased exponent is E = 0.
 When the biased exponent E = 0 and significand are equal to zero, then the ex-
ponent e = -126 and the value is 0.0 x 2-126 = 0.0, it is a plus zero.□
VŠB-TU Ostrava 103

Zero is encoded when the biased exponent E and the trailing significand field T are all zeros.
Encoding
Only the sign bit defines the plus zero or the minus zero. When the operation on the float-
of zero□
ing point numbers produces a zero result, the standard prefers a plus zero, so it is a zero
word.
NaN -∞ Normal Sub. Sub. Normal +∞ NaN

………………. ……. …… ………………….
0.0
minus -1.0 +1.0 plus
Fig. 09-07 Composition of values on the number line
Fig. 09-07 shows the order of values on the number line for the binary floating point data.
There are numbers that cannot be represented. These are numbers between zero and the
area of subnormal numbers. Subnormal numbers are followed by normal numbers, then by
infinity and NaNs.
9.7 Decimal interchange floating point format

The decimal interchange format is like the binary interchange format, defined by the radix
of the numeral system 10, the precision p, the maximum exponent emax and the word size Decimal inter-
k-bit. The precision p indicates the number of decimal digits in significand, it is not the change format
number of bits like in the binary format. The standard IEEE 754-2008 defines the decimal uses the quan-
interchange floating point formats as decimal32, decimal64, decimal128 and decimal{k}. tum instead of
The decimal{k} is the format that is defined by a user and k must be the multiple of 32 and the exponent. □
higher than 128. The value composition of the decimal interchange floating point format on
the real number line is in Fig. 09-08. The decimal interchange floating point format has the
following values:
 NaN, Not a Number in two forms, the signaling NaN and the quiet NaN.
 Infinity, plus and minus.
 Finite numbers, it is the zero and the non-zero decimal numbers with a sign.
Note to the exponent and quantum
Two exponents e and q are used in the interchange floating point formats and the rela-
tion between them is e = q + p -1. For better understanding the differences between
them, the example is used for binary32 and decimal32 format. The minimum exponent is
defined by emin = 1 – emax.
The binary32 format is defined by parameters, precision p2 = 24 bit, emax2 = 127, bias2 =
127. The range of exponent e is -126 ≤ e ≤ 127 and biased exponent E has the range of
1 ≤ E ≤ 254. This range is used by normal numbers to represent finite values.
The decimal32 format is defined by parameters, precision p10 = 7 digit, emax10 = 96, bi-
as10 = 101. This means, that exponent e is in the range of -95 ≤ e ≤ +96 but the quantum
q is in the range of -95 ≤ q + p10 - 1 ≤ 96 => -101 ≤ q ≤ 90 and the biased exponent E has
values in the range of 0 ≤ E ≤ 191. This range of quantum q is used by finite values. Zero,
infinity and NaN are not coded by the biased exponent.□
104 VŠB-TU Ostrava

NaN -∞ Finite negative numbers Finite positive numbers +∞ NaN
0.0
minus -1.0 +1.0 plus
Fig. 09-08 Values on the real number line for the decimal format
The formula (0910) defines the finite values of the decimal interchange floating point for-
mat and it is derived from formula (0904). The significand in the decimal interchange for-
mat is in the form of the coefficient C.
v = (-1)S * C * 10q (0910)
Where

 S is a sign.
 C is a coefficient as significand and it is a decimal or a binary number.
 q is a quantum as a binary number.
1 bit MSB w + 5 bits LSB MSB t = J x 10 bits LSB
S G T
Sign Combination field Trailing significand field
G0 ………….……..… Gw+4 Decimal encoding: J declets give 3×J = p – 1 digits
t–1
Binary encoding: t bits give values from 0 through 2
Fig. 09-09 Definition fields for the decimal floating point format
The fields of the word, Fig. 09-09, for the decimal interchange floating point format are
defined by the standard IEEE 754-2008 as:
“Representations of floating-point data in the decimal interchange formats are encoded in

k bits in following three following, whose detailed layouts and canonical (preferred) encod-
ings are described below.
a) 1-bit sign S.
b) A w + 5 bit combination field G encoding classification and, if the encoded datum is
Biased represen-
a finite number, the exponent q and four significand bits (1 or 3 of which are im-
tation is used for
plied). The biased exponent E is a w + 2 bit quantity q + bias, where the value of the
quantum. □
first two bits of the biased exponent taken together is either 0, 1, or 2.
c) A t-bit trailing significand field T that contains J × 10 bits and contains the bulk of
the significand. When this field is combined with the leading significand bits from Declet is a code,
the combination field, the format encodes a total of p = 3 × J + 1 decimal digits.” where three
decimal digits
The format of the word does not contain the value of biased exponent E and significant C in
the direct form, however, these values are encoded in the combination field G and the trail- are encoded
into ten bits. □
ing signicand field T. The new term “declet” is defined and used in conjunction with the
VŠB-TU Ostrava 105

decimal encoding of significand C. The declet contains 3 decimal digits and the width of the
declet is 10 bits. Therefore, the size of the trailing signicand field is always defined as the
multiple of 10 bits.
Note to the declet, IEEE 754-2008 Canonical

form.□
Declet - the encoding of three decimal digits into ten bits using the densely-packed-
decimal encoding scheme. From the 1 024 possible combinations, 1 000 combinations
correspond to canonical declets, while 24 combinations are non-canonical declets. The
canonical declets are the result of the operation and non-canonical declets are not pro-
duced by computational operations, but are accepted in operands. □
The canonical form is a new term which is related to the decimal interchange floating point
format. The canonical term means that the combination in any fields is defined by the
standard IEEE 754-2008. All combinations in the fields or declets are not used and these
combinations are called as the non-canonical. The canonical form is produced by any float-
ing point operation and the non-canonical form is accepted in operands. The canonical
form relates to all fields, not only to declets. Therefore, some combinations of the combina-
tion field G and the combinations of the field G with the trailing significand field T are non-
canonical.
The values of the decimal interchange floating point format are inferred from the sign,
combination field G and trailing significand field T. The encoding begins with the leading
bits of the combination field G, Fig. 09-10. The description of the encoding decimal values is
in Annex 09B of this chapter, in details.
G
Combination field G0 ……G5
G0 …………….……..… Gw+4
1111 11… sNaN
Not a Number
1111 10… qNaN
G0 ……G4
1111 0… Infinity plus or minus
E and
Finite non zero numbers, v = (-1)S * C * 10E - bias
d0 or d0d1d2d3
decimal d0 = 0
Trailing significand field Zero plus or minus, (-1)S * 0 * 10q
or binary
T=0 Significand is zero.
d0d1d2d3 = 0000
Fig. 09-10 Encoding the value in the decimal interchange format
106 VŠB-TU Ostrava

The value Not a Number is encoded by the combination (G0…G4) = 11111 and the bit G5
encodes a quiet and signaling NaN. The sign and the remaining bits of the G field have no Encoding
influence on the NaN. The T field contains the payload for distinguishing various values of of NaN.□
NaN. The remaining bits (G6 to Gw+4) of NaN in canonical form are equal to zero and the
encoding of the payload is canonical.
The infinity is encoded by the combination (G0…G4) = 11110, regardless of the remaining
Encoding
bits of field G and field T. The sign determines the plus or minus infinity. The canonical form
of infinity.□
of infinity is defined in such a way that the remaining bits of the combination field
(G5…Gw+4) are equal to zero and the trailing significand field T is equal to 0.
The value zero is encoded when the significand is equal to zero, regardless of the quantum.
Encoding
The sign bit determines a plus or minus zero. The significand is equal to zero when both the
of zero. □
trailing significand field T and the leading bits or a digit are equal to zero. The leading bits or
a digit are encoded in the combination field G.
Finite value for the decimal interchange format is given by formula (0911) which is derived
Coefficient C
from formula (0910). The significand as the coefficient C can be expressed by a decimal or a
(capital C) is a
binary integer number. The form of the significand is given by the implementation or it is
significand. □
agreed beforehand. It is impossible to distinguish from the decimal interchange floating
point format whether the coefficient is a binary or a decimal number.
S E-bias (-1)S x C x 10E-bias

v = (-1) x 10 xC (0911)
Where
Significand C is the
coefficient either in
 S is a sign.
the binary or dec-
 E is a biased exponent that will be encoded in the binary form.
imal numeral sys-
 bias is a defined by a constant according to the format.
tem.□
 E-bias, is a quantum q, it is the exponent in case the significand is a integer.
 C (capital C), is a significant as a coefficient, a decimal or binary unsigned integer.
Fig. 09-11 shows the situation of the encoding decimal significand C10 as a coefficient which
has p decimal digits in the BCD code. The combination field G encodes the biased exponent
E and MSB decimal digit d0 of the significand. The trailing significand field T contains the J
declets of the significand, where each declet encodes 3 BDC numbers. Therefore, the trail-
ing significand field T contains the p-1 decimal digits d1d2d3…dp-1. Then the decimal signifi-
cand C10 as a coefficient is the concatenation of MSB digit d0 and the decimal digits from
trailing field, d1d2d3…dp-1. The declet uses the densely-packed decimal encoding which is
described in the subchapter below. The detailed description of the combination field en-
coding is in Annex 09C of this chapter.
Fig. 09-12 shows a similar situation for the binary significand as a coefficient, which has t+4
bits. The combination field G encodes the biased exponent E and the leading 4 bits of the
significand, d0d1d2d3. The trailing significand field T contains the remaining bits of the binary
VŠB-TU Ostrava 107

significand d4d5d6…dt+3. Then the binary significand C2 as a coefficient is an unsigned integer

number and it is the concatenation of the leading 4 bits and the remaining bits from the
trailing field, d0d1d2d3 + d4d5d6…dt+3. The detailed description of the combination field en-
coding is in Annex 09C of this chapter.
Storage word with k bit

S G T
G0 ………….……..… Gw+4 J declets give 3×J = p – 1 digits
Encoding Densely-packed decimal encoding
MSB w + 2 bits MSB 4 + J x 10 bits

S
Biased exponent E Decimal significand C
Sign
E0 ………….……..… Ew+1 d0 d1 d2 ……………… dp-1
C10 = d0 d1 d2 ….. dp-1,
where di is a decimal digit in BCD code Biased representation
is used for expo-
v = (-1)S x C10 x 10(E2-bias) nent.□
Fig. 09-11 Basic encoding scheme for the decimal significand
Storage word with k bit

S G T
G0 ………….……..… Gw+4 t bits
Encoding Binary number
MSB w + 2 bits MSB 4 + J x 10 bits

S
Biased exponent E d0d1d2d3 Significand C
Sign
E0 ………….……..… Ew+1 d0d1d2d3 d4d5d6d7 ……………….. dtdt+1dt+2dt+3
C2 = d0 d1 d2 d3 d4….. dt dt+1 dt+2 dt+3
where di is binary digit, 0 or 1, bit
v = (-1)S x C2 x 10(E2-bias)
Fig. 09-12 Basic encoding scheme for a binary significand
108 VŠB-TU Ostrava

Note to the relation between the binary and decimal significand
 Decimal32 has the precision p = 7, which means 7 decimal digits, and the range
of the significand is from 0 to 9 999 99910.
 The corresponding range in the binary numeral system is from 0 to 98 967F16.
The higher binary numbers are non-canonical. The important point is that the
most significant 4 bits have a value only from 0000 to 10012.
 Therefore, the combination field G of the interchange format encodes only
numbers from 0 to 9.□
The number The number of The number of

The number of
of combina- used bits in BCD possible combi- Redundancy
decimal digits
tions code nations
n1 10n1 n2 = n1 x 4 2n2 2n2 – 10n1
1 10 4 16 6
2 100 8 256 156
3 1 000 12 4 096 3 096
Table 09-03 Reason for declet
9.8 Declet and densely-packed decimal

The Binary Code Decimal, BCD code, is the code for the decimal numeral system in the bi-
nary digital system. BCD uses 4 bits and this 4-bit tuple defines 16 possible combinations.
Only 10 combinations out of 16 are used and 6 combinations are redundant. Table 09-
03 shows the following calculations and, in case of 3 decimal digits, the redundancy is
higher than useful combinations. This leads to the fact that 10-bit tuple can be used for Declet is the encod-
1 000 useful combinations and the redundancy is only 24. It is the basic idea of the ing of three decimal
declet. The standard IEEE 754-2008 defines a declet as an encoding of three decimal digits into ten bits.□
digits into ten bits by means of the densely-packed decimal encoding scheme. Table
09-04 also shows the encoding rules, [wiki_0907].
Densely-packed decimal encoded
Decimal digit
value
b9 b8 b7 b6 b5 b4 b3 b2 b1 b0 d2 d1 d0 Value encoded Description
a b c d e f 0 g h i 0abc 0def 0ghi (0-7)(0-7)(0-7) Three small digits
a b c d e f 1 0 0 i 0abc 0def 100i (0-7)(0-7)(8-9)
Two small digits,
a b c g h f 1 0 1 i 0abc 100f 0ghi (0-7)(8-9)(0-7)
one large
g h c d e f 1 1 0 i 100c 0def 0ghi (8-9)(0-7)(0-7)
a b c 1 0 f 1 1 1 i 0abc 100f 100i (0-7)(8-9)(8-9)
One small digit,
d e c 0 1 f 1 1 1 i 100c 0def 100i (8-9)(0-7)(8-9)
two large
g h c 0 0 f 1 1 1 i 100c 100f 0ghi (8-9)(8-9)(0-7)
x x c 1 1 f 1 1 1 i 100c 100f 100i (8-9)(8-9)(8-9) Three large digits
Source http://en.wikipedia.org/wiki/Densely_packed_decimal
Table 09-04 Densely-packed decimal encoding scheme
VŠB-TU Ostrava 109

The basic access to the design of the densely packed format is according to literature [Cowl-
ishaw_2000] :
“The primary advantage of the encoding over a pure binary representation in ten bits is
that no arithmetic is needed for conversion to or from BCD. Only a very few Boolean opera-
tions are needed for conversions – in hardware, encoding or decoding can be achieved with
only 2–3 gate delays; in software, a simple table look-up suffices. In addition, the encoding
has other advantages, for example, the least-significant bit of each digit remains unencod-
ed, which allows bit-per-digit operations to be effected directly.”
9.9 Rounding
The floating point format can represent exactly only some numbers that can be drawn on
the real number line as points. In Fig. 09-13, they are green points and the width of the gap
depends on the precision and the value of the exponent. Between two nearest green
points, there is an infinite amount of numbers that cannot be represented. However, the
non-representable numbers can be the result of operations and it is desirable to place them
to the interchange floating point formats. Therefore, the non-representable numbers must
be rounded to one of the nearest green points.
Real number line

-∞ +∞
1.0000 0000 00 x 2e
For half precision, it is:
1.0000 0000 01 x 2e
0.0000 0000 01 x 2e
Fig. 09-13 Representable numbers
In mathematics, there are more rules and also names for rounding. The basic rules for
rounding in the floating point arithmetic are in the standard IEEE 754, see [wiki_0906] and
[IEEE 754-2008].
 Round to nearest. The number is rounded to the nearest possible number. The
problem occurs, when the rounded number lies in the middle, then the distance to
both nearest points is the same. In the decimal numeral system, it is number 5, e.g.
12.345, the distance to 12.34 and 12.35 is the same and equaled to 0.005. There-
Round to near-
fore, two definitions exist for this case of rounding.
est, ties to even,
 Round to nearest, ties to even. It means that the number is rounded to the
default rounding.
nearest value. When the number falls into the midway, it is rounded to the
□
even value. This rounding is the default for binary floating point and it is the
recommended default for decimal floating point.
 Round to nearest, ties away from zero. It means that the number is round- Round to near-
ed to the nearest value. When the number falls into the midway, it is est, ties away
rounded to the nearest value in the direction from zero. For positive from 0.□
110 VŠB-TU Ostrava

rounded numbers, it is to a higher number, and for negative rounded num-

bers, it is to a lower number. This rounding is intended as an option for the
decimal floating point.
Round toward 0 or
 Direction rounding. The direction is defined and all numbers in the gap will be
truncation. □
rounded in this direction.
 Round toward 0 - directed rounding towards zero, it is also known as
Round toward
truncation.
+∞ or rounding
 Round toward +∞ – directed rounding towards positive infinity, it is also
up or ceiling.□
known as rounding up or ceiling.
 Round toward −∞ – directed rounding towards negative infinity, it is also
known as rounding down or floor. Round toward -
∞ or rounding
For better understanding, Table 09-05 shows the results for different rules of rounding. down or floor.□
No problems are with the direction rounding. For rounding to nearest, the differences
are for numbers lying in midway and these differences are caused by the rules: tie to even
or tie away from zero. More information on the practical rounding of results is in the fol-
lowing chapter about performing arithmetic operations on the floating point numbers.
Round to nearest Direction rounding

Number Ties away Round Round Round
Ties to even
from zero toward 0 toward +∞ toward -∞
+ 20.4 + 20 + 20 + 20 + 21 + 20
+ 20.5 + 20 + 21 + 20 + 21 + 20
+20.6 + 21 + 21 + 20 + 21 + 20
+ 21.4 + 21 + 21 + 21 + 22 + 21
+ 21.5 + 22 + 22 + 21 + 22 + 21
+ 21.6 + 22 + 22 + 21 + 22 + 21
- 20.4 - 20 - 20 - 20 - 20 - 21
- 20.5 - 20 - 21 - 20 - 20 - 21
- 20.6 - 21 - 21 - 20 - 20 - 21
- 21.4 - 21 - 21 - 21 - 21 - 22
- 21.5 - 22 - 22 - 21 - 21 - 22
- 21.6 - 22 - 22 - 21 - 21 - 22
Green cells show the differences of rounding to nearest

Tab. 09-05 Results of rounding for different principles
9.10 Not a Number

Not a Number, abbreviation NaN, is a symbolic value that is produced by arithmetic opera-
tions and functions. NaN means that something in the operation or function is not correct
from the mathematical point of view. A typical example is the square root of minus num-
VŠB-TU Ostrava 111

bers, the division by infinity, the multiplication by infinity and so on. Literature [wiki_0908]
and [Goldberg_1991] states a situation when the NaN is used.
There are three kinds of operations that can return NaN:
 Operations with a NaN as at least one operand. In this situation the input value
NaN is produced by a previous arithmetic operation.
 Indeterminate forms:
 Divisions, 0/0 and ±∞/±∞. Notice that the division of finite number by zero
results in infinity.
 Multiplications, 0 × ±∞ and ±∞ × 0.
 Additions, ∞ + (−∞), (−∞) + ∞ and equivalent subtractions.
 The standard has alternative functions for powers:
 The standard pow function and the integer exponent pown func-
tion define 00, 1∞, and ∞0 as 1.
 The powr function defines all three indeterminate forms as invalid
operations and so returns NaN.
 Real operations with complex results, for example:
 The square root of a negative number.
 The logarithm of a negative number.
 The inverse sine or cosine of a number that is less than −1 or greater than
+1.
The standard IEEE 754 defines two NaN values, quiet NaN and signaling NaN. The basic
difference is in setting the exception, where only signaling NaN sets the invalid exception
with following trap (interrupt service routine), in case it is enabled. More information about
traps and interrupt handlers is in literature [wiki_0914], [wiki_0915], [wiki_0916], [Mul-
ler_2010] and [Ergovac_Lang_2004]. More information about NaN is in literature [IEEE 754-
2008], [wiki_0908], [Muller_2010], [Ergovac_Lang_2004] and [Goldberg_1991].
9.11 Infinity
Infinity is a normal mathematical term and its use is related to mathematical limits. Infin-
Infinity is a normal
ity in case of the floating point can be understood as a number that is outside the finite
result of some
numbers. The symbol of infinity is ∞, U+221E. Infinity can be produced by arithmetic
operations.□
operations or functions and it can also be the input operand. The following operations
with infinity do not cause the exception with trap handler, [IEEE 754-2008]:
 Addition or subtraction of the finite number with infinity.

 Multiplication of infinity and finite numbers that are not equal to zero.
 Division of infinity and finite numbers. Division of a finite number by zero is infinity.
 Square root of plus infinity (+∞).
 Remainder (x, ∞). The nominator is a finite normal number x and the denominator
is infinity. Result is a finite number x.
 Conversion of infinity into the same infinity in another format.
The exceptions are set and following trap handler can be is run when [IEEE 754-2008]:
112 VŠB-TU Ostrava

 Infinity is an invalid operand.

 Infinity is created from finite operands by overflow or division by zero
 Remainder (subnormal, ∞) signals the underflow. The nominator is subnormal
number and denominator is infinity.
9.12 Default exceptions

Exceptions are special situations that can occur in the floating point arithmetic operations
or functions. These exceptions are also called flags and for each of them some way of han-
dling is defined. The term of handling means running the trap handler or the interrupt ser-
vice routine, when the trap is enabled. Standard only defines default exceptions and the
following list introduces typical situations when the exception occurs. More information is
in literature [wiki_0906], [Muller_2010], [Ergovac_Lang_2004] and [IEEE 754-2008]. An
actual implementation of the floating point arithmetic can define other exceptions. The
default exceptions according to the standard are below.
 Invalid operation. Typically, invalid operation flag is set in these situations:

 Multiplication of zero by infinity.
 Addition of plus infinity and minus infinity.
 Division of zero by zero or division of infinity by infinity.
 Square root, if the operand is less than zero.
 And others.
 Division by zero. The flag is set in these situations:
 Divisor is zero and dividend is finite not zero number, result is infinity and
sign is the exclusive OR of the signs of both operands.
 Logarithm of zero, result is minus infinity.
 Overflow. The result is too large to be represented correctly. Rounding can also
produce the overflow. The default result is then plus or minus infinity.
 Underflow. The result is very small and it is inexact. Such numbers lie between ze-
ro and the range of normal numbers. Subnormal numbers also belong to this gap.
In case of subnormal numbers, the inexact flag is also set.
 Inexact. The result is rounded.
9.13 Implementation
The implementation of the floating point arithmetic can be realized by hardware or soft-
ware. Hardware implementation is a faster way; the execution time of operations is mini-
mal. Vice versa, software implementation is a slower way and the execution time of opera-
tions is longer.
Hardware realization is known as FPU, Floating Point Unit or coprocessor. This FPU is made
by the producer of processor as a separate unit. Most of currently used processors have
FPU implemented directly. The instruction set of FPU typically contains instructions for the
basic floating point arithmetic as the addition, subtraction, multiplication and division.
More complex functions, such as logarithm or trigonometric functions, are implemented by
software.
VŠB-TU Ostrava 113

Software implementation depends on a hardware support. In case of non-existing hard-

ware support, all definitions of floating formats and the execution of basic floating point
operations are realized by software. With existing hardware support, software libraries
implement missing operations and functions.
Note to the slowest operation
Division belongs to the slowest operations of a processor for all data types. Division has
no implementation as a combinational circuit. The performance of the division is a se-
quence of additions and subtractions that is given by the algorithm of the division. It is a
classical digital synchronous system and the sequence is generated by FSM. □
The first version of standard IEEE 754 was issued in 1985 and its revised version in 2008.
Today, floating point arithmetic according to this standard is implemented in a lot of pro-
cessors and systems. The implementation of binary floating point arithmetic has been long
known and it depends on the producer of a processor. The decimal floating arithmetic ac-
cording to standard IEEE 754-2008 is newer and it has been introduced in practice. The web
speleotrove.com states following implementation of decimal floating point arithmetic ac-
cording to standard IEEE 754-2008, [spel_0901]:
“The decimal-encoded formats and arithmetic described in the new standard now have
many implementations in hardware and software, including:
 the hardware decimal floating-point unit in the IBM POWER6 and POWER7 proces-
sors, the firmware (with assists) in the IBM System z9 mainframe, and the hardware
decimal floating-point unit in the IBM System z10 mainframe (see this paper for de-
tails)
 SilMinds’ Decimal Floating Point Arithmetic hardware IP Cores Family (see also their
presentation for some details)
 Fujitsu’s decimal instructions in the SPARC64 X processor (see presentation, charts
5 & 6).
 IBM XL C/C++ for AIX, Linux and z/OS, DB2 for z/OS, Linux, UNIX, and Windows, and
Enterprise PL/I for z/OS; IBM is also adding support to many other software prod-
ucts including z/VM V5.2, System i/OS, the dbx debugger, and Debug Tool Version
8.1
 SAP NetWeaver 7.1, which includes the new DECFLOAT datatype in ABAP, with
support for hardware decimal floating-point on Power6
 GCC 4.2 and later includes support for the proposed ISO C extensions for decimal
floating point.”
9.14 References
[Burks_1946] Burks, Arthur W.; Goldstine, Herman H.; Von Neumann, John (1946).
Preliminary discussion of the logical design of an electronic computing
instrument. Technical Report, Institute for Advanced Study, Princeton, NJ.
In Von Neumann, Collected Works, Vol. 5, A. H. Taub, ed., MacMillan, New
York, 1963, p. 42:
114 VŠB-TU Ostrava

5.3. 'Several of the digital computers being built or planned in this

country and England are to contain a so-called "floating decimal
point". This is a mechanism for expressing each word as a
characteristic and a mantissa—e.g. 123.45 would be carried in the
machine as (0.12345,03), where the 3 is the exponent of 10
associated with the number.'
[Cowlishaw_2000] Cowlishaw, M. F. (2000-10-03). "Summary of Densely Packed Deci-

mal encoding". Retrieved 2008-09-10;
http://speleotrove.com/decimal/DPDecimal.html; on line 2013-07-12
[Cowlishaw_2002] Cowlishaw, M. F. (May 2002). "Densely packed decimal encoding".

IEE Proceedings – Computers and Digital Techniques (Institution of Electri-
cal Engineers) 149 (3): 102–104. doi:10.1049/ip-cdt:20020407. ISSN 1350-
2387. [IEEE 754-1985] IEEE Std 754-1985, IEEE Standard for Binary Float-
ing-Point Arithmetic, 1985

[Goldberg_1991] David Goldberg: What Every Computer Scientist Should Know About
Floating-Point Arithmetic; published in March, 1991 issue of Computing
Surveys. Copyright 1991, Association for Computing Machinery Inc.
[IEEE 754-2008] IEEE Std 754™-2008, IEEE Standard for Floating-Point Arithmetic, 29
August 2008, revision of IEEE 754 – 1985


8176-4704-9; e-ISBN 978-0-8176-4705-6
[Randeli_1982] B. Randell (1982). From analytical engine to electronic digital computer: the
contributions of Ludgate, Torres, and Bush. IEEE Annals of the History of
Computing, 04(4). pp. 327–341.
[RFC 387] SOME EXPERIENCES IN IMPLEMENTING NETWORK GRAPHICS PROTOCOL

LEVEL 0; 1972; http://tools.ietf.org/html/rfc387; on line 2013-06-13
[Rojas_1997] R. Rojas: "Konrad Zuse’s Legacy: The Architecture of the Z1 and Z3". IEEE
Annals of the History of Computing 19 (2): 5–15. 1997; http://ed-
thelen.org/comp-hist/Zuse_Z1_and_Z3.pdf; on line 2013-06-18
[spel_0901] General Decimal Arithmetic; http://speleotrove.com/decimal/; on line

2013-07-18
VŠB-TU Ostrava 115

[wiki_0901] Floating point; http://en.wikipedia.org/wiki/Floating_point; on line 2013-

06-13
[wiki_0902] Scientific notation; http://en.wikipedia.org/wiki/Scientific_notation; on line

2013-06-13
[wiki_0903] Engineering notation; http://en.wikipedia.org/wiki/Engineering_notation;

on line 2013-06-13
[wiki_0904] Normalized number; http://en.wikipedia.org/wiki/Normalized_number; on

line 2013-06-13
[wiki_0905] Significand; http://en.wikipedia.org/wiki/Significand; on line 2013-06-13
[wiki_0906] IEEE floating point; http://en.wikipedia.org/wiki/IEEE_floating_point; or

http://en.wikipedia.org/wiki/IEEE_754; on line 2013-06-13
[wiki_0907] Densely packed decimal;

http://en.wikipedia.org/wiki/Densely_packed_decimal; on line 2013-07-12
[wiki_0908] NaN; http://en.wikipedia.org/wiki/NaN; on line 2013-07-17
[wiki_0909] Denormal number; http://en.wikipedia.org/wiki/Subnormal_number; on

line 2013-07-18
[wiki_0910] Half-precision floating-point format; http://en.wikipedia.org/wiki/Binary16;

on line 2014-07-22
[wiki_0911] decimal32 floating-point format; http://en.wikipedia.org/wiki/Decimal32;

on line 2014-07-22
[wiki_0912] Minifloat; http://en.wikipedia.org/wiki/Minifloat; on line 2014-07-22
[wiki_0913] International System of Units; http://en.wikipedia.org/wiki/SI; on line 2014-

08-21
[wiki_0914] Trap (computing); http://en.wikipedia.org/wiki/Trap_(computing); on line

2014-08-21
[wiki_0915] Interrupt handler; http://en.wikipedia.org/wiki/Interrupt_service_routine;

on line 2014-08-21
[wiki_0916] Interrupt; http://en.wikipedia.org/wiki/Interrupt; on line 2014-08-21
[Zuse_2008] H. Zuse: Konrad Zuses Z3 in Detail; April 2008;

http://staffweb.worc.ac.uk/DrC/Courses%202008-
9/Comp%203104/Reading%20Materials/Z3-detail-english.pdf; on line 2013-
06-13
116 VŠB-TU Ostrava

9.15 Annex 09A

Each floating-point number has just one encoding in a binary interchange format. To make the en-
coding unique, in terms of the parameters in 3.3, the value of the significand m is maximized by
decreasing e until either e = emin or m ≥ 1. After this process is done, if e = emin and 0 < m < 1, the
floating-point number is subnormal. Subnormal numbers (and zero) are encoded with a reserved
biased exponent value.
Representations of floating-point data in the binary interchange formats are encoded in k

bits in the following three fields ordered as shown in Figure 3.1:
 1-bit sign S
 w-bit biased exponent, E = e + bias
 (t = p – 1)-bit trailing significand field digit string T = d1 d2 … dp −1; the leading bit of
the significand, d0, is implicitly encoded in the biased exponent E.
1 bit MSB w bits LSB MSB t = p -1 bits LSB
S E T
E0 …………..… Ew-1 d1 d2………………………………………..………..… dp-1
Figure 3.1—Binary interchange floating-point format
The values of k, p, t, w, and bias for binary interchange formats are listed in Table 3.5 (see
3.6). The range of the encoding’s biased exponent E shall include:
- every integer between 1 and 2w − 2, inclusive, to encode normal numbers

- the reserved value 0 to encode ±0 and subnormal numbers
- the reserved value 2w − 1 to encode ±∞ and NaNs.
The representation r of the floating-point datum, and value v of the floating-point datum
represented, are inferred from the constituent fields as follows:
a) If E = 2w − 1 and T ≠ 0, then r is qNaN or sNaN and v is NaN regardless of S (see

6.2.1).
b) If E = 2w − 1 and T = 0, then r and v = (−1)S × (+∞).
c) If 1 ≤ E ≤ 2w− 2, then r is (S, (E−bias), (1 + 21−p × T)); the value of the corresponding
floating-point number is v = (−1)S × 2E−bias × (1 + 21−p × T); thus normal numbers have
an implicit leading significand bit of 1.
d) If E = 0 and T ≠ 0, then r is (S, emin, (0 + 21−p × T)); the value of the corresponding
floating-point number is v = (−1)S × 2emin × (0 + 21−p × T); thus subnormal numbers
have an implicit leading significand bit of 0.
e) If E = 0 and T = 0 , then r is (S, emin, 0) and v = (−1)S × (+0) (signed zero, see 6.3).
End of the citation of IEEE 754-2008.
VŠB-TU Ostrava 117

The calculation of a value v for a normal and/or subnormal form uses the scaling factor
1/2(p-1). The second possibility is to use the scientific form of significand. The trailing signifi-
cand field contains a bit string of a part of the significand, d1 d2 … dp-1. The biased expo-
nent E determines either a normal or a subnormal form and, according to this determina-
tion, the MSB bit will be 1 or 0. Formulas for the calculation of a value v will be v = (−1)S ×
2E−bias × (1.T) for normal numbers and v = (−1)S × 2emin × (0.T) for subnormal numbers.
Note to the border between normal and subnormal numbers

The explanation is shown for the real format binary32, where emax is 127, emin is -126
and bias is 127.
 The smallest normal number with the exponent e = -126 is 1.0… x 2-126, the biased
exponent is E = 1.
 The following smaller number is 1.11… x 2-127 = 0.111… x 2-126. The resulting value is
the subnormal number with the exponent e = -126 and the significand is non zero
with MSB bit equaled to zero. For this situation, the biased exponent is E = 0.
 When the biased exponent E = 0 and the signicand is equal to zero, then the expo-
nent e = -126 and the resulting value is 0.0 x 2-126 = 0.0; it is a plus zero.□
For better understanding, all values of the interchange binary floating point format are
shown in Table 09-A01. The important part is the biased exponent field; if this field has all
ones, the value is NaN or plus/minus infinity, according to the trailing significand field. Val-
ue zero is a very simple combination because the biased exponent and the trailing signifi-
cand field have all zeros; only the sign defines a plus or a minus zero. The preferred combi-
nation for zero is a plus zero, in this case all three fields have all zeros. The remaining com-
binations correspond to the normal or subnormal value.
Sign Biased exponent Trailing field Value

1xxxx qNaN
x All 1 Not a Number, NaN
0 and not all 0 sNaN
0/1 All 1 All 0 Plus or minus infinity
Remaining com-
0/1 Value T Normal value v = (−1)S × 2E−bias × (1.T),
binations
0/1 All 0 Not zero Subnormal value v = (−1)S × 2E−bias × (0.T)
0/1 All 0 All 0 Plus or minus zero
Table 09-A01 Value of the binary interchange format
118 VŠB-TU Ostrava

9.16 Annex 09B
Representations of floating-point data in the decimal interchange formats are encoded in k

bits in the following three fields, whose detailed layouts and canonical (preferred) encod-
ings are described below.
a) 1-bit sign S.
b) A w + 5 bit combination field G encoding classification and, if the encoded datum is
a finite number, the exponent q and four significand bits (1 or 3 of which are im-
plied). The biased exponent E is a w + 2 bit quantity q + bias, where the value of the
first two bits of the biased exponent taken together is either 0, 1, or 2.
c) A t-bit trailing significand field T that contains J × 10 bits and contains the bulk of
the significand. When this field is combined with the leading significand bits from
the combination field, the format encodes a total of p = 3 × J + 1 decimal digits.
S G T
G0 ………….……..… Gw+4
Decimal encoding: J declets give 3×J = p – 1 digits
t–1
Figure 3.2—Decimal interchange floating-point formats
The representation r of the floating-point datum, and value v of the floating-point datum
represented, are inferred from the constituent fields as follows:
a) If G0 through G4 are 11111, then v is NaN regardless of S. Furthermore, if G5 is 1,

then r is sNaN; otherwise r is qNaN. The remaining bits of G are ignored, and T con-
stitutes the NaN’s payload, which can be used to distinguish various NaNs.
The NaN payload is encoded similarly to finite numbers described below, with G
treated as though all bits were zero. The payload corresponds to the significand of
finite numbers, interpreted as an integer with a maximum value of 10(3×J) − 1, and
the exponent field is ignored (it is treated as if it were zero). A NaN is in its pre-
ferred (canonical) representation if the bits G6 through Gw+4 are zero and the en-
coding of the payload is canonical.
b) If G0 through G4 are 11110 then r and v = (−1)S × (+∞). The values of the remaining
bits in G, and T, are ignored. The two canonical representations of infinity have bits
G5 through Gw+4 = 0, and T = 0.
c) For finite numbers, r is (S, E − bias, C) and v = (−1)S × 10(E−bias) × C, where C is the
concatenation of the leading significand digit or bits from the combination field G
VŠB-TU Ostrava 119

and the trailing significand field T, and where the biased exponent E is encoded in
the combination field. The encoding within these fields depends on whether the
implementation uses the decimal or the binary encoding for the significand.
1. If the implementation uses the decimal encoding for the significand, then
the least significant w bits of the exponent are G5 through Gw+4. The most
significant two bits of the biased exponent and the decimal digit string d0
d1…dp−1 of the significand are formed from bits G0 through G4 and T as fol-
lows:
i. When the most significant five bits of G are 110xx or 1110x, the
leading significand digit d0 is 8 + G4, a value 8 or 9, and the leading
biased exponent bits are 2G2 + G3 , a value 0, 1, or 2.
ii. When the most significant five bits of G are 0xxxx or 10xxx, the
leading significand digit d0 is 4G2 + 2G3 + G4, a value in the range of
0−7, and the leading biased exponent bits are 2G0 + G1, a value 0, 1,
or 2. Consequently if T is 0 and the most significant five bits of G
are 00000, 01000, or 10000, then v = (−1)S × (+0).
The p −1 = 3 × J decimal digits d1…dp−1 are encoded by T which contains J

declets encoded in densely-packed decimal.
A canonical significand has only canonical declets, as shown in Tables 3.3

and 3.4. Computational operations produce only the 1000 canonical de-
clets, but also accept the 24 non-canonical declets in operands.
2. Alternatively, if the implementation uses the binary encoding for the signif-
icand, then:
i. If G0 and G1 together are one of 00, 01, or 10, then the biased ex-
ponent E is formed from G0 through Gw+1 and the significand is
formed from bits Gw+2 through the end of the encoding (includ-
ing T).
ii. If G0 and G1 together are 11 and G2 and G3 together are one of 00,
01, or 10, then the biased exponent E is formed from G2 through
Gw+3 and the significand is formed by prefixing the 4 bits (8 + Gw+4)
to T.
The maximum value of the binary-encoded significand is the same as that

of the corresponding decimal-encoded significand; that is, 10(3 × J + 1) −1 (or
10(3 × J ) −1 when T is used as the payload of a NaN). If the value exceeds the
maximum, the significand c is noncanonical and the value used for c is zero.
Computational operations generally produce only canonical significands, and al-

ways accept noncanonical significands in operands.
End of the citation of IEEE 754-2008.
120 VŠB-TU Ostrava

Table 09-B01 shows the encoding of the combination field G for Not a Number and infinity.
In case of NaN, the trailing significand field contains a payload, which describes NaN in de-
tails. In case of infinity, the trailing field has no reason. Table 09-B02 continues in the en-
coding for finite values. In this case, the combination field contains a biased exponent and a
leading digit or bits of significand. A decimal significand uses densely-packet encoding in the
trailing field and a binary significand is placed into the trailing field directly without encod-
ing. Zero is encoded by the significand equal to zero.
The form of the significand is given by the implementation or it is agreed beforehand. It is

impossible to distinguish from the decimal interchange floating point format whether the
coefficient is a binary or a decimal number.
Sig Combination field Trailing signifi-

Value
n G cand field T
G5 = 1 sNaN
x 11111 … payload Not a Number, NaN
G5 = 0 qNaN
0/1 11110 … Plus or minus infinity
Table 09-B01 Values of NaN and infinity in the decimal interchange format
Combination
Sign Trailing field Significand Value
field G
v = (−1)S × C10 x 10E−bias
0/1 E d0 d1 … dp-1 C10 =d0 + d1 … dp-1
Significand C is not zero
v = (−1)S × C2 x 10E−bias
0/1 E d0d1d2d3 d4d5d6d7….dt+3 C2= d0d1d2d3 + d4….dt+3
Significand C is not zero
Plus or minus zero
0/1 0 0 C=0
Significand C is zero
Explanation:
 Operator “+”is the overloaded operator and it means the concatenation
Table 09-B02 Finite values of the decimal interchange format
VŠB-TU Ostrava 121

9.17 Annex 09C

The decimal interchange floating point format is defined by three fields, the sign S, the
combination field G and the trailing significand field T, Fig. 09C-01. The finite numbers use a
coefficient in a binary or decimal numeral system. In case of the decimal numeral system,
the least significant part of the coefficient is placed in the trailing signicand field as declets.
The leading decimal digit is encoded in the combination field. The biased exponent is also
encoded in the combination field.
S G T
G0 ………….……..… Gw+4 Decimal encoding: J declets give 3×J = p – 1 digits
t–1
Fig. 09C-01 Definition of fields in the decimal floating point format
In case that the coefficient is in the binary numeral system, the trailing significand field con-
tains the least significant t bits of the coefficient. The most significant 4 bits are encoded in
the combination field.
The following tables show the encoding scheme of the biased exponent E and leading bits
or a decimal digit. Table 09C-01 shows the detailed encoding of the leading decimal digit
and biased exponent from the combination field. Table 09C-02 shows the encoding scheme
for value zero.
Table 09C-03 shows the detailed encoding scheme of the leading bits of the binary coeffi-
cient and the biased exponent. Table 09C-04 shows the encoding scheme for value zero.
122 VŠB-TU Ostrava

Tables for decimal coefficient.
Leading
Combination field
Biased exponent E decimal
Note G Note
digit
G0 ….. Gw+4 E0 …. Ew+1 d0
G0 … G3 1111 ….. NaN or infinity
11101 … 10 + G5…Gw+4 9 Decimal digit is 8 plus G4
11100 … 10 + G5…Gw+4 8
11011 … 01 + G5…Gw+4 9 These are all combina-
G0 … G4 11010 … 01 + G5…Gw+4 8 tions for digits 8 and 9,
11001 … 00 + G5…Gw+4 9 with all combinations of
11000 … 00 + G5…Gw+4 8 the most significant 2
bits of biased exponent.
10111 … 10 + G5…Gw+4 7 Decimal digit is 0G2G3G4
10110 … 10 + G5…Gw+4 6
10101 … 10 + G5…Gw+4 5 These are all combina-
10100 … 10 + G5…Gw+4 4 tions for digits 0 to 7,
G0 … G4
10011 … 10 + G5…Gw+4 3 with one combination of
10010 … 10 + G5…Gw+4 2 102 as the most signifi-
10001 … 10 + G5…Gw+4 1 cant 2 bits of biased ex-
10000 … 10 + G5…Gw+4 0 ponent.
01111 … 01 + G5…Gw+4 7
01110 … 01 + G5…Gw+4 6 Decimal digit is 0G2G3G4
01101 … 01 + G5…Gw+4 5
These are all combina-
01100 … 01 + G5…Gw+4 4
G0 … G4 tions of digits 0 to 7, with
01011 … 01 + G5…Gw+4 3
one combination of 012
01010 … 01 + G5…Gw+4 2
as the most significant 2
01001 … 01 + G5…Gw+4 1
bits of biased exponent.
01000 … 01 + G5…Gw+4 0
00111 … 00 + G5…Gw+4 7
Decimal digit is 0G2G3G4
00110 … 00 + G5…Gw+4 6
00101 … 00 + G5…Gw+4 5
These are all combina-
00100 … 00 + G5…Gw+4 4
G0 … G4 tions of digits 0 to 7, with
00011 … 00 + G5…Gw+4 3
one combination of 002
00010 … 00 + G5…Gw+4 2
as the most significant 2
00001 … 00 + G5…Gw+4 1 bits of biased exponent.
00000 … 00 + G5…Gw+4 0
Explanation:
 Blue colored bits belong to the biased exponent E.
 Red colored bits determine the leading decimal digit d0.
 Operator “+” is the overloaded operator and it means the concatenation of
strings.
Table 09C-01 Encoding scheme of biased exponent and leading digit for decimal coeffi-
cient
VŠB-TU Ostrava 123

Combi- Leading
Trailing signifi-
nation decimal
Note cand field T Value Note
field G digit
G0 … Gw+4 d0
10000 …
Significand C is
G0 to G4 01000 … 0 =0 (-1)S x 0
equal to 0
00000 …
11101 … 9
trough
10000 … 0 ≠0 Finite values
01111 … 9
G0 to G4 trough (-1)S x C x 10E-bias Significand C is
01000 … 0 ≠0 not equal to
00111 … 9 zero
trough
00000 … 0 ≠0
Note: declet 000 is encoded by 10-bit tuple (00 0000 0000)B
Explanation:
 Red colored bits determine the leading decimal digit d0.
Table 09C-02 Encoding scheme for zero and finite numbers for decimal coefficient
124 VŠB-TU Ostrava

Tables for binary coefficient.
Combination Leading 4
Biased exponent E
Note field G bit binary Note
G0 …. Gw+4 E0 ….. Ew+1 d0d1d2d3
1111 ………… NaN or infinity
1110 ……… 1 1001
10 + G4 … Gw +3
1110 ……… 0 1000
G0 G1 G2 G3…… Gw+4 1101 ……… 1 1001
01 + G4 … Gw +3
1101 ……… 0 1000
1100 ……… 1 1001
00 + G4 … Gw +33
1100 ……… 0 1000
10 ……… 111 0111
10 ……… 110 0110
10 ……… 000 0101
10 ……… 000 0100
G0 G1 …… Gw+2 … Gw+4 10 + G2 … Gw+1
10 ……… 011 0011
10 ……… 010 0010
10 ……… 001 0001
10 ……… 000 0000
01 ……… 111 0111
01 ……… 110 0110
01 ……… 000 0101
01 ……… 000 0100
G0 G1 …… Gw+2 … Gw+4 01 + G2 … Gw+1
01 ……… 011 0011
01 ……… 010 0010
01 ……… 001 0001
01 ……… 000 0000
00 ……… 111 0111
00 ……… 110 0110
00 ……… 000 0101
00 ……… 000 0100
G0 G1 …… Gw+2 … Gw+4 01 + G2 … Gw+1
00 ……… 011 0011
00 ……… 010 0010
00 ……… 001 0001
00 ……… 000 0000
Explanation:
 Red colored bits determine the leading 4-bit tuple of binary coefficient, d0d1d2d3.
 Operator “+” is the overloaded operator and it means the concatenation of
strings.
Table 09C-03 Encoding scheme of biased exponent and leading bit for binary coefficient
VŠB-TU Ostrava 125

Combina- Leading Trailing signif-

Note tion field G 4 bits icand field T Value Note
G0 … Gw+4 d0d1d2d3
10 …. 000 0000 =0
G0 G1 …… Significand C
01 …. 000 0000 =0 (-1)S x 0
Gw+2 … Gw+4 is equal to 0
00 …. 000 0000 =0
11 ………1 1001
11 ………0 1000
10 … 111 0111
trough trough
Finite values
10 … 000 0000 ≠0
G0 to G1 01 … 111 0111 (-1)S x C x 10E-bias Significand C
trough trough is not equal
to zero
01 … 000 0000 ≠0
00 … 111 0111
trough trough
00 … 000 0000 ≠0
Explanation:
 Red colored bits determine the leading 4-bit tuple of binary coefficient, d0d1d2d3.
Table 09C-04 Encoding scheme for zero and finite numbers for binary coefficient
126 VŠB-TU Ostrava

10 Floating point arithmetic
The floating point arithmetic also contains the basic mathematical operations and func-
tions. These functions are, e.g., trigonometric functions, logarithms, exponentiations and so
on. The inputs of these operations are operands in the interchange floating point format
and the result must be in the canonical interchange floating point format. During the per-
formance of the operations or functions, the own format is used to ensure the highest ac-
curacy of the result. The calculated result can have a higher precision than the interchange
floating point format requires. The increasing number of bits in the result can be seen in the
following examples. The addition of numbers can increase the size of the integer part by
one order, 1.01B + 1.01B = 10.1B, 9D + 4D = 13D. The multiplication has the maximum product Canonical format
size, which is the addition of the sizes of both operands, 1.01B * 1.01B = 1.1001B or 8D * 16D is a format de-
= 128D. These principles are valid for any numeral system. However, it is expected that the fined by IEEE
result will be in the canonical interchange floating point format. After each performance of 754.□
the floating point arithmetic, it is necessary to perform the normalization, rounding and
setting of exceptions.
Normalization is an adjustment to the canonical binary interchange format. The standard

IEEE 754-2008 defines a normal or a subnormal form of the significand in the binary inter- Normalization in
change format. Both forms use the significand in the scientific notation. The values of these binary.
□
forms are defined by formulas (1001) and (1002).
𝑇
𝑣 = (−1)𝑆 ∗ (1 + 2𝑝−1 ) ∗ 2𝑒 (1001)
𝑇
𝑣 = (−1)𝑆 ∗ (0 + 2𝑝−1 ) ∗ 2𝑒𝑚𝑖𝑛 (1002)
𝑣 = (−1)𝑆 ∗ 𝐶 ∗ 10𝑞 (1003)
Where

 S is a sign.
 T is a value from the trailing significand field as an unsigned integer value.
 1/2p-1 is a scaling factor.
 C is a significand in the decimal format as a coefficient.
 e is an exponent in the binary format and it is equal to E – bias.
 q is a quantum, exponent in the decimal format and it is equal to E – bias.
 E is an unsigned integer value of biased exponent.
 bias is an offset of the offset binary representation.
 emin is a minimal value of the exponent and it is equal to emax– 1.
VŠB-TU Ostrava 127

Note to a normal and subnormal form
The normal form has the leading bit of the significand equal to 1, e.g. 1.0001 * 23. The sub-
normal form has the leading bit of the significand equal to 0 and the exponent e is equal to
emin. The example of a subnormal number in binary32 is 0.001 * 2-126. □
Decimal interchange floating point format has values that are defined by formula (1003).
Significand is a coefficient that is an unsigned decimal or an unsigned binary integer. Signifi-
cand in the decimal interchange format has only the preferred forms, Fig. 10-01. When the
number of valid digit is less than precision p, then there are more preferred forms. The
recommended form depends on the operation and the standard IEEE 754-2008 recom-
mends these forms in details. When the number of digits is equal to precision p, the num-
ber stays without changes. If the number of valid digits is higher than precision p, then
there is only one preferred form. The number has to be rounded to the p digits by incre-
menting the exponent. [Internet_1001].
Number 1234 * 100 1234567 * 100 1234567111 * 100 Preferred signifi-

cand.□
Significand 0001234 * 100 1234567 * 100 1234567 * 103
with p = 7
0012340 * 10-1
0123400 * 10-2
1234000 * 10-3
Fig. 10-01 Preferred significand for the decimal format with precision p = 7
After the normalization, rounding and setting exceptions are performed. Non-all bits of the
calculated result are needed for performing these operations. Therefore, the calculated
result is transformed to the format with the significand with p bits or p digits and the auxil-
iary bits or digits. The auxiliary bits are called a guard bit, a round bit and a sticky bit,
[Koren_2008]. It is possible to find different names in literature, however, the meaning of
these bits is:
 Additional integer bit, this bit is used by the normalization of the binary floating
point data and the result is shifted to the right. This bit is only significant, when the
integer part of the result can have 2 bits. After the right shift, a new value is as-
signed to a guard, a round and a sticky bit.
 Guard bit is significant for the binary floating point format and normalization.
Guard bit is a bit on p position in the significand, it is dp. Some results of the opera- Guard, round
tions can have the integer part of the result equal to zero. In this case, the logical and sticky bit.□
left shift is performed as the normalization. After this normalization, a guard bit is
not needed and only the round and sticky bits remain. When the normalization is
not necessary, the guard bit is removed and the round and sticky bits are shifted to
128 VŠB-TU Ostrava

the left by one position. A new value of sticky bit is calculated. More information is
in literature [wiki_1001], [Muller_2010] and [Koren_2008].
 Round bit or a round digit are used for rounding, [wiki_1001] and [Koren_2008]. At
the beginning of final operations, the round bit is following the guard bit for a bina-
ry floating number. For a decimal floating number, the round digit is placed in the
p+1 position from the leading digit.
 Sticky bit is calculated and it is used for rounding. The sticky bit is placed behind the
round bit. The sticky bit is always the logical OR of the remaining least significant
bits of the result behind the round bit, [wiki_1001] and [Koren_2008]. The sticky bit
determines whether the result is exactly in the middle of the ulp – unit in the last
place or not. When sticky is zero, the result is in the middle of the ulp, when sticky
is non zero, it means that the result is out of the middle of the ulp. This is a problem
of, for example, the number 2.500… that should be rounded to nearest. This num-
ber lies in the middle of the ulp, therefore rounding down to 2 or up to 3 can be
done. But the number 2.500…01 have sticky bit equal to one, and then the nearest
number is 3.
Calculated result
Logical OR
Precision
p
MSB LSB
0 p-1 GRS
G - Guard bit
R - Round bit
Additional integer bit S - Sticky bit
Significand
Fig. 10-02 Auxiliary bits in binary format, guard, round and sticky bit
All these bits are used in the binary floating point arithmetic, where the scientific form of
significand is used and results have more bits than the precision p. The guard bit and the
round bit are created simply by adding names to the bits in the correct position. The sticky
bit is calculated by the logical OR of the remaining bits, Fig. 10-02. The decimal floating
point uses the creation of the preferred form. When the number of digits in the result is
higher than precision p, rounding to the leading p digits is applied. Therefore, the round
digit and the sticky digit are significant.
10.1 Rounding
Rounding ensures that the result in the canonical floating point form is the most accurate.
More information about rounding is in one of the previous chapters. The standard IEEE
754-2008 states 5 principles of rounding:
 Round to nearest are:

 Round to nearest, ties to even. Round to nearest, ties to even is
 Rounding to nearest, ties to away from zero. a default rounding.□
 Direction rounding are:
VŠB-TU Ostrava 129

 Round toward zero.

 Round toward plus infinity.
 Round toward minus infinity.
The performance of rounding in computers is often implemented by the addition of a con-

stant to the rounded number. For this purpose, the term ulp is defined, Fig. 10-03. The ulp
is the abbreviation of unit in the last place or unit of least precision, [wiki_1004], but other
definitions are in literature, [Harrison_1999], [Muller_2005] and [Muller_2010]. For round-
ing, the sufficient definition says that ulp is the space between the nearest floating-point
numbers, [wiki_1001].
The neighboring num- Real number lines □

bers of floating point
ulp
data in precision p □
Floating point data in
½ ulp ½ ulp
the middle of ulp □
Fig. 10-03 Definition of ulp - unit in the last place
Formula (1004) defines ulp for binary format, where significand is expressed in scientific
form and formula (1005) defines ulp for decimal format, where significand is expressed as
coefficient. All decimal formats have the same significand of ulp.
For binary, ulp = 1/2p-1 * 2e (1004)
For decimal, ulp = 1 * 10q (1005)
Where
 p is a precision of the floating point format.

 e is an exponent.
 q is a quantum.
Rounding has errors, the first error is that the result loses bits and thus the accuracy of the
result. Next errors of rounding are brought by the application of the associative and distrib-
utive rules into the computation and also by the number of rounding. Fig. 10-04 shows the
example of the decimal computation with the precision p = 7. The application of the associ-
ative rule in the addition of three numbers a, b, c and the rounding after each addition gen-
erates different results. The principle round to nearest was used, [wiki_1005].
Note to value ulp and ½ ulp

Binary32 format has the precision p = 24, after
 ulp = 1/2p-1 * 2e = 0.0000 0000 0000 0000 0000 001B * 2e = 0.0000 02H * 2e.
 ½ ulp = 0.0000 0000 0000 0000 0000 0001B * 2e = 0.0000 01H * 2e.
All decimal formats have the same value of significand
 ulp = 1 * 10q and ½ ulp = 0.5 * 10q. □
130 VŠB-TU Ostrava

-4 -5 -6
a = 1 234 567 * 10 b = 3 456 746 * 10 c = 1 000 088 * 10
a = 123.456 7 b = 34.567 46 c = 1.000 088
a+b+c 159.024 248 a+b 158.024 16 a+c 124,.456 788 b+c 35.567 548
round 158.024 2 round 124.456 8 round 35.567 5
plus c 159.024 288 plus b 159.024 248 plus a 159.024 248
round 159.024 2 round 159.024 3 round 159.024 2 round 159.024 2
Fig. 10-04 Possible errors of rounding

The application of one rounding instead of two roundings also generates a different result
in the comparison with the previous ones. These problems with rounding are known and
they are described, in details, in literature [wiki_1002], [wiki_1003] and [Muller_2010].
These problems concern the MAC - Multiply–ACcumulate operations. It is an operation with MAC operations.□
three or more operands on the input. A typical operation is a FMA operation – Fused Multi-
ply-Add or FMAC - Fused Multiply–Accumulate, [wiki_1002]. The FMA operation is defined
as a ← a*b + c and it is frequently used in the digital signal processing.
The performance of rounding is made by adding 0, ½ ulp or ulp value to the result. The ad-
dition may change all digits of the significand because the carry is generated, [Koren_2008].
Therefore, after rounding it is necessary to check the interchange format and to set the Rounding as ad-
exception. Fig. 10-05 shows the table for the binary rounding to nearest, ties to even, dition.□
where the rounding depends on the LSB bit of significand, the round and sticky bit,
[Koren_2008]. Fig. 10-06 shows the table for the binary direction rounding, where the
rounding depends on the sign of significand and the round and sticky bit, [Koren_2008].
LSB R S Operation Note

0 0 0 +0
0 0 1 +0
0 1 0 +0
0 1 1 + ½ ulp One on p-th position
1 0 0 +0
1 0 1 +0
Fig. 10-05 Rounding to nearest, ties to even
Operation for direction rounding

Sign R S Note
To zero To + ∞ To - ∞
+ 0 0 +0 +0 +0
+ 0 1 +0 + 1 ulp +0
+ 1 0 +0 + 1 ulp +0
+ 1 1 +0 + 1 ulp +0
- 0 0 +0 +0 +0
- 0 1 +0 +0 + 1 ulp
- 1 0 +0 +0 + 1 ulp
- 1 1 +0 +0 + 1 ulp
Fig. 10-06 Direction rounding scheme
VŠB-TU Ostrava 131

10.2 Exception
The exceptions characterize the result of floating point operation, [Muller_2010] and [IEEE
754-2008]. The practical realization uses more exceptions than the standard IEEE 754-2008
defines. The first setting of exceptions is made by operations or functions. The second set-
ting is made by normalization and rounding. More information is in the previous chapter.
The exceptions according to the standard are:
 Invalid operation.
 Division by zero.
 Overflow.
 Underflow.
 Inexact.
The normalization and rounding can cause overflow or underflow. Normalization is the shift
with the correction of exponent and rounding is an addition. The result of both operations
can be out of the range of representation. Rounding always sets the inexact exception.
10.3 Operation on result

The calculated result does not have to correspond to the canonical interchange floating
point format. Normalization, rounding and setting of exceptions are performed on the cal-
culated result. These operations use a guard bit, rounding and sticky bit. The ending steps
of each floating operation are, [xxxxxx mudawar ne]:
 Post-normalization, the result is adjusted to the normal form by shifting the signifi-
cand with the correction of exponent. In some cases, the guard bit is used.
 Checking exceptions.
 Rounding. Post-normalized result is rounded. The standard 754 defines the possibil-
ities of rounding. The default principle of rounding is rounding to nearest, ties to
even.
 Return to the first step until the result and exceptions do not change.
In decimal 82.345 x 10123 8.2345 x 10124
0.082345 x 10126 8.2345 x 10124
In binary 10.101 x 236 1.0101 x 237
0.010101 x 239 1.0101 x 1037
Fig. 10-07 Principles of the exponent adjustment when a number is shifted
132 VŠB-TU Ostrava

The normalization or the creation of preferred significand is realized by shifting with the
correction of exponent. Fig. 10-07 shows the basic principles of the normalization. The basic
principles of the normalization are:
 When the number is shifted to right, the exponent is incremented by one for each
position.
 When the number is shifted to left, the exponent is decremented by one for each
position.
The sticky bit is logical OR of previously calculated bits if they exist.
OR
G - Guard bit R - Round bit S - Sticky bit
G R S
This is the calculated result with marking a guard, a round and a
10.011 1 0 0
sticky bit.
Normalization by shifting to right, after which the guard bit is
1.001 1 1 0 0
rejected. A new round bit and a new sticky bit are calculated.
OR
Position of the round bit is moved by one to the left and a new
1.001 1 1 sticky bit is logical OR of the remaining bits.
1.001 1
+0.000 1 Rounding to nearest, ties to even means to add 1/2ulp to
1.010 the number. It is 1 in R position.
Fig. 10-08 Shifting and calculation of a round and a sticky bit
The example of the final operation is in Fig. 10-08. The intended precision p is 4 bits. The
result has 2 bits in the integer part and a guard, a round and a sticky bit are labeled. The
first operation is the logical right shift to normalize the number. A guard bit is rejected be-
cause it is not needed. A round bit has a new shifted value. The new value of the sticky bit
is the logical OR of the previous values of the round and sticky bits. The next step is round-
ing. The result is positive and lies in the upper half of ulp. Fig. 10-05 contains the rules for
rounding to nearest, ties to even. The LSB, round and sticky bits are equal to 1, therefore
1/2ulp is added. After these operations, the auxiliary bits lose their function.
10.4 Minifloat floating point format

Following text deals with the basic mathematical operations. For better understanding the
examples, the minifloat format of floating point is used, Fig. 10-09. A short definition of
minifloat format is:
 Binary numeral system.

 Word size k is 8 bits. Byte is used.
 Precision p is 4.
VŠB-TU Ostrava 133

For 8 bits, p = 4
S E E E E T T T  The bias is b=7, emax is +7 and emin is -6.
 The precision is p = 4, trailing field has 3 bits and
Sign Biased ex- Trailing
MSB bit of the significand is hidden.
ponent
Plus numeral line
0.0 0.001 x 2-6 T=0 T≠0
0.11 x 2-6 1.000 x 2-6 1.111 x 2+7

Normal Infinity NaN
Subnormal
E is in the range from 1 to 0xE E = 0xF E = 0xF
E = 0x0
Fig. 10-09 Minifloat definition
 Exponent. The biased exponent has 4 bits. The bias is 7, maximal exponent emax is
7, minimal exponent emin is -6, (emin = 1 - emax).
 NaN. The biased exponent E is 0xF and the trailing field is not zero.
 Infinity. The biased exponent E is 0xF and the trailing field is zero.
 Normal form of finite numbers. The MSB bit is 1 as a hidden bit. The biased expo-
nent E is in the range from 1 to 0xE. Then, the exponent e is in the range
from -6 to +7.
 Subnormal form of finite numbers. The MSB bit is 0 as a hidden bit. The biased ex-
ponent E is 0x0, the exponent e is -6.
The addition of 12333 * 101 + 12665 * 10-2 in decimal32
12333000 * 10-2 Alignment of exponents to the smaller one. Decimal addition.□

12665 * 10-2
12333000 Addition, the exponents are not needed. Significand is un-

+ 12665 derstood as an integer, radix point is not needed.
12345665
The sum has 8 digits and the required preci-
1234566 sion is 7 digits. Decimal32 format is used. Ex-
ponent has to be incremented. G - Guard bit
R S R - Round bit
A round bit and a sticky bit are marked.
1234566 5 0 S - Sticky bit
A sticky bit is zero.
Rounding to nearest, ties to even. The number lies in the

1234566
middle of ulp. No addition.
1234566 * 10-1 Result with exponent. Exponent was incremented.
Fig. 10-10 Decimal addition
134 VŠB-TU Ostrava

10.5 Addition and subtraction

The performance of the addition and the subtraction has the basic condition that the expo-
nents of floating point numbers have to be the same. Therefore the exponents have to be
aligned before the addition or the subtraction. For the binary floating point, the smaller
exponent is increased to the higher exponent by shifting the significand, [Koren_2008]. It
means that the significand is shifted to right and the exponent is incremented. For the dec-
imal floating point, it is better to decrease the higher exponent to the smaller one. After
that, the operation is performed and in the end, the normalization, rounding and setting of
exceptions are performed. The size of the result depends on the difference of operand ex-
ponents; therefore the sum can have more bits or digits than the operands. Fig. 10-10 and
Fig. 10-11 represent the examples for addition and subtraction.
Binary subtraction.□
The subtraction of 1.001 * 2-3 - 1.101 * 2-2 in minifloat
0.1001 * 2-2 Alignment of exponents, an own format is used.

-1.0110 * 2-2
Radix point is not needed. The following calculation is made
0 1001 with the integer number, fixed point is used. It is necessary to
- 1 0110 change the second operand into two’s complement. Two bits
are added as leading bits, one is for the sign and the second
000 1001 one is for the increased size of the sum. After that, the addi-
+110 1010 tion is calculated.
111 0011 The result is negative, two’s complement is used.
000 1101 The sign of the result is minus.

G - Guard bit
G R S A guard bit, a round bit and a sticky bit are R - Round bit
0 110 1 0 0 marked. The round and sticky bits are zero.
S - Sticky bit
The normalization is applied. The significand is shifted to left,

1 101 0 0
and the exponent is decremented.
1 101 Rounding to nearest, ties to even is applied. No adding.
-1.110 * 2-3 The result with the exponent.
Fig. 10-11 Subtraction
The mathematical definition of the addition and the subtraction is given by formulas (1006)
and (1007). In both formulas, the finite value of floating point data is supposed. The opera-
tions with infinite number and NaN are described in detail by [IEEE 754-2008].
sum = (± m1 x 2E1) + (± m2 x 2E2) = (± m1 x 2E1) + (± m3 x 2E2 + (E1-E2))
= (±m1) + (±m3) x 2E1 (1006)
dif = (± m1 x 2E1) - (± m2 x 2E2) = (± m1 x 2E1) - (± m3 x 2E2 + (E1-E2)) =
(±m1) - (±m3) x 2E1 (1007)
VŠB-TU Ostrava 135

Where
 Both floating point numbers are rational numbers.

 sum is a result of the addition.
 dif is a result of the subtraction.
 m1, m2 are significands.
 m3 is a shifted significand with aligned biased exponent, (E1 = E2+(E1-E2)).
 E1 is a biased exponent which is higher than E2.
 E2 is a biased exponent which is less than or equal to E1.
 E2 + (E1-E2) is a biased exponent which is equal to E1.
Hardware realization of the floating point addition and subtraction has two parts. The first
part deals with the exponent and the second part deals with the addition and the subtrac-
tion of significands. The binary floating point addition and subtraction use the integer bina-
ry adder and two’s complement. Description of the realization is in literature [Mul-
ler_2010], [Koren_2008] and [Ergovac_Lang_2004], in details.
The multiplication of (-1.11 * 2-5) * (+1.1 * 2+3) in minifloat Binary multiplication.□

(-1.110 * 2-5) The sign of the product is minus.
*(+1.100 * 2+3) The exponent of the product is -2.
1110
* 1100 Multiplication of integer numbers, where operands have the
111000 scaling factor of 1/23.
1 110
10 101000
10.101000 The product has the scaling factor of 1/26.
1.010 1 00 The normalization is applied. The significand is shifted to

right, and the exponent is incremented.
G - Guard bit
1.010 1 0 The round bit and sticky bit are marked. The
sticky bit is logical OR of the remaining bits. R - Round bit
S - Sticky bit
Rounding to nearest, ties to even is applied.
1.010
No adding.
-1.010 * 2-1 Result with the exponent. The exponent was incremented.
Fig. 10-12 Binary multiplication
10.6 Multiplication
Multiplication of two floating point numbers has more parts. The first one is a separate
calculation of the resulting sign. On the bit level, it is logical XOR operation of both sign bits.
The second part is a separate calculation of the exponent of the result. It is the addition of
both exponents. The next step is a separate multiplication of the significands. The fixed
point principles are used, therefore the binary multiplication of integer numbers is applied.
136 VŠB-TU Ostrava

The operands have the scaling factor of 1/2(p-1), then the product has the scaling factor of
1/22(p-1). The scaling factor of the product determines the position of radix point. The next
steps, the normalization, rounding and setting of exception are made. Fig. 10-12 and Fig.
10-13 show the examples of multiplication in the binary and decimal numeral systems.
Multiplication of floating point numbers is defined by mathematical formula (1008), where

both floating point data are finite values. When the operand is equal to infinite or NaN,
then the multiplication is described by [IEEE 754-2008]. The product m1*m2 has 2p digit
size, where p is precision.
product = ((-1)S1 m1 x 2E1) * ((-1)S2 m2 x 2E2) = (-1)(S1 xor S2) (m1*m2) x 2(E1+E2) (1008)
Where
 product is a result of multiplication.

 S1, S2 are signs.
 m1 and m2 are significands.
 E1 and E2 are biased exponents.
 xor is logical operation xor.
The multiplication of (+123 * 10-15) * (-456 * 2+9) in decimal32
(+123 * 10-15) The sign of the product is minus.

*(-456 * 10+9) The exponent of the product is -6.
Decimal multiplication.□
123123
* 789 Multiplication.
97144047
9714404 The product has 8 digits, the required precision is p = 7.

The product is shifted to right with incrementing expo-
nent.
G - Guard bit
9714404 7 0 The round and the sticky are derived from
the original product. R - Round bit
S - Sticky bit
9714404 Rounding to the nearest is performed by add-
+ 0000001 ing ulp.
9714405
-9714405 * 10-5 Result with the exponent. Exponent was incremented.
Fig. 10-13 Decimal multiplication
The hardware realization of the binary multiplication can be a combinational logical circuit,
which can contain p-1 binary ripple-carry adder. For the binary64 format, where the preci-
sion is p = 54, the multiplier has 53 binary ripple-carry adders. This realization has a high
propagation delay and the design of the binary multiplier with a small propagation delay is
described in literature [Muller_2010], [Koren_2008] and [Ergovac_Lang_2004].
VŠB-TU Ostrava 137

10.7 Division
Division of floating point numbers is defined by mathematical formula (1009), where both
floating point data are finite values. The division with the zero, infinity and NaN is described
by [IEEE 754-2008].
quotient = ((-1)S1 m1 x 2E1) / ((-1)S2 m2 x 2E2) = (-1)(S1 xor S2) (m1/m2) x 2(E1-E2) (1009)
Where
 quotient is a result of division.

 S1, S2 are signs.
 m1 and m2 are significands.
 E1 and E2 are biased exponents.
 xor is logical operation xor.
The floating point division only has the quotient, not the remainder. The algorithms of the
floating point division are described in detail in literature [Muller_2010], [Koren_2008],
[Ergovac_Lang_2004] and [wiki_1007]. The hardware realization of the floating point divi-
sion is a digital synchronous system, where the algorithm is implemented by FSM – Finite
State Machine. The division is considered the slowest operation in computer.
10.8 References
[EETimes_1001] Clive Maxfield: Design How-To, An introduction to different round-
ing algorithms; EETimes 1/4/2006,
http://www.eetimes.com/document.asp?doc_id=1274485&page_number=
1; on line 2014-08-04

[IEEE 754-2008] IEEE Std 754™-2008, IEEE Standard for Floating-Point Arithmetic, 29 August
2008, revision of IEEE 754 – 1985
[Internet_1001]Decimal Arithmetic Specification, Arithmetic operations, material of IBM,

http://speleotrove.com/decimal/daops.html; on line 2014-08-04
[Harrison_1999] John Harrison; A machine-checked theory of floating point arithme-

tic; Proceedings of the 1999 International Conference on Theorem Proving
in Higher Order Logics, Nice, France, 1999, TPHOLs'99. Springer LNCS 1690,
pp. 113-130, 1999; http://www.cl.cam.ac.uk/~jrh13/papers/fparith.pdf; on
line 2014-08-05
[Maxfield_2005] Clive “MAX” Maxfield, Alvin Brown: How Computers Do Math;

Wiley Interscience, A John Wiley & Sons, INC., publication, 2005,
ISBN-13-978-0471-73278-5
[Muller_2005] Jean-Michel Muller: On the definition of ulp (x); ACM Transactions on

Mathematical Software, Vol. V, No. N, November 2005;
138 VŠB-TU Ostrava

http://ljk.imag.fr/membres/Carine.Lucas/TPScilab/JMMuller/ulp-toms.pdf;
on line 2014-08-05

8176-4704-9; e-ISBN 978-0-8176-4705-6
[Koren_2008] Israel Koren: Computer Arithmetic Algorithms; A. K. Peters 2008;

ISBN 1-56881-160-8
[Mudawar_2014] Muhamed Mudawar: Floating point; presentation for subject Computer

architecture; King Fahd University of Petroleum and Minerals;
http://opencourseware.kfupm.edu.sa/colleges/ccse/coe/coe308/files%5C2
-Lecture_Notes_06-FloatingPoint.pdf; on line 29-01-2014
[wiki_1001] Floating point; http://en.wikipedia.org/wiki/Floating_point; on line 2014-

08-06
[wiki_1002] Multiply–accumulate operation;

http://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation;
on line 2014-08-05
[wiki_1003] Accuracy problems;

http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems; on line
2014-08-06
[wiki_1004] Unit in the last place; http://en.wikipedia.org/wiki/Unit_in_the_last_place;

on line 2014-01-31
[wiki_1005] Rounding; http://en.wikipedia.org/wiki/Rounding; on line 2014-08-06
[wiki_1006] IEEE floating point; http://en.wikipedia.org/wiki/IEEE_floating_point; on

line 2014-08-06

2014-08-26
VŠB-TU Ostrava 139

11 Characters and Unicode
At the beginning of the communication based on the electricity principles, the text was
transferred by electrical impulses and each letter of the alphabet was defined by a ASCII – American
sequence of impulses. Morse alphabet was the first widespread code and it was used for Standard Code
transferring the text, [wiki_1101]. The following important code was the 5-bit code that for Information
was used in the telex (teletype machine) for transferring the text, [wiki_1102], [wiki_1133]. Interchange.□
And in the 1960s, the ASCII was defined and standardized. This is an important milestone in
the history of encoding the characters. The ASCII code is used in the communication and
computers. The original ASCII code only contains the American alphabet. Later, when per-
sonal computers were introduced, the ASCII code was modified by adding national alpha-
bets. Today, the Unicode is a successor.
A current text on the monitor not only has text information but a graphical meaning and
properties as well. This text can be colored and different fonts, typeface and other features
can be used. A displayed text is not only a technical matter but rather a graphical design
that comes from the printing industry. In the time of computers, some terms from the
printing industry have changed or have a new meaning.
11.1 Terminology
The typography is a predecessor of today’s display of information by a computer. In the
information technology, a new terminology is used or the old terms have a new meaning. A
lot of meanings are taken from the Unicode.
ASCII code Page Unicode Textual definition Basic Other possible

(7-bit) 437 position Unicode name glyph glyphs
x61 x61 U+0061 LATIN SMALL LETTER A a a, a, a, a, a, a,
x39 x39 U+0039 DIGIT NINE 9 9, 9, 9, 9, 9
x07 x07 U+0007 BELL, control character
x0A x0A U+000A LINE FEED (LF), control character
BOX DRAWINGS UP DOUBLE AND
xBC U+255C ╜ ╜
LEFT SINGLE
xAC U+00BC VULGAR FRACTION ONE QUARTER ¼ ¼, ¼, ¼, ¼,
Fig. 11-01 Characters and their definition and encoding

Character is a term which have changed the meaning. In the information technology, it is a
basic information unit that corresponds to a symbol that has, typically, a phonetic or a pic- Character.□
tographic meaning. It may be a Latin letter, Chinese sonogram (letter), digit, punctuation,
and dingbat and so on. A character can also have a control meaning, e.g. in printing, such as
line feed, tabulator and so on. Sometimes, these characters are called format characters.
VŠB-TU Ostrava 141

More information is in literature [Interent_1102], [wiki_1103] and [wiki_1104]. Examples of

characters are in Fig. 11-01. Unicode defines a character by 4 sentences, [Unicode_1101]:
 “Character is the smallest component of written language that has semantic value;
refers to the abstract meaning and/or shape, rather than a specific shape (see also
glyph), though in code tables some form of visual representation is essential for
the reader’s understanding.
 Synonym for abstract character.
 The basic unit of encoding for the Unicode character encoding.
 The English name for the ideographic written elements of Chinese origin.”
Note to the character
A character has a name and a basic glyph. It carries no information about the properties,
for example fonts, color, size and so on. □
Glyph is a way of representing a character. Glyph defines the shape of a character, litera-
ture [Internet_1101], [Unicode_1102] and [wiki_1105]. The difference between the charac- Character and
ter and the glyph is shown in Fig. 11-01 and Fig.11-02, [Unicode_1103]. One character can glyph.
□
have one or more glyphs and vice versa.
Character Other possible glyphs Note

a a, a, a, a, a, a One character and more glyphs
9 9, 9, 9, 9, 9 One character and more glyphs
(c) © More characters transformed to one glyph
Pts ₧ More characters transformed to one glyph
Ã ~a One character and two glyphs
Fig. 11-02 Characters and their definition and encoding
Character encoding; it is an assignment of one element from some kind of encoding sys-
tem. A character can be encoded by a number, a sequence of electrical pulses or flags and
so on. In computer, character encoding is an assignment of a number that is called the code
of a character. Then each character is defined by its code, textual definition and by a basic
glyph, [wiki_1106] and [Unicode_1104].
[Wiki_1135] defines the character encoding in this way: “Computers and communication
equipment represent characters using a character encoding that assigns each character to
something — an integer quantity represented by a sequence of bits, typically — that can be
stored or transmitted through a network.”
Character set is a collection of characters and their encoding scheme that is used for repre-
senting information. The ASCII character set is famous; the next sets are the Unicode set Character set.□
and others, [Unicode_1105]. Some literature does not make differences between the char-
acter encoding and the character set.
142 VŠB-TU Ostrava

Font is “a collection of glyphs that are used for the visual depiction of character. A font is
Font is a file that
often associated with a set of parameters (for example, size, posture, weight, and ser-
defines glyphs
ifness), which, when set to particular values, generate a collection of image able glyphs”,
for characters.□
[Unicode_1106]. Wikipedia defines a computer font as a file which has a set of glyphs,
[wiki_1107].
Script is “a collection of letters and other written signs or diacritics that are used to repre-
sent textual information in one or more writing systems (languages)”, [Unicode_1107]. For
example, the Czech script is defined by the Czech alphabet, German script is defined by the Script.
□
German alphabet and so on. In result, all these national scripts are subsets of one script,
the Latin script. It means that the Latin script contains the definition of all national letters in
the languages where the Latin alphabet is a base. The same applies to the Cyrillic script,
where Russian is written with a subset of the Cyrillic script; Ukrainian is written with a dif-
ferent subset. Some countries have more scripts, e.g. the Japanese writing system uses
several scripts, [Unicode_1107].
Typeface, [wiki_1108], in typography, it means more fonts, where all glyphs of a character
have the same properties, signs or slope. In other words, the typeface defines common Serif, San Serif,
design features that are shared by all fonts with the same typeface. Therefore more fonts Handwriting,
belong to a typical typeface and each kind of typeface has its name. Among famous type- Console.□
faces belong Serif, Sans Serif (also known as gothic), handwriting, calligraphy, console and
others. The examples of the typefaces are in Fig. 11-03.
Text Typeface Font

The quick brown fox jumps over the lazy dog Serif Times New Roman
The quick brown fox jumps over the lazy dog San Serif Arial
The quick brown fox jumps over the lazy dog Handwriting Blackadder ITC
The quick brown fox jumps over the lazy dog Console Consolas
The quick brown fox jumps over the lazy dog Calligraphy Lucida calligraphy
Fig. 11-03 Typefaces
Each font has four basic typefaces, normal, italic, bold and italic-bold, Fig. 11-04. These
typefaces are historical but, in computer, these typefaces are well-known and they are as-
Italic and bold.□
signed to each font. They are defined as separate files or by means of mathematics for vec-
tor fonts. The definition of these typefaces as separate fonts is preferred to reach a better
quality of displayed glyphs.
Text Typeface Font

The quick brown fox jumps over the lazy dog Normal Calibri
The quick brown fox jumps over the lazy dog Italic or cursive Calibri
The quick brown fox jumps over the lazy dog Bold Calibri
The quick brown fox jumps over the lazy dog Italic-Bold Calibri
Fig. 11-04 Typefaces in each font
Proportional and non-proportional typeface is a basic feature of each font, [wiki_1108].

Fig. 11-05 shows a difference between these two terms. A proportional typeface means
VŠB-TU Ostrava 143

that each letter has a different width and a non-proportional one has a constant width.
Other terms for a non-proportional typeface are the monospaced, fixed space and console Proportional and
typeface. monospaced.□
Text Typeface
imlw imlw Proportional
imlw imlw Non-proportional or monospaced or console
Fig. 11-05 The width of glyphs
Letterpress printing is a relief printing and it is the oldest

technology in the printing industry, [wiki_1109] and
[wiki_1110]. This principle is used in a classical typewriter,
Fig. 11-06, where the text is produced by pressing the
matrix onto paper. This technology was used at the begin-
ning of computers and today it is used sporadically.
Dot is the smallest element of graphics from the technical Fig. 11-06 Relief printing in the typewriter
point of view. Dot is controllable; it means each dot has its http://en.wikipedia.org/wiki/File:Typewriters.jpg
address, intensity of color and other properties. Dot has
different meanings according to the computer equipment. In printer, a dot is the smallest
point of one color that can be printed. In LCD monitor, a dot is the smallest point of one
color and a pixel has three dots, red, green and blue. A monochrome monitor only has dots.
Pixel is the smallest element of graphics, a picture or digital art. Each pixel is controllable, it
means each pixel has its address, color and others properties, [wiki_1111]. Pixels are typi-
cally used as elements of a graphic file and for defining the properties of LCD monitors,
scanners and cameras.
Fig. 11-07 Pixel in graphics
DPI, Dot per Inch, this parameter can be found in the specification of a printer, where DPI
means the number of dots in one inch. DPI parameter is indicated in the specification of
printers or scanners. In case of inkjet printers, dot is a drop.
PPI, Pixel per Inch, this parameter can be found in the specification of monitors, cameras
and as a parameter of raster graphic files or programs.
144 VŠB-TU Ostrava

Pixel graphics, this term relates to the definition of graphics in the information technology,
where all graphical objects are defined by pixels. The term of raster graphics is also used for
this principle.
Vector graphics, this term relates to the definition of graphics, where all graphical objects
are defined by geometrical primitives such as points, lines, curves, circles, and so on. Every
geometrical primitive can be colored and it has other properties. It means that all graphical
objects are defined by maths as lines, vectors, Bezier curves and so on. A vector graphic
principle as an output representation is used by a cutting plotter.
3D is the newest 3-dimensional technology and it can be defined by pixels or vectors. To-
day, 3D devices are 3D scanners, 3D monitors, 3D cameras and 3D printers.
11.2 Fonts
Font is a set that defines glyphs for any character. The first fonts were used in the typogra-
phy industry, where the font is defined mechanically by a relief. In the computer area, a
Character.□
letterpress font was used at the beginning of printing for electric typewriters, raw and dai-
sywheel printers. When terminals or monitors began to be used, the font was defined by a
file. Today, it is possible to find three main definitions of fonts, a bitmap and two vector
definitions, [wiki_1107].
 Bitmap fonts consist of the definition of dots or pixels in the matrix for representing
each glyph. A bitmap font is also called a raster font.
 Outline fonts, each glyph in outline fonts is defined by outer curves. Bezier curves
are used for the mathematical definition of glyphs. Outline font is also called as a
vector font.
 Stroke fonts use a series of specified lines, shapes and additional information to de-
fine a final glyph. A glyph consists of more shapes.
Fig. 11-08 Examples of raster letters and semi-graphic symbols
11.3 Bitmap font

Bitmap fonts are also known as raster fonts. Raster fonts are monospaced fonts where each
Bitmap font is
glyph has the same width that is defined by the size of the matrix. The size of the matrix is
also called raster
defined by the type of a device or by the company. The first usable size of the matrix was 5
font.□
times 7 dots and today the normal size is 12 times 16 dots. Each glyph is rendered in this
matrix. The matrix 5 times 7 is suitable for the capital English alphabet and it has a small
space for the punctuation, the small and capital letters. Today, a larger matrix is used, with
VŠB-TU Ostrava 145

more space for small and capital letters, punctuation, etc. The spaces between the glyphs
and rows are composed into the matrix. Examples of possible glyphs are in Fig. 11-08.
Each raster font has one typeface and its height of glyphs. When a different height is re-
quired, then it is suitable to define a separate font for each height. The scaling of a raster
font is problematic and it leads to the deformation of glyphs. Each glyph in the matrix has ANSI escape
properties that define color, light intensity, inversion, flashing and so on. These properties code.□
in the text terminal are defined by the ANSI escape code, [wiki_1112].
The definitions of raster fonts are typically placed into the file in a specific format. These
formats are Portable Compiled Format (PCF), Glyph Bitmap Distribution Format (BDF),
Server Normal Format (SNF) and others. Raster fonts were the first fonts in computer area
and they are used till today in terminals and a lot of dot matrix printers or inkjet printers as
default fonts.
11.4 Outline fonts

Outline fonts are modern fonts where each glyph is de-
fined by an outer line. It is a mathematical definition and
Bezier curves are used. Some properties of each glyph are
changed by the alteration of the parameters in algorithm. PostScript is a regis-
The modification of default parameters of Bezier curves tered trademark of
Adobe Systems In-
and color properties of a glyph are shown in Fig. 11-09. All corporated.□
modifications were performed in a mathematical way.
The first collection of outline fonts was destined for desktop Fig. 11-09 Principle of outline
font
publishing in the 1980s and Adobe Systems with its Post- PostScript is a
Script Type 1 Font was the first. Today, PostScript fonts are used in pdf documents. As a language for
competition to this, Apple and Microsoft created their own format of TrueType, also in the vector graphics.□
1980s. At the beginning of the 1990s, the collection of TrueType fonts was first applied in
operating systems MAC OS X and Windows 3. Later, the OpenType format was designed by OpenType is a regis-
Microsoft as a successor of TrueType and the OpenType format was issued as the standard tered trademark of
Microsoft Corpora-
ISO/IEC 14496-22:2009, Information technology – Coding of audio-visual objects – Part 22: tion.□
Open Font Format. More information is in literature [wiki_1113], [wiki_1114] and
[wiki_1115].
The main advantage of the outline fonts is their mathematical definition and some proper-
ties can be modified by the change of parameters in algorithm. Most Outline fonts are pro-
portional fonts and they are used in word processing and desktop publishing. Some Outline
fonts are monospaced and these are used in specific cases. Typically, there are emulators of
terminals, notations of programs in the articles or books, plain text editors, etc.
11.5 Stroke fonts

Stroke font is a font where each glyph is divided into small elements. These elements are
called the strokes. Then, each glyph is a composition of small strokes in a defined order.
Each stroke is described by mathematics. Therefore, some properties of a stroke glyph can
be modified by the change of parameters in algorithm. Stroke fonts are suitable for ideo-
146 VŠB-TU Ostrava

grams (pictograms) and CJKV symbols. The best example from literature is the CJKV charac-
ter 永 (pinyin: yǒng, "forever", "permanence"), Fig. 11-10, [wiki_1116]. This glyph is com- CJKV is the ab-
posed of eight calligraphic strokes. Each stroke has a name. Other examples are ideograms breviation for
(pictograms) which are composed of more pictograms. The ideogram “no dog”, Fig. 11-11, China, Japan,
is composed of two pictograms, dog and not allowed. Pictograms have different colors be- Korea and Vi-
cause the order is defined, the pictogram of “dog” covers the pictogram of “not allowed”, etnam.□
[wiki_1117].
Source: Source: http://en.wikipedia.org/wiki/File:Perros_No.svg

http://en.wikipedia.org/wiki/File:8_Strokes_of_Han_Ch
aracters.svg
Fig. 11-10 Strokes in CJK character Fig.11-11 Strokes in ideogram
11.6 ASCII
American Standard Code for Information Interchange – ASCII is a character encoding
scheme that is used by computers, communication devices and other devices with text Unicode name.□
processing. This standard was created in the 1960s and its last modification is from 1986.
The original ASCII is a 7-bit encoding scheme and it has two parts, 33 control characters and
95 printable characters. The printable characters are small and capital letters of American
alphabet, digits and special characters, Fig. 11-12.
LSB
MSB
0 1 2 3 4 5 6 7 8 9 A B C D E F
0 NUL SOH STX ETX EOT ENQ ACK BEL BS HT LF VT FF CR SO SI
1 DLE DC1 DC2 DC3 DC4 NAK SYN ETB CAN EM SUB ESC FS GS RS US
2 ! " # $ % & ' ( ) * + , - . /
3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
4 @ A B C D E F G H I J K L M N O
5 P Q R S T U V W X Y Z [ \ ] ^ _
6 ` a b c d e f g h i j k l m n o
7 p q r s t u v w x y z { | } ~ DEL
Fig.11-12 US ASCII encoding scheme
The newest standard for encoding is the Unicode, where the 7-bit ASCII code is the first
part of Unicode and it is called the Basic Latin Unicode block. In Unicode, each character
has a name, therefore, in the following text, this Unicode name will be used and in some
cases the Unicode name will be supplemented by a slang name or a very popular historical
name. In the Internet world, IANA preferred all the names on the World Wide Web in US
VŠB-TU Ostrava 147

ASCII encoding scheme till 2007, then ASCII was surpassed by Unicode in format UTF-8, WWW uses
[wiki_1118]. UTF-8. □
The codes from 0 to 0x1F are the control codes and this area is also called as the control
code C0. The code 0x7F also belongs to the control codes and it means delete. These con-
trol codes were designed for controlling the peripheral equipment of computers and com-
munication devices, and for the flow control of transmission. In Fig. 11-13, it is possible to Control codes C0.
see the way of generating these codes as the caret notation and the notation in program- □
ming language, mainly C language. Fig. 11-13 shows the codes used and their meaning. The
codes in blue are frequently used. The unlisted codes can be considered obsolete and their
meaning can be found in literature [wiki_1118].
Caret Unicode name

ASCII Unicode C lan- Acro-
nota- [Unicode_1108] Description
code point guage nym
tion or other name
Today, it is a string terminator
00 U+0000 ^@ \0 NUL NULL in C language. Originally, NUL
had a different meaning.
Causes the sound of the bell in
07 U+0007 ^G \a BEL BELL
terminal.
Moves the cursor one position
08 U+0008 ^H \b BS BACKSPACE leftwards. In terminal, delete
one character leftwards.
CHARACTER TABULA-
TION Moves the cursor horizontally
09 U+0009 Î \t
Horizontal Tabu- to the next tab position.
HT
lation
Originally, it moved the cursor
LF LINE FEED (LF)
down one row without affect-
0A U+000A ^J \n
NL New line ing its column position. More
EOL End of line in the New Line subchapter.
LINE TABULATION
Moves the cursor vertically to
0B U+000B ^K \v Vertical Tabula-
VT the next tab position.
tion
In printers, it loads the next
0C U+000C ^L \f FF FORM FEED sheet; in terminals, it clears
the screen.
Originally, it moved the cursor
CARRIAGE RETURN to the first column in the same
0D U+000D ^M \r CR
(CR) row. More in the New Line
subchapter.
1B U+001B ^[ \e ESC ESCAPE Note below
7F U+007F ^? \? DEL DELETE Delete
Fig. 11-13 Meaning of some control codes
The ESCAPE is used in two ways, as a key on the keyboard or as an escape sequence. After
pressing the Esc key on the keyboard, the code 0x1B is sent to the operating system. The
implementation depends on the program; in some situations, ESCAPE may cause the exit,
148 VŠB-TU Ostrava

[wiki_1137]. An escape sequence is a series of characters used to change the state of com-
puters and their attached peripheral devices. The ESC sequence begins with the ESCAPE code
which is followed by others code, [wiki_1126]. Famous escape codes are the Hayes com-
mand set, the ANSI escape code and ESC/P.
The Hayes commands is the set of sequences which are used for controlling a modem,
[wiki_1138]. These sequences can perform actions like dialing a phone number, answering
a phone, setting parameters of a transfer and so on.
ANSI escape code or ANSI escape sequence is a method of controlling text terminals,
ANSI escape
[wiki_1112]. This sequence can change the properties of each glyph or all text or the
code. □
screen. This ANSI escape sequences are still used in the Linux and UNIX operating systems.
ESC/P is an escape sequence defined by Epson Corporation for controlling printers,

[wiki_1139]. It was mainly used in dot matrix printers and some inkjet printers, and it is still
widely used in many receipt printers. This sequence can perform actions like setting a text
mode or a graphic mode, setting normal or bold typeface, setting colors, etc.
The codes from 0x20 to 0x7E are printable codes and a part of them is the American alpha-
bet. Also this definition is a part of Unicode with the name of Basic Latin Block. The names
for alphabet character are shown in Fig. 11-14. The reaming names can be derived from this
table.
ASCII code Unicode Unicode name

Character
hexadecimal point [Unicode_1108]
41 U+0041 A LATIN CAPITAL LETTER A
42 U+0042 B LATIN CAPITAL LETTER B
61 U+0061 a LATIN SMALL LETTER A

62 U+0062 b LATIN SMALL LETTER B
: :
Fig. 11-14 Table of codes and names for alphabet characters
Fig. 11-15 shows special printable characters with official Unicode names [Unicode_1108].
But some glyphs are known by their slang names (blue word) or by names used in the past
or in another technology.
ASCII Unicode HTML Unicode name

Character Other codes or names
code point code [Unicode_1108]
20 U+0020 SPACE
21 U+0021 ! EXCLAMATION SIGN
Double quotes or invert-
22 U+0022 " “ QUOTATION MARK
ed commas
Hash or hash key on the
telephone
23 U+0023 # NUMBER SIGN Sharp - ♯ is a different
glyph and it is used in
music
VŠB-TU Ostrava 149

24 U+0024 $ DOLLAR SIGN

25 U+0025 % PERCENT SIGN
26 U+0026 & & AMPERSAND Logical and
27 U+0027 ' ' APOSTROPHE
28 U+0028 &lparen; ( LEFT PARENTHESIS Bracket, parentheses
29 U+0029 &rparen; ) RIGHT PARENTHESIS
2A U+002A * ASTERISK
%2B in URL
2B U+002B + PLUS SIGN
Plus
2C U+002C , COMMA Decimal separator
%2D in URL
2D U+002D - HYPHEN-MINUS
Minus, hyphen
Dot, very popular term;
2E U+002E . FULL STOP
period, baseline dot
Slash, very popular term;
2F U+002F / SOLIDUS fraction slash, division
slash
3A U+003A : COLON
3B U+003B ; SEMICOLON
In programming lan-
3C U+003C < < LESS-THAN SIGN guages, it is an abbrevia-
tion lt or .lt.
3D U+003D = EQUALS SIGN
In programming lan-
guages, it is an abbrevia-
3E U+003E > > GREATER-THAN SIGN tion gt or .gt.
In shell script (Linux) it
means the redirection
3F U+003F ? QUESTION MARK
40 U+0040 @ COMMERCIAL AT At sign
5B U+005B [ LEFT SQUARE BRACKET
5C U+005C \ REVERSE SOLIDUS Backslash
5D U+005D ] RIGHT SQUARE BRACKET
In ASCII caret,
5E U+005E ^ CIRCUMFLEX ACCENT Unicode caret is the ‸
glyph, (U+2038) CARET
Underscore, Understrike;
5F U+005F _ LOW LINE
Underbar; Underline
60 U+0060 ` GRAVE ACCENT
7B U+007B { LEFT CURLY BRACKET
Vertical bar, logical or,
7C U+007C | VERTICAL LINE
pipe
7D U+007D } RIGHT CURLY BRACKET
7E U+007E ~ TILDE Logical not
Fig. 11-15 Table of codes and names for non-alphabet characters
11.7 Code pages

At the beginning of the era of personal computers, the 8-bit encoding scheme was defined. Code page 437. □
This new encoding was established on the classical 7-bit ASCII coding and new characters
and control codes were added into the range from 0x80 to 0xFF. The range from 0x20 to
150 VŠB-TU Ostrava

0x7F of 8-bit coding corresponds to 7-bit ASCII code. Only the range from 0 to 0x1F has two
meanings, according to the equipment which receives the code. When the code is sent to
the video adapter of PC in the text mode, the code generates a visual glyph, Fig. 11-16.
When this code is sent to the peripherals of personal computer, then the code is interpret-
ed as a control code. The upper area from 0x80 to 0xFF contains the characters with diacrit-
ic, Greece alphabet and semi-graphic symbols. This 8-bit definition is called the code page
and Fig. 11-16 shows the code page 437, [wiki_1119]. The code page 437 is default in many
systems.
Source: http://en.wikipedia.org/wiki/File:Codepage-437.png
Fig. 11-16 Code page 437
Personal computers expanded to the whole world and a lot of national characters were
missing, as Czech diacritic, Cyrillic alphabet, etc. Therefore, other code pages were defined
and contained the missing national characters. These definitions only changed the upper
area, the range from 0x80 to 0xFF, [wiki_1120]. The code pages 437, 850, 852… were de-
fined in the era of MS-DOS operating system, [wiki_1139]. The code pages Windows 1250;
Windows 1252, … were defined in the era of Windows operating system, [wiki_1140].
These code pages were international and were issued as the standard. Outside these, a lot
of code pages were defined locally, which causes the mutual incompatibility.
International Standard Organization and IEC – ISO/IEC defined 8-bit codes for all world lan-
guages and these codes are called as code pages ISO 8859-1, ISO 8859-2, ISO 8859-3, etc.
This standard follows the previous definitions and ASCII 7-bit code. All the range of the code
page was divided into the areas which contain:
 The range from 0x20 to 0x7E, this area contains the Latin alphabet and corresponds
to the ASCII definition. This definition is the same in all pages.
 The range from 0xA0 to 0xFF, this area contains the national characters. The defini-
tion depends on the language and the world area.
 The ranges from 0 to 0x1F and 0x80 to 0x9F are not defined by this standard. The
codes in these ranges correspond to the codes that are defined by standard ISO/IEC
6429. They are the control codes C0 and C1.
The problem of incompatibility also existed in the Czech Republic, where the following pag-
es were used by users:
VŠB-TU Ostrava 151

 Code page 437 contains American alphabet. This code page is called as Basic Latin
Alphabet. This page was the first code page for personal computers, [wiki_1119].
 Code page 852 is the page for Central Europe languages that use Latin script. This
page contains Czech alphabet. Also known as Latin-2, [wiki_1120].
 Windows 1252 and ISO8859-1 are similar pages with Latin alphabet for Western
Europe, [wiki_1121], [wiki_1123].
 Windows 1250 and ISO8859-2 are similar pages with Latin alphabet for Central Eu-
rope, [wiki_1122], [wiki_1124].
 Code pages defined nationally and incompatible with each other, [wiki_1145].
Each typeface has a separate font file and one code page has a lot of font files for each
typeface. When the document is written in more languages, then the corresponding font
files have to be installed. This is no problem in one personal computer. The problem occurs
when the document is sent to a recipient or is published on the Internet. The reader may
not have installed all required code pages and the document becomes unreadable. This
problem and others are solved by Unicode.
11.8 C0 and C1 control codes

C0 and C1 control codes are used for controlling peripheral equipment. The C0 control code
was firstly defined by ASCII, and later the C1 control code was defined. Subsequently, the Co and C1 con-
C0 and C1 control codes have become the standard ISO/IEC 6429. The C0 control code cor- trol codes. □
responds to the ASCII definition in the range from 0 to 0x1F of 8-bit encoding scheme,
[wiki_1125].
The C1 control code is a new set of codes which is located in the range from 0x80 to 0x9F of
8-bit encoding scheme. All the definitions and explanations of these new codes can be
found in [wiki_1125]. The NEL code is a code for the next line; it is an attempt to solve the
ambiguity of the CR+LF sequence. Another code, which is worth mentioning, is CSI - CONTROL
SEQUENCE INTRODUCER. This code is the leading code of ANSI escape sequence and it is fol-
lowed by parameters. ANSI escape sequence is used for controlling text terminal in Linux
and UNIX operating systems. The CSI code can be replaced by the sequence of codes,
ESC + [, hexadecimal 0x1B 0x5B. The ESC is escape code from the C0 control code and it is
followed by the left square bracket.
Acro Unicode name

ASCII Unicode
ESC+ ro- [Unicode_1108] or Description
code point
nym other name
Equivalent to CR+LF. Used to mark
85 U+0085 ESC+E NEL NEXT LINE (NEL) end-of-line on some IBM main-
frames.
Used to introduce control sequenc-
CONTROL SEQUENCE
9B U+009B ESC+[ CSI es that take parameters.
INTRODUCER
ANSI escape sequence
Fig. 11-17 C1 control code
152 VŠB-TU Ostrava

11.9 Unicode
Unicode is a code that aims to encode any world alphabet.
The previous attempts were very problematic. Let’s under-
stand there are live and dead languages in the world. The “Unicode is a compu-
live languages are used by people in world till today. The ting industry standard
dead languages are historical languages as Egyptian hiero- for the consistent en-
glyph, Indian languages, etc. Therefore, the universal cod- coding, representation
ing of all alphabets of the world is desirable. And also, let’s and handling of text
understand that today’s documents are not only classical expressed in most of
texts. The documents contain other symbols from science, Fig. 11-18 Logo Unicode the world's writing
graphical area, art, etc. The work on a new coding of world systems.” □
alphabet started in the 1990s. Two groups, Unicode and ISO/IEC, cooperated on the Source
work. The Unicode was also published as ISO/IEC standard. The conclusions of both http://en.wikipedia.org/wiki/Unicode
groups are very similar. But the famous name is Unicode, where the Unicode ver-
sion 7 is from 2014.
U+xxxx NAME OF THE CHARACTER basic glyph U+0031 DIGIT ONE 1
Code point as hexa- Unicode name, the standard uses

U+ Basic glyph
decimal number small capitals letters for the name
The order is not obligatory
Fig. 11-19 Notation of Unicode character
The Unicode space comprises 1,114,112 code points and the corresponding range is from 0
to 0x10FFFF. It is 21-bit space where each character is defined by Unicode point or code
point, name and basic glyph, Fig. 11-19. Each part of this definition has rules for writing and
the order of these parts is not obligatory.
 Unicode point or code point is a hexadecimal number that corresponds to a charac-

ter. Minimal number of digits is 4 digits. It means that code points from the Basic
Multilingual Plane are written in 4 digits.
 Unicode name is a textual definition of a character and the standard uses small cap-
ital letters for the notation.
 Basic glyph, it is a graphical representation of a character. The typeface can change
the graphical representation.
Note to coding space

Example of graphical symbols
Originally, Unicode was defined in U+25B3 - WHITE UP-POINTING TRIANGLE (trime) △
32-bit space and later the space U+2624 - CADUCEUS ☤
was reduced to 21 bits. Therefore,
Unicode is also known as 32-bit U+1F638 - GRINNING CAT FACE WITH SMILING EYES 😸
encoding. □
□

VŠB-TU Ostrava 153

The Unicode 21-bit space is divided into 17 planes and each plane has 65,536 code points. It
means that each plane uses 16 bits, 216 = 65,536 and 17 * 65,536 = 1,114,112 code points.
Each plane has its number and its name, abbreviation and the range, Fig. 11-20.
B
a
U s 0000 - FFFF Plane 0 BMP Basic Multilingual Plane
n i
i c
c S Supplementary Multilingual
1 0000 – 1 FFFF Plane 1 SMP
o u Plane
d p
e p Supplementary Ideographic
2 0000 – 2 FFFF Plane 2 SIP
l Plane
p e
l m 3 0000 – D FFFF Plane 3 – 13 - Unassigned
a e
n n Supplementary Special-
e t E 0000 – E FFFF Plane 14 SSP
purpose Plane
s a
r S PUA Supplementary Private Use
F 0000 – 10 FFFF Plane 15 - 16
y A/B Area
Fig. 11-20 Definition of Unicode planes
Basic plane 0 has the name BMP – Basic Multilingual Plane. This plane is important and
contains the scripts for all live world languages and other graphical symbols. The first 256
code points, 0 to 0x00FF, correspond to the ISO 8859-1 standard and to the C0 and C1 con-
trol codes. And also the first 128 code points, 0 to 0x7F, correspond to the ASCII 7-bit code,
because ASCII code is a subset of ISO 8859-1. The notation of code points corresponding to ASCII code is the
the ASCII range in UTF-8 is the same as the ASCII 7-bit code. This definition in UTF-8 ensures first part of
the compatibility with ASCII encoding. Unicode. □
The remaining code points contain scripts for all modern world languages. It means that
BMP plane contains scripts for Cyrillic, Greek, Arabic, Chinese, Japan, Korea, many symbols
and etc. It means that most world documents can be written by using Basic Multilingual
Plane.
Supplementary plane 1 has the name SMP - Supplementary Multilingual Plane. This plane
BMP plane de-
contains the historical scripts as Egyptian hieroglyphs, Maya language etc. and a lot of sym-
fines only a part
bols. The symbols are mathematical alphanumeric symbols, today and past music symbols,
of CJK characters.
game symbols as playing cards, Mahjong, domino tiles, etc.
Therefore the 2nd
Supplementary plane 2 has the name SIP - Supplementary Ideographic Plane. This plane plane is deter-
contains CJK ideograms that were not included in earlier character encoding standards. mined for all CJK
ideograms. □
Supplementary planes 3 to 13 are unassigned planes.
154 VŠB-TU Ostrava

Supplementary plane 14 has the name SSP - Supplementary Special-purpose Plane. This
plane currently contains non-graphical characters. This plane contains a block of tags that
are for old language tag characters for use when language cannot be indicated through
other protocols, [wiki_1146].
Supplementary planes 15 and 16 have the name SPUA A/B - Supplementary Private Use
Area A and B. These planes are designated for the private use by the parties that are out-
side of the ISO and the Unicode Consortium.
11.10 Using Unicode

A special notation of Unicode points is defined for the usage in the computer science. This
new notation follows the ASCCI encoding of characters and it has new principles for cover-
ing all 21-bit Unicode space. Unicode defines three notations of writing the code position in
the computer. The Unicode has three encoding forms that use 8-bit, 16-bit and 32-bit units.
These are named UTF-8, UTF-16, and UTF-32, respectively. The UTF is Unicode (or UCS)
Transformation Format. The Unicode standard admits both explanations for letter U. The
ISO/IEC standard uses the abbreviations UCS1, UCS2 and UCS4, but UTF abbreviations are
more widespread.
11.11 UTF-32
This notation uses 32-bit word. Unicode position has 21 bits and for increasing to 32 bits
the zero prefix is used. It means that the notation UTF-32 corresponds to Unicode point UTF-32
□
directly, [wiki_1127]. Programming language Python from version 3.2 uses the UTF-32 as
the unique encoding scheme, [wiki_1127].
11.12 UTF-16
UTF-16 is a variable length encoding scheme and covers all Unicode space. UTF-16 uses
UTF-16 □
16-bit words and the length of sequence is 1 or 2 words, where each of them has 16 bits,
[wiki_1128]. In case of two words, UTF-16 encoding scheme uses surrogate pair. It is two
numbers from the range 0xD800 to 0xDFFF. These values lie in the BMP plane and have no
defined characters. These values are only destined for UTF-16 encoding. UTF-16 encoding
scheme depends on the value of the code point and there are two principles.
The code points from the BMP plane are directly used as UTF-16. The first plane is a basic
plane and it lies in the range from 0 to 0xFFFF, only 16 significant bits are used. No trans-
formation is used in this case.
When the code point lies outside the basic plane, the surrogate pair is used for encoding
and the code point is transformed into two words. The lead and trail surrogates are used. Surrogate pair.
□
The leading surrogate is a number in the range from 0xD800 to 0xDBFF. The trailing surro-
gate is a number in the range from 0xDC00 to 0xDFFF. The algorithm is, [wiki_1128]:
 0x010000 is subtracted from the code point, the result has maximum 20 significant
bits and it lies in the range from 0 to 0xFFFFF.
 The result of subtraction is divided into two groups where each of them has 10 bits.
Both numbers are in the range from 0 to 0x3FF.
VŠB-TU Ostrava 155

 The number corresponding to top 10 bits is added to 0xD800. The sum is the first
Lead and trail
code unit or lead surrogate, which will be in the range from 0xD800 to 0xDBFF.
surrogate. □
(Previous versions of the Unicode Standard referred to these as high surrogates.)
 The number corresponding to low 10 bits is added to 0xDC00, in order to obtain the
second code unit or trail surrogate, which will be in the range from 0xDC00 to
0xDFFF. (Previous versions of the Unicode Standard referred to these as low surro-
gates.)
The principle of encoding scheme is in Fig. 11-21. The first surrogate sequence 0xD800 and
0xDC00 corresponds to code point 0x010000.
Lead\Trail DC00 DC01 DC02 … DFFF

D800 01 0000 01 0001 01 0002 … 01 03FF
D801 01 0400 01 0401 01 0402 … 01 07FF
D802 01 0800 01 0801 01 0802 … 01 0BFF
⋮ ⋮ ⋮ ⋮ ⋱ ⋮
DBFF 10 FC00 10 FC01 10 FC02 … 10 FFFF
Fig. 11-21 UTF-16 encoding by using surrogate pairs
Example of encoding using surrogate pairs is in Fig. 11-22 and the steps are:
 The first step, subtract 0x010000 from code point. The result has only 20 significant
bits.
 In the 2nd step, transform the difference to binary, divide into two of 10-bit groups
each. Transform each group to hexadecimal. The top group belongs to lead surro-
gate and the low one to trail surrogate.
 The 3rd step, calculate lead surrogate by adding 0xD800 to the top group.
 The 4th step, calculate trail surrogate by adding 0xDC00 to the low group.
 In the end, UTF-16 sequence is lead and trail surrogates.
U+24A3D 𤨽 HAN IDEOGRAM UTF-16 – 0XD852 0XDE3D
0x02 4A3D

Lead surrogate
-0x01 0000
0x01 4A3D  0xD800 + 0x052 = 0xD852
0001 0100 1010 0011 1101

 Trail surrogate
Leading part Trailing part  0xDC00 + 0x23D = 0xDE3D

0x052 0x23D
Fig. 11-22 Example of UTF-16 encoding
156 VŠB-TU Ostrava

The UTF-16 format is used in:
 For texts in the OS API in Microsoft Windows 2000/XP/2003/Vista/7/8/CE. Files

and network data tend to be a mix of UTF-16, UTF-8, and legacy byte encodings.
 The .NET environments
 The Qt cross-platform graphical widget toolkit.
 Java and Python only to version 3.2.
11.13 UTF-8
UTF-8 is a variable length encoding scheme and covers all Unicode space. This notation uses
8-bit byte as a basic element, [wiki_11029]. UTF-8 ensures backward compatibility with
ASCII and solves the problem of the endianness in UTF-16 and UTF-32. The number of used
bytes depends on the value of the code point. RFC 3629 from 2003 restricts the maximum
number of bytes to 4 and this is in accordance with UTF-32 and two words of UTF-16. Also 4
bytes of UTF-8 encode the code point in the range from 0 to 0x10FFFF. This range corre-
sponds to 21 bit n-tuple and 17 planes.
The basic features of this encoding scheme are, [wiki_1129]:
 “Backward compatibility: One-byte codes are used only for the ASCII values 0
Backward com-
through 127. In this case the UTF-8 code has the same value as the ASCII code. The
patibility. □
high-order bit of these codes is always 0.
 Clear distinction between multi-byte and single-byte characters: Code points larger
than 127 are represented by multi-byte sequences, composed of a leading byte and Unique encod-
one or more continuation bytes. The leading byte has two or more high-order 1s ing.□
followed by a 0, while continuation bytes all have '10' in the high-order position.
 Self synchronization: Single bytes, leading bytes, and continuation bytes do not
share values. This makes the scheme self-synchronizing, allowing the start of a Self-
character to be found by backing up at most five bytes (three bytes in actual UTF‑8 synchronization. □
per RFC 3629 restriction, see above).
 Clear indication of code sequence length: The number of high-order 1s in the lead-
ing byte of a multi-byte sequence indicates the number of bytes in the sequence, so
that the length of the sequence can be determined without examining the continu-
ation bytes.
 Code structure: The remaining bits of the encoding are used for the bits of the code
point being encoded, padded with high-order 0s if necessary. The high-order bits go
in the lead byte, lower-order bits in succeeding continuation bytes. The number of
bytes in the encoding is the minimum required to hold all the significant bits of the
code point.”
UTF-8 sequence of bytes uses the names leading and continuation byte, Fig. 11-23. The first
byte of the sequence is the leading byte, followed by one or more continuation bytes. All
continuation bytes are uniquely encoded by the prefix 10B. The leading byte is uniquely
Unique encoding
encoded:
of leading byte.□
 The prefix 0B is used, when the sequence has only one byte, the leading byte.
VŠB-TU Ostrava 157

 The prefixes 110B, 1110B and 11110B are used. In this case, the number of 1s in the
prefix defines the number of bytes in the sequence. It means the sum of leading
and continuation bytes.
Number Leading byte

Continuation bytes in binary
of bits in First code Last code in binary
Line
code point point
point Byte 1 Byte 2 Byte 3 Byte 4
1 7 U+0000 U+007F 0xxx xxxx
2 11 U+0080 U+07FF 110x xxxx 10xx xxxx
3 16 U+0800 U+FFFF 1110 xxxx 10xx xxxx 10xx xxxx
4 21 U+1 0000 U+1F FFFF 1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx
Source: http://en.wikipedia.org/wiki/UTF-8
Fig. 11-23 UTF-8 encoding
The UTF-8 has self-synchronizing properties. It is important in the situation, when one or
more bytes in the flow of UTF-8 are missing. The decoder of UTF-8 is able to find leading
byte after the error and to continue in deciding. A few characters in the text are missing.
UTF-8 has invalid codes. The invalid codes are derived from surrogate pair, or they are
codes outside the Unicode space, or the code is specified by a longer sequence than desira-
ble. Each code point must be encoded by the shortest sequence.
U+13051 EGYPTIAN HIEROGLYPH B002 UTF-8 – 0XF0 0XA3 0X81 0X91
 0001 0011 0000 0101 0001 Leading byte

 “1111 0” + “000” = “1111 0000” => 0xF0
 00 0001
01 0001

Continuation byte 2
“10” + “01 0011” = “1010 0011” => 0xA3
01 0011 part 4
000 part 3 Continuation byte 3
part 2
Leading Continuation parts  “10” + “00 0001” = “1000 0001” => 0x81
part 1 Continuation byte 4
Operator “+”means the concatenation  “10” + “01 0001” = “1001 0001” => 0x91
Fig. 11-24 Example of UTF-8 encoding
The UTF-8 encoding is defined by Fig. 11-23. Example of UTF-8 encoding is in Fig. 11-24.
This example is only valid for Unicode point higher than 0x7F, outside of 7-bit ASCII code. ASCII code corre-
When the Unicode point is 7-bit ASCII, then UTF-8 is directly this code. For codes higher sponds to UTF-8.
□
than 0x7F the algorithm is used:
 The first step, transform the Unicode point to a binary number.
158 VŠB-TU Ostrava

 The 2nd step, divide the binary string into 6-bit parts from LSB. The top part is short-
er than 6 bits. The top part belongs to the leading byte and the remaining parts be-
long to the appropriate continuation bytes.
 The 3rd step, choose the appropriate line in the table according to the number of
continuation parts.
 The 4th step, assemble the leading byte as the concatenation of the leading prefix
with the appropriate leading part. The number of 1s in the prefix defines the num-
ber of continuation bytes.
 The 5th and/or next step, assemble all continuation bytes as the concatenation of
the continuation prefix 10B with the appropriate continuation part.
The importance of UTF is given by using in real practice in, [wiki_1129]:
 UTF-8 has become the dominant character encoding for the World Wide Web, ac-
counting for more than half of all Web pages.
 The Internet Mail Consortium (IMC) recommends that all e-mail programs be able
to display and create mail using UTF-8.
 The W3C recommends UTF-8 as default encoding in their main standards (XML and
HTML).
 UTF-8 is also increasingly being used as the default character encoding in operating
systems, programming languages, APIs, and software applications, [wiki_1129].
 UTF-8 is one of possible default character sets in Internet, [oracle_1101].
UTF Sequence of Glyphs correspond- Note

bytes in hex- ing to CP 1252
adecimal
UTF-8 EF BB BF Ï»¿
UTF-16 (BE) FE FF Þÿ Big endian
UTF-16 (LE) FF FE ÿþ Little endian
Big endian
UTF-32 (BE) 00 00 FE FF ␀ ␀ þÿ
NULL is ASCII code 00
Little endian
UTF-32(LE) FF FE 00 00 ÿþ ␀ ␀
NULL is ASCII code 00
⦙ ⦙ ⦙
Note:
 The byte is an atomic element in the endianness shown.
 The sequence is presented in the order as it can be seen in the editor.
Fig. 11-25 BOM and UTF
11.14 Byte order mark

BOM – Byte Order Mark is a Unicode character used to signal the endianness (byte order)
of a text file or stream. The Unicode point is U+FEFF, BYTE ORDER MARK (BOM). The use of
BOM is optional, and, if used, it should appear at the start of the text stream. Beyond its
VŠB-TU Ostrava 159

specific use as a byte-order indicator, the BOM character also indicates the UTF format.
Fig. 11-25 defines the use of BOM in UTF encoding, [wiki_1130]. The figure shows only def-
inition for UTF encoding, other values for different and older encoding are not presented.
U+1F308 🌈 RAINBOW UTF-16 – 0XD83C 0XDF08
UTF-16 (BE) UTF-16 (LE)
0xD8 0x3C 0xDF 0x08 0x3C 0xD8 0x08 0xDF
i i+1 i i+1
Fig. 11-26 Example of big and little endian in UTF-16
The examples of the application of big and little endian are in Fig. 11-26 and Fig. 11-27. The
atomic element is a byte and generated stream can be seen in the hexadecimal editor. In Big and little
case of the big endian, the MSB byte is placed in the lower address and LSB in the higher endian. □
address. In case of the little endian, it is vice versa, LSB is placed in the lower address and
MSB in the higher address.
U+FB62 ‫ﭢ‬ ARABIC LETTER TEHEN ISOLATED FORM UTF-32 – 0X0000FB62
UTF-32 (BE) UTF-32 (LE)
0x00 0x00 0xFB 0x62 0x62 0xFB 0x00 0x00
i i+1 i i+1
Fig. 11-27 Example of big and little endian in UTF-32
Code point Name Code point Name

U+0009 CHARACTER TABULATION U+000A LINE FEED (LF)
U+000B LINE TABULATION U+000C FORM FEED (FF)
U+000D CARRIAGE RETURN (CR) U+0020 SPACE
U+0085 NEXT LINE (NEL)
Other white spaces are, [wiki_1141]

 The range form U+2000 to U+200A
 U+20028, U+2029,
 U+202F, U+205F, U+3000
Fig. 11-28 Whitespace characters
11.15 Whitespace character

Whitespace is any character that is represented by a horizontal or a vertical space. White
space character is not represented as a readable glyph. It is only a space that makes the
text more readable. A typical example of whitespace is U+0020 SPACE, U+0009 CHARACTER
TABULATION. The Unicode defines 25 whitespace characters, Fig. 11-28.
160 VŠB-TU Ostrava

11.16 Newline
Newline is a special character or a sequence of characters that signifies the end of the text
line. The synonyms for newline are line ending, end of line (EOL), or line break. The mean-
ing of newline is that cursor or head moves to the first column and by one line down. The
actual codes for newline have different behavior across the operating systems. It means
that the text file exchange between the operating systems is a problem. ASCII uses the
characters LINE FEED and CARRIAGE RETURN to perform the newline. The abbreviations are LF
and CR.
Literature [wiki_1136] states about LINE FEED and CARRIAGE RETURN following: “The concepts
of line feed (LF) and carriage return (CR) are closely associated and can be either considered Newline.
□
separately or lumped together. In the physical media of typewriters and printers, two axes
of motion, "down" and "across", are needed to create another line (a new line) on the
page. Although the design of a machine (typewriter or printer) must consider them sepa-
rately, the abstract logic of software can lump them together as one event. This is why a
newline in character encoding can be defined as LF and CR combined into one (LF+CR, LFCR,
CR+LF, CRLF).”
The programing languages have the format strings \n and \r corresponding to line feed and
carriage return. The application of these strings does not mean that the codes 0x0A and
0x0D are assigned. The codes for newline across the operating systems are different.
 UNIX and UNIX-like operating system use the LINE FEED 0x0A code for the new line.
UNIX-like operating systems are e.g. Linux, OS X – operating system of Mac com-
puters, Android, etc.
 Windows operating system uses the sequence CARRIAGE RETURN and LINE FEED (0x0D
0x0A) for the new line.
 Internet, where the sequence CARRIAGE RETURN and LINE FEED (0X0D 0X0A) should be
used on the protocol level of the most textual Internet protocols (FTP, HTTP…). But,
it is recommended that the tolerant application recognizes the LINE FEED as the new-
line, as well.
Unicode uses the previous codes for newline and defines the new codes. New codes have
to ensure the transformation between new and old documents. The codes associated with
newline are:
 LINE FEED (LF), U+000A, graphical symbol in document is ␊, U+240A - SYMBOL FOR
LINEFEED.
 LINE TABULATION U+000B, original name was vertical tab (VT). Graphical symbol in
document is ␋, U+240B - SYMBOL FOR LINE FABULATION.
 FORM FEED (FF), U+000C, graphical symbol in text is ␌, U+240C - SYMBOL FOR FORM
FEED.
 CARRIAGE RETURN (CR), U+000D, graphical symbol in text is ␍, U+240D - SYMBOL FOR
CARRIAGE RETURN.
VŠB-TU Ostrava 161

 CR+LF: CR (U+000D) followed by LF (U+000A).

 NEXT LINE (NEL), U+0085, graphical symbol in text is ␤, U+2424 - SYMBOL FOR NEW LINE.
 LINE SEPARATOR, U+2028.
 PARAGRAPH SEPARATOR, U+2029.
11.17 Possible notations of Unicode

Unicode in UTF-8 format is preferred as the character encoding scheme in a lot of docu-
ments today. Unicode contains a lot of characters that are not on the keyboard. The way,
how to write the characters missing on the keyboard into the documents, depends on the
operating systems and programs.
Windows. In text documents, it is possible to use alt codes. Alt code input is entered by
holding Alt key and typing the decimal number on the numeric keypad, [wiki_1144]. The Alt code.
□
decimal number corresponds to the hexadecimal Unicode code point. The way for the input
of hexadecimal number is described in [wiki_1144].
UNIX/Linux. There are three ways for entering the Unicode characters missing on the key-
board, [wiki_1144]. Nevertheless, not all application programs allow all three methods of
entering.
 Ctrlrl + Shift
Hold Ct Shift keys and type U followed by up to eight hexadecimal digits (on
Ctrl + Shift
main keyboard or numpad). Then release Ho Shift .
 Ctrl + Shi
Hold Ctr ft + U U
Shift ke keys and type up to eight hexadecimal digits, then re-
lease Ctrl + Shift + U .
 Type Ctrl + Shift + U , then type up to eight hexadecimal digits, and then
type Enter .
HTML has defined a notation that is able to write all Unicode characters, Fig. 11-29. The
notation is a string with the prefix “&#x” or “&x”, followed by a number and ending charac-
ter “;”. A number in the middle corresponds to the code point. Prefix “&#x” is for hexadec-
imal number and “&x” for decimal number corresponding to the code point, [wiki_1142].
Another possibility is to use the defined name instead of the number. Not all Unicode char-
acters have the defined name in HTML, the list of these names is in [wiki_1143].
α is α &x945; is α α is α
Hexadecimal Decimal Defined

code point code point name
Fig. 11-29 Notation of Unicode code point in HTML
162 VŠB-TU Ostrava

11.18 References
[Internet_1101]Glyph; http://whatis.techtarget.com/definition/glyph; on line 2013-12-04
[Interent_1102]Character; http://searchcio-
midmarket.techtarget.com/definition/character; on line 2013-12-04
[Internet_1103]Unicode; http://searchcio-midmarket.techtarget.com/definition/Unicode;
on line 2013-12-04
[Internet_1104]Coldewey D.: A quick PSA on "dots" versus "pixels" in LCDs; Jul 21, 2010;
http://techcrunch.com/2010/07/21/a-quick-psa-on-dots-versus-pixels-in-
lcds/; on line 2013-12-12
[Internet_1105]Bigman A.: PPI vs. DPI: what’s the difference? in Design Tips and Resources,
on February 26, 2013; http://99designs.com/designer-
blog/2013/02/26/ppi-vs-dpi-whats-the-difference/; on line 2013-12-12
[oracle_1101] 23.3 Specifying a Character Set in a JSP or XML File;

http://docs.oracle.com/cd/E17904_01/bi.1111/b32121/pbr_nls003.htm;
on line 2014-08-31
[Unicode_1101] Character; http://www.unicode.org/glossary/#character; on line

2013-12-06
[Unicode_1102] Glyph; http://www.unicode.org/glossary/#glyph; on line 2013-12-

06
[Unicode_1103] Unicode Technical Report #17; UNICODE CHARACTER ENCODING

MODEL; http://www.unicode.org/reports/tr17/; on line 2013-12-06
[Unicode_1104] Character Encoding Form;

http://www.unicode.org/glossary/#character_encoding_form; on line
2013-12-06
[Unicode_1105] Character Set; http://www.unicode.org/glossary/#character_set; on

line 2013-12-06
[Unicode_1106] Font; http://www.unicode.org/glossary/#font; on line 2013-12-06
[Unicode_1107] Script; http://www.unicode.org/glossary/#script; on line 2013-12-

06
[Unicode_1108] C0 Controls and Basic Latin;

http://www.unicode.org/charts/PDF/U0000.pdf; on line 2014-01-16
[wiki_1101] Morse code; http://en.wikipedia.org/wiki/Morse_code; on line 2013-11-21
[wiki_1102] Telex; http://en.wikipedia.org/wiki/Telex; on line 2013-11-21
VŠB-TU Ostrava 163

[wiki_1103] Character (computing);

http://en.wikipedia.org/wiki/Character_(computing); on line 2013-12-06
[wiki_1104] Character (symbol); http://en.wikipedia.org/wiki/Character_(symbol); on

line 2013-12-06
[wiki_1105] Glyph; http://whatis.techtarget.com/definition/glyph; on line 2013-12-06
[wiki_1106] Character encoding; http://en.wikipedia.org/wiki/Character_encoding; on

line 2013-12-06
[wiki_1107] Computer font; http://en.wikipedia.org/wiki/Computer_font; on line 2013-

12-06
[wiki_1108] Typeface; http://en.wikipedia.org/wiki/Typeface; on line 2013-12-06
[wiki_1109] Letterpress printing; http://en.wikipedia.org/wiki/Letterpress_printing; on

line 2013-12-12
[wiki_1110] Typewriter; http://en.wikipedia.org/wiki/Typewriter; on line 2013-12-12
[wiki_1111] Pixel; http://en.wikipedia.org/wiki/Pixel; on line 2013-12-12
[wiki_1112] ANSI escape code; http://en.wikipedia.org/wiki/ANSI_escape_code; on line

2013-12-27
[wiki_1113] PostScript fonts; http://en.wikipedia.org/wiki/PostScript; on line 2013-12-

27
[wiki_1114] TrueType; http://en.wikipedia.org/wiki/TrueType; on line 2013-12-27
[wiki_1115] OpenType; http://en.wikipedia.org/wiki/OpenType; on line 2013-12-27
[wiki_1116] Stroke (CJKV character); http://en.wikipedia.org/wiki/Stroke (CJKV charac-

ter); on line 2013-12-27
[wiki_1117] Ideogram; http://en.wikipedia.org/wiki/Ideogram; on line 2013-12-27
[wiki_1118] ASCII; http://searchcio-midmarket.techtarget.com/definition/ASCII; on line

2013-12-27
[wiki_1119] Code page 437; http://en.wikipedia.org/wiki/Code_page_437; on line 2014-

01-21
[wiki_1120] Code page 852; http://en.wikipedia.org/wiki/Code_page_852; on line 2014-

01-21
[wiki_1121] Windows-1252; http://en.wikipedia.org/wiki/Windows-1252; on line 2014-

01-21
[wiki_1122] Windows-1250; http://en.wikipedia.org/wiki/Windows-1250; on line 2014-

01-21
164 VŠB-TU Ostrava

[wiki_1123] ISO/IEC 8859-1; http://en.wikipedia.org/wiki/8859-1; on line 2014-01-21
[wiki_1124] ISO/IEC 8859-2; http://en.wikipedia.org/wiki/8859-2; on line 2014-01-21
[wiki_1125] C0 and C1 control codes;

http://en.wikipedia.org/wiki/C0_and_C1_control_codes; on line 20143-01-
24
[wiki_1126] Escape sequence; http://en.wikipedia.org/wiki/Escape_sequence; on line

2014-09-01
[wiki_1127] UTF-32; http://en.wikipedia.org/wiki/UTF32; on line 2014-01-25
[wiki_1128] UTF-16; http://en.wikipedia.org/wiki/UTF16; on line 2014-01-25
[wiki_1129] UTF-8; http://en.wikipedia.org/wiki/UTF-8; on line 2014-01-25
[wiki_1130] Byte order mark; http://en.wikipedia.org/wiki/Byte_order_mark; on line

2014-01-26
[wiki_1131] Telegraphy; http://en.wikipedia.org/wiki/Telegrafy; on line 2014-01-27
[wiki_1132] Baudot code; http://en.wikipedia.org/wiki/Baudot_code; on line 2014-01-

27
[wiki_1133] Teleprinter; http://en.wikipedia.org/wiki/Teleprinter; on line 2014-01-27
[wiki_1135] Character (computing)#Character_encoding;

http://en.wikipedia.org/wiki/Character_(computing)#Character_encoding;
on line 2014-08-27
[wiki_1136] Newline; http://en.wikipedia.org/wiki/Line_feed; on line 2014-08-27
[wiki_1137] Esc key; http://en.wikipedia.org/wiki/Esc_key; online 2014-08-28
[wiki_1138] Hayes command set; http://en.wikipedia.org/wiki/Hayes_command_set;

on line 2014-08-28
[wiki_1139] ESC/P; http://en.wikipedia.org/wiki/ESC/P#Modern_printers; on line 2014-

08-01
[wiki_1139] Code page; http://en.wikipedia.org/wiki/Code_pages; on line 2014-08-28
[wiki_1140] Windows code page; http://en.wikipedia.org/wiki/Windows_code_page;

on line 2014-08-28
[wiki_1141] Whitespace character; http://en.wikipedia.org/wiki/Whitespace_character;

on line 2014-08-31
[wiki_1142] Character encodings in HTML;

http://en.wikipedia.org/wiki/Character_encodings_in_HTML; on line 2014-
08-31
VŠB-TU Ostrava 165

[wiki_1143] List of XML and HTML character entity references;

http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_ref
erences; on line 2014-08-31
[wiki_1144] Alt code; http://en.wikipedia.org/wiki/Alt_code; on line 2014-08-31
[wiki_1145] Kód Kamenických;

http://cs.wikipedia.org/wiki/K%C3%B3d_Kamenick%C3%BDch; on line
2014-09-22
[wiki_1146] Plane (Unicode);

http://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_Special-
purpose_Plane; on line 2014-09-24
166 VŠB-TU Ostrava

12 Finite state machine
The FSM - Finite State Machine is a mathematical system which is possible to explain in
simple examples. Let’s imagine a counter that counts the incoming and outgoing cars
to/from the garage. The output of the counter corresponds to the number of cars in the
garage. The counter is incremented by the incoming cars and decremented by the outgoing
cars. At the beginning, it is necessary to clear the counter and to start the counting from
zero. From the technical point of view, it is a digital system that has the input impulse signal
AUTO and the input level signals DO and RESET. When the signal DO is active, the car is
incoming to the garage and the counter is incremented by one. And vice versa, when the
signal DO is inactive, the counter is decremented by one. The increment or decrement is
performed on the leading edge of the impulse AUTO. During the active RESET signal, the
counter is being re-set, regardless of other signals. The incoming or outgoing cars have no
influence on the counting; the counter stays on the zero. When it is desirable to start the
counting of cars, the signal RESET is deactivated. The time is used for the description of the
The time is used
counter behavior. Fig. 12-01 shows the behavior of the counter in time and shows the situ-
for the descrip-
ation when 3 cars have entrance to the garage. In the timing waveform, the interesting
tion of behavior.□
areas are marked where the input signals have the same values, but the output is different.
This behavior cannot be described by Boolean function, which defines the unambiguous
representation of the input to the output. The unambiguous definition of Boolean function
means that the same input must be the same output. Another variable must be used for
the description of this behavior.
AUTO
AUTO OUTPUT DO
DO
Counter
RESET RESET
OUTPUT 0 1 2 3
time
Fig. 12-01 Counter of cars
Note to the increment and decrement
The increment means to add one. The counter is incremented; it means that one is add-
ed to the contents of the counter.
The decrement means to subtract one. The counter is decremented; it means that one
is subtracted from the contents of the counter.
□
VŠB-TU Ostrava 167
Let’s realize how many sequences can be drawn for three cars in the garage. The number of
the transitions between numbers 0, 1 and 2 of cars is not limited, Fig. 12-02. Theoretically,
it means that it is possible to design an infinite number of sequences for three cars in the
garage. The description of the counter behavior in the timing area is impossible.
AUTO
DO
The exhaustive
description in
RESET time is not real.□
OUTPUT 0 1 2 1 0 1 2 3
time
Fig. 12-02 Another sequence of cars
Therefore, another variable is defined, which allows the exhaustive description. This varia-
ble is a state and the abstract model is a finite state machine. The machine works in time
and the machine has two states, the present state and the next state. The present state State.
□
contains all information needed to plan and calculate the future. The next state will occur in
the future.
In our example, the output is equal to the present state. The output and the present state The present state
correspond to the number of cars in the garage. This number cumulates all events in the contains all in-
past. It is a result of incoming and outgoing cars. This number does not have the infor- formation need-
mation about how many cars entered into the garage or went out of the garage. The pre- ed to derive the
sent state has only the final information about the incoming and outgoing cars. When it is behavior in the
desirable to know the number of incoming and outgoing cars, other separate machines future.□
(counters) must be used. The next state is the future and the time is given by the moment
when the car goes into or out of the garage. The direction of cars determines the value of
the next state.
From the digital system point of view, the machine can have either synchronous or asyn- Synchronous
chronous behavior. The synchronous behavior uses the clock signal CLK and all actions of behavior.□
the synchronous Moore machine are performed on the edge of clock. In our example, it is
the impulse signal AUTO.
For the design of a real system, it is necessary to define the maximum number of cars in the Application of
garage. This limitation leads to the finite number of states. Then, FSM - Finite State Ma- FSM – Finite
chine is the mathematical model, which is used for the description of many systems. There State Machine.□
are computers and their parts, communication protocols, wending machines, checks of the
syntax of the programming languages - language parsing, artificial intelligence, etc. In non-
technical disciplines, FSM has been used for the description of neurological systems and in Sequential cir-
linguistics, for the description of the grammars of natural languages, [wiki_1201]. From the cuits or control
realization point of view, both hardware and software realization is possible. Hardware unit.□
realization is known as the sequential logic circuits or the control unit of the digital system.
168 VŠB-TU Ostrava

FSM – Finite State Machine is a subset of Turing machine, which has more possibilities of
modeling, [wiki_1201]. The limited number of states is the first limitation. Literature
FSM is subset of
[Black_2008] says that “FSM is the Turing machine where head only reads and moves from
Turing machine.□
left to the right”.
12.1 Discrete time

The discrete time is the opposite to the continuous time, [wiki_1204]. All the word uses the
continuous time and physicists denote the continuous time with the small letter t. On the Discrete time.□
contrary, the discrete time is used by the digital systems and in the calculation in the digital
signal processing. The discrete time is point on the time axis that is denoted by the se-
quence of integer numbers. In general, the discrete time is denoted by the small letter i.
Then, the discrete time i means the present, the expression i-1 means the past and the
expression i+1 means the future, Fig. 12-03.
6 7 8 9 11
i-2 i-1 i i+1 i+2
Past Present Future
Present state Next state
Fig. 12-03 Discrete time

The discrete time in FSM is defined by any change on the inputs. Any change of input can
produce some actions, as the change of output or the change of state. The change of out-
put can be caused by the change of input, when the state is constant, and vice versa.
12.2 Definitions of finite state machine

The finite-state machine (FSM) or finite-state automation, or simply the state machine, is a
mathematical model of computation. This abstract model describes a sequence of output
events which is generated on a base of input events. FSM reflects the behavior of the sys-
tem in time and therefore the output can be a sequence of output events in time. The defi-
nition of FSM is based on the concept of input, output, state and representations between
them.
The state or the value of the state is a suitable sum of information about the past. On the
basis of this sum, the future is derived in the conjunction with the future input. The finite Present state.
□
machine is only in one state in given time. This state is called the present state. The state,
which will occur in the future, is called the next state. The change from the present state to
Next state.□
the next state is called the transition. The transition occurs on the basis of any event on the
input.
“The state of a sequential circuit is a collection of state variables whose values at any one time
contain all the information about the past necessary to account for the circuit's future behavior.”
Source: Herbert Hellerman's book on Digital Computer System Principles (McGraw-Hill, 1997).□
VŠB-TU Ostrava 169

The theory of finite state machine distinguishes two definitions of FSM according to Mr.
Mealy and Mr. Moore. More definitions are in literature and the next definition is from
Wikipedia, [wiki_1202], [wiki_1203] and [Fristacky_1986]. Mathematical definitions that
follow have a lot of in common. The difference is only in one representation and therefore
both definitions are placed in two neighboring columns.
Mealy and
Mealy machine Moore machine Moore machine.□
Mealy FSM is 6-tuple (S, S0, Σ, Λ, T, G), Moore FSM is 6-tuple (S, S0, Σ, Λ, T, G),
where where
 S is a finite set of states  S is a finite set of states
 S0 is a start state (also called initial state)  S0 is a start state (also called initial state)
which is an element of S which is an element of S
 Σ is a finite set of the inputs  Σ is a finite set of the inputs
 Λ is a finite set of the output  Λ is a finite set of the output
 T is a transition function (transition func-  T is a transition function (transition func-

tion) defined by representation tion) defined by representation
T:S×Σ→S T:S×Σ→S
The transition function maps the present The transition function maps the present
state and the input to the next state state and the input to the next state
 G is an output function defined by repre-  G is an output function defined by repre-
sentation sentation
G:S×Σ→Λ G:S→Λ
Output function maps the present state Output function maps the present state
and input to the present output to the present output
To understand the representation of the transition function of FSM better, it is suitable to

define the discrete time that is written as the superscript. Then, the transition function with
the denoted discrete time is given by formula (1201). The value of discrete time is denoted
by upper index.
T: Si × Σi → Si+1 (1201) Transition func-

tion.□
Where
 T is a name of the representation and also transition function.

 Si is a set of the states in the present time.
 Σi is a set of the present input events.
 Si+1 is a set of the state in the future time.
The representation T is the transition function which is defined in time, the present to the
future. The set of the state is mapped into itself. It is only possible in time. In details, the
present state and present input generate the next state. A new value of state occurs in dis-
crete time i+1. It means, that the present state is unchanged between discrete times i a i+1.
The realization of FSM must contain an element for storing the present state. More infor-
mation is in literature [Fristacky_1986], [Katz_Borriello_2005], [Roth_2004], [Warkley
_2006] and [Divis_2008].
170 VŠB-TU Ostrava

The representation G is the output function. The mapping is defined between the different
sets. The time is not needed, the representation is the present. Mealy and Moore machine
has different definitions of the output function G:
 Mealy machine maps the present state and the present input to the present out-
put.
 Moore machine maps only the present state to the present output. The simplest
representation is situation when the present output is equal to the present state.
A typical example is counter.
Mealy G: S x Σ → Λ (1202)
Moore G: S → Λ (1203) Output function.□
Where
 G is a name of the representation and also the output function.

 S is a set of state in the present.
 Σ is a set of inputs in the present.
 Λ is a set of outputs in the present.
The output of Mealy machine can be generated without the change of the present state.
The definition of the output function enables to generate a new output when only the input
is changed. A new output of Moore machine is derived only on basis of the change of the
state.
12.3 Synchronous and asynchronous machine

The transition function of FSM is defined in time, and the present state must be stored for
correct behavior. The transition between the states of FSM is derived from the event or
condition on the input. According to the definition of the input event or condition, the FSM
has the synchronous or asynchronous behavior. The behavior has the influence on the defi-
nition of the transition function T.
The synchronous finite state machine uses a special input, from which it derives the transi-
tion between the states. This input is called a clock - CLK, [wiki_1205]. The clock ensures the
synchronization and the discrete points for transitions of states are either the leading edge
or the trailing edge or both edges.
The asynchronous finite state machine does not use the clock input, [wiki_1206]. The asyn-
chronous finite state machine derives the transition between the states on basis of any
input event. Then, the discrete points are any changes of inputs. The definition of the asyn-
chronous transition function is more demanding than that of the synchronous one. The
asynchronous definition of the transition function must eliminate possible infinite cycles of
the states. More information is in literature [Fristacky_1986]. Therefore, the synchronous
finite state machines are preferred for the hardware realization.
The output function G works only in the present and the synchronous or asynchronous
behavior has no influence. Only note down, a new output of Mealy machine can be derived
from a new input during one present state. This is impossible in case of Moore machine.
VŠB-TU Ostrava 171

12.4 Block diagram of synchronous FSM

The block diagram of the synchronous FSM is a base for realization, understanding and de-
scription of FSM. The above mathematical definitions of the finite state machine determine
the realization. The transition and output functions are the representations which corre-
spond to the definition of Boolean function. The hardware realization of Boolean function is
a combinational logic circuit. FSM works in time, the present and the future. The present
state is unchanged between the transitions of states. Therefore the present state is stored
in a register. The definition of Mealy and Moore machine only has a difference in the out-
put function G. The transition functions F are the same. The block diagrams of Mealy and
Moore machine are very similar. Mealy synchronous machine is in Fig. 12-04, [wiki_1203]
and Moore synchronous machine is in Fig. 12-05, [wiki_1202].
Mealy FSM
Mealy FSM.□
Output
Transition
function State function
register G
T Next Output
Input Present
state (present)
(present) state
Clock
CLK
Initial state
(Init) CLR
Fig. 12-04 Block diagram of Mealy FSM
Blue line from Mealy machine is missing

Moore FSM
Output
Transition
function State function
T register G
Next Output
Input Present
state (present)
(present) state
Clock
CLK
Initial state
(Init) CLR
Fig. 12-05 Block diagram of Moore FSM
172 VŠB-TU Ostrava

The block diagrams of Mealy and Moore are very alike and they only differ in the input of
the output function. The realization of FSM has three basic blocks: Moore FSM.□
 State register stores the value of the present state. The register is synchronized by
the clock signal “Clock”. The signal “Initial state” is used for setting the initial state State register.□
on the output of the state register. This action can be synchronous or asynchro-
nous. The input of the state register is “Next state” as the output of the transition
function.
 Transition function T corresponds to the representation which is given by formula
Transition
(1201). The output of this block is “Next state” which is calculated on the base of
function.□
“Input” and “Present state”. From the point of view of hardware realization, it is a
combinational circuit.
 Output function G corresponds to the representation which is given by formulas
(1202) or (1203). This block calculates the “Output”. From the point of view of Output function.□
hardware realization it is a combinational circuit. The output can be defined in dif-
ferent ways, according to the type of FSM:
 For Mealy machine, the representation of the output function is given by
formula (1202). The output is the function of “Present state” and “INPUT”.
 For Moore machine, the representation of the output function is given by
formula (1203). The output is the function of only ”Present State”.
Note to the output function of Moore machine
If the output is equal to the present state, then it is the simplest output function of
Moore machine. A typical example is a counter. □
12.5 Description of FSM behavior

The description of FSM behavior is given by a mathematical definition of representations
(1201) and (1202) or (1203). The oldest principles of the description of FSM behavior are
the table and the state diagram. The more modern principles are the Petri’s network, the
UML description or by the programming languages. In the following text, the state diagram
is considered to be the basic principle of description.
The state diagram is an oriented graph where nodes correspond to the states and oriented
edges represent the transition function, Fig. 12-06. Oriented edges mean that each edge
has the arrow and the condition, when the edge can be used for the transition between the
states. More information about oriented graph is in [wiki_1209] and [wiki_1210].
Beginning
Telephone rings
No ring I am home Input / output.□

I am calling
Fig. 12-06 State and transition function
VŠB-TU Ostrava 173

The names of states can correspond to a real situation. Fig. 12-06 shows the example of Transition
picking up the telephone. The initial state is “I am home”. If somebody calls me, it is the function.□
input “Telephone ring”; I change the state and pass to the next state “I am calling”. When
telephone does not ring, the next state is “I am home”, I do not change the state. The
above sentences describe the transition function.
The state “I am home” is the initial state, from which FSM begins all activities. Also, the
study of behavior of FSM is meaningful. The initial state is generated by activating the signal Initial State.□
“Beginning”, regardless of any state or other activities.
Beginning
Telephone rings / Pick up Slash “/” is a
No ring/ separator be-
I am home
Study I am calling tween input
and output. □
Fig. 12-07 State diagram of Mealy machine
The output function has two definitions, where the first definition is for the output of
Mealy machine, Fig. 12-07. The Mealy output function depends on the input and the pre-
sent state. The output is signified on each edge, separated with a slash "/" from the input.
The output function is expressed by action “Pick up” the phone and “Study”. This output
action “Pick up” is generated when FSM is in the present state “I am home” and the input is
”Telephone rings”. The output action “Study” is generated, when FSM is in the present
state “I am home” and the input is “No ring”. The description of the transition and output
functions can be connected into one sentence: Mealy output
function.□
 FSM is in the present state “I am home” and the input is “Telephone rings”, then
the output is “Pick up” and FSM passes to the next state “I am calling”.
 FSM is in the present state “I am home” and the input is “No ring”, then the output
is “Study” and FSM passes to the next state “I am home”.
Beginning
Telephone Slash “/” is a
rings separator be-
No ring State / output.□
I am home/ What to do/ I am calling tween state
Study Pick up
and output. □
Fig. 12-08 State diagram of Moore machine
In case of Moore machine, the output function depends only on the present state. The out-
put is signified on each state, separated with a slash "/" from the state. The output function
is expressed by action “Pick up” the phone and “Study”. This output action “Pick up” is gen-
erated when FSM is in the present state “What to do”. The output action “Study” is gener-
ated when FSM is in the present state “I am home”. The description of the transition and
output functions can be connected into one sentence:
174 VŠB-TU Ostrava

 FSM is in the present state “I am home” and FSM generates the output “Study”,
Moore output
and when the input is “Telephone rings”, then FSM passes to the next state “What
function.□
to do”.
 FSM is in the present state “I am home” and FSM generates the output “Study”,
and when the input is “No ring”, then FSM passes to the next state “I am home”.
 Etc.
There is no algorithm for the transformation between Moore and Mealy machine. At the
beginning of designing, it is necessary to choose either Mealy or Moore machine. Mealy
machine, Fig. 12-07, and Moore machine, Fig. 12-08, do not have the same number of
states, even if they describe the same problem. Moore machine has three states to reach
the same state “I am calling”. The transition between the states “What to do” to “I am call-
ing” is without the input.
12.6 Examples of finite state machine

Almost all current transfers are serial, and therefore the task is to design a serial binary
adder. There are two serial binary streams, denoted as “a” and “b”. The output is a serial
stream “s”, which is the addition of input streams, “a” and “b”. There is an explicit condi-
tion saying that the addition is always performed from LSB bit.
Init
Input slash output
11/1
2
00/0 01/0
01/1 No carry Carry 10/0
10/1 00/1 11/1
1 3
4
Green color is remarks
Fig. 12-09 State diagram of the serial adder as Mealy machine
The state diagram contains two states with the significance of carry. The initial state is “No
carry” and the second state “Carry” means that the carry has been generated. The full
comprehension and the design of the state diagram are in the sentences, which are neces-
sary to assemble for all combinations of inputs “a”, “b” and carry. The sentences are:
 FSM is in the current state “No carry”, and when the input combination is “a=0”
and “b=0”, FSM generates the output “0”, and then FSM passes to the next state
“No carry”. This sentence corresponds to the edge 1. It is a situation of the addition
where “a + b + cin = 0 + 0 + 0 = 0”, all in binary, and cin is the carry in.
 FSM is in the current state “No carry”, and when the input combination is “a=1”
and “b=1”, FSM generates the output “0”, and then FSM passes to the next state
“Carry”. This sentence corresponds to the edge 2. It is a situation of the addition
where “a + b + cin = 1 + 1 + 0 = 10”, all in binary, and cin is the carry in. The output
is “0” and the carry to the next order is “1”.
VŠB-TU Ostrava 175

 FSM is in the current state “Carry”, and when the input combination is “a=1” and
“b=1”, FSM generates the output “1”, and then FSM passes to the next state “Car-
ry”. This sentence corresponds to the edge 3. It is a situation of the addition where
“a + b + cin = 1 + 1 + 1 = 11”, all in binary, and cin is the carry in. The output is “1”
and the carry to the next order is “1”.
Init
01 or 10 01
00 1 or
No carry 00 No carry 10
with 0/0 with 1/1
Input value, the condi-

11 11 2
00 tion of transition
00
01
or 11
Carry with 11 3 Carry with
10 0/0 1/1 Name of state
01 or 10 slash the output
Green color is remarks
Fig. 12-10 State diagram of the serial adder as Moore machine
The following machine is Moore machine with the same task. It is necessary to realize that
the carry was generated with the output equal to 0 or 1. This thought leads to four combi-
nations, and therefore Moore machine has four states. The names of states are the concat-
enation of the carry, yes or not, with the output value equal to 0 or 1. The initial state is
“No carry with 0”, it means the carry was not generated and the output is equal to zero.
There is no algorithm for the transformation between Moore and Mealy machine.
The full comprehension and the design of the state diagram are in the sentences, which are
necessary to assemble for all combinations of inputs “a”, “b” in each state. The sentences
are:
 FSM is in the current state “No carry with 0” and FSM generates the output “0”,
and when the input combination of n-tuple “ab” is “01” or “10”, then FSM passes
to the next state “No carry with 1”. This sentence corresponds to the edge 1. It is a
situation of the addition where “a + b + cin = 0 + 1 + 0 = 1” or “a + b + cin = 1 + 0 + 0
= 1”, all in binary, and cin is the carry in.
 FSM is in the current state “Carry with 0” and FSM generates the output “0”, and
when the input combination of n-tuple “ab” is “00”, then FSM passes to the next
state “No carry with 1”. This sentence corresponds to the edge 2. It is a situation of
the addition where “a + b + cin = 0 + 0 + 1 = 1”, all in binary, and cin is the carry in.
 FSM is in the current state “Carry with 0” and FSM generates the output “0”, and
when the input combination of n-tuple “ab” is “11”, then FSM passes to the next
state “Carry with 1”. This sentence corresponds to the edge 3. It is a situation of the
addition where “a + b + cin = 1 + 1 + 1 = 11”, all in binary, and cin is the carry in.
176 VŠB-TU Ostrava

This is a pattern showing how to understand the state graph and how to explain it. In the
design of a state graph, it is necessary to define nodes and also the significance of each
node. Then the proposed edges make sense and explain the behavior of the finite state
machine.
12.7 Table notation of FSM

The table notation of the behavior of finite state machine is suitable for a few states and it
is important for a manual design of sequential circuits. The table notation of FSM consists
of two tables, the first one is the state transition table and second one is the output table.
The table of state transition corresponds to the transition function T, [wiki_1211]. The out-
put table corresponds to the output function G and it is the truth table of Boolean function.
In practice, both tables are drawn to one table with two parts, a transition part and an out-
put part. Fig. 12-11 shows the table for Mealy machine with marking these parts.
Inputs and their combinations Inputs and their combinations

00 01 10 11 00 01 10 11
Current No
Area for the next states which Area for outputs, which are the
carry
states are the function of the current function of the current states
Carry states and inputs and inputs
State transition table Output table
Fig. 12-11 State and output tables of Mealy machine
Rows of the table correspond to the current state and columns correspond to the combina-
tions of inputs. The first part is the state transition table where the green area is deter-
mined for the notation of the next states. The second part is a table for the output function
and the output values are written in the blue area. The initial state is not marked in a spe-
cial way, and typically, it is the current state in the first row.
Init
11/1
00/0 01/0
01/1 No carry Carry 10/0
10/1 00/1
11/1
Input combinations Input combinations

00 01 10 11 00 01 10 11
No carry No carry No carry No carry Carry 0 1 1 0
Carry No carry Carry Carry Carry 1 0 0 1
Fig. 12-12 Transcription of the state diagram to table notation for Mealy machine
Fig. 12-12 shows the full transcription of the state diagram to the table notation. It is Mealy
machine of the serial adder, Fig. 12-09. The table for Moore machine is simpler because the
output is only the function of the current state, Fig. 12-13.
VŠB-TU Ostrava 177

Init
01 or 10 01
or
00
No carry 00 No carry 10
with 0 /0 with 1 /1
01 11 00
11 00
or
10 11
Carry 11 Carry
with 0/0 with 1/1
01 or 10
Input combinations Output

00 01 10 11
No carry with 0 No carry with 0 No carry with 1 No carry with 1 Carry with 0 0
No carry with 1 No carry with 0 No carry with 1 No carry with 1 Carry with 0 1
Carry with 0 No carry with 1 Carry with 0 Carry with 00 Carry with 1 0
Carry with 1 No carry with 1 Carry with 0 Carry with 0 Carry with 1 1
Fig. 12-13 Transcription of the state diagram to table notation for Moore machine
12.8 Synchronization of input

Finite state machine is connected to its surrounding environment which is asynchronous to
FSM, Fig. 12-14. The surrounding environment does not know the clock and it provides the
inputs for FSM in random time, regardless of FSM clock. It is an asynchronous connection,
which can cause an incorrect behavior of FSM.
Asynchronous Synchronous
inputs FSM output
Real
world
Generator
of clock
Asynchronous Synchronous Synchronous

inputs inputs FSM output
Real Sampling
world register
CLK Generator
of clock
Fig. 12-14 Synchronization of FSM inputs
It is suitable to synchronize the asynchronous inputs by sampling. Then the sampling regis-
ter provides the synchronized inputs. This synchronization solves problems of timing in
realization.
178 VŠB-TU Ostrava

In a hardware realization, non-synchronized inputs of FSM can cause a metastable behavior

of the state register. The metastability of the state register means that the output of the
state register can be undefined during a random length of time. After this time, the state
register passes to a random state. The problem of metastability is explained in literature
[TI_1997], [Zdralek_2006], [Zdralek_2008] and [wiki_1207].
In software realization, non-synchronized inputs of FSM can cause the situation in which
the transition function uses two different values of input in one calculation. This situation
could lead to a wrong next state. The sampling register solves this situation and it ensures
the constant input during the calculation of the transition and output functions.
12.9 Notation in programming languages

Notation of the finite state machine in a programming language is shown in examples. The
RFC 793 defines The TCP protocol and its behavior is described by the state diagram, Fig.
12-15. This state diagram is implemented in all nodes of the network which use the TCP/IP
protocol. Fig. 12-15 shows only a part of the state diagram that is responsible for establish-
ing the session between two nodes, a transmitter and a receiver. The initialization of the
session can start in two ways. The session can be initialized by any side and this part of the
state diagram takes this fact into account. The state diagram does not solve error situa-
tions. All the activities start in the present state “LISTEN” of the node. The first possibility
of establishing the session is to send the flag “snd_SYN” and the flag “snd_ACK” to the op-
posite side after the node received “rcv_SYN”. The node of network goes to the state “SYN
RCVD”. This is a situation when the session is established by an opposite side.
V | \ \
1+---------+ CLOSE | \
| LISTEN | ---------- | |
+---------+ delete TCB | |
rcv SYN | | SEND | |
----------- | | ------- | V
+---------+ 2 snd SYN,ACK / \ snd SYN +---------+
| |<----------------- ------------------>| |
| SYN | rcv SYN | SYN |
| RCVD |<-----------------------------------------------| SENT |
| | snd ACK | |
| |------------------ -------------------| |
+---------+ rcv ACK of SYN \ / rcv SYN,ACK +---------+
| -------------- | | -----------
| x | | snd ACK
| V V
| CLOSE +---------+
| ------- | ESTAB |
| snd FIN +---------+
| CLOSE | | rcv FIN Source: RFC 793
Fig. 12-15 Establishing of the session in TCP protocol
The second possibility is that an application requires establishing the session with the des-
tination node. The node in state “LISTEN” received the statement “SEND” from an applica-
tion. After that, the node of network sends the flag “snd_SYN” to the destination and it
goes to the state “SYN_SENT”, where it waits for “rcv_SYN” from the destination.
VŠB-TU Ostrava 179

:
05 enum state_type {LISTEN, SYN_RECEIVED, SYN_SENT, ESTABLISH};
06 enum state_type Current_state;
07 enum state_type Next_state;
08
09 bool SEND, INIT = false;
10 bool rcv_SYN, rcv_SYN_and_ACK, rcv_ACK_of_SYN = false;
11 bool snd_SYN, snd_SYN_and_ACK, snd_ACK;
12
13 int main()
14 {
15 void input (bool *inSYN, bool *inSYN_ACK, bool *inACK_of_SYN,
16 bool *inSEND, bool *inINIT);
17
18 Current_state = LISTEN;
19 Next_state = LISTEN;
20
21 while (true)
22 { snd_SYN = false; snd_SYN_and_ACK = false; snd_ACK = false;
23 rcv_SYN=false; rcv_SYN_and_ACK=false; rcv_ACK_of_SYN=false;
24 SEND=false; INIT=false;
25
26 // sampling of inputs
27 input(&rcv_SYN, &rcv_SYN_and_ACK, &rcv_ACK_of_SYN, &SEND, &INIT);
28
29 if (INIT) { Current_state = LISTEN;continue;
30 } // initialization of FSM
31
32 switch (Current_state) // transition and output functions
33 {case LISTEN:
34 if (rcv_SYN) {snd_SYN_and_ACK = true;
35 Next_state = SYN_RECEIVED;}
36 else if (SEND) {snd_SYN = true;
37 Next_state = SYN_SENT;}
38 else Next_state = LISTEN; break;
39
40 case SYN_RECEIVED:
41 if (rcv_ACK_of_SYN) Next_state = ESTABLISH;
42 else Next_state = SYN_RECEIVED; break;
43
44 case SYN_SENT:
45 if (rcv_SYN) {snd_ACK = true;
46 Next_state = SYN_RECEIVED;}
47 else if (rcv_SYN_and_ACK) {snd_ACK = true;
48 Next_state = ESTABLISH;}
49 else Next_state = SYN_SENT; break;
50
51 case ESTABLISH: break;
52 } // end of transition and output functions
53
54 Current_state = Next_state; // state register
55 }
56 return 0;
64 }
Fig. 12-16 Listing of FSM of TCP protocol
Fig. 12-16 shows the description of the state diagram from Fig. 12-15 in C programming
language. The program uses the same names of states as in the state diagram. The declara-
tion of the enumeration type is used for these names, row 5 of program listing. The varia-
bles “Current_state” and “Next_state” have values corresponding to the names of states.
The actual program begins in row 25 by infinite cycle. Row 31, the function “input” is called
180 VŠB-TU Ostrava

to ensure sampling inputs. After that, all inputs of FSM are constant during the following
calculation. Row 37, the statement switch according to the variable “Current_state” is the
beginning of the description of the transition and output functions. In individual cases of
the current state, the next state is assigned and the output is set up. At the end, row 61, a
new value of the next state is written to the state register and a new value of the current
state is valid for the next cycle.
Two situations are marked in the state diagram in Fig. 12-15 and they are described by rows
from 32 to 38 in listing, Fig. 12-16. The following sentences are a basic model for the de-
scription and the explanation of the state diagram in Fig.12-15. They are:
 Situation 1 and rows 33 and 38. FSM is in the present state “LISTEN”, and the input
is neither “rcv_SYN” nor “SEND”, then no output is generated and FSM goes to the
next state “LISTEN”.
 Situation 2 and rows 33, 34 and 35. FSM is in the present state “LISTEN”, and the
input “rcv_SYN” is received, then the output “snd SYN, ACK” is activated and FSM
goes to the next state “SYN RCVD”.
12.10 References
Literature
[Black_2008] Black, Paul. E. (12 May 2008). "Finite State Machine". Dictionary of Algo-
rithms and Data Structures (U.S. National Institute of Standards and Tech-
nology).
[Divis_2008] Zdeněk Diviš, Zdeňka Chmelíková, Jaroslav Zdrálek: Logické obvody; skripta
VŠB-TU Ostrava, ISBN 978-80-248-1734-8
[Fristacky_1986] Frištacký, N., Kolesár, M., Kolenička, J., Hlavatý, J.: Logické systémy;
ALFA 1986; ISBN 80-05-00414-1
[Katz_Borriello_2005] Randy H. Katz, Gaetano Borriello: Contemporary Logic Design, Sec-

ond Edition; Prentice Hall 2005, ISBN 0-201-30857-6

534-37804-8
[Hellerman_1977] Herbert Hellerman: Digital Computer System Principles; McGraw-

Hill, 1. 1. 1973; (McGraw-Hill, 1997)
[TI_1997] Digital Design Seminar, Reference manual; literature of Texas Instrument,

SDYED01A; 1997
[wiki_1201] Finite-state machine; http://en.wikipedia.org/wiki/Finite-state_machine; on

line 2014-09-03
VŠB-TU Ostrava 181

[wiki_1202] Moore machine; http://en.wikipedia.org/wiki/Moore_machine; on line

2014-09-03
[wiki_1203] Mealy machine; http://en.wikipedia.org/wiki/Mealy_machine; on line

2014-09-03
[wiki_1204] Discrete time and continuous time;

http://en.wikipedia.org/wiki/Discrete_time_and_continuous_time; on line
2014-09-03
[wiki_1205] Synchronous circuit; http://en.wikipedia.org/wiki/Synchronous_circuit; on

line 2014-09-03
[wiki_1206] Asynchronous circuit; http://en.wikipedia.org/wiki/Asynchronous_circuit;

on line 2014-09-03
[wiki_1207] Metastability in electronics;

http://en.wikipedia.org/wiki/Metastability_in_electronics; on line 2014-09-
05
[wiki_1208] Hazard (logic); http://en.wikipedia.org/wiki/Hazard_(logic); on line 2014-

09-03
[wiki_1209] State diagram; http://en.wikipedia.org/wiki/State_diagram; on line 2014-

09-13
[wiki_1210] Directed graph; http://en.wikipedia.org/wiki/Directed_graph; on lin e2014-

09-13
[wiki_1211] State transition table; http://en.wikipedia.org/wiki/State_transition_table;

on line 2014-09-13
[Zdralek_2006] Jaroslav Zdrálek: Programovatelné logické prvky, skripta VŠB=TU Ostrava,

2006, ISBN 80-248-1060-3
[Zdralek_2008] Jaroslav Zdrálek: Programovatelné logické prvky; E-learningové prvky pro

podporu výuky odborných a technických předmětů, OP RLZ
CZ.04.1.03/3.2.15.2/0326; ISBN 978-80-248-1502-2
182 VŠB-TU Ostrava

13 Synchronous digital system
The synchronous digital system is a system which is synchronized by a clock. This principle
ensures the stability of the system and it is a more simple design, in comparison with asyn-
chronous systems. The basic block diagram of the synchronous digital system is in Fig.
13-01. This system has two basic blocks, the data unit and the control unit which are con-
nected by signals, both between each other and with the outside environment. The de-
scription of blocks and signals are:
 The data unit is a block which can perform any operation. It contains combinational
logic circuits with registers. From the point of view of a processor, the data unit is
the arithmetic logic unit which contains combinational circuits and registers. Com-
binational circuits perform logic operations, binary arithmetic operations – addi-
tion, subtraction, multiplication, and encoding, multiplexing and etc. Registers are
used for storing inputs, outputs and auxiliary data. The composition of the data unit
depends on a concrete design. The data unit is connected with the control unit by
control and condition signals and it is connected to the outside environment by
flags and bidirectional data.
 The control unit is a block which controls the data unit. The control unit is a finite
state diagram which has the condition signals from the data unit as the input; and it
has the control signals to the data unit as the output. The control unit is connected
to the outside environment by flags and commands.
Flags Commands Flags Data

Controls
Control
Data Unit
Unit
Conditions
CLK
Init
Fig. 13-01 Block diagram of synchronous digital system
 Control signals are signals,

 which are used for defining when the data must be written to the register
in time,
VŠB-TU Ostrava 183

 which define the output of a multiplexer that can choose the register for
the output,
 which define an addition or a subtraction,
 which define parameters of encoding,
 etc.
 Condition signals are signals which are generated by the data unit and they de-
scribe the results of performed actions. For example, ”result is equal to zero”, “sign
bit”, “carry out”, “overflow”, “underflow” and so on.
 Data, it is a bidirectional bus for transferring data as the input and the output of the
data unit.
 Flags, they are signals which are generated by the control unit or the data unit and
they signalize some information to the outside environment. Flags are only for
reading and they signalize, for example, “give the next command”, “set input data”,
“system is a busy”, “error” and so on.
 Command signals, they are signals which are generated by the outside environment
of a digital system. For example, “start”, “termination of operation”, “parameters
of encoding” and so on.
 CLK, it is a synchronization signal for all actions in the system. The clock signal de-
fines the transition between the next and the present state, the clock input of reg-
isters or flip-flops. All actions are derived from an edge of the clock.
 Init, it is an initialization signal which ensures the transition to the initial state of the
finite state machine, the signal which clears flags, registers and so on. It is desirable
to start the operation of the digital system from a defined state. Sometimes this
state is called the default state.
13.1 Decimal adder

The example of a synchronous digital system is described here, for better understanding.
The example performs the decimal addition by using a binary adder. The decimal numbers
have 4 decimal orders and they are in the packed BCD format. The algorithm of decimal
addition is described in the previous chapter. The conditions of adding 6 to the sub-result
are:
 If the sub-result in the nibble is higher than 9 or the nibble generates the carry to
the next nibble, then 6 must be added to the nibble.
 If the addition of 6 to the nibble generates the carry, the nibble must be correct-
ed by adding 6.
The data unit only has an 8-bit adder and 16-bit registers. The 4-order decimal digits are
stored in registers and the sum must be performed two times, sequentially and consecu-
tively with the carry. The first addition adds the low bytes with storing the carry out. The
second addition preforms the addition of the high bytes with the carry from the previous
addition.
184 VŠB-TU Ostrava

B register, 16-bit
A register, 16-bit
06
60
66
H L
A mux B mux
C8
3 8-bit binary adder
A6 Carry out Carry in C8
C4 A6
H L
Nibbles > 9 4 16-bit accumulator
Sum
Fig. 13-02 Block diagram of the data unit
13.2 Data unit for decimal adder

Data unit has 3 registers, each with 16 bits. Two registers are used for storing the input
numbers A and B. The third register is an auxiliary register and it is called the accumulator.
The input of 16-bit accumulator is connected to the 8-bit adder and therefore the writing to
the accumulator is performed by bytes. The central block is the 8-bit binary adder with in-
put multiplexers A and B. This adder can perform the addition of different sources in vari-
ous combinations. In our example, the used additions are:
 Low bytes of A and B register.

 High bytes of A and B register.
 Low byte of accumulator with one of constants 0x06, 0x60 and 0x66.
 High byte of accumulator with one of constants 0x06, 0x60 and 0x66.
The adder produces the carry outputs which can be used as the carry input of the adder, or
as the conditions in the control unit. The carry flags are:
 “C8” is a carry from the 7th bit of adder, when low bytes are added. The carry “C8”is
used in the addition of high bytes as the carry in.
 “A6”is a carry from the 7th bit of adder, when a low byte of the accumulator is add-
ed with one of the correction constants. The “A6”carry is used as carry in, when a
high byte of the accumulator is corrected.
 “C4”is a carry from the third bit of adder. The “C4”carry expresses the carry from
the low nibble to the high nibble. It is used by the control unit as a condition for
branching.
VŠB-TU Ostrava 185

The accumulator is a 16-bit register and the writing is performed by bytes. The accumulator
contains the circuits for decoding the situation when nibble is higher than 9. These 4 signals
“Nibbles > 9” are used as the conditions by the control unit.
B register, 16-bit
A register, 16-bit
06
60
66
H L
A mux B mux
C8
3 8-bit binary adder
C4 A6
H L
Nibbles > 9 4 16-bit accumulator

Data inputs and output
Sum Conditions for the control unit
Fig. 13-03 Highlighting of data inputs, output and conditions
CLK
CLK
B register, 16-bit Load_B
Load_A A register, 16-bit
06
60
66
H L
A_sel B_sel
A mux B mux
Clear
C8
Load_xx 3 8-bit binary adder
CLK C4 A6
H L Cin_sel
Load_ACC
Nibbles > 9
Low/High 16-bit accumulator Synchronization
CLK
Control signals from the control unit
Fig. 13-04 Highlighting of control signals and synchronization
186 VŠB-TU Ostrava

Following figures show the comparison with the general block diagram and the general
definition of the data unit. The data unit has two data inputs A and B, next a data output
and it produces the conditions for the control unit. These signals are marked in Fig. 13-03.
Fig. 13-04 shows the undrawn control and synchronization signals. Control signals are not
usually drawn. It is assumed that everybody knows the control signals of blocks. For exam-
ple, control signals of multiplexer are given by the knowledge of the function of multiplex-
er. And also, each register needs the enable signal and the synchronization clock signal,
where both signals define the time of writing. The data unit does not produce flags.
13.3 Control unit

The control unit is a block which is connected with the data unit by control and condition
signals, Fig. 13-05. The control block is connected to the outside environment by the com-
mand “Start” and the flag “End_cal”. The command “Start” begins the calculation and the
flag “End_cal” signalizes the end of the calculation and the correct result on the output. The
control unit has the synchronization signal “CLK” and the initialization signal “Init”.
Flags Commands
- end_cal - Start
Load_A, Load_B, A_sel(2), B_sel(3),
Controls
Load_C8, Load_A6, Load_C4, Clear,

Control Cin_sel, Load_ACC, High/Low
Unit
Conditions
C8, C4, Nibble_higher_9 (4)
CLK
Init
Fig. 13-05 Input and output of the control unit
The control unit is a finite state machine. The behavior is described by the state diagram of
Moore machine. The design and its description are:
 State “First”, FSM is waiting for “Start”. Init_L

• Active “Init_L” sets the initial state “First”. First No Start
• The deactivation of flags “End_cal”.
Start
• Branching according to the “Start”.
 State “Begin”, the decimal numbers A and B must be Begin
Clear, write A,B
set.
• Writing new data into registers A and B.
Low_add
• Clearing carry flags “C8”, “A6”, and “C4”.
L_ACC<-L_A+L_B
 State “Low_add”, the addition of low bytes. The carry
“C8” and “A6” are equal to zero.
• The addition of low bytes of A and B registers
Fig. 13-06 State diagram of the
with the carry in, the carry in is equal to 0.
control unit, part A
VŠB-TU Ostrava 187

• Writing the carry flags “C8” and “C4”.

• Writing the sum into the low byte of ACC.
 Branching in state “Low_add”.
• If (the “nibble(0)>9” is true or “C4” is one) and (“nibble(1)>9” is false or
“C8” is zero), then FSM continues by the next state “Low_06”;
• if (the “nibble(0)>9” is false or “C4” is zero) and (“nibble(1)>9” is true or “C8
is one”), then FSM continues by the next state “Low_60”;
• if (the “nibble(0)>9” is true or “C4” is one) and (“nibble(1)>9” is true or “C8”
is one), then FSM continues by the next
Low_add
state “Low_66”;
L_ACC<-L_A+L_B
• else FSM continues by the next state
“High_add”. Low_06
 State “Low_06”, the low nibble in the low byte of L_ACC+06
ACC must be corrected by adding 0x06.
• Add the constant 0x06 to the low byte of Low_60
L_ACC+60
ACC with the carry in equal to zero.
• Write the carry out to “A6”. Low_66
• Write the correction into the low byte of L_ACC+66
ACC.
 Branching in state “Low_06”. High_add
• If the (“nibble(1)>9” is true or “C8” is H_ACC<-H_A+H_B
one), then FSM continues by the next
state “Low_60”, else by the next state Fig. 13-06 State diagram of the
“High_add”. control unit, part B
 State “Low_60”, the high nibble in the low byte of ACC must be corrected by adding
0x60.
• Add the constant 0x60 to the low byte of ACC with the carry in equal to 0.
• Write into the low byte of ACC.
 State “Low_66”, the both nibbles in the low byte of ACC must be corrected by add-
ing 0x66.
• Add the constant 0x66 to the low byte of High_add
H_ACC<-H_A+H_B
ACC with the carry in equal to 0.
• Write the correction into the low byte of High_06
ACC. H_ACC+06
 State “High_add”, the addition of high bytes of A
and B registers with a carry in from the previous High_60
addition H_ACC+60
• Add high bytes of A and B registers with High_66

the carry in from the previous addition. H_ACC+66
The carry in is equal to “C8” + “A6”.
• Write the carries “C8” and “C4”. End_state
• Write the sum into the high byte of ACC. End_cal
 Branching in state “High_add”.
Fig. 13-06 State diagram of the
• If (the “nibble(2)>9” is true or “C4” is one)
and (“nibble(3)>9” is false or “C8” is zero), control unit, part C
188 VŠB-TU Ostrava

then FSM continues by the next state “High_06”;

• if (the “nibble(2)>9” is false or “C4” is zero) and (“nibble(3)>9” is true or “C8
is one”), then FSM continues by the next state “High_60”;
• if (the “nibble(2)>9” is true or “C4” is one) and (“nibble(3)>9” is true or “C8”
is one), then FSM continues by the next state “High_66”;
• else FSM continues by the next state “End_state”.
 State “High_06”, the low nibble in the high byte of ACC must be corrected by add-
ing 0x06.
• These actions are similar to those in the state “Low_06”.
 State “High_60”, the high nibble in the high byte of ACC must be corrected by add-
ing 0x60.
• These actions are similar to those in the state in the
state “Low_60”.
• The FSM continues by the next state “End_state”. End_state
Start
 State “High_66”, the both nibbles in the high byte of ACC End_cal
must be corrected by adding 0x66.
No Start
• These actions are similar to those in the state in the
state “Low_66”. To the state “First”
• The FSM continues by the next state “End_state”.
 State “End_state”, Fig. 13-06 State diagram of the
• Activate the flag “End_cal”, the result on the output control unit, part D
is correct.
• Waiting for the deactivation command “Start”.
• If “Start” is deactivated, then FSM continues by the first state “First”, the
system is the ready to add other numbers.
Fig. 13-07 Simulation of 999 + 25 in decimal
13.4 Simulation and realization

This example of the addition of two decimal numbers was written by the programming
language VHDL. The listing of this program is in Annex xx01 of this chapter. This VHDL nota-
tion was processed by IDE tools - Integrated development environment, [wiki_xx01]. This
environment provides a compilation, simulation and synthesis. Fig. 13-07 shows the wave-
VŠB-TU Ostrava 189

form of the addition 999 + 25 = 1024, in decimal, from the IDE simulation. In this simula-
tion, the current state and next states show the way in the state diagram, which was used
in this example of addition.
IDE environment performs the synthesis, which is the realization in programmable logic
devices. The result of the synthesis is a file which is loaded into the PLD - Programmable
Logic Devices. PLD is a universal circuit which can realize any digital systems. Programmable
devices are CPLD and well known FPGA, [wiki_xx02], [wiki_xx03], [wiki_xx04],
[Zdralek_2006] and [Zdralek_2008].
13.5 Reference
[wiki_xx01] Integrated development environment;
http://en.wikipedia.org/wiki/Integrated_development_environment; on
line 2014-09-23
[wiki_xx02] Field-programmable gate array; http://en.wikipedia.org/wiki/Field-

programmable_gate_array; online 2014-09-23
[wiki_xx03] Programmable logic device;

http://en.wikipedia.org/wiki/Programmable_logic_device; on line 2014-09-
23
[wiki_xx04] Complex programmable logic device;

http://en.wikipedia.org/wiki/Complex_programmable_logic_device; on line
2014-09-23
[Zdralek_2006] Jaroslav Zdrálek: Programovatelné logické prvky, skripta VŠB-TU Ostrava,

2006, ISBN 80-248-1060-3
[Zdralek_2008] Jaroslav Zdrálek: Programovatelné logické prvky; E-learningové prvky pro

podporu výuky odborných a technických předmětů, OP RLZ
CZ.04.1.03/3.2.15.2/0326; ISBN 978-80-248-1502-2
190 VŠB-TU Ostrava

13.6 Annex 13A

Listing of the decimal adder in VHDL
---------------------------------------
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;
entity BCD_adder is
Port ( A: in STD_LOGIC_VECTOR (15 downto 0);
B: in STD_LOGIC_VECTOR (15 downto 0);
Sum: out STD_LOGIC_VECTOR (15 downto 0);
Start: in STD_LOGIC;
End_cal: out STD_LOGIC;
CLK: in std_logic;
Reset_L: in std_logic
);
end BCD_adder;
architecture Behavioral of BCD_adder is

signal flags: std_logic_vector (5 downto 0);
signal word: std_logic_vector (13 downto 0);
component control_unit is
Port ( flags: in STD_LOGIC_VECTOR (5 downto 0);
Start: in std_logic;
contr_word: out STD_LOGIC_VECTOR (13 downto 0);
CLK: in STD_LOGIC;
Reset_L: in STD_LOGIC;
End_cal: out std_logic);
end component control_unit;
component data_unit is
CLK: in std_logic;
contr_word: in STD_LOGIC_VECTOR (13 downto 0);
flags: out std_logic_vector (5 downto 0)
);
end component data_unit;
begin
IO1: control_unit port map (flags, start, word, clk, reset_L, end_cal);
IO2: data_unit port map (A, B, Sum, clk, word, flags);
end Behavioral;
---------------------------------------------------------------------
--- Data Unit
---------------------------------------------------------------------
library IEEE;
VŠB-TU Ostrava 191

entity data_unit is
clk: in std_logic;
contr_word: in STD_LOGIC_VECTOR (13 downto 0);
flags: out std_logic_vector (5 downto 0)
);
end data_unit;
architecture Behavioral of data_unit is
signal A_reg, B_reg: std_logic_vector (15 downto 0);

signal ACC: std_logic_vector (15 downto 0):=(others => '0');
signal A_mux, B_mux, ADD: std_logic_vector (7 downto 0);
signal cin, cout, cc44, C8, C4, c_A6: std_logic;
alias A_sel is contr_word (11 downto 10);

alias B_sel is contr_word (9 downto 7);
alias cin_C8orA6 is contr_word (6);
alias sel_ACC_high is contr_word (5);
alias load_A_reg is contr_word (4);
alias load_B_reg is contr_word (3);
alias load_C4 is contr_word (2);
alias load_c_A6 is contr_word (12);
alias clear_carry is contr_word (13);
alias load_ACC is contr_word (0);
alias nibble is flags (3 downto 0);

alias fC8 is flags (4);
alias fC4 is flags (5);
component add_8_bit is
S: out STD_LOGIC_VECTOR (7 downto 0);
cin: in STD_LOGIC;
c4: out STD_LOGIC;
cout: out STD_LOGIC);
end component add_8_bit;
begin
Sum <= ACC;
-- A and B registers
A_reg <= A when clk='1' and clk'event and load_A_reg='1';
B_reg <= B when clk='1' and clk'event and load_B_reg='1';
-- AMUX
A_mux <= ACC(15 downto 08) when A_sel="00" else
ACC(07 downto 00) when A_sel="01" else
A_reg (15 downto 08) when A_sel="10" else
A_reg (07 downto 00) when A_sel="11";
-- B MUX
B_mux <= B_reg (15 downto 08) when B_sel="000" else
B_reg (07 downto 00) when B_sel="001" else
"00000000" when B_sel="010" else
"00000110" when B_sel="011" else
192 VŠB-TU Ostrava

"01100000" when B_sel="100" else
"01100110";
-- cin mux
cin <= '0' when cin_C8orA6='0' else
C8 or c_A6;
IO_add: add_8_bit port map (A_mux,B_mux, ADD, cin, cc44, cout);
-- carry flags
process (clk) is
begin
if clk='1' and clk'event then
if load_C8='1' then C8 <= cout;
elsif clear_carry='1' then C8 <='0';
else C8 <= C8;
end if;
end if;
end process;
fC8 <= C8;
-- carry when correction of 6 is added

process (clk) is
begin
if load_c_A6='1' then c_A6 <= cout;
elsif clear_carry='1' then c_A6 <='0';
else c_A6 <= c_A6;
end if;
end if;
end process;
-- C4 flag
process (clk) is
begin
if load_C4='1' then C4 <= cc44;
elsif clear_carry='1' then C4 <='0';
else C4 <= C4;
end if;
end if;
end process;
fC4 <= C4;
-- ACC register and the encoding the conditions

-- when nibbles are higher than 9
ACC(15 downto 08) <= ADD
when clk='1' and clk'event and load_ACC='1' and sel_ACC_high='1';
ACC(07 downto 00) <= ADD
when clk='1' and clk'event and load_ACC='1' and sel_ACC_high='0';
nibble(0) <= '1' when ACC(03 downto 00) > "1001" else '0';
end Behavioral;
VŠB-TU Ostrava 193

-----------------------------------------------------------
---- Control unit
-----------------------------------------------------------
library IEEE;
entity control_unit is
Port ( flags: in STD_LOGIC_VECTOR (5 downto 0);
start: in std_logic;
contr_word: out STD_LOGIC_VECTOR (13 downto 0);
clk: in STD_LOGIC;
reset_L: in STD_LOGIC;
end_cal: out std_logic);
end control_unit;
architecture Behavioral of control_unit is
type state_type is
(First, Beginig, Low_add, Low_06, Low_60, Low_66,
High_add, High_06, High_60, High_66, End_state);
signal Next_state, Current_state: state_type;
alias A_sel is contr_word (11 downto 10);

alias B_sel is contr_word (9 downto 7);
alias cin_C8orA6 is contr_word (6);
alias sel_ACC_high is contr_word (5);
alias load_A_reg is contr_word (4);
alias load_B_reg is contr_word (3);
alias load_c_add6 is contr_word (12);
alias clear_carry is contr_word (13);
alias load_ACC is contr_word (0);
alias nibble is flags (3 downto 0);

alias fc8 is flags (4);
alias fc4 is flags (5);
begin
State_reg: process (clk, reset_L, next_state)is

begin
if reset_L='0' then current_state <= first;
elsif clk='0' and clk'event then current_state <= next_state;
end if;
end process State_reg;
State_diagram: process (current_state, start, fc8, fc4,

nibble(0), nibble(1),nibble(2),nibble(3), flags) is
begin
case current_state is
when first => contr_word <= (others => '0');
end_cal <= '0';
if start = '1' then next_state <= beginig;
else next_state <= first;
end if;
194 VŠB-TU Ostrava

when beginig => contr_word <= (others => '0');
load_A_reg <= '1';
load_B_reg <= '1';
clear_carry <= '1';
next_state <= low_add;
when low_add => contr_word <= (others => '0');
A_sel <= "11"; -- A_reg_low
B_sel <= "001";
cin_C8orA6 <= '0'; -- cin is zero
load_ACC <= '1';
sel_ACC_high <= '0';
load_C4 <= '1';
load_C8 <= '1';
if (fc8='1' or nibble(1)='1') and (fc4='0') and (nibble(0)='0')
then next_state <= low_60;
elsif (fc8='0') and (nibble(1)='0') and (fc4='1' or nibble(0)='1')
elsif (fc8='1' or nibble(1)='1') and (fc4='1' or nibble(0)='1')
else next_state <= high_add;
end if;
when low_66 => contr_word <= (others => '0');

A_sel <= "01";
B_sel <= "101";
cin_C8orA6 <= '0';
load_ACC <= '1';
load_C4 <= '1';
load_c_add6 <= '1';
next_state <= high_add;
A_sel <= "01";
B_sel <= "100";
cin_C8orA6 <= '0';
load_ACC <= '1';
load_C4 <= '1';
load_c_add6 <= '1';
next_state <= high_add;
A_sel <= "01";
B_sel <= "011";
cin_C8orA6 <= '0';
load_ACC <= '1';
load_C4 <= '1';
load_c_add6 <= '1';
if (fc8='1' or nibble(1)='1')
else next_state <= high_add;
end if;
when high_add => contr_word <= (others => '0');

A_sel <= "10"; -- A_reg high
B_sel <= "000";
cin_C8orA6 <= '1';
load_ACC <= '1';
VŠB-TU Ostrava 195

load_C4 <= '1';
load_C8 <= '1';
if (fc8='1' or nibble(3)='1') and (fc4='0') and (nibble(2)='0')
then next_state <= high_60;
elsif (fc8='0') and (nibble(3)='0') and (fc4='1' or nibble(2)='1')
elsif (fc8='1' or nibble(3)='1') and (fc4='1' or nibble(2)='1')
else next_state <= end_state;
end if;
when high_66 => contr_word <= (others => '0');
A_sel <= "00";
B_sel <= "101";
cin_C8orA6 <= '0';
load_ACC <= '1';
load_C4 <= '1';
load_c_add6 <= '1';
next_state <= end_state;
A_sel <= "00";
B_sel <= "100";
cin_C8orA6 <= '0';
load_ACC <= '1';
load_C4 <= '1';
load_c_add6 <= '1';
next_state <= end_state;
A_sel <= "00";
B_sel <= "011";
cin_C8orA6 <= '0';
load_ACC <= '1';
load_C4 <= '1';
load_c_add6 <= '1';
if (fc8='1' or nibble(3)='1') --and (fc4='0') and (flags(2)='0')
end if;
when end_state => contr_word <= (others => '0');
end_cal <='1';
if start='0' then next_state <= first;
end if;
end case;
end process State_diagram;
end Behavioral;
196 VŠB-TU Ostrava

EN - LO - Introduction To Digital System - Ang - ZDralek PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

EN - LO - Introduction To Digital System - Ang - ZDralek PDF

Загружено:

Авторское право:

Доступные форматы

FACULTY OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

VŠB – TECHNICAL UNIVERSITY OF OSTRAVA

Introduction to Digital Systems

This textbook was supported by the project No. CZ.1.07/2.2.00/28.0062 funded by

This publication has not been linguistically or editorially modified

VŠB-TU Ostrava iii

Digital systems are systems which are characterized by discontinuous representation of

Digital system is also understood as a combination of hardware and software. Hardware is a

1.1 Basic terms

value = an-1Bn-1 … + a1B1 + a0B0 + a-1B-1… + a-mB-m

Fig. 01-01 Position of LSB and MSB

MSB LSB MSB LSB

LSB MSB LSB MSB

Fig. 01-02 Possible positions and weights in byte

MSB LSB Atomic element Atomic element

Fig. 01-03 Little endian

Atomic element Atomic element MSB LSB

Fig. 01-04 Big endian

1.3 Binary prefixes– Standard IEC

Decimal SI prefix Binary SI prefix

Fig. 01-05x Decimal and binary SI prefixes

[wiki_0102] Bit; http://en.wikipedia.org/wiki/Bit; on line 2014-1021

[wiki_0103] Byte; http://en.wikipedia.org/wiki/Byte; on line 2014-1-21

[wiki_0104] Octet (computing); http://en.wikipedia.org/wiki/Octet_(computing); on line

[wiki_0105] Nibble; http://en.wikipedia.org/wiki/Nibble; on line 2014-10-21

[wiki_0106] Word (computer architecture);

[wiki_0107] Least significant bit; http://en.wikipedia.org/wiki/Least_significant_bit; on

[wiki_0108] Most significant bit; http://en.wikipedia.org/wiki/Most_significant_bit; on

[wiki_0109] Bit numbering, http://en.wikipedia.org/wiki/Bit_numbering; on line 2014-

[wiki_0110] Endianness; http://en.wikipedia.org/wiki/Endianness; on line 2014-10-21

[wiki_0111] Binary prefix; http://en.wikipedia.org/wiki/Binary_prefix; on line 2014-10-

Roman numeral system is famous as a non-positional notation. An example of a Roman

Decimal numeral Binary numeral Octal numeral Hexadecimal nu-

Fig. 02-01 Digits and orders in numeral systems

A positional notation of a number is expressed by a series of symbols where each symbol

Notation in programming languages

Fig. 02-02 Notation of radix in a number

2.1 Polynomial of numeral system

 NR is a value which is expressed in radix R.

Integer part Fractional part

Fig. 02-03 Terms of real number

NI = (an-1an-2…a2a1a0)R = an-1Rn-1 + an-2Rn-2 + …a2R2 + a1R1 + a0R0 (0202)

 NI is an integer part of a real number.

2.2 Numeral systems used in digital systems

 (3725)D = 3 · 103 + 7 · 102 + 2 · 101 + 5 · 100

 (1101.101)B = 1 · 23 + 1 · 22 + 0 · 21 + 1 · 20 + 1 · 2-1 + 0 · 2-2 + 1 · 2-3

 (35A1)H = 3 · 163 + 5 · 162 + 10 · 161 + 1 · 160

2.3 Conversion between numeral systems

Convert (527)10 to numeral system with radix R = 7.

The solution is:

Answer: 52710= 13527

Fig. 02-04 Conversion of integer decimal number to number with radix 7

 The fractional number NF is multiplied by new radix R, the result of multiplication is

Fig. 02-05 Conversion of fractional decimal number to number with radix 16

0011 0101 0011 1101.1011 0101 0000 1100 Groups of 4 bits

Decimal Weights Corresponding

Fig. 02-07 Examples of principle 8, 4, 2 and 1

[Internet_0202]Římská čísla; http://www.converter.cz/prevody/rimska-cisla.htm; on line

[wiki_0201] Numeral system; http://en.wikipedia.org/wiki/Numeral_system; on line

[wiki_0202] Roman numerals; http://en.wikipedia.org/wiki/Roman_numerals; on line

[wiki_0203] Positional notation; http://en.wikipedia.org/wiki/Positional_notation; on

[wiki_0204] Decimal mark; http://en.wikipedia.org/wiki/Decimal_mark; on line 2014-

Logic is a science dealing with reason, truthfulness, demonstrability, refutability. Logic is a