Вы находитесь на странице: 1из 18




Why 64-Bit Computing?

The question of why we need 64-bit computing is often asked but rarely answered in a
satisfactory manner. There are good reasons for the confusion surrounding the question.

That is why first of all; let's look through the list of users who need 64 addressing and 64-
bit calculations today:

• Users of CAD, designing systems, simulators do need RAM over 4 GB. Although
there are ways to avoid this limitation (for example, Intel PAE), it impacts the
performance. Thus, the Xeon processors support the 36bit addressing mode where
they can address up to 64GB RAM. The idea of this support is that the RAM is
divided into segments, and an address consists of the numbers of segment and
locations inside the segment. This approach causes almost 30% performance loss
in operations with memory. Besides, programming is much simpler and more
convenient for a flat memory model in the 64bit address space - due to the large
address space a location has a simple address processed at one pass. A lot of
design offices use quite expensive workstations on the RISC processors where the
64bit addressing and large memory sizes are used for a long time already.

• Users of data bases. Any big company has a huge data base, and extension of the
maximum memory size and possibility to address data directly in the data base is
very costly. Although in the special modes the 32bit architecture IA32 can address
up to 64GB memory, a transition to the flat memory model in the 64bit space is
much more advantageous in terms of speed and ease of programming.

• Scientific calculations. Memory size, a flat memory model and no limitation for
processed data are the key factors here. Besides, some algorithms in the 64bit
representation have a much simpler form.

• Cryptography and safety ensuring applications get a great benefit from 64bit
integer calculations.

What is 64-bit computing?

The labels "16-bit," "32-bit" or "64-bit," when applied to a microprocessor, characterize the
processor's data stream. Although you may have heard the term "64-bit code," this
designates code that operates on 64-bit data.

In more specific terms, the labels "64-bit," 32-bit," etc. designate the number of bits that
each of the processor's general-purpose registers (GPRs) can hold. So when someone uses
the term "64-bit processor," what they mean is "a processor with GPRs that store 64-bit
numbers." And in the same vein, a "64-bit instruction" is an instruction that operates on
64-bit numbers.



In the diagram above black boxes are code, white boxes are data, and gray boxes are
results. The instruction and code "sizes" are not to be taken literally, since they're intended
to convey a general feel for what it means to "widen" a processor from 32 bits to 64 bits.

Not all the data either in memory, the cache, or the registers is 64-bit data. Rather, the
data sizes are mixed, with 64 bits being the widest.

Note that in the 64-bit CPU pictured above, the width of the code stream has not changed;
the same-sized opcode could theoretically represent an instruction that operates on 32-bit
numbers or an instruction that operates on 64-bit numbers, depending on what the
opcode's default data size is. On the other hand, the width of the data stream has doubled.
In order to accommodate the wider data stream, the sizes of the processor's registers and
the sizes of the internal data paths that feed those registers must be doubled.



Now let's take a look at two programming models, one for a 32-bit processor and another
for a 64-bit

The registers in the 64-bit CPU pictured above are twice as wide as those in the 32-bit
CPU, but the size of the instruction register (IR) that holds the currently executing
instruction is the same in both processors. Again, the data stream has doubled in size, but
the instruction stream has not. Finally, the program counter (PC) has also doubled in size.

For the simple processor pictured above, the two types of data that it can process are
integer data and address data. Ultimately, addresses are really just integers that designate
a memory address, so address data is just a special type of integer data. Hence, both data
types are stored in the GPRs and both integer and address calculations are done by the

Many modern processors support two additional data types: floating-point data and vector
data. Each of these two data types has its own set of registers and its own execution
unit(s). The following table compares all four data types in 32-bit and 64-bit processors:

Data Type Register Type Execution Unit x86 width x86-64 width
Integer GPR ALU 32 64
Address GPR ALU OR AGU 32 64
Floating Point* FPR FPU 64 64
Vector VR VPU 128 128
*x87 uses 80-bit registers to do double-precision floating-point. The floats themselves are
64-bit, but the processor converts them to an internal, 80-bit format for increased
precision when doing computations.

From the table above that the difference the move to 64 bits makes is in the integer and
address hardware. The floating-point and vector hardware stays the same.



Now that we know what 64-bit computing is, let's take a look at the benefits of increased
integer and data sizes.

Dynamic range

The main thing that a wider integer gives you is increased dynamic range.

In the base-10 number system to which we're all accustomed, you can represent a
maximum of ten integers (0 to 9) with a single digit. This is because base-10 has ten
different symbols with which to represent numbers. To represent more than ten integers
you need to add another digit, using a combination of two symbols chosen from among the
set of ten to represent any one of 100 integers (00 to 99). The general formula that you
can use to compute the number of integers (dynamic range, or DR) that you can represent
with an n-digit base-ten number is:

DR = 10n

So a 1-digit number gives you 101 = 10 possible integers, a 2-digit number 102 = 100
integers, a 3-digit number 103 = 1000 integers, and so on.

The base-2, or "binary," number system that computers use has only two symbols with
which to represent integers: 0 and 1. Thus, a single-digit binary number allows you to
represent only two integers, 0 and 1. With a two-digit (or "2-bit") binary, you can
represent four integers by combining the two symbols (0 and 1) in any of the following four

00 = 0
01 = 1
10 = 2
11 = 3

Similarly, a 3-bit binary number gives you eight possible combinations, which you can use
to represent eight different integers. As you increase the number of bits, you increase the
number of integers you can represent. In general, n bits will allow you to represent 2n
integers in binary. So a 4-bit binary number can represent 24 or 16 integers, an 8-bit
number gives you 28=256 integers, and so on.

So in moving from a 32-bit GPR to a 64-bit GPR, the range of integers that a processor can
manipulate goes from 232 = 4.3e9 to 264 = 1.8e19. The dynamic range, then, increases by
a factor of 4.3 billion. Thus a 64-bit integer can represent a much larger range of numbers
than a 32-bit integer.

The benefits of increased dynamic range,

Or, how the existing 64-bit computing market uses 64-bit integers?

Since addresses are just special-purpose integers, an ALU and register

combination that can handle more possible integer values can also handle that many more
possible addresses. With all the recent press coverage that 64-bit architectures have
garnered, it's fairly common knowledge that a 32-bit processor can address at most 4GB of
memory. (Remember our 232 = 4.3 billion number? That 4.3 billion bytes is about 4GB.) A
64-bit architecture could theoretically, by contrast, address up to 18 million terabytes.

So, what do you do with over 4GB of memory? Well, caching a very large
database in it is a start. Back-end servers for mammoth databases are one place where 64
bits have long been a requirement, so it's no surprise to see upcoming 64-bit offerings
billed as capable database platforms.

On the media and content creation side of things, folks who work with very large 2D image
files also appreciate the extra RAM. And a related, much interesting application domain



where large amounts of memory come in handy is in simulation and modeling. Under this
heading you could put various CAD tools and 3D rendering programs, as well as things like
weather and scientific simulations, and even real-time 3D games. Though the current crop
of 3D games wouldn't benefit from greater than 4GB of RAM, it is quite possible that we'll
see a game that benefits from greater than 4GB RAM within the next five years.

Some applications, mostly in the realm of scientific computing (MATLAB,

Mathematica, MAPLE, etc.) and simulations, require 64-bit integers because they work with
numbers outside the dynamic range of 32-bit integers. When the result of a calculation
exceeds the range of possible integer values, you get a situation called either overflow
(i.e. the result was greater than the highest positive integer) or underflow (i.e. the result
was less than the largest negative integer). When this happens, the number you get in the
register isn't the right answer. There's a bit in the x86's processor status word that allows
you to check to see if an integer has just exceeded the processor's dynamic range, so you
know that the result is bogus. Such situations are rare in integer applications.

Programmers who run into integer overflow or underflow problems on a 32-bit platform do
have the option of using a 64-bit integer construct provided by a higher level language like
C. In such cases, the compiler uses two registers per integer, one for each half of the
integer, to do 64-bit calculations in 32-bit hardware. This has obvious performance
drawbacks, making it less desirable than a true 64-bit integer implementation.

Finally, there is another application domain for which 64-bit integers can offer real
benefits: cryptography. Most popular encryption schemes rely on the multiplication and
factoring of very large integers and the larger the integers the more secure the encryption.

64-bit integer code runs slowly on a 32-bit machine, due to the fact that the 64-bit
computations have to be split apart and processed as two separate 32-bit computations.
So you could say that there's a performance penalty for running 64-bit integer code on a
32-bit machine; this penalty is absent when running the same code on a 64-bit machine,
since the computation doesn't have to be split in two. The take-home point here is that
only applications that require and use 64-bit integers will see a performance increase on
64-bit hardware that is due solely to a 64-bit processor's wider registers and increased
dynamic range.

64 bit Architectures

Let’s discuss 64 bit Architectures from the leaders of Processor Manufacturers – AMD &
Intel (AMD’s Opteron & Intel’s Itanium).

Intel 64-bit architecture (IA-64)

By using a technique called VLIW, the letters VLIW mean “Very Large Instruction Word”.
Processors that use this technique access the memory by transferring long program words,
and in each word many instructions are packed. In the case of the IA-64, three instructions
are used for each pack of 128 bits. As each instruction has 41 bits, there are 5 bits left that
will be used to indicate the kinds of instruction that were packed. Figure 1 shows the
instruction packaging scheme. This packaging lessens the number of memory accesses,
leaving to the compiler the task of grouping the instructions in order to get the best of the



Instruction packaging used in the IA-64 architecture.

As it has already been said, the 5-bit field, named as “pointer”, serves to indicate the kinds
of instructions that are packed. Those 5 bits offer 32 kinds of packaging possible that, in
fact, are reduced to 24 kinds, since 8 are not used. Each instruction uses one of the CPU
features, which are listed below, and that can be identified in Figure given below.

Unit I - integer data

Unit F - floating-point operations
Unit M - memory access and
Unit B - branch prediction.

The architecture that Intel suggests to execute those instructions, that was called Itanium,
is versatile and promises performance by means of the simultaneous (parallel) execution of
up to 6 instructions. Figure shows the diagram in blocks of this architecture that uses a
‘pipeline’ of 10 stages.



Block diagram of the Itanium CPU (IA-64 architecture).

The basic structural unit of the Itanium looks like the picture above. The data bus can cope
according to Intel with a data rate of 2.1GB/sec. The Itanium processor contains 4 integer
ALUs, 4 multimedia ALUs, 2 AGUs, 3 branching units and 4 FPUs for arithmetic with floating
point numbers. The processor is capable of theoretically performing 20 operations in one
clock cycle by loading 16 operands and evaluating 4 ALU operations. This possibility should
not be confused with the number of instructions possible within one clock cycle - namely
six. The instructions are retrieved from memory and are bundled by a process called
bundle rotation; this prepares the execution of parallel instructions on the hardware level.
The instructions are fetched from the cache speculatively. All this is implemented with the
help of 128 floating point registers, 128 integer registers and 8 branching registers, which
all support explicitly 64-bits

The IA-64 architecture receives the sigla EPIC, which means “Explicit Parallel Instruction
Computing”. By using this sigla, Intel wants to say that the compiler will be the great
responsible for determining and clearing the parallelism present in the instructions to be



executed. This is a combination of concepts called speculation, predication and explicit


Next, we will briefly study each one of them.

Explicit parallelism:

The Instruction Level Parallelism - ILP is the ability of executing multiple instructions at the
same time. As we have seen, the IA-64 architecture allows to pack independent
instructions to be executed in parallel and, for each clock period, is capable of treating
multiple packs. Due to the great number of features in parallel, as well as the great
number of registers and multiple executing units, it is possible for the compiler to manage
and program the parallel computing. The compilers used for the traditional architectures
are limited in their speculative capacity because there is not always a way to be sure if the
speculation will be correctly managed by the processor. The IA-64 architecture allows the
compiler to explore the speculative information without sacrificing the correct execution of
an application.

The IA-64 architecture has mechanisms denominated instruction pointer, suggestions for
branches and cache, that allow the compiler to send to the processor information obtained
during the time of compilation. That information minimizes the penalties that come from
the branches and cache misses.


Itanium can load instructions and data onto the CPU before they're actually needed or even
if they prove not to be needed, effectively using the processor itself as a cache.
Presumably, this early loading is done when the processor is otherwise idle. The advantage



gained by speculation limits the effects of memory latency by allowing loading of data
before it is needed, thus making it ready to go the moment the processor can use it.

There are two kinds of speculation: data and control. With the speculation, the
compiler advances an operation in a way that its latency (time spent) is removed from the
critical way. The speculation is a form of allowing the compiler to avoid that slow
operations spoil the parallelism of the instructions. Control speculation is the execution of
an operation before the branch that precedes it. On the other hand, data speculation is the
execution of a memory load before a storage operation (store) that precedes it and with
which it can be related.

Speculation Benefits:
Reduces impact of memory latency .Reduces impact of memory latency
Performance improvement at 79% when combined with predication*.
Greatest improvement to code with many cache accesses large databases and operating
Scheduling flexibility enables new levels of performance headroom levels of performance

Branch prediction is currently used in today's processors. However, much processor time is
taken by doing calculations for branches that end up being unneeded. Predication is a
compiler-based technique of looking ahead to make more accurate predictions of which
code branches will actually be used, thus limiting unneeded calculations.

With the predication you mark with predicates all the branches of the conditional branches
that, next, are sent to the execution in parallel, however only the necessary ones are
executed. Therefore, it is possible to prepare the execution of the instructions even before
having solved the conditional branches. Besides the removal of branches by means of
predicates, IA-64 architecture has a series of mechanisms that should reduce the error in
predicting the branches and the cost when this error happens.

Predication Benefits:
Reduces branches and mispredict penalties.
Parallel compares further reduce critical paths Parallel compares further reduce itical paths
Greatly improves code with hard to predict branches ranches
Large server apps- capacity limited .e server apps- capacity limited
Sorting, data mining- large database apps .Sorting, data mining- large database
Data compression Data compression
Traditional architectures’ “bolt-on” approach can’t efficiently approximate predication.
Cmove: 39% more instructions, 30% lower performance.39% m
Instructions must all be speculative.

The IA-64 architecture has a great number of registers. There are 128 integer registers,
128 floating-point registers, 64 predicate registers of 1 bit, and many other registers for
configuration, management and monitoring of the CPU’s performance.

Rotating Registers

On top of the frames, there's register rotation, a feature that helps loop unrolling more
than parameter passing. With rotation, Itanium can shift up to 96 of its general-purpose
registers (the first 32 are still fixed and global) by one or more apparent positions. Why?
So that iterative loops that hammer on the same register(s) time after time can all be
dispatched and executed at once without stepping on each other. Each instance of the loop
actually targets different physical registers, allowing them all to be in flight at once.



If this sounds a lot like register renaming, it is. Itanium's register-rotation feature is less
generic than all-purpose register renaming like Athlon's, so it's easier to implement and
faster to execute. Chip-wide register renaming like Athlon's adds gobs of multiplexers,
adders, and routing, one of the big drawbacks of a massively out-of-order machine. On a
smaller scale, ARM used this trick with its ill-fated Piccolo DSP coprocessor. At the high
end, Cydrome also used this technique, a favorite feature that Cydrome alumnus and
Itanium team member Bob Rau apparently brought with him.

So IA-64 has two levels of indirection for its own registers: the logical-to-virtual mapping
of the frames and the virtual-to-physical mapping of the rotation. All this means that
programs usually aren't accessing the physical registers they think they are, but that's
nothing new to high-end microprocessors. Arcane as it seems, this method still uses less
hardware trickery than the full register renaming of Athlon, Pentium III, or P4.

Intel promises compatibility with the 32-bit software (IA-32). They should run without any
change since the operating system and the firmware have features for that. It should be
possible to run software in real mode (16 bits), protected mode (32 bits) and virtual mode
86 (16 bits). They mean that the CPU will be able to operate in IA-64 mode or IA-32 mode.
There are special instructions to go from one mode to the other, as it is shown in Figure 3.

Figure 3: Model of instruction sets transition.

The three instructions that make the transition between the instruction sets are:

JMPE (IA-32): jumps to a 64-bit instruction and changes to IA-64 mode;

br.ia (IA-64): moves to a 32-bit instruction and changes to IA-32 mode;

Interruptions transit to IA-64 mode, allowing the fulfillment of all interruption conditions

rfi (IA-64): it is the return of the interruption; the return happens both to an IA-32
situation and to an IA-64, depending on the situation present at the moment when the
interruption is invoked.



Athlon 64 and AMD's 64-bit technology

64-bit architecture


To get a first idea, how the 64-bit architecture works and also how it differs significantly
from a 32-bit implementation it is useful to consider one definition first:

"A 64-bit processor is a microprocessor with a word size of 64 bits, a requirement for
memory and data intensive applications such as computer-aided design (CAD) applications,
database management systems, technical and scientific applications, and high-performance
servers. 64-bit computer architecture provides higher performance than 32-bit architecture
by handling twice as many bits of information in the same clock cycle.

The most important parts, which define a 64-bit architecture are boldfaced and give a
rough idea that one can now process not only 2^32 = 4294967296 basic units of
information, but 2^64 = 18446744073709551616 units. The numbers are quite impressive
and show that the architecture level has to be updated accordingly.

There are several companies, which actually implemented 64-bit processors, but the two
main companies are AMD and Intel. Other enterprises certainly have their place in the
development of 64-bit processors, too, but the mainstream market is going to face those
products by AMD and Intel. Therefore it is reasonable to explain, how those two companies
designed the 64-bit processors and moreover there are only details to consider in
translating the two special layouts and implementations to the general concept. There are
quite some differences how the two companies chose to convert 32-bit programs to work
with the 64-bit architecture and those differences will be outlined in the 32-bit part of this
document, but in the following part the structure of a "pure" 64-bit architectural level will
be outlined. As there is not much public information available about the physical structure
of current 64-bit processors due to the fact that neither AMD nor Intel want to provide
crucial information to the corresponding rival on the processor market it is useful to focus
on the instruction set architecture (ISA) and the general differences between a 32-bit
processor and the new 64-bit one.

With the successful introduction of the Opteron processor, AMD completed one half of its
forecast entry into the 64-bit processing world. It is based on an evolution of the x86
instruction set used by current 32-bit processors made by Intel and AMD, the Opteron is
targeted at the high to mid-range server and workstation market.

The second processor released under the AMD64 architecture will be the Athlon 64,
formerly known as 'Claw hammer,' which aims to bring 64-bit computing power to the
desktop and mobile markets. The Athlon 64 will be a slightly hobbled version of the
Opteron, and with its built in compatibility with current software and operating systems,
will attempt to bridge the gap easily between 32-bit and 64-bit computing environments.

We will focus on the Athlon 64 and what it will offer to home users and PC enthusiasts, as
well as covering the important details of the AMD64 platform. The Opteron and the Athlon
64 share an identical base architecture.

AMD has positioned the Opteron as the solution to many system needs, with the primary
goal of providing a 64-bit physical architecture while supplying high-end performance for
both 64- and 32-bit software. This translates into architectural advantages such as 64-bit
data and address pathways, upgraded physical and virtual memory addressing, and a true
64-bit internal design.



The other main innovation has been to move key Northbridge functions from the system
chipset directly into the Opteron core. These include a memory controller, multiprocessing
control, and data flow, along with a bridge to peripheral data traffic. Traditional
Southbridge and AGP components are still present in the Opteron architecture, but AMD's
eighth-generation processor has absconded with the main performance and CPU-centric

Opteron Micro architecture

The Opteron core resembles the basic design of the Athlon XP, but the move to a 64-bit
architecture has brought some inherent advantages. Both the Opteron and Athlon XP
contain a few similar features, such as 64K apiece of Level 1 data and instruction cache
and three apiece of integer and floating-point units, but there have been some noted
improvements elsewhere. In terms of basic features, the Opteron includes a full 1MB of
Level 2 cache on the inside, along with an integrated heat spreader and new Socket 940
packaging on the outside.

Looking a bit deeper, AMD has improved on its seventh-generation design in other ways. A
processor's registers are like miniature cache areas where crucial data is stored and
retrieved; the Opteron features eight more general-purpose registers, and these have been
extended to 64 bits. AMD has also added eight 128-bit Streaming SIMD Extension (SSE)
registers for multimedia instructions, as well as compatibility with the SSE2 instructions
that premiered in Intel's Pentium 4.

The chip's transaction look-aside buffers are larger and offer lower latencies than those of
the Athlon XP. Branch prediction is also enhanced, including an increase to 16K
bimodal/history counters, or four times the level found on the Athlon XP.

This last note is important, because in order to provide higher frequencies and better
scalability, AMD has extended the Opteron pipelines. The Opteron features a 12-stage
integer operation pipeline (versus 10 stages for the Athlon XP) and a 17-stage floating-
point operation pipeline (versus 15 for the Athlon XP). While this pays dividends on higher
potential clock speeds, it also incurs a risk of increased prediction misses, so AMD has
adjusted the architecture to provide even higher pipeline efficiencies than the Athlon XP.

The Opteron also has built-in core logic to support multiprocessor systems without the
need for a Northbridge chip. Internal CPU data traffic is all routed through a crossbar
(XBAR) communications architecture, which shuttles command and data information
between the CPU, memory controller, and three HyperTransport links. This is a huge
technological leap for multiprocessor workstation and server designs, as it provides a true
standard for OEMs to work with, and takes the Northbridge component out of the equation.

Dual-Channel Memory, More Or Less



The AMD Opteron includes an integrated memory controller, capable of supporting DDR200
through DDR333 speeds and a maximum of eight DIMM memory modules per processor.
The controller provides up to 5.3GB/sec of memory bandwidth (with 333MHz DDR),
yielding higher memory performance, lower memory latencies, and performance levels that
can scale to processor frequencies.

Since each CPU has its own memory controller, memory bandwidth will also scale in
multiprocessor systems. For example, a 2-way Opteron workstation will yield 10.6GB/sec
of memory bandwidth, while a 4-way Opteron server will double this again to an incredible
21.3GB/sec, along with supporting up to 32 DDR DIMMs.

The Opteron's integrated memory controller has been referred to as a dual-channel design,
but this isn't the exact truth. It certainly delivers double the bandwidth of a single-channel
controller, but does so by taking two 64-bit DDR modules and viewing them as a single
128-bit DIMM with a corresponding 128-bit data path. This is similar to the design of Intel's
dual-channel DDR chipsets such as the E7205 and 875P, but different than the true dual-
channel memory architecture of the NVIDIA nForce2.

This is actually a smart call when it comes to building an integrated memory controller, as
for all intents and purposes, the bandwidth and performance are equivalent, but the 128-
bit memory bus is more streamlined. In the Opteron architecture, there is no need for an
arbiter chip to handle traffic along the dual physical memory channels, and no requirement
for extra controller hardware. Of course, due to the "single-channel 128-bit" memory
architecture, the pairs of DDR modules but be matched in size, speed, and chip-count,
though not necessarily in manufacturer

AMD's 64-bit platform

To access an area in the computer's physical memory (RAM) to store or retrieve data, the
processor needs the address of that location, which is an integer number representing one
byte of memory storage.

Suddenly, having 64-bit registers makes sense as, while a 32-bit processor can access up
to 4.3 billion memory addresses (232) for a total of about 4GB of physical memory, a 64-
bit processor could conceivably access over 18 petabytes of physical memory. This is the
one area that clearly shows why 64-bit processors are the future of computing, as
demanding applications such as databases have long been scraping on the 4GB memory

If you are a business with a database of a terabyte or more of information, 64-bit

processors look pretty good right now.

Formerly known as X86-64, the AMD64 architecture is AMD's method of implementing 64-
bit processors



AMD64 is massively different from Intel's approach to 64-bit processors as seen in their
Itanium line. While Intel used a completely different architecture for the Itanium chips,
forcing software developers to relearn in order to program for them, or use emulation
which slowed down performance, AMD decided to simply extend the existing x86
architecture (the foundation of all PC's since Intel developed the 8086 processor in 1978)
to accommodate 64-bit registers as mentioned above.

There are several advantages to this. First, obviously, reworking code for AMD 64-bit
processors should be considerably easier, since the basis is the same. Secondly, the
AMD64 based Opteron and Athlon 64, are fully compatible with 32-bit applications.

A system based on either of these processors can use a 32-bit operating system and
software without a hitch, providing a stress free upgrade path for businesses and opening
up the desktop market to 64-bit processors, and more specifically, AMD's Athlon 64.

AMD accomplishes this by enabling the AMD64 processors to run in one of two modes,
Legacy mode and Long mode. Legacy mode removes all 64-bit support and enables the
processor to run strictly in 32-bit mode, necessary for running most current operating
systems, including Windows. Long mode is comprised of two sub modes, Compatibility
mode and 64-bit mode.

Compatibility mode is designed for a 64-bit operating system such as Microsoft's

impending 64-bit versions of XP and Server 2003, due late this year or early in the next,
but running 32-bit software such as current databases. The advantage of this is that each
32-bit application, though still limited by the 4GB memory limit, can have all of that 4GB to
itself with no overhead for the operating system, since that will use 64-bit addressing and
can thus access additional memory space.

This provides some improved performance for demanding 32-bit apps before they are
ported over to 64-bit. 64-bit mode is intended for a pure 64-bit environment, operating
system and software, and offers one huge advantage.....

AMD - Instruction Set Architecture:

The most basic units of organization for the instructions are specified the following way
(see AMD manual again - page 38/39):

1. General Purpose Instructions: The basic integer instructions, which are used nearly
everywhere. Also often referred to as the x86 instruction set and easily illustrated
by examples like addition of integers, moving, load, store, shifts and so on.

2. 128-Bit Media Instructions: Named due to their primary application, these

instructions operate on vectors of large data packages (e.g. video, scientific
applications, games, etc.). Moreover, they operate in parallel. That means they are
able to access multiple data sets at once. Obviously, these instructions are
designed for speed in one special field of applications and therefore are not able to
perform any task.

3. 64-bit Media Instructions: Also SIMD instructions and not much different in use
compared to the 128-bit instructions.

4. Floating Point Instructions: As GPIs only work for integers, these instructions are
designed to have a suitable tool for floating point operations.

When the LMA is activated the maximum speed for instructions to be performed is enabled
and this is usually done by the operating system. This is the stage we would like to call
"pure" 64-bit mode and this mode can be recognized for both architectures, the one
described here from AMD and the Intel IA64 described later on this page. For the following
part of the analysis we assume that LMA is activated and the processor is in "pure" 64-bit



mode, which is not to be confused with legacy mode or long mode compatibility mode;
these are features to support the transition from 32-bit machines and software to the new
architecture. Those should not be considered yet, but in the 32-bit section. The default size
for operands is 32-bits in contrast to the 16-bits of the 32-bit architecture. The REX
registers, which is the common name for the 8 new GPRs R8-R15 - specify whether one
would like to accept this default value or to extend to virtual 64-bits (basically a
concatenation of two registers). This means that some of the instructions for the opcode
had to be redefined to allow the virtual 64-bit addressing. Nevertheless, these are only
minor changes and most parts of the opcode are carried over from a 32-bit processor. The
memory is a single flat address space starting at the address 0 and is distributed linearly
over 64-bits. The operating system can specify several levels of data access/protection for
the address space. The segment registers to access memory locations are set to a
canonical position - namely 0 - and it is not possible for the processor to access all
segmented registers. This is essentially a real simplification compared to 32-bit processing
and all the compatibility modes offered by AMD. It is just pure memory addressing from 0
to 2^64 -1 without any specialties. This concept shows on the micro level what the goal of
the complete architecture is. The search for more simplicity, more raw computing power
and preparation for large amounts of data. Another cornerstone of this path is the
possibility to translate all the virtual 64-address space in physical memory in a one-to-one
translation process. Paging can be performed on the virtual address directly. The bytes
themselves are ordered according to little/low Endean and so are all the data and
instructions. The instructions do not really "change" in the sense that there a structural
redesign has happened. The size of the operands is the crucial factor. Consider for example
this instruction: 48 B8 1234567812345678. The 48 specifies the length of the operands:
64-bits! The opcode B8 is also used in the 32-bit architecture and the remaining part is just
an 8-bit immediate value and we are computing with a 64-bit processor.

There exist five addressing modes:

• Absolute Address: given as displacements from the base - for 64-bits just 0)

• Instruction-Relative Address: referring to the IP (instruction pointer) and the PC

(program counter)

• Stack Address: using the stack pointer

• String Addresses

• Mod R/M Address

And again one realizes that there are no real differences in the structure compared to non-
64-bit ISAs. The PC, the Stack and absolute addressing just carry over with more bits. The
RIP (relative instruction pointer / program counter) keeps its function, but due to 64-bits
provides a more efficient way to directly access segments of code with relative addressing.
This is one reason, why there is a significant increase in speed for the AMD 64-bit
architecture - direct access to program code.

For the Absolute Addressing it gets even easier due to the common
standard base 0. The same holds for pointers in general. As one is no longer able to access
the segmented registers the concept of far pointers, which store a segment address and
the usual address, is no longer needed as the memory is just one linear chunk. Near
pointers are enough and one can return for 64-bit applications for the AMD architecture to
the general term pointer as it is obvious that it can only point into one data segment. The
immediate and displacements remain of 32-bit size but can be extended to a virtual 64-bit
mode if needed.

This finishes the broad outline of the instruction

set architecture for AMD based on the document mentioned above and their philosophy to
keep it simple and easy becomes apparent, but this is only true for AMD, not for 64-but



processors in general. They might demand more sophisticated instruction sets and might
not rather focus and build upon established concepts. One has to know more certain
technical details, which should not be emphasized here as the new registers must be taken
into account and therefore the possibility of combinations to address and declare correctly
rises, but their complexity level does not rise significantly for AMD. Outlining the new
instructions for every new register would be tedious and cumbersome work and is only
valid for the ISA of AMD

Memory Controllers and Hypertransport

Both the Opteron and the Athlon 64 contain 8 extra registers useable only in 64-bit mode,
which should increase application performance significantly.

One of the largest problems in modern computer design is the presence of bottlenecks, or
areas of low performance which slow an otherwise fast system down.

In most modern computers, data intended for the video and main memory needs to be
passed to and through the Northbridge chip on the motherboard, and data from other
sources like USB connections, PCI slots or hard-drives must pass through the Southbridge
chip, then the Northbridge.

With the amount of information that needs to be squeezed through the various data buses
into the processor to be operated on, bottlenecks inevitably develop, where the processor
is waiting for the necessary bits to be delivered by the I/O subsystem feeding it.

As processors get consistently faster every few months, while data bus breakthroughs are
irregular, the issue perpetuates itself.

AMD has attempted to get around this constant problem by equipping its 64-bit processors
with two advantages, internal DDR memory controllers and Hypertransport links. AMD has
built the memory controller (normally a part of the motherboard to which the processor is
attached), directly into their Opteron and Athlon 64 CPUs.

As you can imagine, this gives a considerably reduces the time it takes the processor to
access memory, since while data still needs to travel between the processor and the
physical memory, communication with the controller that arranges the data flow does not
need to be passed outside the processor, reducing the amount of computing cycles lost
while waiting for the memory.

Another benefit is the fact that memory traffic no longer needs to run between the
processor and the Northbridge chip on the motherboard which traditionally provides the



memory controller, reducing bottlenecks. The second part of the package is support for
Hypertransport input/output technology.

HyperTransport™ technology

HyperTransport™ technology is a high-speed, low latency, point-to-point link designed to

increase the communication speed between integrated circuits in computers, servers,
embedded systems, and networking and telecommunications equipment up to 48 times
faster than some existing technologies.

HyperTransport™ technology helps reduce the number of buses in a system, which can
reduce system bottlenecks and enable today's faster microprocessors to use system
memory more efficiently in high-end multiprocessor systems.

HyperTransport™ technology is designed to:

• Provide significantly more bandwidth than current technologies

• Use low-latency responses and low pin counts

• Maintain compatibility with legacy PC buses while being extensible to new SNA
(Systems Network Architecture) buses.

Appear transparent to operating systems and offer little impact on peripheral drivers.

With this article and the previous one, that mention the 64-bit architectures by Intel and
AMD, we finished to talk about the processors for the beginning of the millennium. In
addition, it is important to mention that there already are computers running 64-bit
versions of Windows and Linux. Now, more than performance, our biggest concern is the
compatibility with our present programs. We really have to verify how much those 64-bit
architectures are compatible with our 32- or 16-bit programs. We hope that in less than a
year we already have the answer to this question. To finish this part of 64-bit CPUs, it is
very good to see how the two companies compete in the market of high performance
processors. This grants us access to even cheaper and better computers.

To conclude, we would like to comment the great space that there still is to the evolution of
electronics and consequently to the evolution of computers. More important than the
creation of supercomputers, this new age will see the permeability of the computers. It will
be the time of invisible computers. They will be present in nearly all modern devices. At the
moment they inhabit our TV sets, microwave ovens, cars, watches, stereos, DVD, etc... In
a near future, they will invade the refrigerator, the toaster, the air-conditioner and all
everyday appliances. We have gone beyond the cheap electronics age and we are entering
the cheap intelligence age.


References for this part are basically placed in the appropriate positions - this list gives an

- Search390.com:



- Hammer Review A1- Electronics: http://www.a1-


- Article X86-64 Hardware site: http://www.hardwaresite.net/x86-64.html

- AMD Developer's Manual X86-64: http://www.amd.com/us-


- Article IA-64 Hardware site: http://www.hardwaresite.net/ia64.html

- Presentation IA-64:

- Software Developer's Manual Itanium:


- Hardware Developer's Manual Itanium:


- AMD Opteron video: http://www.amd.com/us-


- Article 64-bit computing: c't 12/99 page 28

- basic notations, definitons and concepts are taken from "Computer Organization and
Design", Hennessey and Patterson