Вы находитесь на странице: 1из 33

CS6303 - Computer Architecture Notes

Unit 1
Design Principles of Computer Architecture

CISC vs. RISC


Instructions directly executed by hardware
Maximize instruction issue rate (ILP)
Simple instructions (easy to decode)
Access to memory only via load/store
Plenty of registers
Pipelining

Basic Computer Organization

Bus-Based Computer Organization

Data path:

Memory
Dynamic Random Access Memory (DRAM)
The choice for main memory
Volatile (contents go away when power is lost)
Fast
Relatively small
DRAM capacity: 2x / 2 years (since 96);
64x size improvement in last decade
Static Random Access Memory (SRAM)
The choice for cache
Much faster than DRAM, but less dense and more costly
Magnetic disks
The choice for secondary memory
Non-volatile
Slower
Relatively large
Capacity: 2x / 1 year (since 97)
250X size in last decade
Solid state (Flash) memory

The choice for embedded computers


Non-volatile
Optical disks
Removable, therefore very large
Slower than disks
Magnetic tape
Even slower
Sequential (non-random) access
The choice for archival
System Software

Operating system supervising program that interfaces the users program


with the hardware (e.g., Linux, MacOS, Windows)
- Handles basic input and output operations
- Allocates storage and memory
- Provides for protected sharing among multiple applications
Compiler translate programs written in a high-level language (e.g., C, Java)
into instructions that the hardware can execute

TECHNOLOGY

A transistor is simply an on/off switch controlled by electricity. The


integrated circuit (IC) combined dozens to hundreds of transistors into
a single chip. To describe the tremendous increase in the number of
transistors from hundreds to millions, the adjective very large scale is
added to the term, creating the abbreviation VLSI, for very largescale integrated circuit.
CLASSES OF COMPUTERS
Desktop computers

Designed to deliver good performance to a single user at low cost usually


executing 3rd party software, usually incorporating a graphics
display, a keyboard, and a mouse

Servers
Used to run larger programs for multiple, simultaneous users typically
accessed only via a network and that places a greater emphasis
on dependability and (often) security
Modern version of what used to be called mainframes, minicomputers and
supercomputers
Large workloads
Built using the same technology in desktops but higher capacity
- Gigabytes to Terabytes to Peta bytes of storage
Expandable
Scalable
Reliable
Large spectrum: from low-end (file storage, small businesses) to
supercomputers (high end scientific and engineering applications)
Examples: file servers, web servers, database servers
Supercomputers

A high performance, high cost class of servers with


hundreds to thousands of processors, terabytes of memory and
petabytes of storage that are used for high-end scientific and
engineering applications
Embedded computers (processors)

A computer inside another device used for running one


predetermined application
- Microprocessors everywhere! (washing machines,
cell phones, automobiles, video games)
- Run one or a few applications
- Specialized hardware integrated with the application
(not your common processor)
- Usually stringent limitations (battery power)
- High tolerance for failure (dont want your airplane
avionics to fail!)
- Becoming ubiquitous
- Engineered using processor cores
- The core allows the engineer to integrate other
functions into the processor for fabrication on the
same chip
- Using hardware description languages: Verilog, VHDL

Embedded Processor Characteristics


The largest class of computers spanning the widest range of
applications and performance
Often have minimum performance requirements.
Often have stringent limitations on cost.
Often have stringent limitations on power consumption.
Often have low tolerance for failure.
PERFORMANCE

Defining Performance
Lets suppose we define performance in terms of speed. This still leaves two
possible definitions. You could define the fastest plane as the one with the
highest cruising speed, taking a single passenger from one point to another in
the least time. If you were interested in transporting 450 passengers from one
point to another, however, the 747 would clearly be the fastest, as the last
column of the figure shows. Similarly, we can define computer performance in
several different ways.

Throughput and Response Time Do the following changes to a computer system

increase throughput, decrease response time, or both?


1. Replacing the processor in a computer with a faster version
2. Adding additional processors to a system that uses multiple processors
for separate tasksfor example, searching the World Wide Web

Example
Time taken to run a program = 10s on A, 15s on B
Relative performance =Execution TimeB / Execution TimeA
=15s/10s
=1.5
So A is 1.5 times faster than B
Measuring Execution Time
Elapsed time

Total response time, including all aspects Processing, I/O, OS overhead,


idle time

Determines system performance

CPU time

Time spent processing a given job Discounts I/O time, other jobs shares
Comprises user CPU time and system CPUtime
Different programs are affected differently byCPU and system
performance

CPU Clocking
Operation of digital hardware governed by a constant-rate clock

Clock
l
Clock
l
CPU TIME

period: duration of a clock cycle


e.g., 250ps = 0.25ns = 2501012s
frequency (rate): cycles per second
e.g., 4.0GHz = 4000MHz = 4.0109Hz

Example
Computer A: 2GHz clock, 10s CPU time
Designing Computer B
l Aim for 6s CPU time
l Can do faster clock, but causes 1.2 clock cycles
How fast must Computer B clock be?

Clock RateB

Clock CyclesB 1.2 Clock Cycles A

CPU Time B
6s

Clock Cycles A CPU Time A Clock Rate A


10s 2GHz 20 10 9
Clock RateB

1.2 20 10 9 24 10 9

4GHz
6s
6s

Instruction Count and CPI


Instruction Count for a program
l Determined by program, ISA and compiler
Average cycles per instruction
l Determined by CPU hardware
l If different instructions have different CPI
- Average CPI affected by instruction mix
Clock Cycles Instruction Count Cycles per Instructio n
CPU Time Instruction Count CPI Clock Cycle Time

Instruction Count CPI


Clock Rate

CPI Example

Computer A: Cycle Time = 250ps, CPI = 2.0


Computer B: Cycle Time = 500ps, CPI = 1.2
Same ISA
Which is faster, and by how much?

CPU Time

CPU Time
CPU Time
CPU Time

Instruction Count CPI

A
I 2.0 250ps I 500ps

Cycle Time

Instruction Count CPI Cycle Time


B
B
I 1.2 500ps I 600ps

B I 600ps 1.2
I 500ps
A

CPI in More Detail


If different instruction classes take different numbers of cycles
n

Clock Cycles (CPIi Instruction Count i )


i 1

Weighted average CPI

CPI

n
Clock Cycles
Instruction Count i

CPIi

Instruction Count i1
Instruction Count

CPI Example
Alternative compiled code sequences using instructions in classes A, B, C
Class

CPI for class

IC in sequence 1

IC in sequence 2

Sequence 1: IC = 5
Clock Cycles
l
= 21 + 12 + 23
= 10
Avg. CPI = 10/5 = 2.0
l
Sequence 2: IC = 6
Clock Cycles
l
= 41 + 12 + 13
=9

Avg. CPI = 9/6 = 1.5

Performance depends on
Algorithm: affects IC, possibly CPI
l
Programming language: affects IC, CPI
l
Compiler: affects IC, CPI
l
Instruction set architecture: affects IC, CPI, Tc
l

CPU Time

Instructio ns Clock cycles


Seconds

Program
Instruction
Clock cycle

computers are constructed using a clock that determines when events take place
in the hardware. These discrete time intervals are called clock cycles (or ticks,
clock ticks, clock periods, clocks, cycles). Designers refer to the length of a
clock period both as the time for a complete clock cycle (e.g., 250 picoseconds,
or 250 ps) and as the clock rate (e.g., 4 gigahertz, or 4 GHz), which is the inverse
of the clock period. In the next subsection, we will formalize the relationship
between the clock cyclesof the hardware designer and the seconds of the
computer user.
Computer Performance and its Factors

Instruction Performance

The term clock cycles per instruction, which is the average number of clock
cycles each instruction takes to execute, is often abbreviated as CPI
Classic cpu performance

POWERWALL

UNIPROCESSOR AND MULTIPROCEASSOR


The power limit has forced a dramatic change in the design of microprocessors.
Figure shows the improvement in response time of programs for desktop
microprocessors over time. Since 2002, the rate has slowed from a factor of 1.5
per year to less than a factor of 1.2 per year.

As an analogy, suppose the task was to write a newspaper story. Eight reporters
working on the same story could potentially write a story eight times faster. To
achieve this increased speed, one would need to break up the task so that each
reporter had something to do at the same time. Thus, we must schedule the
subtasks. If anything went wrong and just one reporter took longer than the
seven
others did, then the benefits of having eight writers would be diminished. Thus,
we must balance the load evenly to get the desired speedup. Another danger
would be if reporters had to spend a lot of time talking to each other to write

their sections. You would also fall short if one part of the story, such as the
conclusion, couldnt be written until all of the other parts were completed. Thus,
care must be taken to reduce communication and synchronization overhead. For
both this analogy and parallel programming, the challenges include scheduling,
load balancing, time for synchronization, and overhead for communication
between the parties. As you might guess, the challenge is stiffer with more
reporters for a newspaper story and more processors for parallel programming

OPERATIONS OF COMPUTER HARDWARE

OPERANDS OF COMPUTER HARDWARE


A very large number of registers may increase the clock cycle time simply
because it takes electronic signals longer when they must travel farther.
Guidelines such as smaller is faster are not absolutes; 31 registers may not be
faster than 32. Yet, the truth behind such observations causes computer
designers to take them seriously. In this case, the designer must balance the
craving of programs
for more registers with the designers desire to keep the clock cycle fast.Another
reason for not using more than 32 is the number of bits it would take in the
instruction forma
Memory Operands
As explained above, arithmetic operations occur only on registers in MIPS
instructions; thus, MIPS must include instructions that transfer data between
memory and registers. Such instructions are called data transfer instructions.
To access a word in memory, the instruction must supply the memory address.

Memory is just a large, single-dimensional array, with the address acting as the
index to that array, starting at 0. For example, in Figure the address of the third
data element is 2, and the value of Memory[2] is 10.

Given the importance of registers, what is the rate of increase in the number of
registers in a chip over time?
1. Very fast: They increase as fast as Moores law, which predicts doubling the
number of transistors on a chip every 18 months.
2. Very slow: Since programs are usually distributed in the language of the
computer, there is inertia in instruction set architecture, and so the number
of registers increases only as fast as new instruction sets become viable.

REPRESENTING INSTRUCTION

LOGICAL INSTRUCTION

Case/Switch Statement
Most programming languages have a case or switch statement that allows the
programmer to select one of many alternatives depending on a single value. The
simplest way to implement switch is via a sequence of conditional tests, turning
the switch statement into a chain of if-then-else statements. Sometimes the
alternatives may be more efficiently encoded as a table of addresses of
alternative instruction sequences, called a jump address table or jump table,
and the program needs only to index into the table and then jump to the
appropriate sequence. The jump table is then just an array of words containing
addresses that correspond to labels in the code. The program loads the
appropriate entry from the jump table into a register. It then needs to jump using
the address in the register. To support such situations, computers like MIPS
include a jump register instruction (jr), meaning an unconditional jump to the
address specified in a register. Then it jumps to the proper address using this
instruction
Nested Loop
Procedures that do not call others are called leaf procedures. Life would be
simple if all procedures were leaf procedures, but they arent. Just as a spy might
employ other spies as part of a mission, who in turn might use even more spies,
so do procedures invoke other procedures. Moreover, recursive procedures even
invoke clones of themselves. Just as we need to be careful when using registers
in procedures, more care must also be taken when invoking nonleaf procedures.
ADDRESSING MODES
1. Immediate addressing, where the operand is a constant within the instruction
itself
2. Register addressing, where the operand is a register
Base or displacement addressing, where the operand is at the memory location
whose address is the sum of a register and a constant in the instruction
4. PC-relative addressing, where the branch address is the sum of the PC and a
constant

in the instruction
5. Pseudodirect addressing, where the jump address is the 26 bits of the
instruction
concatenated with the upper bits of the PC

UNIT 2
ADDITION AND SUBTRACTION
Digits are added bit by bit from right to left, with carries passed to the next digit to the
left, just as you would do by hand. Subtraction uses addition: the appropriate operand is
simply negated before being added.

Binary Addition and Subtraction


Example
1. Add 610 and 710
2. Subtract 610 from 710
Adding 610 and 710 can be done as follows:
0000 0000 0000 0000 0000 0000 0000 0111two = 7ten
+ 0000 0000 0000 0000 0000 0000 0000 0110two = 6ten
= 0000 0000 0000 0000 0000 0000 0000 1101two = 13ten
The 4 bits to the right have all the action;
Subtracting 6ten from 7ten can be done directly:
0000 0000 0000 0000 0000 0000 0000 0111two = 7ten
0000 0000 0000 0000 0000 0000 0000 0110two = 6ten
= 0000 0000 0000 0000 0000 0000 0000 0001two = 1ten
or
Subtraction can be done via addition using the twos complement representation of
-6:
0000 0000 0000 0000 0000 0000 0000 0111two = 7ten
+ 1111 1111 1111 1111 1111 1111 1111 1010two = 6ten
= 0000 0000 0000 0000 0000 0000 0000 0001two = 1ten

The above figure shows the sums and carries. The carries are shown in parentheses, with
the arrows showing how they are passed.
Binary addition, showing carries from right to left. The rightmost bit adds 1 to 0,
resulting in the sum of this bit being 1 and the carry out from this bit being 0. Hence, the
operation for the second digit to the right is 0 + 1 + 1. This generates a 0 for this sum bit
and a carry out of 1. The third digit is the sum of 1 + 1 + 1, resulting in a carry out of 1
and a sum bit of 1. The fourth bit is 1 + 0 + 0, yielding a 1 sum and no carry.

When the result from an operation cannot be represented with the available hardware, in
this case a 32-bit word. When can overflow occur in addition? When adding operands
with different signs, overflow cannot occur. The reason is the sum must be no larger than
one of the operands. For example, -10 + 4 = -6. Since the operands fit in 32 bits and the
sum is no larger than an operand, the sum must fit in 32 bits as well. Therefore, no
overflow can occur when adding positive and negative operands. There are similar
restrictions to the occurrence of overflow during subtract, but its just the opposite
principle: when the signs of the operands are the same, overflow cannot occur. To see
this, remember that x - y = x + (-y) because we subtract by negating the second operand
and then add. Therefore, when we subtract operands of the same sign we end up by
adding operands of different signs. From the prior paragraph, we know that overflow
cannot occur in this case either.
Adding or subtracting two 32-bit numbers can yield a result that needs 33 bits to be fully
expressed. The lack of a 33rd bit means that when overflow occurs, the sign bit is set with
the value of the result instead of the proper sign of the result. Since we need just one
extra bit, only the sign bit can be wrong. Hence, overflow occurs when adding two
positive numbers and the sum is negative, or vice versa. This means a carry out occurred
into the sign bit. Overflow occurs in subtraction when we subtract a negative number
from a positive number and get a negative result, or when we subtract a positive number
from a negative number and get a positive result. This means a borrow occurred from the
sign bit.
Two kinds of Overflow Conditions

MIPS detects overflow with an exception, also called an interrupt on many computers.
An exception or interrupt is essentially an unscheduled procedure call. The address of the
instruction that overflowed is saved in a register, and the computer jumps to a predefined
address to invoke the appropriate routine for that exception. The interrupted address is
saved so that in some situations the program can continue after corrective code is
executed.

MIPS includes a register called the exception program counter (EPC) to contain the
address of the instruction that caused the exception. The instruction move from system
control (mfc0) is used to copy EPC into a general-purpose register so that MIPS software
has the option of returning to the offending instruction via a jump register instruction.

MULTIPLICATION
The multiplication of decimal numbers in longhand to remind ourselves of the steps of
multiplication and the names of the operands. For reasons that will become clear shortly,
we limit this decimal example to using only the digits 0 and 1.
Multiplying 100010 by 100110:
Multiplicand 100010
Multiplier
x 100110
--------1000
0000
0000
1000
------------------Product
100100010
The first operand is called the multiplicand and the second the multiplier. The final result
is called the product.
The old algorithm shows that taking the digits of the multiplier one at a time from right to
left, multiplying the multiplicand by the single digit of the multiplier, and shifting the
intermediate product one digit to the left of the earlier intermediate products.

The number of digits in the product is considerably larger than the number in
either the multiplicand or the multiplier
If the sign bits were ignored then, the length of the multiplication of an n-bit
multiplicand and an m-bit multiplier is a product that is n + m bits long. n + m bits
are required to represent all possible products
Multiply must cope with overflow because we frequently want a 32-bit product as
the result of multiplying two 32-bit numbers

In the above example


1. Just place a copy of the multiplicand (1 multiplicand) in the proper place if the
multiplier digit is a 1, or

2. Place 0 (0 multiplicand) in the proper place if the digit is 0.


SEQUENTIAL VERSION OF THE MULTIPLICATION ALGORITHM AND
HARDWARE
This design mimics the algorithm what we learned in school. The following figure shows
the hardware.

We have drawn the hardware so that data flows from top to bottom to resemble more
closely the paper-and-pencil method. Lets assume that the multiplier is in the 32-bit
Multiplier register and that the 64-bit Product register is initialized to 0. From the paperand-pencil example above, its clear that we will need to move the multiplicand left one
digit each step, as it may be added to the intermediate products. Over 32 steps, a 32-bit
multiplicand would move 32 bits to the left. Hence, we need a 64-bit Multiplicand
register, initialized with the 32-bit multiplicand in the right half and zero in the left half.
This register is then shifted left 1 bit each step to align the multiplicand with the sum
being accumulated in the 64-bit Product register.
THREE BASIC STEPS NEEDED FOR EACH BIT

The least significant bit of the multiplier (Multiplier0) determines whether the
multiplicand is added to the Product register. The left shift in step 2 has the effect of
moving the intermediate operands to the left, just as when multiplying with paper and
pencil. The shift right in step 3 gives us the next bit of the multiplier to examine in the

following iteration. These three steps are repeated 32 times to obtain the product. If each
step took a clock cycle, this algorithm would require almost 100 clock cycles to multiply
two 32-bit numbers.
This algorithm and hardware are easily refined to take 1 clock cycle per step. The speedup comes from performing the operations in parallel: the multiplier and multiplicand are
shifted while the multiplicand is added to the product if the multiplier bit is a 1. The
hardware just has to ensure that it tests the right bit of the multiplier and gets the pre
shifted version of the multiplicand. The hardware is usually further optimized to halve the
width of the adder and registers by noticing where there are unused portions of registers
and adders.

Replacing arithmetic by shifts can also occur when multiplying by constants. Some
compilers replace multiplies by short constants with a series of shifts and adds. Because
one bit to the left represents a number twice as large in base 2, shifting the bits left has
the same effect as multiplying by a power of 2. Almost every compiler will perform the
strength reduction optimization of substituting a left shift for a multiply by a power of 2.

EXAMPLE FOR A MULTIPLY ALGORITHM

Using 4-bit numbers to save space, multiply 210 310, or 00102 00112.

The value of each register for each of the steps is given in the above figure. The final
value is 0000 01102 or 610.

SIGNED MULTIPLICATION

In the signed multiplication, convert the multiplier and multiplicand to positive


numbers and then remember the original signs.
The algorithms should then be run for 31 iterations, leaving the signs out of the
calculation.
The shifting steps would need to extend the sign of the product for signed
numbers. When the algorithm completes, the lower word would have the 32-bit
product.

FASTER MULTIPLICATION

Faster multiplications are possible by essentially providing one 32-bit adder for
each bit of the multiplier: one input is the multiplicand ANDed with a multiplier
bit, and the other is the output of a prior adder.

Connect the outputs of adders on the right to the inputs of adders on the left,
making a stack of adders 32 high.

The above figure shows an alternative way to organize 32 additions in a parallel


tree. Instead of waiting for 32 add times, we wait just the log 2 (32) or five 32-bit
add times.
Multiply can go even faster than five add times because of the use of carry save
adders
It is easy to pipeline such a design to be able to support many multiplies
simultaneously

MULTIPLY IN MIPS
MIPS provides a separate pair of 32-bit registers to contain the 64-bit product, called Hi
and Lo. To produce a properly signed or unsigned product, MIPS has two instructions:
multiply (mult) and multiply unsigned (multu). To fetch the integer 32-bit product, the
programmer uses move from lo (mflo). The MIPS assembler generates a
pseudoinstruction for multiply that specifies three generalpurpose registers, generating
mflo and mfhi instructions to place the product into registers.

DIVISION
The reciprocal operation of multiply is divide, an operation that is even less frequent and
even more quirky. It even offers the opportunity to perform a mathematically invalid
operation: dividing by 0.
Lets start with an example of long division using decimal numbers to recall the names of
the operands and the grammar school division algorithm. For reasons similar to those in
the previous section, we limit the decimal digits to just 0 or 1.

EXAMPLE
The example is dividing 1,001,01010 by 100010:

Divides two operands, called the dividend and divisor, and the result, called the
quotient, are accompanied by a second result, called the remainder. Here is another way
to express the relationship between the components:
Dividend = Quotient Divisor + Remainder
where the remainder is smaller than the divisor. Infrequently, programs use the divide
instruction just to get the remainder, ignoring the quotient. The basic grammar school
division algorithm tries to see how big a number can be subtracted, creating a digit of the
quotient on each attempt. Our carefully selected decimal example uses only the numbers
0 and 1, so its easy to figure out how many times the divisor goes into the portion of the
dividend: its either 0 times or 1 time. Binary numbers contain only 0 or 1, so binary
division is restricted to these two choices, thereby simplifying binary division.

Lets assume that both the dividend and the divisor are positive and hence the quotient
and the remainder are nonnegative. The division operands and both results are 32-bit
values, and we will ignore the sign for now.

A DIVISION ALGORITHM AND HARDWARE

The above figure shows hardware to mimic our grammar school algorithm. We start with
the 32-bit Quotient register set to 0. Each iteration of the algorithm needs to move the
divisor to the right one digit, so we start with the divisor placed in the left half of the 64bit Divisor register and shift it right 1 bit each step to align it with the dividend. The
Remainder register is initialized with the dividend.

THREE STEPS OF THE FIRST DIVISION ALGORITHM.

The above figure shows three steps of the first division algorithm. Unlike a human, the
computer isnt smart enough to know in advance whether the divisor is smaller than the
dividend. It must first subtract the divisor in step 1; remember that this is how we
performed the comparison in the set on less than instruction. If the result is positive, the
divisor was smaller or equal to the dividend, so we generate a 1 in the quotient (step 2a).
If the result is negative, the next step is to restore the original value by adding the divisor
back to the remainder and generate a 0 in the quotient (step 2b). The divisor is shifted
right and then we iterate again. The remainder and quotient will be found in their
namesake registers after the iterations are complete.
EXAMPLE
A DIVIDE ALGORITHM
Using a 4-bit version of the algorithm to save pages, lets try dividing 710 by 210, or 0000
01112 by 00102.

The above figure shows the value of each register for each of the steps, with the quotient
being 3ten and the remainder 1ten. Notice that the test in step 2 of whether the remainder
is positive or negative simply tests whether the sign bit of the Remainder register is a 0 or

1. The surprising requirement of this algorithm is that it takes n + 1 steps to get the proper
quotient and remainder.
This algorithm and hardware can be refined to be faster and cheaper. The speedup comes
from shifting the operands and the quotient simultaneously with the subtraction. This
refinement halves the width of the adder and registers by noticing where there are unused
portions of registers and adders.
The following figure shows the revised hardware.

SIGNED DIVISION
The one complication of signed division is that we must also set the sign of the
remainder. Remember that the following equation must always hold:
Dividend = Quotient Divisor + Remainder
To understand how to set the sign of the remainder, lets look at the example of dividing
all the combinations of 710 by 210.
The first case is easy:
+7 +2: Quotient = +3, Remainder = +1

Checking the results:


7 = 3 2 + (+1) = 6 + 1
If we change the sign of the dividend, the quotient must change as well:
7 +2: Quotient = 3
Rewriting our basic formula to calculate the remainder:
Remainder = (Dividend Quotient Divisor) = 7 (3 +2) = 7(6) = 1
So,
7 +2: Quotient = 3, Remainder = 1
Checking the results again:
7 = 3 2 + (1) = 6 1
The reason the answer isnt a quotient of 4 and a remainder of +1, which would also fit
this formula, is that the absolute value of the quotient would then change depending on
the sign of the dividend and the divisor! Clearly, if
(x y) (x) y
programming would be an even greater challenge. This anomalous behavior is avoided by
following the rule that the dividend and remainder must have the same signs, no matter
what the signs of the divisor and quotient. We calculate the other combinations by
following the same rule:
+7 2: Quotient = 3, Remainder = +1
7 2: Quotient = +3, Remainder = 1
Thus the correctly signed division algorithm negates the quotient if the signs of the
operands are opposite and makes the sign of the nonzero remainder match the dividend.
FASTER DIVISION
We used many adders to speed up multiply, but we cannot do the same trick for divide.
The reason is that we need to know the sign of the difference before we can perform the
next step of the algorithm, whereas with multiply we could calculate the 32 partial
products immediately.
There are techniques to produce more than one bit of the quotient per step. The SRT
division technique tries to guess several quotient bits per step, using a table lookup based

on the upper bits of the dividend and remainder. It relies on subsequent steps to correct
wrong guesses. A typical value today is 4 bits. The key is guessing the value to subtract.
With binary division, there is only a single choice.
These algorithms use 6 bits from the remainder and 4 bits from the divisor to index a
table that determines the guess for each step. The accuracy of this fast method depends on
having proper values in the lookup table.

SUBWORD PARALLELLISM
A subword is a lower precision unit of data contained within a word. In subword
parallelism, multiple subwords are packed into a word and then process whole words.
With the appropriate subword boundaries this technique results in parallel processing of
subwords. Since the same instruction is applied to all subwords within the word, This is a
form of SIMD(Single Instruction Multiple Data) processing.
It is possible to apply subword parallelism to noncontiguous subwords of different sizes
within a word. In practical implementation is simple if subwords are same size and they
are contiguous within a word. The data parallel programs that benefit from subword
parallelism tend to process data that are of the same size.
For example if word size is 64bits and subwords sizes are 8,16 and 32 bits. Hence an
instruction operates on eight 8bit subwords, four 16bit subwords, two 32bit subwords or
one 64bit subword in parallel.

Subword parallelism is an efficient and flexible solution for media processing


because algorithm exhibit a great deal of data parallelism on lower precision data.
It is also useful for computations unrelated to multimedia that exhibit data
parallelism on lower precision data.
Graphics and audio applications can take advantage of performing simultaneous
operations on short vectors
Example: 128-bit adder:
Sixteen 8-bit adds
Eight 16-bit adds
Four 32-bit adds

Also called data-level parallelism, vector parallelism, or Single Instruction, Multiple Data
(SIMD)

Вам также может понравиться