Вы находитесь на странице: 1из 62

Chapter 1

2. How does a high-level language influence the processor architecture?

High-level language influence the processor by enabling efficient translates language constructs to
instructions that a computer can execute.

3. Answer true or false, with justification: “The compiler writer is intimately aware of the details of the
process implementation.”

False, compiler writer must have details on ISA, and furthermore, most of the processor implementation
is of no use to complier writer.

4. Explain the levels of abstractions found inside the computer, from the silicon substrate to a complex
multiplayer video game.

_The semiconductor materials possess electrical properties that enable them to be used to make
transistors which are switches.

_The transistors can be connected to implement logic gates, which are simple circuits that allow logic
functions such as AND, OR, and NOT to be recognized.

_Logic gates can get connected to form functional units, which execute operations such a decoding an
n-bit binary number into selecting one of n outputs or adding two n-bit binary numbers together.

_The logic elements and logic gates can be used to build devices such as memories or state machines
that can be used to control logic circuits. The elements are joined to create a data path and control
system which is the processor.

_ Instructions set commands the processor on what to execute within its capability limits (e.g., add two
numbers together; fetch something from memory, etc.)

_The complier is the developed to translate a written computer program in the high-level language into
instruction from derived from the instruction set.

_ Computer programs written in a high-level language are connected to using networking and
communication technology enabling people to interact and play games.

5. Answer true or false, with justification: “The internal hardware organization of a computer system
varies dramatically, depending on the specifics of the system.”
True also false and it all depends on personal interpretation. Details of transistors can change based on
speed vs. power consumption issues, that all computers know had the knowledge of adding and the
essential circuitry of the same will be similar to the organization of logic elements and design of the data
path and control system might be entirely different between a graphics processor and one used to
control a hearing aid.

Chapter 2

2. Distinguish between the frame pointer and the stack pointer.

A frame pointer acts as a reference point, where the value in frame pointer register is the addresses of
the local variables determined. On the other hand, a stack pointer is a value in registers of the
processor, which holds the memory address of the top of the stack. In most cases, this value is less than
the top of the stack (i.e., the stack grows from the top to bottom). Anytime a push is done, stack pointer
value decreases.

3. In the LC-2200 architecture, where are operands normally found for an add instruction?

The operands are located in the registers.

5. An ISA may support different flavors of conditional branch instructions, such as BZ (branch on zero),
BN (branch on negative), and BEQ (branch on equal). Figure out the predicate expressions in an if
statement that may be best served by these different flavors of conditional branch instructions. Give
examples of such predicates in an if statement and how you would compile them, using these different
flavors of branch instructions.

BZ: if (a == 0)

BEQ: if (a == b)

BN: if (a < 0) or if (a < b) which becomes if (a-b < 0)

BP: if (a > 0) or if (a>b) which becomes if (a-b > 0)


8. Procedure A has important data in both S and T registers and is about to call procedure B. Which
registers should B store on the stack?

Procedure A saves the T registers before calling procedure B. B saves any S registers it uses OR saves any
T registers it needs before calling another function.

14. What is an ISA and why is important?

ISA or Instruction Set Architecture defines the machine code that a processor reads and acts upon so is
the word size, memory address modes, processor registers, and data type.

15. What are the differences in instruction set design?

Instruction sets get influenced by ease of implementations, efficiency, and ease of programming. More
extensive instructions set make writing compilers easier, but at the detriment of speed or ease of
implementation.

22. Convert the statement g = h + A[i]

Into an LC-2200 assembler, with the assumption that address of A is located in $t0, g is in $s1, h is in $s2,
and i is in $t1

ADD $t2, $t0, $t1; Calculates the address of A[i]

LW $t3, 0($t2); Load all content in A[i] into a register

ADD $s1, $s2, $t3; Assign the sum to g

Chapter 3
1. What is the difference between level triggered logic and edge triggered logic? Which do we use?
Why?

Level triggered logic register content alter their state from current to new when clock signal hits high,
while the edge triggered a logic change to the registered contents happen depending on the rising and
falling clock edge. The rising clock edge gives positive edge triggered logic and the falling edge provide a
negative edge triggered logic.

We use positive edge triggered logic.

Edge triggered logic is a method which avoids certain instability problems which are found in level
triggered circuits.

2. Given the FSM and state transition diagram for a garage door opener (Figure 3.12-a and Table
3.1) implement the sequential logic circuit for the garage door opener.

(Hint: The sequential logic circuit has two states and produces three outputs, namely, next state, up
motor control and down motor control).

6. What are the advantages and disadvantages of a bus-based data path design?

The advantage of a bus-based datapath design is that data signals are available to every piece of
hardware in the circuit, hence no worry of sending signals to multiple devices. The disadvantage of this
design is that there is a limitation on how many signals can be sent to each clock cycle. For instance, in a
single bus design, only one signal can be sent out each clock cycle, which makes the data path function
less efficiently. Also, there are cost and space problems that arise out of having so many wires.

9. The Instruction Fetch is implemented in the text with first four states and then three. What would
have to be done to the datapath to make it two states long?

So as to implement the Instruction Fetch in 2 states, there would have to be a second bus incorporated
into the datapath. Which could either take the MEM [MAR] to the IR in the first state or could take A+1
to the PC in the second state, therefore combining 2 of the present states into 1, eliminating 1 state
overall.

10. How many words of memory will this code snippet require when assembled? Is space allocated for
“L1”?

beq $s3, $s4, L1 add $s0, $s1, $s2 L1: sub $s0, $s0, $s3

This code snippet, when assembled, would require 3 words, 1 per instruction. No space is allocated for
L1. Instead, when the code is assembled, the L1 reference made in the first beq is replaced with the line
number of the instruction it refers to, which in this case would be line 2, rendering the L1 label no longer
useful.

11. What is the advantage of fixed length instructions?

One of the advantages is to ease and simplify instruction pipelining, which enables a single-clock
throughput at high frequencies. Hence it means that variable length instructions make it difficult to
decouple memory fetches, requiring the processor to fetch part, then decide whether to fetch more,
maybe missing in the cache before the instruction is complete, whereas fixed length allows the full
instruction to be fetched in one access, increasing speed and efficiency.

12. What is a leaf procedure?

A leaf procedure refers to a procedure that never calls any other procedure.

14. Suppose you are writing an LC-2200 program and you want to jump a distance that is farther that
allowed by a BEQ instruction. Is there a way to jump to address?

If you are writing an LC-2200 program and you want to jump a distance that is further than allowed by a
BEQ instruction, you can jump using a JALR instruction. This instruction stores PC+1 into Reg Y, where
PC is the address of the current JALR instruction. Then, it branches to the address currently in Reg X,
which can be any address you want to go to, any distance. You can also use JALR to perform an
unconditional jump, where after using the JALR instruction; you discard the value stored in Reg Y.

15. Could the LC series processors be made to run faster if it had a second bus? If your answer was no,
what else would you need?

Another ALU

An additional mux

A TPRF

The LC Series processors could not be made to run faster if it had a second bus. However, if it also had a
DRPF (Dual Ported Register File), instead of the temporary A and B registers, it could run faster. Using a
DRPF to get values to be operands for the ALU will enable both buses to drive values to the DRPF at the
same time, allowing the ALU operations and therefore overall performance of the processor to run
faster.

Chapter 4

1. Upon an interrupt what has to happen implicitly in hardware before control is transferred to the
interrupt handler?

The hardware must save the program counter value implicitly before the control is transmitted to the
handler. Also the hardware must determine the address of the handler to transfer control from the
currently executing program to the handler. Depending on the architecture interrupts can also be
disabled.

3. Put the following steps in the correct order

Save KO on stack

Enable interrupt

Save state

Actual work of the handler

Restore state

Disable interrupt

Restore KO from stack

Return from interrupt

4. How does the processor know which device has requested an interrupt?

Initially the processor has no knowledge. However, once the processor acknowledges the interrupt, the
information will be supplied to the processor as to its identity by the interrupting device. For instance,
the device might supply the address of its interrupt handler or it might supply a vector into a table of
interrupt handler addresses.

6. In the following interrupt handler code select the ONES THAT DO NOT BELONG.

___X___ disable interrupts;

___X___ save PC;

_______ save $k0;

_______ enable interrupts;

_______ save processor registers; _______ execute device code;

_______ restore processor registers;

_______ disable interrupts;

_______ restore $k0;

___X___ disable interrupts;

___X___ restore PC;

___X___ enable interrupts;

_______ return from interrupt;

7. In the following actions in the INT macro state select the ONES THAT DO NOT BELONG. ___X__ save
PC; ___X__ save SP;

______ $k0←PC;

___X__ enable interrupts;

___X__ save processor registers;

______ ACK INT by asserting INTA;

______ Receive interrupt vector from device on the data bus;

______ Retrieve PC address from the interrupt vector table;


___X__ Retrieve SP value from the interrupt vector table;

___X__ disable interrupts

___X__ PC←PC retrieved from the vector table;

______ SP←SP value retrieved from the vector table;

Homework chapter 5

1. True or false: For a given workload and a given instruction-set architecture, reducing the CPI (clocks
per instruction) of all the instructions will always improve the performance of the processor?

False. The processor execution depends on the instructions number, the average CPI, and clock cycle
time. If the average CPI is decreased but require lengthening of the instruction cycle time improvement
cannot be achieved or even cause a decrease in performance.

3. What would be the execution time for a program containing 2,000,000 instructions if the processor
clock was running at 8 MHz and each instruction takes 4 clock cycles?

Exec. Time = n*CPI ave*clock cycle time exec. Time = 2000000(4) (1/8000000) exec.
Time = 1 sec

6. Given the CPI of various instruction classes

Class CPI

R-type 2

I-type 10

J-type 3

S-type 4
And instruction frequency as shown:

Class Program 1 Program 2

R 3 10

I 3 1

J 5 2

S 2 3

Which code will execute faster and why?

Program 1 Total CPI = 3 * 2 + 3 * 10 + 5 * 3 + 2 * 4

=6 + 30 + 15 + 8

= 59

Program 2 Total CPI = 10 * 2 + 1 * 10 + 2 * 3 + 3 * 4

= 20 + 10 +6 + 12

= 48

Program 2 will have a faster execution since the total number of cycles it will take to execute is less than
the total number of cycles of Program 1.

7. What is the difference between static and dynamic instruction frequency?

Static instruction frequency measures how often instructions appear in the compiled code. Dynamic
instruction frequency measures how often instructions are executed while a program is running.

9. Compare and contrast structural, data and control hazards. How are the potential negative effects on
pipeline performance mitigated?

Type of Hazard reason for Potential hazard stalls fix


Structural Hardware limitations 0,1,2,3 feedback lines or live with it

Data(RAW) Instruction reads a value of a register before it has been updated by the pipeline Data
Forwarding, NOP (for LW instruction)

Data(WAR) Instruction writes a new value to a register while another

Instruction is still reading the old value before the pipeline updates it. No problem since data was
already copied into the pipeline buffer N/A

Data(WAW) Instructions write a new value into a register that was no problem since old value was
N/A

Previously written to. Probably useless anyway

Control These are breaks in the sequential execution of a program because

of branch instruction 1,2 Branch prediction or Delayed branch

Notes:

a) Feedback Lines are useful during the hardware implementation of the pipeline. These define the
previous stage of the pipeline that the current instruction is still under processing, therefore do not send
any more to process. They also send a NOP, dummy operation, to the next stage of the pipeline until the
current instruction is ready to proceed to the next stage.

b) Data forwarding: the ex-stage forwards the new value of the written register to the ID/RR stage so
that it can process the right values of the updated register

c) Branch Prediction: the idea here is to predict the outcome of the branch and let the instructions
flow along the pipeline based on this prediction. If the prediction is correct, it completely gets rid of any
stalls. If the prediction is not correct, it will create two stalls.
d) Delayed branch: the idea here is to find a useful instruction that we can feed the pipeline with,
while we test the branch instruction.

10. How can a read after write (RAW) hazard be minimized or eliminated?

RAW hazards get eliminated by adding data forwarding to a pipelined. Performing this process
involves adding mechanisms to send data being written to a register to previous stages in the buffer that
wants to read from the same register.

14. Regardless of whether we use a conservative approach or branch prediction (branch not taken),
explain why there is always a 2-cycle delay if the branch is taken (i.e., 2 NOPs injected into the pipeline)
before normal execution can resume in the 5-stage pipeline used in Section 5.13.3.

The processor does not recognize whether the branch is taken until the BEQ instruction is in the
execution stage. Since there is no definite mechanism to load the instruction from the new PC after the
branch is taken, new instructions can only get loaded after the BEQ instruction has exited the execution
stage, which will occur 2 cycles after it has been loaded.

18. Using the 5-stage pipeline shown in Figure 5.6c answer the following two questions:

a. Show the actions (similar to Section 5.12.1) in each stage of the pipeline for BEQ instruction of LC-
2200.

a) IF stage (cycle 1):

I-MEM [PC] -> FBUF

PC + 1 -> PC

ID/RR stage (cycle 2):

DPRF [FBUF [Rx] ] -> DBUF [A];

DPRF [FBUF [Ry] ] -> DBUF [B];

FBUF [OPCODE] -> DBUF [OPCODE]; EXT [FBUF [OFFSET] ]-> DBUF[OFFSET]; EX stage (cycle
3):
DBUF [OPCODE] -> EBUF [OPCODE];

IF LDZ [DBUF [A] - DBUF [B] ] == 1 PC + DBUF[OFFSET]-> PC;

MEM stage (cycle 4):

EBUF -> MBUF;

WB stage (cycle 5):

22. Consider

I1: R1 <- R2 + R3

I2: R4 <- R1 + R5

If I2 is following I1 immediately in the pipeline with no forwarding, how many bubbles (i.e., NOPs) will
result in the superior execution? Explain your answer.

Three bubbles will appear during execution because when I2 is in the ID/RR stage, it must wait until
I1 has left the WB stage since I2 needs to decode R1, but I1 is writing to R1.

Homework Chapter 6

1. Compare and contrast process and program.

A process is a program in execution. A program is static, has no state, and has a fixed size on disk,
whereas a process is dynamic, exists in memory, may grow or shrink, and has associated with it “state”
that represents the information associated with the execution of the program.

7. Which scheduling algorithm was noted as having a high variance in turnaround time?
First-Come First-Served (FCFS)

8.Which scheduling algorithm is provably optimal (minimum average wait time)?

Shortest Job First (SJF)

9.Which scheduling algorithm must be preemptive?

Round Robin

10.Given the following processes that arrived in the order shown

CPU Burst Time IO Burst Time

P1 3 2

P2 4 3

P3 8 4

Show the processor activities and the I/O area using the FCFS, SJF, and Round Robin algorithms.

Assuming each process requires a CPU burst followed by an I/O burst followed by a final CPU burst (as in
Example 1 in Section 6.6):

FCFS

SJF
Round Robin (assumed timeslice = 2)

1. Consider Google Earth application. You launch the application, move the mouse on the earth’s
surface, click on Mount Everest to see an up-close view of the mountain range. Identify the interactions
in layman’s terms between the operating system and the hardware during the above sequence of
actions.

Launching the application triggers the operation to commence a process which requests the
operating system connection to Google Earth. Each operation executed by the user either alters the
state of the program and requests output be performed by the operating system or alter the state of the
program and requests the operating system send information to Google Earth requesting for additional
information.

2. How does a high-level language influence the processor architecture?

The processor architecture is designed to enable efficient, cost-effective conversion of the high-level
language constructs into instructions that the machine can execute.

3. Answer True or False with justification: “The compiler writer is intimately aware of the details of the
processor implementation.”

False: The compiler writer must acknowledge some details known as the instruction set architecture,
but many details of the processor implementation are of no use or interest to the compiler writer

4. Explain the levels of abstractions found inside the computer from the silicon substrate to a complex
multi-player video game.
The semiconductor materials possess electrical properties that enable them to be used to make
transistors which are switches.

The transistors can be connected to implement logic gates, which are simple circuits that allow logic
functions such as AND, OR, and NOT to be recognized.

Logic gates can get connected to form functional units, which execute operations such a decoding an n-
bit binary number into selecting one of n outputs or adding two n-bit binary numbers together.

The logic elements and logic gates can be used to build devices such as memories or state machines that
can be used to control logic circuits. The elements are joined to create a data path and control system
which is the processor.

Instructions set commands the processor on what to execute with in its capability limits (e.g., add two
numbers together; fetch something from memory, etc.)

The complier is the developed to translate a written computer program in a high-level language into
instruction from derived from the instruction set.

Computer programs written in a high-level language are connected to using networking and
communication technology enabling people to interact and play games.

5. Answer True or False with justification: “The internal hardware organization of a computer system
varies dramatically depending on the specifics of the system.”

True also false and it all depends on personal interpretation. Details of transistors can change based on
speed vs. power consumption issues, that all computers know had the knowledge of adding and the
essential circuitry of the same will be similar to the organization of logic elements and design of the data
path and control system might be quite different between a graphics processor and one used to control
a hearing aid.

6. What is the role of a “bridge” between computer buses as shown in Figure 1.8?

Acts as a kind of translator/communications path between two devices (the two buses) which may
consist of no similar operational protocols

7. What is the role of a “controller” in Figure 1.8?


Appear to the computer to be memory locations which are in reality control registers for the
particular I/O devices to get controlled. The controller transmits the information supplied by the
processor and converts it into the appropriate control signals for the I/O device and retrieve information
from the device and sets bits in control registers to allow the processor to receive the information.

8. Using the Internet, research and explain five major milestones in the evolution of computer
hardware.

Examples:

Vacuum tunes to transistors

Integrated circuits

Disk drives

Display technology (from paper to glass)

Networking

9. Using the Internet, research and explain five major milestones in the evolution of the operating
system.

Examples:

Multiprogramming

Scheduling

Time-sharing

GUI Interface

Parallel operating systems

Error recovery
10. Compare and contrast grid computing and the power grid. Explain how the analogy makes sense.
Also, explain how the analogy breaks down.

There is an interconnected network of devices serving some useful purpose in both cases. The
generating systems can be perceived as a thought of powerful resources, supplying power to the
industrial and residential users of electricity. In grid computing, information flow is more two-way (or
even n-way). There are differences in the way things are paid for wherein the electric grid the
consumers pay the producers whereas in grid computing additional revenue streams may be provided
by advertisers or others wishing to use information generated by the grid. For power grid, there are a
relatively small number of suppliers compared to a vast number of consumers. In grid computing, there
would perhaps more consumers than producers but much more producers that in the case of the power
grid.

11. Match the left and right-hand sides.

Unix operating system Ritchie

Microchip Kilby and Noyce

FORTRAN programming language Backus

C programming language Thompson and Ritchie

Transistor Bardeen, Brattain, and Shockley

World’s first programmer Lovelace

World’s first computing machine Babbage

Vacuum Tube DeForest

ENIAC Mauchly and Eckert

Linux operating system Torvalds

Chapter 2 Processor Architecture…Solutions


1. Availability of large register-file is detrimental to the performance of a processor since it results in a
substantial overhead for procedure call/return in high-level languages. Do you agree or disagree? Give
supporting arguments.

Disagree.

By judicious use of calling conventions defining saved and temporary registers call/return overhead is
manageable to any preferred level of performance

2. Distinguish between the frame pointer and the stack pointer.

A frame pointer acts as a reference point, where the value in frame pointer register is the addresses of
the local variables determined. On the other hand, a stack pointer is a value in registers of the
processor, which holds the memory address of the top of the stack. In most cases, this value is less than
the top of the stack (i.e., the stack grows from the top to bottom). Anytime a push is done, stack pointer
value decreases.

3. In the LC-2200 architecture, where are operands normally found for an add instruction?

The operands are in registers.

4. Endianness: Let’s say you want to write a program for comparing two strings. You have a choice of
using a 32-bit byte-addressable Big-endian or Little-endian architecture to do this. In either case, you
can pack 4 characters in each word of 32-bits. Which one would you choose and how will you write such
a program? [Hint: Normally, you would do string comparison one character at a time. If you can do it a
word at a time instead of a character at a time, that implementation will be faster.]

The choice of endianness does not matter, so long as it is consistent in the comparison. Taking away
each word in string A from string B. If the return is zero, then the strings are identical.
5. ISA may support different flavors of conditional branch instructions such as BZ (branch on Zero), BN
(branch on negative), and BEQ (branch on equal). Figure out the predicate expressions in an “if”
statement that may be best served by these different flavors of conditional branch instructions. Give
examples of such predicates in “if” statement and how you will compile them using these different
flavors of branch instructions.

BZ: if (a == 0)

BEQ: if (a == b)

BN: if (a < 0) or if (a < b) which changes to if (a-b < 0)

BP: if (a > 0) or if (a>b) which changes to if (a-b > 0)

6. We said that endianness would not affect your program performance or correctness so long as the
use of a (high level) data structure is commensurate with its declaration. Are there situations where
even if your program does not violate the above rule, you could be bitten by the endianness of the
architecture? [Hint: Think of programs that cross network boundaries.]

Yes, if data from a big-endian computer was to be transferred over a network into a small-endian
computer data corruption can be experienced. However, this problem has been appropriately solved
using the Internet and networks using similar technology. The solution is for networks to use standard
endianness. If at any point the endianness of the network varies from the host computer. The host’s
network interface will execute and do the necessary conversion.

7. Work out the details of the implementing the switch statement using jump tables in assembly using
any flavor of conditional branch instruction. [Hint: After ensuring that the value of the switch variable is
within the bounds of valid case values, jump to the start of the appropriate code segment corresponding
to the current switch value, execute the code and finally jump to exit.]

The first step is to check the switch variable bounds

If the variable is within the intended bounds, you will index into the jump-table depending on which
switch case got executed. For example, if you took the second switch statement, you would jump to the
second location listed in the jump-table.
Execute the desired branch

JLR or JMP $return address

progress the original coding

8. Procedure A has important data in both S and T registers and is about to call procedure B. Which
registers should A store on the stack? Which registers should B store on the stack?

T registers should be kept in procedure A before it calls procedure B, while B should keep any registers
it needs from S or T before calling other functions.

9. Consider the usage of the stack abstraction in executing procedure calls. Do all actions on the stack
happen only via pushes and pops onto and from the top of the stack? Explain circumstances that
warrant reaching into other parts of the stack during program execution. How is this accomplished?

No, the amount of memory included in the stack at any given time is controlled by simply changing the
stack pointer value. Then values may be read or written in locations defined as offsets from the address
stored in the stack pointer (or frame pointer). the memory content in the stack gets controlled at any
time by radically changing pointer value. The values are written or read in location known as offsets
derived from the abb

10. Answer True/False with justification: Procedure call/return cannot be implemented without a
frame pointer.

False, the frame pointer is not to implement procedure calls necessarily, but it can make code simpler.

11. DEC VAX had a single instruction for loading and storing all the program visible registers from/to
memory. Can you see a reason for such an instruction pair? Consider both the pros and cons.

Pros: If you are a caller and need to use all of the safe and temporary registers, you can perform this
operation in one call. If you want to save the current state of execution, it can be done in one call.
Cons: In most cases you do not need all available registers, so this command uses more memory (and
possibly time) than required.

12. Show how you can simulate a subtract instruction using the existing LC-2200 ISA?

Since our system uses 2’s complement, the negative value of a number is NOT X plus 1. The LC-2200
does not have support for NOT, but NAND serves the same function.

B B NAND B ; not B

B B+1 ; B+1

A A+B ; A+B, net result is A-B

13. The BEQ instruction restricts the distance you can branch to from the current position of the PC. If
your program warrants jumping to a distance larger than that allowed by the offset field of the BEQ
instruction, show how you can accomplish such “long” branches using the existing LC-2200 ISA.

BEQ $s0, $s1, near

BEQ $zero, $zero, Skip

Near JALR $s2, $zero

Skip …

Note: Assume the address of the location that is a "long" way away is in $s2

14. What is an ISA and why is it important?


The ISA (Instruction Set Architecture) serves as a kind of contractual document that enables all parties
concerned with the design, implementation and make use of the provided processor to know what is
expected of them and what resources will be available by that processor.

As soon as an ISA is finalized:

• Implementation engineers can come up with the detail that will allow the processor to meet the ISA
specification

• Assembler and compiler writers create appropriate assemblers and compiler tor use with the
processor long before a working model even exists.

• Operating system designers/maintainers determine what is needed to be done to enable their


operating system run on this processor.

• I/O Device engineers design controllers and driver software that will be used with the processor.

• Box (or equivalent) engineers can determine how to use the processor in their designs • etc.

15. What are the influences on instruction set design?

Instruction sets are influenced by ease of implementations, efficiency, and ease of programming. Larger
instructions set make writing compilers easier, but at the detriment of speed or ease of implementation.
See also CISC and RISC.

16. What are conditional statements and how are they handled in the ISA?

Conditional statements compare 2 values in order to determine some sort of relation (equality, equal to
zero, positive, or negative). The ISA specifies which conditional statements are available. The LC-2200
features BEQ, but the LC-2110 had BR(N/Z/P).
17. Define the term addressing mode.

An addressing mode specifies how the bits of the instruction the operands locations. For instance, some
of the bits might be a register number or an offset to be added to the PC, etc.

18. In Section 2.8, we mentioned that local variables in a procedure are allocated on the stack. While
this description is convenient for keeping the exposition simple, modern compilers work quite
differently. This exercise is for you to search the Internet and find out how exactly modern compilers
allocate space for local variables in a procedure call. [Hint: Recall that registers are faster than memory.
So, the objective should be to keep as many of the variables in registers as possible.]

Generally many local variables which are located in memory technically found on the stack are
maintained in registers due to their significant advantage of speed which is enjoyed by the registers. We
have already noted that saved and temporary register conventions, argument registers, return value and
return address registers. All of these are focused towards increasing speed and efficiency. In addition,
modern optimizing compilers employ sophisticated register allocation strategies designed to maximize
use of registers. However, arrays and structures are maintained on the stack and not in registers.

We use the term abstraction to refer to the stack. What is meant by this term? Does the term
abstraction imply how it is implemented? For example, is a stack used in a procedure call/return a
hardware device or a software device?

An abstraction is a method of simplifying and eliminating unnecessary details while defining the
desired behavior of a system. Normally abstraction implies hiding implementation details. For example,
a queue is an abstraction that has to support enqueuing and dequeuing. The queue abstraction may be
implemented with a linked list, an array, etc. The stack facilitating a procedure call/return is a software
abstraction implemented with memory and register hardware but it could also be implemented as a
separate device.

19. Given the following instructions


BEQ Rx, Ry, offset; if (Rx == Ry) PC=PC+offset

SUB Rx, Ry, Rz ; Rx <- Ry - Rz

ADDI Rx, Ry, Imm ; Rx <- Ry + Immediate value AND Rx, Ry, Rz ; Rx <- Ry AND Rz

Show how you can realize the effect of the following instruction:

BGT Rx, Ry, offset; if (Rx > Ry) PC=PC+offset

Assume that the registers and the Imm field are 8-bits wide. You can ignore the case that the SUB
instruction causes an overflow.

Solution:

If Rx > Ry then Ry-Rx < 0

SUB $at, Ry, Rx ; Ry - Rx

ADDI $t3, $zero, x80 ; Create the mask 1000 0000

AND $at, $at, $t3 ; Check for negative

BEQ $at, $zero offset

Note: $t3 would need to be saved if in use already

21. Given the following load instruction

LW Rx, Ry, OFFSET; Rx <- MEM [Ry + OFFSET]


Show how to realize a new addressing mode, called indirect, for use with the load instruction that is
represented in assembly language as:

LW Rx, @ (Ry);

The semantics of this instruction is that the contents of register Ry is the address of a pointer to the
memory operand that must be loaded in Rx.

Solution:

LW Rx, Ry, 0

LW Rx, Rx, 0

22. Convert this statement:

g = h + A[i];

Into an LC-2200 assembler with the assumption that the address of A is located in $t0, g is in $s1, h is in
$s2, and, i is in $t1.

ADD $t2, $t0, $t1 ; Calculate address of A[i]

LW $t3, 0($t2) ; Load the contents of A[i] into a register

ADD $s1, $s2, $t3 ; Assign sum to g

23. Suppose you design a computer called the Big Looper 2000 that will never be used to call
procedures and that will automatically jump back to the beginning of memory when it reaches the end.
Do you need a program counter? Justify your answer.
Big Looper 2000 needs a pc to identify which address to fetch an instruction from on each cycle. The
PC is also useful in calculating relative addresses like the one used with branch instructions.

24. Consider the following program and assume that for this processor:

• All arguments are transmitted on the stack.

• Register V0 is for return values.

• The S registers are expected to be saved, that is a calling routine can leave values in the S registers
and expect it to be there after a call.

• The T registers are expected to be temporary, that is a calling routine must not expect values in the T
registers to be preserved after a call.

Int bar (int a, int b)

/* Code that uses registers T5, T6, S11-S13; */ return(1);

int foo (int a, int b, int c, int d, int e)

int x, y;

/* Code that uses registers T5-T10, S11-S13; */ bar(x, y); /* call bar */

/* Code that reuses register T6 & arguments a, b, & c; */

return (0);

main(int argc, char **argv)

{ int p, q, r, s, t, u;

/* Code that uses registers T5-T10, S11-S15; */ foo(p, q, r, s, t); /* Call foo */ /* Code that reuses
registers T9, T10; */
}

Here is the stack when bar is executing, clearly indicate in the spaces provided which procedure (main,
foo, bar) saved specific entries on the stack.

Main foo bar

_X__ ____ ____ p

_X__ ____ ____ q

_X__ ____ ____ r

_X__ ____ ____ s

_X__ ____ ____ t

_X__ ____ ____ u

_X__ ____ ____ T9

_X__ ____ ____ T10

____ _X__ ____ p

____ _X__ ____ q

____ _X__ ____ r

____ _X__ ____ s

____ _X__ ____ t

____ _X__ ____ x ____ _X__ ____ y

____ _X__ ____ S11

____ _X__ ____ S12

____ _X__ ____ S13

____ ____ ____ S14

____ ____ ____ S15

____ _X__ ____ T6


____ ____ _X__ x

____ ____ _X__ y

____ ____ _X__ S11

____ ____ _X__ S12

____ ____ _X__ S13 <----------- Top of Stack

Chapter 3 Processor Implementation…Solutions

1. What is the difference between level triggered logic and edge triggered logic? Which do we use?
Why?

In level triggered logic, register contents change state from current to new when the clock signal is high.
In edge triggered logic, the register contents change state on the rising or falling edge of the clock. In
edge triggered logic, if the change happens on the rising edge it is referred to as positive edge triggered
logic; as opposed to change happening on the falling edge, which is referred to as negative edge
triggered logic.

We use positive edge triggered logic.

Edge triggered logic is a method which avoids certain instability problems which are found in level
triggered circuits

2. Given the FSM and state transition diagram for a garage door opener (Figure 3.12-a and Table
3.1) implement the sequential logic circuit for the garage door opener.

(Hint: The sequential logic circuit has 2 states and produces three outputs, namely, next state, up
motor control and down motor control).
3. Re-implement the above logic circuit using the ROM plus state register approach detailed in this
chapter.

4. Compare and contrast the various approaches to control logic design.

Microprogrammed:

Pros: Maintainable, flexible

Cons: Space/Time inefficiency

Uses: For complex instructions or quick non-pipelined prototyping of architectures

Hardwired:

Pros: Capable of pipelined implementation, potential for higher performance

Cons: Harder to change the design, longer design time

Uses: High performance pipelined implementation of architectures

5. One of the optimizations to reduce the space requirement of the control ROM based design is to
club together independent control signals and represent them using an encoded field in control ROM.
What are the pros and cons of this approach? What control signals can be clubbed together and what
cannot be? Justify your answer.
The main pro of this approach is that it would reduce the size of the control signal table (ROM). The con
of this approach is that it adds a decoding step to the process, which delays the datapath for generating
control signals. This is because drive signals can be grouped, and load signals can be grouped, but not all
together because multiple storage elements may need to be clocked in the same clock cycle.

6. What are the advantages and disadvantages of a bus-based datapath design?

The advantage of a bus-based datapath design is that data signals are available to every piece of
hardware in the circuit, so you do not have to worry about sending signals to multiple devices because
they all have access. The disadvantage of this design is that you are limited to how many signals you can
send on each clock cycle. For example, in a single bus design, only one signal can be sent out each clock
cycle, which makes the datapath function less efficiently. Also, there are cost and space problems that
arise out of having so many wires.

7. Consider a three-bus design. How would you use it for organizing the above datapath elements?
How does this help compared to the two-bus design?

For organizing the above datapath using a 3-bus design, it would look as follows:

This design would function more efficiently than the 2-bus design, allowing the values to be pulled from
memory, driven to the ALU, and then stored in the register file all in one step, as shown by the ADD
instruction. A 2-bus design requires 2 clock cycles to complete the ADD instruction, whereas the 3-bus
design could do it in 1 clock cycle, using the additional bus to drive the ALU result to the register file,
where the IR can then supply the destination register number to the register file, to complete the
writing and therefore the instruction.
8. To save time would it be possible to store intermediate values in registers on the datapath such as
the Program Counter or Instruction Register? Explain why or why not.

It would be helpful to store intermediate values in registers to be used from one instruction to another
which essentially precludes using the PC or IR for this purpose since they are vital to fetch and decode
each instruction.

9. The Instruction Fetch is implemented in the text with first 4 states and then three. What would have
to be done to the datapath to make it two states long?

In order to implement the Instruction Fetch in 2 states, there would have to be a second bus
incorporated into the datapath. A Situation that could either take the MEM[MAR] to the IR in the first
state or could take A+1 to the PC in the second state, therefore combining 2 of the present states into 1,
eliminating 1 state overall.

10. How many words of memory will this code snippet require when assembled? Is space allocated for
“L1”?

beq $s3, $s4, L1 add $s0, $s1, $s2

L1: sub $s0, $s0, $s3

This code snippet, when assembled, would require 3 words, 1 per instruction. No space is allocated for
L1. Instead, when the code is assembled, the L1 reference made in the first beq is replaced with the line
number of the instruction it refers to, which in this case would be line 2, rendering the L1 label no longer
useful.

11. What is the advantage of fixed length instructions?


The advantage of fixed length instruction is to ease and simplify instruction pipelining, allowing for a
single-clock throughput at high frequencies. This means that variable length instructions make it
difficult to decouple memory fetches, requiring the processor to fetch part, then decide whether to
fetch more, maybe missing in a cache before the instruction is complete, whereas fixed length allows
the full instruction to be fetched in one access, increasing speed and efficiency.

12. What is a leaf procedure?

A leaf procedure referes to a procedure that does not calls any other procedure.

13. For this portion of a datapath (assuming that all lines are 16 bits wide). Fill in the table below.

Time A B C D E F

1 0x42 0xFE 0 0 0 0

2 0 0 0x42 0xFE 0 0

3 0xCAFE 0x1 0 0 0x140 0

4 0 0 0xCAFE 0x1 0 0x140

5 0 0 0 0 0xCAFE 0

6 0 0 0 0 0 0xCAFE

14. Suppose you are writing an LC-2200 program and you want to jump a distance that is farther than
allowed by a BEQ instruction. Is there a way to jump to address?

If you are writing an LC-2200 program and you want to jump a distance that is farther than allowed by a
BEQ instruction, you can jump using a JALR instruction. This instruction stores PC+1 into Reg Y, where
PC is the address of the current JALR instruction. Then, it branches to the address currently in Reg X,
which can be any address you want to go to, any distance. You can also use JALR to perform an
unconditional jump, where after using the JALR instruction; you discard the value stored in Reg Y.

15. Could the LC series processors be made to run faster if it had a second bus? If your answer was no,
what else would you need?

Another ALU

An additional mux

A TPRF

The LC Series processors could not be made to run faster if it had a second bus. However, if it also had a
DRPF (Dual Ported Register File), instead of the temporary A and B registers, it could run faster. Using a
DRPF to get values to be operands for the ALU will enable both buses to drive values to the DRPF at the
same time, allowing the ALU operations and therefore overall performance of the processor to run
faster.

16. Convert this statement:

g = h + A[i];

into an LC-2200 assembler having an assumption that the Address of A get located in $t0, g is in $s1, h is
in $s2, and, i is in $t1

Instruction0

RegSelLo DrReg LdA

goto Instruction1
Instruction1

DrOFF LdB

goto Instruction2

Instruction2

ALU_ADD DrALU LdMAR

goto Instruction3

Instruction3

DrMem LdA

goto Instruction4

Instruction4

RegSelLo DrReg LdB

goto Instruction5

Instruction5

ALU_ADD DrALU WrReg

halt

17. Suppose you design a computer called the Big Looper 2000 that will never be used to call
procedures and that will automatically jump back to the beginning of memory when it reaches the end.
Do you need a program counter? Justify your answer.
In order to design a computer that is not used to call procedures and will automatically jump back to the
beginning of memory when it reaches the end, a program counter is not necessary, as a PC’s purpose
includes pointing to the current instruction and implementing the branch and jump instructions, so if
there are no procedures, then there aren’t any instructions, so there is no reason to have a PC.

18. In the LC-2200 processor, why is there not a register after the ALU?

In the LC-2200 processor, there is not a register file after the ALU because the results of the ALU
operation are written into the destination register, that is pointed to by the IR, so the results are driven
directly from the ALU onto the bus to the proper register.

19. In the datapath diagram shown in Figure 3.15, why do we need the A and B registers in front of the
ALU? Why do we need MAR? Under what conditions would you be able to do without with any of these
registers? [Hint: Think of additional ports in the register file and/or buses.]

We need the A and B registers in front of the ALU because ALU operations (ADD, NAND, A-B, A+1)
require 2 operands, so we need temporary registers to hold at least one of them, since we can only get 1
register value out of the register file since there is only 1 output port (Dout). Also, with only 1 bus, there
is only 1 channel of communication between any pair of datapath elements. Similarly, we need the
MAR so there is a place to hold the address sent by the ALU to the memory. The only conditions where
we would be able to do without some of these registers would be if there were multiple buses and/or
multiple output ports from the register file, thereby allowing multiple value to be communicated
simultaneously, so that the ALU could carry out its operations and the memory could look up data at a
specified address.

20. Core memory used to cost $0.01 per bit. Consider your computer. What would be a rough estimate
of the cost if memory cost is $0.01/bit? If memory were still at that price what would be the effect on
the computer industry?
My laptop has 4 GB memory or 32 Gbits

Roughly: 32,000,000,000 bits

@ $0.01/bit

$320,000,000.00

My laptop would cost over 320 million dollars.

This would dramatically shrink the industry!

21. If computer designers focused entirely on speed and ignored cost implications, what would the
computer industry look like today? Who would the customers be? Now consider the same question
reversed: If the only consideration was cost what would the industry be like?

If computers were designed focused entirely on speed and ignored cost implications, the computer
industry would only produce supercomputers with infinite processing power, and the consumers would
only be large organizations with lots of money that have a lot of data to process, like the government
and large universities and companies. However, if computers designed focused entirely on cost,
computers would be slow and clunky, nothelpful and inefficient, with a very limited set of operations
and the least amount of hardware used as possible, with the consumers being people only doing really
simple operations, like students.

22. Consider a CPU with a stack-based instruction set. Operands and results for arithmetic instructions
are stored on the stack; the architecture contains no general purpose registers.
The data path shown on the next page uses two separate memories, a 65,536 (216) byte memory to
hold instructions and (non-stack) data, and a 256 byte memory to hold the stack. The stack is
implemented with a conventional memory and a stack pointer register. The stack starts at address 0,
and grows upward (to higher addresses) as data are pushed onto the stack. The stack pointer points to
the element on top of the stack (or is -1 if the stack is empty). You may ignore issues such as stack
overflow and underflow.

Memory addresses referring to locations in program/data memory are 16 bits. All data are 8 bits.
Assume the program/data memory is byte addressable, i.e., each address refers to an 8-bit byte. Each
instruction includes an 8-bit opcode. Many instructions also include a 16-bit address field. The
instruction set is shown below. Below, "memory" refers to the program/data memory (as opposed to
the stack memory).

OPCODE INSTRUCTION OPERATION

00000000 PUSH <addr> push the contents of memory at address

<addr> onto the stack

00000001 POP <addr>

pop the element on top of the stack into memory at location <addr>

00000010 ADD

Pop the top two elements from the stack, add them, and push the result onto the stack

00000100 BEQ <addr> Pop top two elements from stack; if they're equal, branch to memory location
<addr>

Note that the ADD instruction is only 8 bits, but the others are 24 bits. Instructions are packed into
successive byte locations of memory (i.e., do NOT assume all instruction uses 24 bits).

Assume memory is 8 bits wide, i.e., each read or write operation to main memory accesses 8 bits of
instruction or data. This means the instruction fetch for multi-byte instructions requires multiple
memory accesses.
Datapath:

Complete the partial design shown on the next page.

Assume reading or writing the program/data memory or the stack memory requires a single clock cycle
to complete (actually, slightly less to allow time to read/write registers). Similarly, assume each ALU
requires slightly less than one clock cycle to complete an arithmetic operation, and the zero detection
circuit requires negligible time.

Control Unit:

Show a state diagram for the control unit indicating the control signals that must be asserted in each
state of the state diagram.

Solution
CH04 Interrupts, Traps and Exceptions…Solutions

1. Upon an interrupt what has to happen implicitly in hardware before control is transferred to the
interrupt handler?

The hardware has to save the program counter value implicitly before the control goes to the handler.
The hardware has to determine the address of the handler to transfer control from the currently
executing program to the handler. Depending on the architecture interrupts may also be disabled.

2. Why not use JALR to return from the interrupt handler?

We check for interrupts at the end of each instruction execution. Therefore, between enabling
interrupts and JALR $k0, we may get a new interrupt that will trash $k0. JALR only saves to the return
address register; if there are multiple interrupts then our $ra register will be overwritten, and we'll lose
our link.

3. Put the following steps in the correct order

Save ko on stack

Enable interrupt

Save state

Actual work of the handler

Restore state

Disable interrupt

Restore ko from stack


Return from interrupt

4. How does the processor know which device has requested an interrupt?

Initially, the processor does not know. However, once the processor acknowledges the interrupt, the
interrupting device will supply information to the processor as to its identity. For example, the device
might supply the address of its interrupt handler or it might supply a vector into a table of interrupt
handler addresses.

5. What instructions are needed to implement interruptible interrupts? Explain the function and
purpose of each along with an explanation of what would happen if you didn't have them.

interrupt handler:

! Assume interrupts are disabled when we enter

SW $k0, OFFSET($sp) ! store $k0 on stack to be able to return to

! original program

ADDI $sp, $sp, OFFSET ! reserves space on stack to save registers

EI ! enables interrupt

SW $registers($sp) ! save registers on stack to retrieve these

! registers when the interrupt finishes later

! Actual work of the handler

LW $registers($sp) ! restore registers from the stack to retrieve the

! original values

DI ! Disable interrupt to make sure we don't mess up

! with return address

LW $k0, OFFSET($sp) ! restore the original value for (return address)


RETI ! since we restore the return address, we can go

! back to the original program now

So the additional instructions are EI, DI and RETI.

Without these instructions an interrupt occurring while processing an interrupt could cause us to lose
our original interrupt return address.

6. In the following interrupt handler code select the ONES THAT DO NOT BELONG.

___X___ disable interrupts;

___X___ save PC;

_______ save $k0;

_______ enable interrupts;

_______ save processor registers;

_______ execute device code;

_______ restore processor registers;

_______ disable interrupts; _______ restore $k0;

___X___ disable interrupts;

___X___ restore PC;

___X___ enable interrupts;

_______ return from interrupt;

7. In the following actions in the INT macro state select the ONES THAT DO NOT BELONG. ___X__ save
PC;

___X__ save SP;


______ $k0←PC;

___X__ enable interrupts;

___X__ save processor registers;

______ ACK INT by asserting INTA;

______ Receive interrupt vector from device on the data bus;

______ Retrieve PC address from the interrupt vector table;

___X__ Retrieve SP value from the interrupt vector table;

___X__ disable interrupts

___X__ PC←PC retrieved from the vector table;

______ SP←SP value retrieved from the vector table;

CH05 Processor Performance and Rudiments of Pipelined Processor Design…Solutions

1. True or false: For a given workload and a given instruction-set architecture, reducing the CPI (clocks
per instruction) of all the instructions will always improve the performance of the processor.

False. The execution time for the processor depends on the number of instructions, the average CPI, and
the clock cycle time. If we decrease the average CPI but this requires us to lengthen the instruction cycle
time we might see no improvement or even a decrease in performance.

2. An architecture has three types of instructions that have the following CPI:

Type CPI

A 2

B 5

C 3

An architect determines that he can reduce the CPI for B by some clever architectural trick, with no
change to the CPIs of the other two instruction types. However, she determines that this change will
increase the clock cycle time by 15%. What is the maximum permissible CPI of B (round it up to the
nearest integer) that will make this change still worthwhile? Assume that all the workloads that execute
on this processor use 40% of A, 10% of B, and 50% of C types of instructions.

Old Clock Cycle Time = 1

New Clock Cycle Time = 1.15

Instruction Old Time New Time

A 2 2.30

B 5 x

C 3 3.45

Old Total Time = 2 * 40 + 5 * 10 + 3 * 50

New Total Time = 2.30 * 40 + x * 10 + 3.45 * 50

Speedup = Old Total Time / New Total Time > 1

1 < (2 * 40 + 5 * 10 + 3 * 50) / (2.30 * 40 + x * 10 + 3.45 * 50)

1 < 280

(10x + 264.5) < 280 10x < 15.5 x < 1.55

The maximum new time for B is 1.55. This is equivalent to about 1.35 clock cycles. Therefore, the
maximum CPI for B is 1, since any CPI greater than one will decrease the speedup to below 1, which
means that the new architecture is slower than the old one, making the changes unnecessary.

3. What would be the execution time for a program containing 2,000,000 instructions if the processor
clock was running at 8 MHz and each instruction takes 4 clock cycles?

exec. time = n*CPI ave*clock cycle time exec. time = 2000000(4)(1/8000000) exec. time
= 1 sec
4. A smart architect re-implements a given instruction-set architecture, halving the CPI for 50% of the
instructions while increasing the clock cycle time of the processor by 10%. How much faster is the new
implementation compared to the original? Assume that the usage of all instructions is equally likely in
determining the execution time of any program for this problem.

Old CPI =1

Old Clock Cycle Time = 1

Old Total Time =1*1 =1

New CPI = 0.5

New Clock Cycle Time = 1.1

New Total Time = 0.5 * 1.1 = 0.55

Speedup = Old Total Time / New Total Time = 1 / 0.55 = 1.82

The new implementation has a speedup of 1.82, meaning that the new implementation is 82% faster
than the old implementation.

5. A certain change is being considered in the non-pipelined (multi-cycle) MIPS CPU regarding the
implementation of the ALU instructions. This change will enable one to perform an arithmetic operation
and write the result into the register file all in one clock cycle. However, doing so will increase the clock
cycle time of the CPU. Specifically, the original CPU operates on a 500 MHz clock, but the new design will
only execute on a 400 MHz clock. Will this change improve, or degrade performance? How many times
faster (or slower) will the new design be compared to the original design? Assume instructions are
executed with the following frequency:

Instruction Frequency

LW 25%

SW 15%

ALU 45%

BEQ 10%

JMP 5%
Compute the CPI of both the original and the new CPU. Show your work in coming up with your answer.

Cycles per Instruction

Instruction CPI

LW 5

SW 4

ALU 4

BEQ 3

JMP 3

CPI old = (0.25)(5)+(0.15)(4)+(0.45)(4)+(0.10)(3)+(0.05)(3)

CPI old = (1.25)+(0.6)+(1.8)+(0.3)+(0.15)

CPI old = 5.9

CPI new = (0.25)(5)+(0.15)(4)+(0.45)(1)+(0.10)(3)+(0.05)(3)

CPI new = (1.25)+(0.6)+(0.45)+(0.3)+(0.15)

CPI new = 4.55

Time old = 5.9 / 500

Time old = 0.0118

Time new = 4.55 / 400

Time new = 0.011375


This change in the ALU instructions does improve the overall speed of the architecture.

6. Given the CPI of various instruction classes

Class CPI

R-type 2

I-type 10

J-type 3

S-type 4

And instruction frequency as shown:

Class Program 1 Program 2

R 3 10

I 3 1

J 5 2

S 2 3

Which code will execute faster and why?

Program 1 Total CPI = 3 * 2 + 3 * 10 + 5 * 3 + 2 * 4

=6 + 30 + 15 + 8

= 59

Program 2 Total CPI = 10 * 2 + 1 * 10 + 2 * 3 + 3 * 4

= 20 + 10 +6 + 12
= 48

Program 2 will execute faster since the total number of cycles it will take to execute is less than the total
number of cycles of Program 1.

7. What is the difference between static and dynamic instruction frequency?

Static instruction frequency measures how often instructions appear in the compiled code. Dynamic
instruction frequency measures how often instructions are executed while a program is running.

8. Given

Instruction CPI

Add 2

Shift 3

Others 2 (average for all instructions including Add and Shift) Add/Shift 3

If the sequence ADD followed by SHIFT appears in 20% of the dynamic frequency of a program, what is
the percentage improvement in the execution time of the program with all {ADD, SHIFT} replaced by the
new instruction?

Old Total CPI = 80 * 2 + 20 * (2 + 3)

= 260

New Total CPI = 80 * 2 + 20 * 3

= 220

Speedup = Old Total CPI / New Total CPI

= 1.18

The speedup is 1.18, so the percentage improvement in execution time of the program is 18%.
9. Compare and contrast structural, data and control hazards. How are the potential negative effects
on pipeline performance mitigated?

Type of Hazard reason for hazard Potential stalls fix

Structural Hardware limitations 0,1,2,3 feedback lines or live with it

Data(RAW) Instruction reads a value of a register before it has been updated by the pipeline Data
Forwarding, NOP (for LW instruction)

Data(WAR) Instruction writes a new value to a register while another

instruction is still reading the old value, before the no problem since data was already copied into the
pipeline buffer N/A

pipeline updates it.

Data(WAW) Instructions write a new value into a register that was previously written to. No
problem since the old value was probably useless anyway N/A

Control These are breaks in the sequential execution of a program because

of branch instruction 1,2 Branch prediction or Delayed branch

Notes:

a) Feedback Lines: are used in the hardware implementation of the pipeline. These tell the previous
stage of the pipeline that the current instruction is still being processed, hence do not send any more to
process. They also send a NOP, dummy operation, to the next stage of the pipeline until the current
instruction is ready to proceed to the next stage.

b) Data forwarding: the ex-stage forwards the new value of the written register to the ID/RR stage so
that it can process the right values of the updated register

c) Branch Prediction: the idea here is to predict the outcome of the branch and let the instructions
flow along the pipeline based on this prediction. If the prediction is correct, it completely gets rid of any
stalls. If the prediction is not correct, it will create two stalls.
d) delayed branch: the idea here is to find a useful instruction that we can feed the pipeline with, while
we test the branch instruction.

10. How can a read after write (RAW) hazard be minimized or eliminated?

RAW hazards can be eliminated by adding data forwarding to a pipelined. Doing this involves adding
mechanisms to send data being written to a register to previous stages in the buffer that wants to read
from the same register.

11. What is a branch target buffer and how is it used?

The branch target buffer (BTB) is a hardware device used to save time by saving the address of each
branches target. That is the address of the instruction to which the branch jumps to if taken. Thus upon
encountering the branch the first time the target address is calculated and stored in the BTB. After that,
if a branch is taken the address is simply retrieved from the BTB.

12. Why is a second ALU needed in the Execute stage of the pipeline?

This second ALU is needed for instructions that change the PC of the processor, since the first ALU is
dedicated to performing arithmetic on the contents of registers.

13. In a processor with a five-stage pipeline as discussed in the class and shown in the picture below
(with buffers between the stages), explain the problem posed by a branch instruction. Present a
solution.

The problem presented by a branch instruction is that as it moves into the decode stage the fetch
stage does not yet know whether to fetch the next instruction or the branch target instruction.

Without other changes, this requires the pipeline to stall until it has the result of the branch
comparison.

A possible solution to ameliorate this problem:

a.) Perform the comparison in the decode stage.


b.) Change policy to unconditionally execute the instruction following the branch regardless of the
outcome of the branch.

c.) Install a branch predictor which will perhaps record the previous results of a branch and predict the
same outcome. Due to the nature of loops, this should have a very high success rate.

d.) Install a BTB as discussed in question 11.

14. Regardless of whether we use a conservative approach or branch prediction (branch not taken),
explain why there is always a 2-cycle delay if the branch is taken (i.e., 2 NOPs injected into the pipeline)
before normal execution can resume in the 5-stage

the pipeline used in Section 5.13.3.

The processor doesn't know whether the branch is taken until the BEQ instruction is the execution
stage. Since there is no way to load the instruction from the new PC after the branch is taken, new
instructions can only be loaded after the BEQ instruction has left the execution stage, which will occur 2
cycles after it has been loaded.

15. With reference to Figure 5.6a, identify and explain the role of the datapath element that deal with
the BEQ instruction. Explain in detail what exactly happens cycle by cycle with respect to this datapath
during the passage of a BEQ instruction. Assume a conservative approach to handling the control
hazard. Your answer should include both the cases of branch taken and branch not taken.

Cycle 1: The BEQ instruction is fetched from memory into the FBUF in the IF stage of the pipeline.

Cycle 2: The BEQ instruction is decoded into the DBUF in the ID/DX stage of the pipeline. Also in this
cycle, the next instruction is fetched.

Cycle 3: The BEW instruction is executed (or tested) in the EX stage of the pipeline. Also

in this cycle, the instruction before the BEW is stalled and a NOP instruction is fed to the ID/DR.

Cycle 4a-1: If the BEQ is not taken, the pipeline continuous its process normally.

Cycle 4a-2: The BEQ is fed into the WB stage, followed by 1 NOP and the non-conditional instruction.

Cycle 4b-1: If the BEQ is taken, the previously stalled instruction (the one coming after the BEQ) is
replaced by the new instruction (the one dictated by the BEQ) which is fetched into the IF stage and a
NOP is fed into the ID/DR stage. The BEQ is fed into the MEM stage.

Cycle 4b-2: The BEQ is fed into the WB stage, followed by 2 NOPs and the conditional instruction.
16. A smart engineer decides to reduce the 2-cycle "branch-taken" penalty in the 5-stage pipeline
down to 1 cycle. Her idea is to directly use the branch target address computed in the EX cycle to fetch
the instruction (note that the approach presented in Section 5.13.3 requires the target address to be
saved in PC first.)

a) Show the modification to the datapath in Figure 5.6a to implement this idea [hint: you have to
simultaneously feed the target address to the PC and the Instruction memory if the branch is taken].

b) While this reduces the bubbles in the pipeline to 1 for branch taken, it may not be a good idea.
Why? [hint: consider cycle time effects.]

a) The output of the second ALU in the EX stage should be available so is the input to the Instruction
Memory as well as the PC in the IF stage. Two extra switches should be added to determine which
address, the contents of the PC or the address computed in the EX stage, is driven into the Instruction
Memory. Another ALU would also need to be added to increment the address computed before it is
saved to the PC since in the next cycle, the address of the next instruction needs to be in the PC.

b) This adds extra circuitry to the processor, which may require an increase in cycle time to work
correctly. This increase in cycle time may increase the run time of processes more than the reduced CPI
of the BEQ instruction would decrease the run time.

17. In a pipelined processor where each instruction could be broken up into 5 stages and where each
stage takes 1 ns what is the best we could hope to do in terms of average time to execute 1,000,000,000
instructions?

Since a pipeline has 5 stages and each instruction can be broken into 5 stages, then the first stage is
going to take 5ns but all the other stages are only going to take 1ns extra for each.

Then:

Best average time = 5ns+999,999,999ns

Best average time = 1,000,000,004ns

18. Using the 5-stage pipeline shown in Figure 5.6c answer the following two questions:
a) Show the actions (similar to Section 5.12.1) in each stage of the pipeline for BEQ instruction of LC-
2200.

b) Considering only the BEQ instruction, compute the sizes of the FBUF, DBUF, EBUF, and MBUF.

a) IF stage (cycle 1):

I-MEM[PC] -> FBUF // Instruction at PC placed in

FBUF

PC + 1 -> PC // Increment PC

ID/RR stage (cycle 2):

DPRF[ FBUF[Rx] ] -> DBUF [A]; // load contents of Rx into DBUF

DPRF[ FBUF[Ry] ] -> DBUF [B]; // load contents of Ry into DBUF

FBUF[OPCODE] -> DBUF[OPCODE]; // copy opcode into DBUF

EXT [ FBUF[OFFSET] ]-> DBUF[OFFSET]; // copy sign extended offset into DBUF

EX stage (cycle 3):

DBUF[OPCODE] -> EBUF[OPCODE]; // copy opcode into EBUF

IF LDZ[ DBUF[A] - DBUF [B] ] == 1 // check if A - B is 0 PC + DBUF[OFFSET]-> PC;


// load PC + offset into offset MEM stage (cycle 4):

EBUF -> MBUF; // copy the EBUF into MBUF WB stage (cycle 5):

b) Size of FBUF: 32 bits (Instruction)

Size of DBUF: (Rx) (32 bits) (Ry) (32 bits) opcode 4 bits sign extended offset (32 bits) 100 bits

Size of EBUF opcode (4 bits) 4 bits

Size of MBUF opcode (4 bits)

4 bits

19. Repeat problem 18 for the SW instruction of LC-2200.

SW instruction
FBUF: Opcode (8 bits) rA (4 bits) rB (4 bits) offset (16 bits)

DBUF: Opcode ( 8 bits) (rA) (32 bits) (rB) (32 bits) sign extended offset (32 bits)

EBUF: Opcode (8 bits) (rA) (32 bits) (rB)+offset (32 bits)

MBUF: Opcode (8 bits)

20. Repeat problem 18 for JALR instruction of LC-2200.

FBUF: Opcode (8 bits) rA (4 bits) rB (4 bits) PC + 1 (32 bits)

DBUF: Opcode (8 bits) (rA) (32 bits) rB (4 bits) PC + 1 (32 bits)

EBUF: Opcode (8 bits) rB (4 bits) PC + 1 (32 bits)

MBUF: Opcode (8 bits) rB (4 bits) PC + 1 (32 bits)

22. Consider

I1: R1 <- R2 + R3

I2: R4 <- R1 + R5

If I2 is immediately following I1 in the pipeline with no forwarding, how many bubbles (i.e.

NOPs) will result in the above execution? Explain your answer.

Three bubbles will appear during execution, because when I2 is in the ID/RR stage, it must wait until
I1 has left the WB stage, since I2 needs to decode R1, but I1 is writing to R1.

21. You are given the pipelined datapath for a processor as shown below.

LW instruction

FBUF: Opcode (8 bits) rA (4 bits) rB (4 bits) offset (16 bits)


DBUF: Opcode (8 bits) (rA) (32 bits) (rB) (32 bits) offset (32 bits)

EBUF: Opcode (8 bits) rA+offset (32 bits) (rB) (32 bits)

MBUF: Opcode (8 bits) data@mem[A+offset] (32 bits) (rB) (32 bits)

23. Consider the following program fragment:

Address instruction

1000 ADD

1001 NAND

1002 LW

1003 ADD

1004 NAND

1005 ADD

1006 SW

1007 LW

1008 ADD

Assume that there are no hazards in the above set of instructions. Currently the IF stage is about to
fetch instruction at 1004.

(a) Show the state of the 5-stage pipeline

(b) Assuming we use the “drain” approach to dealing with interrupts, how many Cycles will elapse
before we enter the INT macro state? What is the value of PC that will be stored in the INT macro state
into $k0?

(c) Assuming the “flush” approach to dealing with interrupts, how many cycles will elapse before we
enter the INT macro state? What is the value of PC that will be stored in the INT macro state into $k0?
IF: NAND

ID: ADD

EX: LW

MEM: NAND

WB: ADD

a.) 5 cycles would elapse, since all the instructions in the pipeline would need to complete their
execution.

b.) 1 cycle would elapse, since all registers in the pipeline would clear their contents at once, so we
would enter the macro state on the next cycle after the interrupt. The value of PC that will be stored in
$k0 is 1000, since that is the last instruction to have completed its execution.

CH06 Processor Scheduling…Solutions

1. Compare and contrast process and program.

A process is a program in execution. A program is static, has no state, and has a fixed size on disk,
whereas a process is dynamic, exists in memory, may grow or shrink, and has associated with it “state”
that represents the information associated with the execution of the program.

2. What items are considered to comprise the state of a process?

Address space contents and register values in use constitute to the state of the process including the PC
and stacks pointer values and can also include scheduling properties like priority and arrival time.

3. Which metric are the most users centric in a timesharing environment?


The response time of a job is the most user centric metric in a timesharing environment.

4. Consider a pre-emptive priority processor scheduler. There are three processes P1, P2, and P3 in
the job mix that have the following characteristics:

Process Arrival Priority Activity

Time

P1 0 sec 1 8 sec CPU burst followed by

4 sec I/O burst followed by

6 sec CPU burst and quit

P2 2 sec 3 64 sec CPU burst and quit

P3 4 sec 2 2 sec CPU burst followed by

2 sec I/O burst followed by

2 sec CPU burst followed by

2 sec I/O burst followed by

2 sec CPU burst followed by

2 sec I/O burst followed by 2 sec CPU burst and quit

Diagram showing process execution:

What is the turnaround time for each of P1, P2, and P3?

P1-turnaround-time = 88 seconds
P2-turnaround-time = 64 seconds

P3-turnaround-time = 76 seconds

What is the average waiting time for this job mix?

Wait-time (P1) = 70 seconds

Wait-time (P2) = 0 seconds

Wait-time (P3) = 62 seconds

Average-wait-time = 44 seconds

5. What are the important deficiencies of FCFS CPU scheduling?

Its huge potential variation in response time and poor processor utilization based on it convoy effect.

6. Name a scheduling algorithm where starvation is impossible?

First-Come First-Served (FCFS)

7. Which scheduling algorithm was noted as having a high variance in turnaround time?

First-Come First-Served (FCFS)

8. Which scheduling algorithm is provably optimal (minimum average wait time)?

Shortest Job First (SJF)


9. Which scheduling algorithm must be preemptive?

Round Robin

10. Given the following processes that arrived in the order shown

CPU Burst Time IO Burst Time

P1 3 2

P2 4 3

P3 8 4

Show the activity in the processor and the I/O area using the FCFS, SJF, and Round Robin algorithms.

Assuming each process requires a CPU burst followed by an I/O burst followed by a final CPU burst (as in
Example 1 in Section 6.6):

FCFS

SJF

Round Robin (assumed timeslice = 2)

11. Redo Example 1 in Section 6.6 using SJF and round robin (timeslice = 2)
SJF a)

b) Response time (P1) = 28 Response time (P2) = 20

Response time (P3) = 7

c) Wait-time (P1) = 10

Wait-time (P2) = 5

Wait-time (P3) = 0

Round Robin

a)

b) Response time (P1) = 30

Response time (P2) = 26

Response time (P3) = 13

c) Wait-time (P1) = 12 Wait-time (P2) = 11

Wait-time (P3) = 6

12. Redo Example 3 in Section 6.6 using FCFS and round robin (timeslice = 2)

FCFS
a)

b) Wait-time (P1) = 6 Wait-time (P2) = 6

Wait-time (P3) = 22

c) Total time = 32

Throughput = 3/32 processes per unit time


Round Robin

a)

b) Wait-time (P1) = 11 Wait-time (P2) = 16

Wait-time (P3) = 7

c) Total time = 28

Throughput = 3/28 processes per unit time

Вам также может понравиться