Вы находитесь на странице: 1из 51

Surname 1

Surname:

Instructor:

Course:

Date:

Chapter 1

2. How does a high-level language influence the processor architecture?

High level language influence the processor by enabling efficient, translates language constructs
to instructions that a computer can execute.

3. Answer true or false, with justification: “The compiler writer is intimately aware of the
details of the process implementation.”
False, complier writer must have details on ISA and furthermore most of processor
implementation is of no use to complier writer.
4. Explain the levels of abstractions found inside the computer, from the silicon substrate
to a complex multiplayer video game.
_The semiconductor materials possess electrical properties that enable them to be used to make
transistors which are switches.
_The transistors can be connected to implement logic gates, which are simple circuits that allow
logic functions such as AND, OR, and NOT to be recognized.
_Logic gates can get connected to form functional units, which execute operations such a
decoding an n-bit binary number into selecting one of n outputs or adding two n-bit binary
numbers together.
_The logic elements and logic gates can be used to build devices such as memories or state
machines that can be used to control logic circuits. The elements are joined to create a data path
and control system which is the processor.
_ Instructions set commands the processor on what to execute within its capability limits (e.g.,
add two numbers together; fetch something from memory, etc.)
_The complier is the developed to translate a written computer program in a high level language
into instruction from derived from the instruction set.
_ Computer programs written in the high-level language are connected to using networking and
communication technology enabling people to interact and play games.

5. Answer true or false, with justification: “The internal hardware organization of a


computer system varies dramatically, depending on the specifics of the system.”
Surname 2

True also false and it all depends on personal interpretation. Details of transistors can change
based on speed vs. power consumption issues, that all computers know had the knowledge of
adding and the essential circuitry of the same will be similar to the organization of logic elements
and design of the data path and control system might be entirely different between a graphics
processor and one used to control a hearing aid.

Chapter 2

2. Distinguish between the frame pointer and the stack pointer.

A frame pointer acts as a reference point, where the value in frame pointer register is the
addresses of the local variables determined. On the other hand, a stack pointer is a value in
registers of the processor, which holds the memory address of the top of the stack. In most cases,
this value is less than the top of the stack (i.e., the stack grows from the top to bottom). Anytime
a push is done, stack pointer value decreases.
3. In the LC-2200 architecture, where are operands normally found for an add
instruction?

The operands are located in the registers..

5. An ISA may support different flavors of conditional branch instructions, such as BZ


(branch on zero), BN (branch on negative), and BEQ (branch on equal). Figure out the
predicate expressions in an if statement that may be best served by these different flavors of
conditional branch instructions. Give examples of such predicates in an if statement and
how you would compile them, using these different flavors of branch instructions.

BZ: that’s if (a == 0)
BEQ: that’s if (a == b)
BN: that’s if (a < 0) or if (a < b) which becomes if (a-b < 0)
BP: that’s if (a > 0) or if (a>b) which becomes if (a-b > 0)

8. Procedure A has important data in both S and T registers and is about to call procedure
B. Which registers should B store on the stack?

Procedure A should save the T registers before calling procedure B. B should save any S
registers it uses OR save any T registers it needs before calling another function.
Surname 3

14. What is an ISA and why is important?

ISA or Instruction Set Architecture defines the machine code that a processor reads and acts upon
so is the word size, memory address modes, processor registers, and data type.
15. What are the differences on instruction set design?

Instruction sets are influenced by ease of implementations, efficiency, and ease of programming.
Larger instructions set make writing compilers easier, but at the detriment of speed or ease of
implementation.

22. Convert the statement g =


h + A[i]
Into an LC-2200 assembler, with the assumption that address of A is located in $t0, g is in
$s1, h is in $s2, and i is in $t1

ADD $t2, $t0, $t1; Calculate address of A [i]


LW $t3, 0($t2); Load the contents of A[i] into a register
ADD $s1, $s2, $t3; Assign sum to g

Chapter 3

1. What is the difference between level triggered logic and edge triggered logic? Which do
we use? Why?

Level triggered logic register content alter their state from current to new when clock signal
hits high, while the edge triggered logic change to the registered contents happen
depending on the rising and falling clock edge. The rising clock edge gives positive edge
triggered logic and the falling edge give a negative edge triggered logic.

We use positive edge triggered logic.


Edge triggered logic is a method which avoids certain instability problems which are found
in level triggered circuits

2. Given the FSM and state transition diagram for a garage door opener (Figure 3.12-a and
Table 3.1) implement the sequential logic circuit for the garage door opener.
(Hint: The sequential logic circuit has 2 states and produces three outputs, namely, next
state, up motor control and down motor control).
Surname 4

6. What are the advantages and disadvantages of a bus-based data path design?

The advantage of a bus-based data path design is that data signals are available to every
piece of hardware in the circuit, hence no worry of sending signals to multiple devices. The
disadvantage of this design is that there is a limitation on how many signal can be sent to
each clock cycle. For instance, in a single bus design, only one signal can be sent out each
clock cycle, which makes the data path function less efficiently. In addition, there are cost
and space problems that arise out of having so many wires.

9. The Instruction Fetch is implemented in the text with first 4 states and then three.
What would have to be done to the datapath to make it two states long?

In order to implement the Instruction Fetch in 2 states, there would have to be a second bus
incorporated into the datapath, that could either take the MEM [MAR] to the IR in the first
state, or could take A+1 to the PC in the second state, therefore combining 2 of the present
states into 1, eliminating 1 state overall.

10. How many words of memory will this code snippet require when assembled? Is space
allocated for “L1”?
Surname 5

beq $s3, $s4, L1


add $s0, $s1, $s2 L1:
sub $s0, $s0, $s3

This code snippet, when assembled, would require 3 words, 1 per instruction. No space is
allocated for L1. Instead, when the code is assembled, the L1 reference made in the first
beq is replaced with the line number of the instruction it refers to, which in this case would
be line 2, rendering the L1 label no longer useful.

11. What is the advantage of fixed length instructions?

One of the advantages is to ease and simplify instruction pipelining, which enables a
single-clock throughput at high frequencies. Hence it means that variable length
instructions make it difficult to decouple memory fetches, requiring the processor to fetch
part, then decide whether to fetch more, maybe missing in cache before the instruction is
complete, whereas fixed length allows the full instruction to be fetched in one access,
increasing speed and efficiency.

12. What is a leaf procedure?

A leaf procedure refers to a procedure that never calls any other procedure.

14. Suppose you are writing an LC-2200 program and you want to jump a distance that is
farther that allowed by a BEQ instruction. Is there a way to jump to an address?

If you are writing an LC-2200 program and you want to jump a distance that is further than
allowed by a BEQ instruction, you can jump using a JALR instruction. This instruction
stores PC+1 into Reg Y, where PC is the address of the current JALR instruction. Then, it
branches to the address currently in Reg X, which can be any address you want to go to,
any distance. You can also use JALR to perform an unconditional jump, where after using
the JALR instruction; you discard the value stored in Reg Y.

15. Could the LC series processors be made to run faster if it had a second bus? If your
answer was no, what else would you need?

Another ALU
An additional mux
A TPRF
Surname 6

The LC Series processors could not be made to run faster if it had a second bus. However,
if it also had a DRPF (Dual Ported Register File), instead of the temporary A and B
registers, it could run faster. Using a DRPF to get values to be operands for the ALU will
enable both buses to drive values to the DRPF at the same time, allowing the ALU
operations and therefore overall performance of the processor to run faster.

Chapter 4

1. Upon an interrupt what has to happen implicitly in hardware before control is transferred to
the interrupt handler?
The hardware must save the program counter value implicitly before the control is transmitted
to the handler. Also the hardware must determine the address of the handler to transfer control
from the currently executing program to the handler. Depending on the architecture interrupts
can also be disabled.

3. Put the following steps in the correct order


Save KO on stack
Enable interrupt
Save state
Actual work of the handler
Restore state
Disable interrupt
Restore KO from stack
Return from interrupt

4. How does the processor know which device has requested an interrupt?

Initially the processor has no knowledge. However, once the processor acknowledges the
interrupt, the information will be supplied to the processor as to its identity by the interrupting
device. For instance, the device might supply the address of its interrupt handler or it might
supply a vector into a table of interrupt handler addresses.
6. In the following interrupt handler code select the ONES THAT DO NOT BELONG.
___X___ disable interrupts;
___X___ save PC;
_______ save $k0;
_______ enable interrupts;
_______ save processor registers; _______
execute device code;
_______ restore processor registers;
_______ disable interrupts;
Surname 7

_______ restore $k0;


___X___ disable interrupts;
___X___ restore PC;
___X___ enable interrupts;
_______ return from interrupt;

7. In the following actions in the INT macro state select the ONES THAT DO NOT
BELONG. ___X__ save PC; ___X__ save SP;
______ $k0←PC;
___X__ enable interrupts;
___X__ save processor registers;
______ ACK INT by asserting INTA;
______ Receive interrupt vector from device on the data bus;
______ Retrieve PC address from the interrupt vector table;
___X__ Retrieve SP value from the interrupt vector table;
___X__ disable interrupts
___X__ PC←PC retrieved from the vector table;
______ SP←SP value retrieved from the vector table;

Homework chapter 5

1. True or false: For a given workload and a given instruction-set architecture, reducing the
CPI (clocks per instruction) of all the instructions will always improve the performance
of the processor?
False. The processor execution depends on the instructions number, the average CPI, and clock
cycle time. If the average CPI is decreased but require lengthening of the instruction cycle time
improvement cannot be achieved or even cause a decrease in performance.

3. What would be the execution time for a program containing 2,000,000 instructions if the
processor clock was running at 8 MHz and each instruction takes 4 clock cycles?

Exec. Time = n*CPI ave*clock


cycle time exec. Time = 2000000(4)
(1/8000000) exec. Time = 1 sec

6. Given the CPI of various instruction classes


Class CPI
R-type 2
I-type 10
J-type 3
Surname 8

S-type 4

And instruction frequency as shown:

Class Program 1 Program 2


R 3 10
I 3 1
J 5 2
S 2 3

Which code will execute faster and why?

Program 1 Total CPI = 3 * 2 + 3 * 10 + 5 * 3 + 2 * 4


=6 + 30 + 15 + 8
= 59

Program 2 Total CPI = 10 * 2 + 1 * 10 + 2 * 3 + 3 * 4


= 20 + 10 + 6 + 12
= 48
Program 2 will have a faster execution since the total number of cycles it will take to execute is
less than the total number of cycles of Program 1.
7. What is the difference between static and dynamic instruction frequency?
Static instruction frequency measures how often instructions appear in compiled code.
Dynamic instruction frequency measures how often instructions are executed while a program
is running.
9. Compare and contrast structural, data and control hazards. How are the potential
negative effects on pipeline performance mitigated?

Type of Hazard reason for hazard Potential stalls fix


Structural Hardware limitations 0,1,2,3 feedback lines or live
with it
Data(RAW) Instruction reads a Data Forwarding,
value of a register NOP (for LW
before it has been instruction)
updated by the
pipeline
Surname 9

Data(WAR) Instruction writes a no problem since data N/A


new value to a was already copied
register while into the pipeline
another buffer
instructions are still
reading the old
value, before the
pipeline updates it.

Data(WAW) Instructions write a no problem since old N/A


new value into a value was
register that was
Previously written probably useless
to. anyway
Control These are breaks in 1,2 Branch prediction or
the sequential Delayed branch
execution of a
program because
of branch
instruction

Notes:
a) Feedback Lines: are useful during the hardware implementation of the pipeline. These defines
the previous stage of the pipeline that the current instruction is still under processing, therefore
do not send any more to process. They also send a NOP, dummy operation, to the next stage of
the pipeline until the current instruction is ready to proceed to the next stage.

b) Data forwarding: the ex-stage forwards the new value of the written register to the ID/RR
stage, so it can process the right values of the updated register

c) Branch Prediction: the idea here is to predict the outcome of the branch and let the instructions
flow along the pipeline based on this prediction. If the prediction is correct, it completely gets
rid of any stalls. If the prediction is not correct, it will create 2 stalls.

d) Delayed branch: the idea here is to find a useful instruction that we can feed the pipeline with,
while we test the branch instruction.

10. How can a read after write (RAW) hazard be minimized or eliminated?
Surname
10
RAW hazards get eliminated by adding data forwarding to a pipelined. Performing this process
involves adding mechanisms to send data being written to a register to previous stages in the
buffer that want to read from the same register.

14. Regardless of whether we use a conservative approach or branch prediction (branch not
taken), explain why there is always a 2-cycle delay if the branch is taken (i.e., 2 NOPs
injected into the pipeline) before normal execution can resume in the 5-stage pipeline used
in Section 5.13.3.
The processor does not recognize whether the branch is taken until the BEQ instruction is in
execution stage. Since there is no definite mechanism to load the instruction from the new PC
after the branch is taken, new instructions can only get loaded after the BEQ instruction has exited
the execution stage, which will occur 2 cycles after it has been loaded.

18. Using the 5-stage pipeline shown in Figure 5.6c answer the following two questions:
a. Show the actions (similar to Section 5.12.1) in each stage of the pipeline for BEQ
instruction of LC-2200.
a) IF stage (cycle 1):
I-MEM [PC] -> FBUF
PC + 1 -> PC

ID/RR stage (cycle 2):


DPRF [FBUF [Rx] ] -> DBUF [A];
DPRF [FBUF [Ry] ] -> DBUF [B];
FBUF [OPCODE] -> DBUF [OPCODE]; EXT
[FBUF [OFFSET] ]-> DBUF[OFFSET]; EX stage (cycle 3):
DBUF [OPCODE] -> EBUF [OPCODE];
IF LDZ [DBUF [A] - DBUF [B]
] == 1 PC + DBUF[OFFSET]->
PC;
MEM stage (cycle 4):
EBUF -> MBUF;
WB stage (cycle 5):
22. Consider
I1: R1 <- R2 + R3
I2: R4 <- R1 + R5
If I2 is following I1 immediately in the pipeline with no forwarding, how many bubbles (i.e.
NOPs) will result in the above execution? Explain your answer.
Three bubbles will appear during execution, because when I2 is in the ID/RR stage, it must
wait until I1 has left the WB stage, since I2 needs to decode R1, but I1 is writing to R1.
Surname
11
Homework Chapter 6
1. Compare and contrast process and program.

A process is a program in execution. A program is static, has no state, and has a fixed
size on disk, whereas a process is dynamic, exists in memory, may grow or shrink, and
has associated with it “state” that represents the information associated with the
execution of the program.

7. Which scheduling algorithm was noted as having a high variance in turnaround


time?

First-Come First-Served (FCFS)

8. Which scheduling algorithm is provably optimal (minimum average wait time)?

Shortest Job First (SJF)

9. Which scheduling algorithm must be preemptive?

Round Robin

10. Given the following processes that arrived in the order shown

CPU Burst Time IO Burst


Time
P1 3 2
P2 4 3
P3 8 4

Show the processor activities and the I/O area using the FCFS, SJF, and Round Robin
algorithms.
Assuming each process requires a CPU burst followed by an I/O burst followed by a final
CPU burst (as in Example 1 in Section 6.6):

FCFS
Surname
12

SJF

Round Robin (assumed timeslice = 2)

1. Consider Google Earth application. You launch the application, move the mouse on the
earth’s surface, and click on Mount Everest to see an up-close view of the mountain
range. Identify the interactions in layman’s terms between the operating system and the
hardware during the above sequence of actions.

Launching the application triggers the operation to commence a process which requests from
the operating system a connection to Google Earth. Each operation executed by the user
either alters the state of the program and requests output be performed by the operating
system or alter the state of the program and requests the operating system send information to
Google Earth requesting for additional information.

2. How does a high-level language influence the processor architecture?


Surname
13
The processor architecture is designed to enable efficient, cost-effective conversion of the
high-level language constructs into instructions that the machine can execute

3. Answer True or False with justification: “The compiler writer is intimately aware of
the details of the processor implementation.”

False: The compiler writer must acknowledge some details known as the instruction set
architecture but many details of the processor implementation are of no use or interest to the
compiler writer
4. Explain the levels of abstractions found inside the computer from the silicon substrate
to a complex multi-player video game.

The semiconductor materials possess electrical properties that enable them to be used to make
transistors which are switches.
The transistors can be connected to implement logic gates, which are simple circuits that allow
logic functions such as AND, OR, and NOT to be recognized.
Logic gates can get connected together to form functional units, which execute operations such a
decoding an n-bit binary number into selecting one of n outputs or adding two n-bit binary
numbers together.
The logic elements and logic gates can be used to build devices such as memories or state
machines that can be used to control logic circuits. The elements are joined to create a data path
and control system which is the processor.
Instructions set commands the processor on what to execute with in its capability limits (e.g. add
two numbers together; fetch something from memory, etc.)
The complier is the developed to translate a written computer program in high level language
into instruction from derived from the instruction set.
Computer programs written in high level language are connected together to using networking
and communication technology enabling people to interact and play games.

5. Answer True or False with justification: “The internal hardware organization of a


computer system varies dramatically depending on the specifics of the system.”

True also false and it all depend with personal interpretation. Details of transistors can change
based on speed vs. power consumption issues, that all computers know had the knowledge of
adding and the essential circuitry of the same will be similar to the organization of logic elements
and design of the data path and control system might be quite different between a graphics
processor and one used to control a hearing aid.

6. What is the role of a “bridge” between computer buses as shown in Figure 1.8?

Acts as a kind of translator/communications path between two devices (the two buses) which
may consist of no similar operational protocols
Surname
14
7. What is the role of a “controller” in Figure 1.8?

Appear to the computer to be memory locations which are in reality control registers for the
particular I/O devices to be controlled. The controller transmits the information supplied by
the processor and converts it into the appropriate control signals for the I/O device and/or
retrieve information from the device and sets bits in control registers to allow the processor to
receive the information.

8. Using the Internet, research and explain 5 major milestones in the evolution of computer
hardware.

Examples:
Vacuum tunes to transistors
Integrated circuits
Disk drives
Display technology (from paper to glass)
Networking

9. Using the Internet, research and explain 5 major milestones in the evolution of the
operating system.

Examples:
Multiprogramming
Scheduling
Time sharing
GUI Interface
Parallel operating systems
Error recovery

10. Compare and contrast grid computing and the power grid. Explain how the analogy makes
sense. Also, explain how the analogy breaks down.

There is an interconnected network of devices serving some useful purpose in both cases.
The generating systems can be perceived as a thought of powerful resources, supplying
power to the industrial and residential users of electricity. In grid computing, information
flow is more two-way (or even n-way). There are differences in the way things are paid for
where in the electric grid the consumers pay the producers whereas in grid computing
additional revenue streams may be provided by advertisers or others wishing to use
information generated by the grid. For power grid there are a relatively small number of
suppliers compared to a vast number of consumers. In grid computing there would perhaps
more consumers than producers but much more producers that in the case of the power grid.
11. Match the left and right hand sides.
Surname
15
Unix operating system Ritchie
Microchip Kilby and Noyce
FORTRAN programming language Backus
C programming language Thompson and Ritchie
Transistor Bardeen, Brattain, and Shockley
World’s first programmer Lovelace
World’s first computing machine Babbage
Vacuum Tube De Forest
ENIAC Mauchley and Eckert
Linux operating system Torvalds

Chapter 2 Processor Architecture…Solutions

1. Having a large register-file is detrimental to the performance of a processor since it


results in a large overhead for procedure call/return in high-level languages. Do you
agree or disagree? Give supporting arguments.

Disagree.
By judicious use of calling conventions defining saved and temporary registers call/return
overhead is manageable to any desired level of performance

2. Distinguish between the frame pointer and the stack pointer.

A frame pointer acts as a reference point, where the value in frame pointer register are the
addresses of the local variables determined. On the other hand a stack pointer is a value in
registers of the processor, which holds the memory address of the top of the stack. In most cases,
this value is less than the top of the stack (i.e. the stack grows from the top to bottom). Anytime a
push is done, stack pointer value decreases. Note: This is not to say that when a function calls
another function the frame pointer will remain fixed. It will not. Rather it will be changed on call
and reestablished upon return thus for all execution of a given functions own code it will be
fixed.

3. In the LC-2200 architecture, where are operands normally found for an add
instruction?

The operands are in registers.

4. Endianness: Let’s say you want to write a program for comparing two strings. You
have a choice of using a 32-bit byte-addressable Big-endian or Little-endian
Surname
16
architecture to do this. In either case, you can pack 4 characters in each word of 32-
bits. Which one would you choose and how will you write such a program? [Hint:
Normally, you would do string comparison one character at a time. If you can do it a
word at a time instead of a character at a time, that implementation will be faster.]

The choice of endianness does not matter, so long as there is consistent in the comparison.
Subtract each word in string A from string B. If you return zero, the strings are identical.
Note you can only perform this operation on the same system. The operation will be more
complex if you are trying to compare a Big-endian system to a Little-endian system. You
would have to compare character by character.

5. ISA may support different flavors of conditional branch instructions such as BZ


(branch on Zero), BN (branch on negative), and BEQ (branch on equal). Figure out
the predicate expressions in an “if” statement that may be best served by these
different flavors of conditional branch instructions. Give examples of such predicates
in “if” statement and how you will compile them using these different flavors of
branch instructions.

BZ: if (a == 0)
BEQ: if (a == b)
BN: if (a < 0) or if (a < b) which becomes if (a-b < 0)
BP: if (a > 0) or if (a>b) which becomes if (a-b > 0)

6. We said that endianness will not affect your program performance or correctness so
long as the use of a (high level) data structure is commensurate with its declaration.
Are there situations where even if your program does not violate the above rule, you
could be bitten by the endianness of the architecture? [Hint: Think of programs that
cross network boundaries.]

Yes, if data from a big-endian computer was to be transferred over a network into a small
endian computer data corruption can be experienced. However, this problem has been
appropriately solved using the Internet and networks using similar technology. The solution
is for networks to use standard endianness. If at any point the endianness of the network
varies from the host computer, the host’s network interface will apply the appropriate
conversion.

7. Work out the details of the implementing the switch statement using jump tables in
assembly using any flavor of conditional branch instruction. [Hint: After ensuring
that the value of the switch variable is within the bounds of valid case values, jump to
the start of the appropriate code segment corresponding to the current switch value,
execute the code and finally jump to exit.]

 First, check the bounds of the switch variable


Surname
17
 If the variable is within the bounds, you would index into the jump-table based on
which switch case was executed. For example, if you took the second switch
statement, you would jump to the second location listed in the jump-table.
 Execute the desired branch
 JLR or JMP $return address
 Continue with your original code

8. Procedure A has important data in both S and T registers and is about to call
procedure B. Which registers should A store on the stack? Which registers should B
store on the stack?

Procedure A should save the T registers before calling procedure B. B should save any S
registers it uses OR save any T registers it needs before calling another function.

9. Consider the usage of the stack abstraction in executing procedure calls. Do all
actions on the stack happen only via pushes and pops on to and from the top of the
stack? Explain circumstances that warrant reaching into other parts of the stack
during program execution. How is this accomplished?

No, the amount of memory included in the stack at any given time is controlled by simply
changing the stack pointer value. Then values may be read or written in locations defined
as offsets from the address stored in the stack pointer (or frame pointer).

10. Answer True/False with justification: Procedure call/return cannot be implemented


without a frame pointer.

False, the frame pointer is not to implement procedure calls necessarily, but it can make
code simpler.

11. DEC VAX had a single instruction for loading and storing all the program visible
registers from/to memory. Can you see a reason for such an instruction pair?
Consider both the pros and cons.

Pros: If you are a caller and need to use all of the safe and temporary registers, you can
perform this operation in one call. If you want to save the current state of execution, it can
be done in one call.

Cons: In most cases you do not need all available registers, so this command uses more
memory (and possibly time) than required.

12. Show how you can simulate a subtract instruction using the existing LC-2200 ISA?
Surname
18
Since our system uses 2’s complement, the negative value of a number is NOT X plus 1.
The LC-2200 does not have support for NOT, but NAND serves the same function.

B B NAND B ; not B
B B+1 ; B+1
A A+B ; A+B, net result is A-B

13. The BEQ instruction restricts the distance you can branch to from the current
position of the PC. If your program warrants jumping to a distance larger than that
allowed by the offset field of the BEQ instruction, show how you can accomplish such
“long” branches using the existing LC-2200 ISA.

BEQ $s0, $s1, near


BEQ $zero, $zero, Skip
Near JALR $s2, $zero
Skip …

Note: Assume the address of the location that is a "long" way away is in $s2

14. What is an ISA and why is it important?

The ISA (Instruction Set Architecture) serves as a kind of contractual document that
enables all parties concerned with the design, implementation and make use of the
provided processor to know what is expected of them and what resources will be available
by that processor.

As soon as an ISA is finalized:

• Implementation engineers can come up with the detail that will allow the processor
to meet the ISA specification
• Assembler and compiler writers create appropriate assemblers and compiler tor use
with the processor long before a working model even exists.
• Operating system designers/maintainers determine what is needed to be done to
enable their operating system run on this processor.
• I/O Device engineers design controllers and driver software that will be used with
the processor.
• Box (or equivalent) engineers can determine how to use the processor in their
designs etc.

15. What are the influences on instruction set design?


Surname
19
Instruction sets are influenced by ease of implementations, efficiency, and ease of programming.
Larger instructions set make writing compilers easier, but at the detriment of speed or ease of
implementation. See also CISC and RISC.

16. What are conditional statements and how are they handled in the ISA?

Conditional statements compare 2 values in order to determine some sort of relation


(equality, equal to zero, positive, or negative). The ISA specifies which conditional
statements are available. The LC-2200 features BEQ, but the LC-2110 had BR(N/Z/P).

17. Define the term addressing mode.

An addressing mode specifies how the bits of the instruction the operands locations. For
instance, some of the bits might be a register number or an offset to be added to the PC,
etc.

18. In Section 2.8, we mentioned that local variables in a procedure are allocated on the
stack. While this description is convenient for keeping the exposition simple, modern
compilers work quite differently. This exercise is for you to search the Internet and
find out how exactly modern compilers allocate space for local variables in a
procedure call. [Hint: Recall that registers are faster than memory. So, the objective
should be to keep as many of the variables in registers as possible.]

Generally many local variables which are located in memory technically found on the stack
are maintained in registers due to their significant advantage of speed which is enjoyed by
the registers. We have already noted that saved and temporary register conventions,
argument registers, return value and return address registers. All of these are focused
towards increasing speed and efficiency. In addition, modern optimizing compilers employ
sophisticated register allocation strategies designed to maximize use of registers. However,
arrays and structures are maintained on the stack and not in registers.

We use the term abstraction to refer to the stack. What is meant by this term? Does
the term abstraction imply how it is implemented? For example, is a stack used in a
procedure call/return a hardware device or a software device?

An abstraction is a method of simplifying and eliminating unnecessary details while


defining the desired behavior of a system. Normally abstraction implies hiding
implementation details. For example, a queue is an abstraction that has to support
enqueuing and dequeuing. The queue abstraction may be implemented with a linked list, an
array, etc. The stack facilitating a procedure call/return is a software abstraction
implemented with memory and register hardware but it could also be implemented as a
separate device.

19. Given the following instructions


Surname
20

BEQ Rx, Ry, offset; if (Rx == Ry) PC=PC+offset


SUB Rx, Ry, Rz ; Rx <- Ry - Rz
ADDI Rx, Ry, Imm ; Rx <- Ry + Immediate value AND Rx, Ry, Rz ;
Rx <- Ry AND Rz

Show how you can realize the effect of the following instruction:

BGT Rx, Ry, offset; if (Rx > Ry) PC=PC+offset

Assume that the registers and the Imm field are 8-bits wide. You can ignore the case that
the SUB instruction causes an overflow.

Solution:

If Rx > Ry then Ry-Rx < 0

SUB $at, Ry, Rx ; Ry - Rx


ADDI $t3, $zero, x80 ; Create the mask 1000 0000
AND $at, $at, $t3 ; Check for negative
BEQ $at, $zero offset

Note: $t3 would need to be saved if in use already

21. Given the following load instruction

LW Rx, Ry, OFFSET; Rx <- MEM [Ry + OFFSET]

Show how to realize a new addressing mode, called indirect, for use with the load
instruction that is represented in assembly language as:

LW Rx, @ (Ry);

The semantics of this instruction is that the contents of register Ry is the address of a
pointer to the memory operand that must be loaded in Rx.

Solution:

LW Rx, Ry, 0
LW Rx, Rx, 0
22. Convert this statement:

g = h + A[i];
Surname
21

Into an LC-2200 assembler with the assumption that the address of A is located in $t0, g is
in $s1, h is in $s2, and, i is in $t1.

ADD $t2, $t0, $t1 ; Calculate address of A[i]


LW $t3, 0($t2) ; Load the contents of A[i] into a register
ADD $s1, $s2, $t3 ; Assign sum to g

23. Suppose you design a computer called the Big Looper 2000 that will never be used to
call procedures and that will automatically jump back to the beginning of memory
when it reaches the end. Do you need a program counter? Justify your answer.

Big Looper 2000 needs a pc to identify which address to fetch an instruction from on each
cycle. The PC is also useful in calculating relative addresses like the one used with branch
instructions.

24. Consider the following program and assume that for this processor:

• All arguments are transmitted on the stack.


• Register V0 is for return values.
• The S registers are expected to be saved, that is a calling routine can leave values in
the S registers and expect it to be there after a call.
• The T registers are expected to be temporary, that is a calling routine must not expect
values in the T registers to be preserved after a call.

Int bar (int a, int b)


{
/* Code that uses registers T5, T6, S11-S13; */ return (1);
}
Int foo (int a, int b, int c, int d, int e)
{
Int x, y;
/* Code that uses registers T5-T10, S11-S13; */ bar(x, y); /* call bar */
/* Code that reuses register T6 & arguments a, b, & c; */
return (0);
}
main(int argc, char **argv)
{ int p, q, r, s, t, u;
/* Code that uses registers T5-T10, S11-S15; */ foo(p, q, r, s, t); /*
Call foo */ /* Code that reuses registers T9, T10; */
}
Surname
22
Here is the stack when bar is executing, clearly indicate in the spaces provided which
procedure (main, foo, bar) saved specific entries on the stack.

Main foo bar


_X__ ____ ____ p
_X__ ____ ____ q
_X__ ____ ____ r
_X__ ____ ____ s
_X__ ____ ____ t
_X__ ____ ____ u
_X__ ____ ____ T9
_X__ ____ ____ T10
____ _X__ ____ p
____ _X__ ____ q
____ _X__ ____ r
____ _X__ ____ s
____ _X__ ____ t
____ _X__ ____ x ____
_X__ ____ y
____ _X__ ____ S11
____ _X__ ____ S12
____ _X__ ____ S13
____ ____ ____ S14
____ ____ ____ S15
____ _X__ ____ T6
____ ____ _X__ x
____ ____ _X__ y
____ ____ _X__ S11
____ ____ _X__ S12
____ ____ _X__ S13 <----------- Top of Stack

Chapter 3 Processor Implementation…Solutions

1. What is the difference between level triggered logic and edge triggered logic? Which
do we use? Why?

In level triggered logic, register contents change state from current to new when the clock
signal is high. In edge triggered logic, the register contents change state on the rising or
falling edge of the clock. In edge triggered logic, if the change happens on the rising edge
it is referred to as positive edge triggered logic; as opposed to change happening on the
falling edge, which is referred to as negative edge triggered logic.
Surname
23
We use positive edge triggered logic.

Edge triggered logic is a method which avoids certain instability problems which are found
in level triggered circuits

2. Given the FSM and state transition diagram for a garage door opener (Figure 3.12
and (Table 3.1) implement the sequential logic circuit for the garage door opener.
(Hint: The sequential logic circuit has 2 states and produces three outputs, namely,
next state, up motor control and down motor control).

3. Re-implement the above logic circuit using the ROM plus state register approach
detailed in this chapter.
Surname
24

4. Compare and contrast the various approaches to control logic design.

Micro programmed:

Pros: Maintainable and flexible at the same time


Cons: Time and space inefficiency
Uses: majorly used in complex instructions or other quick non-pipelined prototyping of
architectures

Hardwired:

Pros: it is capable of pipelined implementation, potential for higher performance


Cons: Harder to change the design, longer design time
Uses: High performance pipelined implementation of architectures

5. One of the optimizations to reduce the space requirement of the control ROM based
design is to club together independent control signals and represent them using an
encoded field in the control ROM. What are the pros and cons of this approach?
What control signals can be clubbed together and what cannot be? Justify your
answer.

The main pro is that this approach can reduce the size of the control signal table (ROM).
While the main con is the it adds decoding steps to the process that leads to a delay data
Surname
25
path for generating control signals. This is because drive signals and load signals can be
grouped, but not together due to multiple storage elements which are needed to be clocked
in the same clock cycle.

6. What are the advantages and disadvantages of a bus-based datapath design?

The advantage of a bus-based datapath design is that data signals are available to every
piece of hardware in the circuit, so you do not have to worry about sending signals to
multiple devices because they all have access. The disadvantage of this design is that you
are limited to how many signals you can send on each clock cycle. For example, in a
single bus design, only one signal can be sent out each clock cycle, which makes the
datapath function less efficiently. In addition, there are cost and space problems that arise
out of having so many wires.

7. Consider a three-bus design. How would you use it for organizing the above datapath
elements? How does this help compared to the two-bus design?

For organizing the above datapath using a 3-bus design, it would look as follows:

This design would function more efficiently compared to the 2-bus design, it pulls values
from the memory, transmitted to the ALU, and then stored in the register file all in one
step, as shown by the ADD instruction. A 2-bus design needs 2 clock cycles so as to
complete the ADD instruction, whereas the 3-bus design can do it in 1 clock cycle, using
Surname
26
the additional bus to drive the ALU result to the register file, where the IR can than supply
the destination register number to the register file, to complete the writing and therefore the
instruction.

8. To save time would it be possible to store intermediate values in registers on the


datapath such as the Program Counter or Instruction Register? Explain why or why
not.

Storing intermediate values in the register is rather helpful so it can be used from
instruction to another which precludes PC or IR specifically for this task since they fetch
and decode every instruction.

9. The Instruction Fetch is implemented in the text with first 4 states and then three.
What would have to be done to the datapath to make it two states long?

For the information Fetch in 2 states to be implemented, there should be a second bus
incorporated into the datapath, which can either convey MEM [MAR] to the IR in first
state, or take A+1 to the PC in second state. Hence combining 2 of the present states into 1,
eliminates 1 state overall.

10. How many words of memory will this code snippet require when assembled? Is space
allocated for “L1”?

beq $s3, $s4, L1 add $s0, $s1,


$s2
L1: sub $s0, $s0, $s3

When the code snippet is assembled it requires 3 words, 1 per instruction. L1 is not
allocated space but instead during the code assembling L1 reference made in the first beq is
replaced with the number line of instructions it refers to. In this case it is line 2 rendering
the L1 label to be no longer useful.

11. What is the advantage of fixed length instructions?

The advantage of fixed length instruction is to ease and simplify instruction pipelining,
allowing for a single-clock throughput at high frequencies. Basically, this means that
variable length instructions make it difficult to decouple memory fetches, requiring the
processor to fetch part, then decide whether to fetch more, maybe missing in cache before
the instruction is complete, whereas fixed length allows the full instruction to be fetched in
one access, increasing speed and efficiency.
Surname
27

12. What is a leaf procedure?

A Leaf procedure refers to a procedure that does not call any other procedure.

13. For this portion of a datapath (assuming that all lines are 16 bits wide). Fill in the
table below.

Time A B C D E F
1 0x42 0xFE 0 0 0 0
2 0 0 0x42 0xFE 0 0
3 0xCAFE 0x1 0 0 0x140 0
4 0 0 0xCAFE 0x1 0 0x140
5 0 0 0 0 0xCAFE 0
6 0 0 0 0 0 0xCAFE

14. Suppose you are writing an LC-2200 program and you want to jump a distance that is
farther that allowed by a BEQ instruction. Is there a way to jump to an address?

If you are writing an LC-2200 program and you want to jump a distance that is farther than
allowed by a BEQ instruction, you can jump using a JALR instruction. This instruction
stores PC+1 into Reg Y, where PC is the address of the current JALR instruction. Then, it
branches to the address currently in Reg X, which can be any address you want to go to,
any distance. You can also use JALR to perform an unconditional jump, where after using
the JALR instruction; you discard the value stored in Reg Y.
Surname
28

15. Could the LC series processors be made to run faster if it had a second bus? If your
answer was no, what else would you need?

Another ALU
An additional mux
A TPRF

The LC series processor cannot be adjusted to run faster if it has second bus but it can run
faster if it had a DRPF (Dual Ported Register File), instead of temporary A and B registers.
Using a DRPF to get values to be operands for the ALU will enable both buses to drive
values to the DRPF at the same time, allowing the ALU operations and therefore overall
performance of the processor to run faster.

16. Convert this statement:

g = h + A[i];

Into an LC-2200 assembler with the assumption that the Address of A is located in $t0, g is
in $s1, h is in $s2, and, i is in $t1

Instruction0
RegSelLo DrReg LdA
go to Instruction1

Instruction1
DrOFF LdB
goto Instruction2

Instruction2
ALU_ADD DrALU LdMAR
goto Instruction3

Instruction3
DrMem LdA
goto Instruction4

Instruction4
RegSelLo DrReg LdB
goto Instruction5

Instruction5
ALU_ADD DrALU WrReg
Surname
29
halt

17. Suppose you design a computer called the Big Looper 2000 that will never be used to
call procedures and that will automatically jump back to the beginning of memory
when it reaches the end. Do you need a program counter? Justify your answer.

In order to design a computer that is not used to call procedures and will automatically
jump back to the beginning of memory when it reaches the end, a program counter is not
necessary, as a PC’s purpose includes pointing to the current instruction and implementing
the branch and jump instructions, so if there are no procedures, then there aren’t any
instructions, so there is no reason to have a PC.

18. In the LC-2200 processor, why is there not a register after the ALU?

In the LC-2200 processor, there is not a register file after the ALU because the results of
the ALU operation are written into the destination register, that is pointed to by the IR, so
the results are driven directly from the ALU onto the bus to the proper register.

19. In the datapath diagram shown in Figure 3.15, why do we need the A and B registers
in front of the ALU? Why do we need MAR? Under what conditions would you be
able to do without with any of these registers? [Hint: Think of additional ports in the
register file and/or buses.]

We need the A and B registers in front of the ALU because ALU operations (ADD,
NAND, A-B, A+1) require 2 operands, so we need temporary registers to hold at least one
of them, since we can only get 1 register value out of the register file since there is only 1
output port (Dout). Also, with only 1 bus, there is only 1 channel of communication
between any pair of datapath elements. Similarly, we need the MAR so there is a place to
hold the address sent by the ALU to the memory. The only conditions where we would be
able to do without some of these registers would be if there were multiple buses and/or
multiple output ports from the register file, thereby allowing multiple value to be
communicated simultaneously, so that the ALU could carry out its operations and the
memory could look up data at a specified address.

20. Core memory used to cost $0.01 per bit. Consider your own computer. What would be
a rough estimate of the cost if memory cost is $0.01/bit? If memory were still at that
price what would be the effect on the computer industry?

My laptop has 4 GB memory or 32 Gbits


Surname
30
Roughly: 32,000,000,000 bits

@ $0.01/bit

$320,000,000.00

My laptop would cost over 320 million dollars.

This would dramatically shrink the industry!

21. If computer designers focused entirely on speed and ignored cost implications, what
would the computer industry look like today? Who would the customers be? Now
consider the same question reversed: If the only consideration was cost what would
the industry be like?

If computers were designed focused entirely on speed and ignored cost implications, the
computer industry would only produce supercomputers with infinite processing power, and
the consumers would only be large organizations with lots of money that have a lot of data
to process, like the government and large universities and companies. However, if
computers designed focused entirely on cost, computers would be slow and clunky, not
helpful and inefficient, with a very limited set of operations and the least amount of
hardware used as possible, with the consumers being people only doing really simple
operations, like students.

22. Consider a CPU with a stack-based instruction set. Operands and results for
arithmetic instructions are stored on the stack; the architecture contains no general
purpose registers.

The data path shown on the next page uses two separate memories, a 65,536 (216) byte
memory to hold instructions and (non-stack) data, and a 256 byte memory to hold the
stack. The stack is implemented with a conventional memory and a stack pointer register.
The stack starts at address 0, and grows upward (to higher addresses) as data are pushed
onto the stack. The stack pointer points to the element on top of the stack (or is -1 if the
stack is empty). You may ignore issues such as stack overflow and underflow.

Memory addresses referring to locations in program/data memory are 16 bits. All data are 8
bits. Assume the program/data memory is byte addressable, i.e., each address refers to an
8-bit byte. Each instruction includes an 8-bit opcode. Many instructions also include a 16-
bit address field. The instruction set is shown below. Below, "memory" refers to the
program/data memory (as opposed to the stack memory).

OPCODE INSTRUCTION OPERATION


Surname
31

00000000 PUSH <addr> push the contents of memory at address


<addr> onto the stack

00000001 POP <addr> pop the element on top of the stack into memory at
location <addr>

00000010 ADD Pop the top two elements from the stack, add
them, and push the result onto the stack

00000100 BEQ <addr> Pop top two elements from stack; if they're equal,
branch to memory location <addr>

Note that the ADD instruction is only 8 bits, but the others are 24 bits. Instructions are
packed into successive byte locations of memory (i.e., do NOT assume all instruction uses
24 bits).

Assume memory is 8 bits wide, i.e., each read or write operation to main memory accesses
8 bits of instruction or data. This means the instruction fetch for multi-byte instructions
requires multiple memory accesses.

Datapath:
Complete the partial design shown on the next page.

Assume reading or writing the program/data memory or the stack memory requires a single
clock cycle to complete (actually, slightly less to allow time to read/write registers).
Similarly, assume each ALU requires slightly less than one clock cycle to complete an
arithmetic operation, and the zero detection circuit requires negligible time.

Control Unit:
Show a state diagram for the control unit indicating the control signals that must be
asserted in each state of the state diagram.
Surname
32

Solution
Surname
33
Surname
34

CH04 Interrupts, Traps and Exceptions…Solutions

1. Upon an interrupt what has to happen implicitly in hardware before control is


transferred to the interrupt handler?

The hardware has to save the program counter value implicitly before the control goes to the
handler. The hardware has to determine the address of the handler to transfer control from the
currently executing program to the handler. Depending on the architecture interrupts may also
be disabled.

2. Why not use JALR to return from the interrupt handler?


Surname
35
We check for interrupts at the end of each instruction execution. Therefore, between enable
interrupts and JALR $k0, we may get a new interrupt that will trash $k0. JALR only saves to
the return address register; if there are multiple interrupts then our $ra register will be
overwritten and we'll lose our link.

3. Put the following steps in the correct order


Save ko on stack
Enable interrupt
Save state
Actual work of the handler
Restore state
Disable interrupt
Restore ko from stack
Return from interrupt

4. How does the processor know which device has requested an interrupt?

Initially the processor does not know. However, once the processor acknowledges the
interrupt, the interrupting device will supply information to the processor as to its identity. For
example, the device might supply the address of its interrupt handler or it might supply a
vector into a table of interrupt handler addresses.

5. What instructions are needed to implement interruptible interrupts? Explain the


function and purpose of each along with an explanation of what would happen if you
didn't have them.

Interrupt handler:
! Assume interrupts are disabled when we enter
SW $k0, OFFSET($sp) ! store $k0 on stack to be able to return to
! original program
ADDI $sp, $sp, OFFSET ! reserves space on stack to save registers
EI ! enables interrupt
SW $registers($sp) ! save registers on stack to retrieve these
! registers when the interrupt finishes later

! Actual work of the handler


LW $registers($sp) ! restore registers from stack to retrieve the
! original values
DI ! disable interrupt to make sure we don't mess up
! with return address
LW $k0, OFFSET($sp) ! restore the original value for (return address)
RETI ! since we restore the return address, we can go
! back to the original program now
Surname
36
So the additional instructions are EI, DI and RETI.
Without these instructions an interrupt occurring while processing an interrupt could cause us
to lose our original interrupt return address.

6. In the following interrupt handler code select the ONES THAT DO NOT BELONG.
___X___ disable interrupts;
___X___ save PC;
_______ save $k0;
_______ enable interrupts;
_______ save processor registers;
_______ execute device code;
_______ restore processor registers;
_______ disable interrupts; _______
restore $k0;
___X___ disable interrupts;
___X___ restore PC;
___X___ enable interrupts;
_______ return from interrupt;

7. In the following actions in the INT macro state select the ONES THAT DO NOT
BELONG.
___X__ save PC;
___X__ save SP;
______ $k0←PC;
___X__ enable interrupts;
___X__ save processor registers;
______ ACK INT by asserting INTA;
______ Receive interrupt vector from device on the data bus;
______ Retrieve PC address from the interrupt vector table;
___X__ Retrieve SP value from the interrupt vector table;
___X__ disable interrupts
___X__ PC←PC retrieved from the vector table;
______ SP←SP value retrieved from the vector table;

CH05 Processor Performance and Rudiments of Pipelined Processor Design…Solutions

1. True or false: For a given workload and a given instruction-set architecture, reducing
the CPI (clocks per instruction) of all the instructions will always improve the
performance of the processor.
Surname
37
False. The execution time for the processor depends on the number of instructions, the
average CPI, and the clock cycle time. If we decrease the average CPI but this requires us to
lengthen the instruction cycle time we might see no improvement or even a decrease in
performance.
2. An architecture has three types of instructions that have the following CPI:
Type CPI
A 2
B 5
C 3

An architect determines that he can reduce the CPI for B by some clever architectural trick,
with no change to the CPIs of the other two instruction types. However, she determines that
this change will increase the clock cycle time by 15%. What is the maximum permissible CPI
of B (round it up to the nearest integer) that will make this change still worthwhile? Assume
that all the workloads that execute on this processor use 40% of A, 10% of B, and 50% of C
types of instructions.

Old Clock Cycle Time = 1


New Clock Cycle Time = 1.15
Instruction Old Time New Time
A 2 2.30
B 5 x
C 3 3.45

Old Total Time =2 * 40 + 5 * 10 + 3 * 50


New Total Time = 2.30 * 40 + x * 10 + 3.45 * 50
Speedup = Old Total Time / New Total Time > 1
1 < (2 * 40 + 5 * 10 + 3 * 50) / (2.30 * 40 + x * 10 + 3.45 * 50)
1 < 280
(10x + 264.5) < 280
10x < 15.5
x < 1.55

The maximum new time for B is 1.55. This is equivalent to about 1.35 clock cycles.
Therefore, the maximum CPI for B is 1, since any CPI greater than 1 will decrease the
speedup to below 1, which means that the new architecture is slower than the old one, making
the changes unnecessary.
Surname
38

3. What would be the execution time for a program containing 2,000,000 instructions if the
processor clock was running at 8 MHz and each instruction takes 4 clock cycles?

exec. time = n*CPI ave*clock


cycle time exec. time =
2000000(4)(1/8000000) exec. time =
1 sec

4. A smart architect re-implements a given instruction-set architecture, halving the CPI for
50% of the instructions, while increasing the clock cycle time of the processor by 10%.
How much faster is the new implementation compared to the original? Assume that the
usage of all instructions is equally likely in determining the execution time of any
program for the purposes of this problem.

Old CPI =1
Old Clock Cycle Time = 1
Old Total Time =1*1 =1
New CPI = 0.5
New Clock Cycle Time = 1.1
New Total Time = 0.5 * 1.1 = 0.55
Speedup = Old Total Time / New Total Time = 1 / 0.55 = 1.82

The new implementation has a speedup of 1.82, meaning that the new implementation is 82%
faster than the old implementation.

5. A certain change is being considered in the non-pipelined (multi-cycle) MIPS CPU


regarding the implementation of the ALU instructions. This change will enable one to
perform an arithmetic operation and write the result into the register file all in one clock
cycle. However, doing so will increase the clock cycle time of the CPU. Specifically, the
original CPU operates on a 500 MHz clock, but the new design will only execute on a 400
MHz clock. Will this change improve, or degrade performance? How many times faster
(or slower) will the new design be compared to the original design? Assume instructions
are executed with the following frequency:
Instruction Frequency
LW 25%
SW 15%
ALU 45%
BEQ 10%
JMP 5%

Compute the CPI of both the original and the new CPU. Show your work in coming up with your
answer.
Surname
39
Cycles per Instruction
Instruction CPI
LW 5
SW 4
ALU 4
BEQ 3
JMP 3

CPI old = (0.25)(5)+(0.15)(4)+(0.45)(4)+(0.10)(3)+(0.05)(3)


CPI old = (1.25)+(0.6)+(1.8)+(0.3)+(0.15)
CPI old = 5.9

CPI new = (0.25)(5)+(0.15)(4)+(0.45)(1)+(0.10)(3)+(0.05)(3)


CPI new = (1.25)+(0.6)+(0.45)+(0.3)+(0.15)
CPI new = 4.55

Time old = 5.9 / 500


Time old = 0.0118

Time new = 4.55 / 400


Time new = 0.011375

This change in the ALU instructions does improve the overall speed of the architecture.

6. Given the CPI of various instruction classes


Class CPI
R-type 2
I-type 10
J-type 3
S-type 4

And instruction frequency as shown:

Class Program 1 Program 2


R 3 10
I 3 1
Surname
40
J 5 2
S 2 3

Which code will execute faster and why?

Program 1 Total CPI = 3 * 2 + 3 * 10 + 5 * 3 + 2 * 4


=6 + 30 + 15 + 8
= 59

Program 2 Total CPI = 10 * 2 + 1 * 10 + 2 * 3 + 3 * 4


= 20 + 10 + 6 + 12
= 48

Program 2 will execute faster since the total number of cycles it will take to execute is less
than the total number of cycles of Program 1.

7. What is the difference between static and dynamic instruction frequency?


Static instruction frequency measures how often instructions appear in compiled code.
Dynamic instruction frequency measures how often instructions are executed while a program
is running.

8. Given
Instruction CPI
Add 2
Shift 3
Others 2 (average for all instructions including Add and Shift)
Add/Shift 3

If the sequence ADD followed by SHIFT appears in 20% of the dynamic frequency of a
program, what is the percentage improvement in the execution time of the program with all
{ADD, SHIFT} replaced by the new instruction?

Old Total CPI = 80 * 2 + 20 * (2 + 3)


= 260
New Total CPI = 80 * 2 + 20 * 3
= 220
Speedup = Old Total CPI / New Total
CPI
= 1.18
The speedup is 1.18, so the percentage improvement in execution time of the program is 18%.

9. Compare and contrast structural, data and control hazards. How are the potential
negative effects on pipeline performance mitigated?
Surname
41

Type of Hazard reason for hazard Potential stalls fix


Structural Hardware limitations 0,1,2,3 feedback lines or live
with it
Data(RAW) Instruction reads a Data Forwarding,
value of a register NOP (for LW
before it has been instruction)
updated by the
pipeline
Data(WAR) Instruction writes a no problem since N/A
new value to a data was already
register while copied into the
another pipeline buffer
instruction is still
reading the old
value, before the
pipeline updates it.

Data(WAW) Instructions write a no problem since old N/A


new value into a value was
register that was probably useless
previously written anyway
to.
Control These are breaks in 1,2 Branch prediction or
the sequential Delayed branch
execution of a
program because
of branch
instruction

Notes:
a) Feedback Lines: are used in the hardware implementation of the pipeline. These tell the
previous stage of the pipeline that the current instruction is still being processed, hence do not
send any more to process. They also send a NOP, dummy operation, to the next stage of the
pipeline until the current instruction is ready to proceed to the next stage.

b) Data forwarding: the ex-stage forwards the new value of the written register to the ID/RR
stage, so it can process the right values of the updated register

c) Branch Prediction: the idea here is to predict the outcome of the branch and let the
instructions flow along the pipeline based on this prediction. If the prediction is correct, it
completely gets rid of any stalls. If the prediction is not correct, it will create 2 stalls.
Surname
42

d) Delayed branch: the idea here is to find a useful instruction that we can feed the pipeline with,
while we test the branch instruction.

10. How can a read after write (RAW) hazard be minimized or eliminated?

RAW hazards can be eliminated by adding data forwarding to a pipelined. Doing this involves
adding mechanisms to send data being written to a register to previous stages in the buffer that
want to read from the same register.

11. What is a branch target buffer and how is it used?


The branch target buffer (BTB) is a hardware device used to save time by saving the address
of each branches target. That is the address of the instruction to which the branch jumps to if
taken. Thus upon encountering the branch the first time the target address is calculated and
stored in the BTB. Thereafter, if a branch is taken the address is simply retrieved from the
BTB.

12. Why is a second ALU needed in the Execute stage of the pipeline?
This second ALU is needed for instructions that change the PC of the processor, since the first
ALU is dedicated to performing arithmetic on the contents of registers.

13. In a processor with a five stage pipeline as discussed in the class and shown in the
picture below (with buffers between the stages), explain the problem posed by branch
instruction. Present a solution.
The problem presented by a branch instruction is that as it moves into the decode stage the
fetch stage does not yet know whether to fetch the next instruction or the branch target
instruction.
Without other changes this requires the pipeline to stall until it has the result of the branch
comparison.
Possible solution to ameliorate this problem:
a.) Perform the comparison in the decode stage.
b.) Change policy to unconditionally execute the instruction following the branch
regardless of the outcome of the branch.
c.) Install a branch predictor which will perhaps record the previous results of a branch
and predict the same outcome. Due to the nature of loops this should have a very high
success rate.
d.) Install a BTB as discussed in question 11.

14. Regardless of whether we use a conservative approach or branch prediction (branch


not taken), explain why there is always a 2-cycle delay if the branch is taken (i.e., 2
NOPs injected into the pipeline) before normal execution can resume in the 5-stage
Pipeline used in Section 5.13.3.
Surname
43
The processor doesn't know whether the branch is taken until the BEQ instruction is the
execution stage. Since there is no way to load the instruction from the new PC after the branch
is taken, new instructions can only be loaded after the BEQ instruction has left the execution
stage, which will occur 2 cycles after it has been loaded.

15. With reference to Figure 5.6a, identify and explain the role of the datapath element that
deal with the BEQ instruction. Explain in detail what exactly happens cycle by cycle
with respect to this datapath during the passage of a BEQ instruction. Assume a
conservative approach to handling the control hazard. Your answer should include
both the cases of branch taken and branch not taken.
Cycle 1: The BEQ instruction is fetched from memory into the FBUF in the IF stage of
the pipeline.
Cycle 2: The BEQ instruction is decoded into the DBUF in the ID/DX stage of the
pipeline. Also in this cycle, the next instruction is fetched.
Cycle 3: The BEW instruction is executed (or tested) in the EX stage of the pipeline. Also
in this cycle, the instruction before the BEW is stalled and a NOP instruction is
fed to the ID/DR.
Cycle 4a-1: If the BEQ is not taken, the pipeline continuous its process normally.
Cycle 4a-2: The BEQ is fed into the WB stage, followed by 1 NOP and the non-conditional
instruction.
Cycle 4b-1: If the BEQ is taken, the previously stalled instruction (the one coming after the
BEQ) is replaced by the new instruction (the one dictated by the BEQ) which is
fetched into the IF stage and a NOP is fed into the ID/DR stage. The BEQ is fed
into the MEM stage.
Cycle 4b-2: The BEQ is fed into the WB stage, followed by 2 NOPs and the conditional
instruction.

16. A smart engineer decides to reduce the 2-cycle "branch-taken" penalty in the 5-stage
pipeline down to 1 cycle. Her idea is to directly use the branch target address computed
in the EX cycle to fetch the instruction (note that the approach presented in Section
5.13.3 requires the target address to be saved in PC first.)
a) Show the modification to the datapath in Figure 5.6a to implement this idea [hint:
you have to simultaneously feed the target address to the PC and the Instruction
memory if the branch is taken].
b) While this reduces the bubbles in the pipeline to 1 for branch taken, it may not be a
good idea. Why? [hint: consider cycle time effects.]

a) The output of the second ALU in the EX stage should be available so is the input to the
Instruction Memory as well as the PC in the IF stage. Two extra switches should be added
to determine which address, the contents of the PC or the address computed in the EX
stage, is driven into the Instruction Memory. Another ALU would also need to be added to
increment the address computed before it is saved to the PC since in the next cycle, the
address of the next instruction needs to be in the PC.
Surname
44
b) This adds extra circuitry to the processor, which may require an increase in cycle time to
work correctly. This increase in cycle time may increase the run time of processes more
than the reduced CPI of the BEQ instruction would decrease the run time.

17. In a pipelined processor where each instruction could be broken up into 5 stages and
where each stage takes 1 ns what is the best we could hope to do in terms of average
time to execute 1,000,000,000 instructions?

Since a pipeline has 5 stages and each instruction can be broken into 5 stages, then the first
stage is going to take 5ns but all the other stages are only going to take 1ns extra for each.
Then:
Best average time = 5ns+999,999,999ns
Best average time = 1,000,000,004ns

18. Using the 5-stage pipeline shown in Figure 5.6c answer the following two questions:
a) Show the actions (similar to Section 5.12.1) in each stage of the pipeline for BEQ
instruction of LC-2200.
b) Considering only the BEQ instruction, compute the sizes of the FBUF, DBUF, EBUF,
and MBUF.
a) IF stage (cycle 1):
I-MEM [PC] -> FBUF // Instruction at PC placed in
FBUF
PC + 1 -> PC // Increment PC

ID/RR stage (cycle 2):


DPRF [FBUF [Rx] ] -> DBUF [A]; // load contents of Rx into
DBUF
DPRF [FBUF [Ry] ] -> DBUF [B]; // load contents of Ry into
DBUF
FBUF [OPCODE] -> DBUF [OPCODE]; // copy opcode into DBUF
EXT [FBUF [OFFSET] ]-> DBUF[OFFSET]; // copy sign extended
offset into DBUF
EX stage (cycle 3):
DBUF [OPCODE] -> EBUF [OPCODE]; // copy opcode into EBUF
IF LDZ [DBUF [A] - DBUF [B] ] == 1 // check if A -
B is 0 PC + DBUF[OFFSET]-> PC; // load PC +
offset into offset MEM stage (cycle 4):
EBUF -> MBUF; // copy the EBUF into MBUF
WB stage (cycle 5):

b) Size of FBUF: 32 bits (Instruction)


Size of DBUF: (Rx) (32 bits) (Ry) (32 bits) opcode 4 bits sign extended offset (32 bits) 100
bits
Size of EBUF opcode (4 bits) 4 bits
Surname
45
Size of MBUF opcode (4 bits)
4 bits
19. Repeat problem 18 for the SW instruction of LC-2200.
SW instruction
FBUF: Opcode (8 bits) rA (4 bits) rB (4 bits) offset (16 bits)
DBUF: Opcode ( 8 bits) (rA) (32 bits) (rB) (32 bits) sign extended offset (32 bits)

EBUF: Opcode (8 bits) (rA) (32 bits) (rB)+offset (32 bits)

MBUF: Opcode (8 bits)

20. Repeat problem 18 for JALR instruction of LC-2200.


FBUF: Opcode (8 bits) rA (4 bits) rB (4 bits) PC + 1 (32 bits)
DBUF: Opcode (8 bits) (rA) (32 bits) rB (4 bits) PC + 1 (32 bits)

EBUF: Opcode (8 bits) rB (4 bits) PC + 1 (32 bits)

MBUF: Opcode (8 bits) rB (4 bits) PC + 1 (32 bits)

22. Consider
I1: R1 <- R2 + R3
I2: R4 <- R1 + R5
If I2 is immediately following I1 in the pipeline with no forwarding, how many bubbles (i.e.
NOPs) will result in the above execution? Explain your answer.
Three bubbles will appear during execution, because when I2 is in the ID/RR stage, it must
wait until I1 has left the WB stage, since I2 needs to decode R1, but I1 is writing to R1.
21. You are given the pipelined datapath for a processor as shown below.

LW instruction
FBUF: Opcode (8 bits) rA (4 bits) rB (4 bits) offset (16 bits)

DBUF: Opcode (8 bits) (rA) (32 bits) (rB) (32 bits) offset (32 bits)

EBUF: Opcode (8 bits) rA+offset (32 bits) (rB) (32 bits)

MBUF: Opcode (8 bits) data@mem[A+offset] (32 bits) (rB) (32 bits)

23. Consider the following program fragment:


Address instruction
1000 ADD
1001 NAND
1002 LW
1003 ADD
1004 NAND
Surname
46
1005 ADD
1006 SW
1007 LW
1008 ADD

Assume that there are no hazards in the above set of instructions. Currently the IF stage is
about to fetch instruction at 1004.

(a) Show the state of the 5-stage pipeline


(b) Assuming we use the “drain” approach to dealing with interrupts, how many Cycles will
elapse before we enter the INT macro state? What is the value of PC that will be stored
in the INT macro state into $k0?
(c) Assuming the “flush” approach to dealing with interrupts, how many cycles will elapse
before we enter the INT macro state? What is the value of PC that will be stored in the
INT macro state into $k0?

IF: NAND
ID: ADD
EX: LW
MEM: NAND
WB: ADD

a.) 5 cycles would elapse, since all the instructions in the pipeline would need to
complete their execution.

b.) 1 cycle would elapse, since all registers in the pipeline would clear their contents at
once, so we would enter the macro state on the next cycle after the interrupt. The
value of PC that will be stored in $k0 is 1000, since that is the last instruction to
have completed its execution.

CH06 Processor Scheduling…Solutions

1. Compare and contrast process and program.

A process is a program in execution. A program is static, has no state, and has a fixed
size on disk, whereas a process is dynamic, exists in memory, may grow or shrink, and
has associated with it “state” that represents the information associated with the
execution of the program.

2. What items are considered to comprise the state of a process?


Surname
47
Address space contents and register values in use constitute to the state of the process
including the PC and stacks pointer values and can also include scheduling properties
like priority and arrival time.
3. Which metric are the most users centric in a timesharing environment?

The response time of a job is the most user centric metric in a timesharing environment.

4. Consider a pre-emptive priority processor scheduler. There are three processes P1,
P2, and P3 in the job mix that have the following characteristics:

Process Arrival Priority Activity


Time
P1 0 sec 1 8 sec CPU burst followed by
4 sec I/O burst followed by
6 sec CPU burst and quit
P2 2 sec 3 64 sec CPU burst and quit
P3 4 sec 2 2 sec CPU burst followed by
2 sec I/O burst followed by
2 sec CPU burst followed by
2 sec I/O burst followed by
2 sec CPU burst followed by
2 sec I/O burst followed
by 2 sec CPU burst and
quit

Diagram showing process execution:

What is the turnaround time for each of P1, P2, and P3?

P1-turnaround-time = 88 seconds
P2-turnaround-time = 64 seconds
P3-turnaround-time = 76 seconds

What is the average waiting time for this job mix?


Surname
48

Wait-time (P1) = 70 seconds


Wait-time (P2) = 0 seconds
Wait-time (P3) = 62 seconds
Average-wait-time = 44 seconds

5. What are the important deficiencies of FCFS CPU scheduling?

Its huge potential variation in response time and poor processor utilization based on it
convoy effect.

6. Name a scheduling algorithm where starvation is impossible?

First-Come First-Served (FCFS)

7. Which scheduling algorithm was noted as having a high variance in turnaround


time?

First-Come First-Served (FCFS)

8. Which scheduling algorithm is provably optimal (minimum average wait time)?

Shortest Job First (SJF)

9. Which scheduling algorithm must be preemptive?

Round Robin

10. Given the following processes that arrived in the order shown

CPU Burst Time IO Burst


Time
P1 3 2
P2 4 3
P3 8 4

Show the activity in the processor and the I/O area using the FCFS, SJF, and Round
Robin algorithms.
Assuming each process requires a CPU burst followed by an I/O burst followed by a final
CPU burst (as in Example 1 in Section 6.6):

FCFS
Surname
49

SJF

Round Robin (assumed timeslice = 2)

11. Redo Example 1 in Section 6.6 using SJF and round robin (timeslice = 2)

SJF
a)
Surname
50
b) Response time (P1) = 28 Response time (P2) = 20
Response time (P3) = 7

c) Wait-time (P1) = 10
Wait-time (P2) = 5
Wait-time (P3) = 0

Round Robin
a)

b) Response time (P1) = 30


Response time (P2) = 26
Response time (P3) = 13

c) Wait-time (P1) = 12 Wait-time (P2) = 11


Wait-time (P3) = 6

12. Redo Example 3 in Section 6.6 using FCFS and round robin (timeslice =
2)

FCFS
a)

b) Wait-time (P1) = 6 Wait-time (P2) = 6


Wait-time (P3) = 22
Surname
51
c) Total time = 32
Throughput = 3/32 processes per unit time

Round Robin
a)

b) Wait-time (P1) = 11 Wait-time (P2) = 16


Wait-time (P3) = 7

c) Total time = 28
Throughput = 3/28 processes per unit time