You are on page 1of 8


1 The suboperations performed in each segment of the pipeline are as follows: R1<--------------- Ai, R2<-------------- Bi R3<--------------- Ci, R4<-------------- Di R5<--------------- R1+R2, R6<---------- R3+R4 R7<----------------R5*R6 Input Ai and Bi Input Ci and Di Add the inputs Multiply

Each segment contains one or two registers and a combinational circuit as shown in the configuration below: Ai Bi Ci Di










R7 Fig.1 Pipeline Configuration for (Ai+Bi)*(Ci+Di) The following table shows the content of all the registers for i= 1 through 6, Segment 1 R2 R3 B1 C1 B2 C2 B3 C3 B4 C4 B5 C5 B6 C6 Segment 2 R5 R6 A1+B1 C1+D1 A2+B2 C2+D2 A3+B3 C3+D3 A4+B4 C4+D4 A5+B5 C5+D5 A6+B6 C6+D6 Segment 3 R7 (A1+B1)*(C1+D1) (A2+B2)*(C2+D2) (A3+B3)*(C3+D3) (A4+B4)*(C4+D4) (A5+B5)*(C5+D5) (A6+B6)*(C6+D6)

Clock Pulse 1 2 3 4 5 6 7 8

R1 A1 A2 A3 A4 A5 A6 -

R4 D1 D2 D3 D4 D5 C6

QUESTION 9.2 Firstly, we determine the no of clock cycle No of segment k = 6 No of tasks n = 8 Tp, clock cycle time is k+(n-1) 6+(8-1) = 13 clock cycles The Space-time diagram for a 6-segment pipeline is shown below: 6 7 8 1 T6 T7 T8 2 T5 T6 T7 3 T4 T5 T6 4 T3 T4 T5 5 T2 T3 T4 6 T1 T2 T3 It takes 13 clock cycles to process 8 tasks in a 6-segment pipeline QUESTION 9.3 Number of segments, k = 6 Number of tasks, n = 200 No of clock cycles = k +(n-1) Therefore, 6+(200-1) = 205 clock cycles QUESTION 9.4 i. Speedup, S =

1 T1

2 T2 T1

3 T3 T2 T1

4 T4 T3 T2 T1

5 T5 T4 T3 T2 T1

9 T8 T7 T6 T5 T4





T8 T7 T6 T5

T8 T7 T6

T8 T7


No of task, n = 100, tn = 50ns, tp = 10ns, no of segments= 6 = 6 +1001 =

100 50 5000 1050

100 =4.76 21

Hence, the speedup ratio = 100/21=4.76 ii. for the maximum speedup, we assume tn = ktp i.e. = = = =
6 10 10


Hence, the maximum speedup that can be achieved is 6


Ai 40ns R1




Multiplier 45ns R3 R4



Number of tasks , n = 7, Number of segment, k= 3 a. Minimum clock cycle time: Tp = (45+5)ns =50ns b. For the non-pipeline system, we have, tn= (40+45+15)ns = 100ns ntn= 7 100= 700ns c. Speedup for 10tasks, n=10, tn= 100ns, tp= 50ns



10100 3+101 50

1000 1250

50 63

1000 1071

d. Speedup of the pipeline for 100tasks:


+ 1

100100 3+1001 50

10000 10250


e. Max. speedup that can be achieved is the number of segments in the pipeline. i.e. S=

S= 3

Question 9.7 The time delay of the four segments in the pipeline in figure 3 are as follows: t1 = 50ns, t2 = 30ns, t3 = 95ns, t4 = 45ns. The interface registers delay time tr = 5ns. a. How long would it take to add 100 pairs of numbers in the pipeline? b. How can we reduce the total time to about one half of the time calculated in part (a)? Solution Time delays for each of the four segments are: t1 = 50ns, t2 = 30ns, t3 = 95ns, t4 = 45ns.

Interface register delay time tr = 5ns Maximum time delay = t3 Clock cycle tp = t3 + tr = 95 + 5 = 100ns Hence, the minimum clock cycle for each task is 100ns For 100 tasks (100 pairs of numbers) we have: (100 * 100) ns = 10000ns

Question 9.8. How would you use the floatingpoint pipeline adder of fig 9.6 to add 100 floatingpoint numbers X1 + X2 + X3 + . . . + X100 ? Solution Let the floating point numbers X1, X2, X3 . . . , X100 be represented in the form below: X1 = A X 2 a X2 = B X 2 b X3 = C X 2 c ... ... ... X100 = M X 2m Where A, B, C . . . , M are the mantissas a, b, c . . ., m are the exponents with the assumption that the floating point numbers are binary numbers.

Question 9.10 Pipeline Processing Four possible hardware schemes that can be used in an instruction pipeline in order to minimize the performance degradation caused by instruction branching What is Branch Instruction? A branch (or jump on some computer architectures such as the PDP-8 and intel x86) is a point in a computer program where the flow of control is altered. The term branch is usually used when referring to a program written in machine code or assembly language; in a high-level programming language. Branches usually take the form of conditional statements, subroutine calls or GOTO statements. An instruction that causes a branch, a branch instruction can be taken or not taken: if a branch is not taken, the flow of control is unchanged and the next instruction to be executed is the instruction immediately following the current instruction in memory; if taken, the next instruction to be executed is an instruction at some other place in memory. There are usually two forms of branch instruction which are: a conditional branch that can be either taken or not taken, depending on a condition such as CPU flag, and an unconditional branch which is always taken, also it alters the sequence program flow by loading the program counter with the target address. Pipelined computers employ various hardware techniques to minimize the performance degradation caused by instruction branching. PREFETCH TARGET INSTRUCTION Pre-fetching of the target instruction is way of handling a conditional branching. One way of handling a conditional branch is to pre-fetch the target instruction in addition to the instruction following the branch. Both are saved until the branch is executed. If the branch condition is successful, the pipeline continues from the branch target instruction. An extension of this procedure is to continue fetching instructions from both places until the branch decision is made. At that time control chooses the instruction stream of the correct program flow. BRANCH TARGET BUFFER (BTB) The branch target buffer is an associative memory included in the fetch segment of the pipeline. Each entry in the branch target buffer consists of the address of a previously executed branch instruction and the target instruction for that branch. It also stores the next few instructions after the branch target instruction. When the pipeline decodes a branch instruction, it searches the associative memory branch target buffer for the address of the instruction. If it is in the branch target buffer, the instruction is available directly and pre-fetch continues from the new path. If the instruction is not in the branch target buffer, the pipeline shifts to a new instruction stream and stores the target instruction in the

branch target buffer. The advantage of this scheme is that branch instructions that have occurred previously are readily available in the pipeline without interruption. LOOP BUFFER A variation of the branch target buffer is the loop buffer. This is a very high speed register file maintained by the instruction fetch segment of the pipeline. When a program loop is detected in the program, it is stored in the loop buffer in its entirety, including all branches. The program can be executed directly without having to access memory until the loop mode is removed by the final branching out. BRANCH PREDICTION A pipeline with branch prediction uses some additional logic to guess the outcome of a conditional branching instruction before it is executed. The pipeline then begins pre-fetching the instruction stream from the predicted path. A correct prediction eliminates the wasted time caused by branch penalties.

Question 9.16 Consider the multiplication of two 40*40 matrices using a vector processor. a. How many product terms are there in each inner product and how many inner products must be evaluated. b. How many multiply-add operations are needed to calculated the product matrix?
Solutions a. The product terms in each inner product = 40; There are 40 product terms in each of the inner product. Hence the inner product = 40*40 = 1600 b. Multiply-add = inner products *40 = 1600 There are 64000 multiply-add operations needed to calculate the product matrix.

Question 9.19 Flop is the number of floating point operation performed per seconds by a computer system. Megaflop is the number of millions operations performed by the computer system and Gigaflop is the number of billions operations performed by the computer system. A typical super computer has a basic 4 cycle time to 20ns. If the processor of this super computer can calculate floating point operations through a pipeline each cycle time, it will have

the ability to perform 50 to 250 megaflops. The number of operation is 250 billion floating point operations i.e. 250 Gigaflops. The Supercomputer can perform 100 million floating point operation i.e. 100 megaflops. Hence, the time that it will take this computer to carry out the operation is: 1000 x 250 100
QUESTION 9.20 To perform 400 floating-point operations using four processors with a cycle time of 40ns in each, The total time required = =


400 4


=4000ns ii) when using a single processor with a clock cycle of 10ns to perform the same task, the total time required = (400/1) 10 =4000ns Therefore, there is no difference in the time taken to perform the jobs between the two cases.