Академический Документы
Профессиональный Документы
Культура Документы
Problem 1: ____ (of 20) Problem 2: ____ (of 15) Problem 3: ____ (of 13) Problem 4: ____ (of 25) Problem 5: ____ (of 27) Extra Credit: ____ (of 13) Final Score: _____ (of 100)
November 1, 2001
c.Three of the six principles of the RISC movement are 1) few addressing modes, 2) single-cycle instruction execution, and 3) reliance on compiler optimizations.
d. Given a ve-stage DLX pipeline (IF,ID,EX,MEM,WB) in which branches complete in the EX stage and are handled by stalling, then converting all branches to delayed branches reduces the average penalty (number of cycles added to execution) per branch from two cycles to one.
e. If processor A has a higher clock rate than processor B and a higher MIPS rating than processor B, then it will execute a given program as fast or faster than processor B.
f. Processors dont implement Beladys cache replacement algorithm because it slows down the hit time of the cache.
November 1, 2001
g. One of the advantages of an accumulator architecture over a load/store architecture is smaller code size for programs.
h. It is impossible to have WAR hazards in an in-order DLX pipeline with single-cycle operations. The possibility for WAR hazards arises only when multi-cycle operations (like oating-point operations and cache misses) are introduced.
j. Increased transistor counts have made CISC architectures more attractive than RISC architectures.
November 1, 2001
b. Pipelining is used because it improves instruction throughput. Increasing the level of pipelining cuts the amount of work performed at each pipeline stage, allowing more instructions to exist in the processor at the same time and instructions to complete at a more rapid rate. However, throughput will not improve as pipelining is increased indenitely. Give two reasons for this.
c. In benchmarking, it is sometimes useful to summarize the performance of a group of benchmarks into a single number. There are three potential functions that can perform this summary: arithmetic mean, harmonic mean, and geometric mean. Which should be used?
November 1, 2001
d. Why is miss rate not a good metric for evaluating cache performance? What is the appropriate metric? Give its denition. What is the reason for using a combination of rstand second- level caches rather than using the same chip area for a larger rst-level cache?
e. The original motivation for using virtual memory was compatibility. What does that mean in this context? What are two other motivations for using virtual memory?
November 1, 2001
branch outcome T T T N T N T T T N T N T T T N T N
misprediction?
November 1, 2001
c. (5 points) Now, assume a two-level branch predictor that uses one bit of branch historyi.e., a one-bit BHR. Since there is only one branch in the program, it does not matter how the BHR is concatenated with the branch PC to index the BHT. Assume that the BHT uses one-bit counters and that, again, all entries are initialized to N. Which of the branches in this sequence would be mis-predicted? Again, use this table.
branch outcome T T T N T N T T T N T N T T T N T N
misprediction?
November 1, 2001
b. (3 points) Why are the rst level caches usually split (instructions and data are in different caches) while the L2 is usually unied (instructions and data are both in the same cache)?
For the rest of the question, consider a 64-byte cache with 8 byte blocks, an associativity of 2 and LRU block replacement. Virtual addresses are 16 bits. The cache is physically tagged. The processor has 16KB of physical memory. c. (3 points) What is the total number of tag bits?
November 1, 2001
b. (3 points) Assuming there are no special provisions for avoiding synonyms, what is the minimum page size?
c. (3 points) Assume each page is 64 bytes. How large would a single-level page table be? Each page requires 4 protection bits, and entries must be an integral number of bytes.
d. (10 points) For the following sequence of references, label the cache misses. Using Mark Hills 3C model, label each miss as being either a compulsory miss, a capacity miss, or a conict miss. The addresses are given in octal (each digit represents 3 bits). Assume the cache initially contains block addresses: 000, 010, 020, 030, 040, 050, 060, and 070 which were accessed in that order.
cache state prior to access reference address 024 100 270 570 074 272 004 044 640 000 410 710 550 570 410 miss? which?
November 1, 2001
November 1, 2001
10
// N is in r5
Using the instruction numbers, label the data and control dependences. For 3 extra credit points, account for cross-iteration hazards (hazards between instructions from different iterations) and hazards through memory if any.
November 1, 2001
11
d. (10 points) Fill in the pipeline diagram for code for the new SAXPY loop. Label the stalls as d* for data-hazard stalls and s* for structural stalls. What is the latency of a single iteration? (The number of cycles between the completion of two successive #0 instructions). For this question, assume that FP addition takes 2 cycles, FP multiplication takes 3 cycles and that all other operations take a single cycle. The functional units are not pipelined. The FP adder, FP multiplier and integer ALU are all separate functional units, such that there are no structural hazards between them. As in DLX, the register le is written by the WB stage in the rst half of a clock cycle and is read by the ID stage in the second half of a clock cycle. In addition, the processor has full forwarding. The processor stalls on branches until the outcome is available which is at the end of the EX stage. The processor has no provisions for maintaining precise state.
instruction 0: slli r2,r1,#3 1: addi r3,r2,#X 2: mulf_m f2,f0,(r3) 3: addi r4,r2,#Y 4: addf_m f4,f2,(r4) 5: addi r4,r2,#Z 6: sf f4,(r4) 7: addi r1,r1,#1 8: slei r6,r1,r5 9: bnez r6, #0 0: slli r2,r1,#3
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
e. (3 points) In DLX, what is the reason for forcing non-memory operations to go through the MEM stage rather than proceeding directly to the WB stage?
November 1, 2001
12
f. (3 points) Aside from the direct loss of register displacement addressing and the subsequent instructions required to explicitly compute addresses, what are two other disadvantages of this sort of pipeline?
h. (10 extra credit points) Reduce the stalls by pipeline scheduling a single loop iteration. Show the resulting code and ll in the pipeline diagram. You do not need to show the optimal schedule for a correct response.
instruction
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
November 1, 2001
13