Вы находитесь на странице: 1из 22

ECE4680 Computer Organization and Architecture Lecture 2: Performance Evaluation

ECE4680 Lec2 Performance.1

January 23, 2003

Review: What is "Computer Architecture"


Co-ordination of

levels of abstraction

Application Operating System Instruction Set Architecture

Compiler

Instr. Set Proc. I/O system Digital Design Circuit Design

Under a set of rapidly changing Forces CA = IS + CO

ECE4680 Lec2 Performance.2

January 23, 2003

Review: Levels of Representation


temp = v[k]; High Level Language Program Compiler Assembly Language Program Assembler Machine Language Program
0000 1010 1100 0101 1001 1111 0110 1000

v[k] = v[k+1]; v[k+1] = temp; lw $15, lw $16, sw $16, sw $15,


1100 0101 1010 0000 0110 1000 1111 1001

0($2) 4($2) 0($2) 4($2)


1010 0000 0101 1100 1111 1001 1000 0110 0101 1100 0000 1010 1000 0110 1001 1111

Machine Interpretation Control Signal Specification

ECE4680 Lec2 Performance.3

January 23, 2003

Review: Levels of Organization

SPARCstation 20

Computer SPARC Processor Control Datapath Memory Devices Input Output

ECE4680 Lec2 Performance.4

January 23, 2003

Summary: Computer System Components


Proc Caches Busses adapters Memory Controllers I/O Devices: Disks Displays Keyboards

Networks

All have interfaces & organizations

ECE4680 Lec2 Performance.5

January 23, 2003

Review: Summary from Last Lecture


All computers consist of five components Processor: (1) datapath and (2) control (3) Memory (4) Input devices and (5) Output devices Not all memory are created equally Cache: fast (expensive) memory are placed closer to the processor Main memory: less expensive memory--we can have more Input and output (I/O) devices has the messiest organization Wide range of speed: graphics vs. keyboard Wide range of requirements: speed, standard, cost ... etc. Least amount of research (so far)

ECE4680 Lec2 Performance.6

January 23, 2003

Integrated Circuits Costs --- manufacturing process (p24)


Click how chips are made on the website.

ECE4680 Lec2 Performance.7

January 23, 2003

Integrated Circuits Costs --- formula


Cost _ per _ wafter Die _ per _ wafer Yield wafer _ area Die _ area

Die cost =

Dies per wafer =

Die Yield =

1 (1 + ( Defect _ per _ area Die _ area )) 2


Die Cost is goes roughly with the cube of the area.

ECE4680 Lec2 Performance.8

January 23, 2003

Real World Examples

Chip 386DX 486DX2

Metal Line layers width 2 3 4 3 3 3 3 0.90 0.80 0.80 0.80 0.70 0.70 0.80

Wafer Defect cost /cm2 $900 $1200 $1700 $1300 $1500 $1700 $1500 1.0 1.0 1.3 1.0 1.2 1.6 1.5

Area Dies/ Yield Die Cost mm2 wafer 43 81 121 196 234 256 296 360 181 115 66 53 48 40 71% 54% 28% 27% 19% 13% 9% $4 $12 $53 $73 $149 $272 $417

PowerPC 601 HP PA 7100 DEC Alpha SuperSPARC Pentium

From "Estimating IC Manufacturing Costs, by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

ECE4680 Lec2 Performance.9

January 23, 2003

Other Costs
IC cost = Die cost + Testing cost + Packaging cost Final test yield
Packaging Cost: depends on pins, heat dissipation

Chip 386DX 486DX2 PowerPC 601 HP PA 7100 DEC Alpha SuperSPARC Pentium

Die cost $4 $12 $53 $73 $149 $272 $417

Package pins type 132 QFP 168 PGA 304 QFP 504 PGA 431 PGA 293 PGA 273 PGA

cost $1 $11 $3 $35 $30 $20 $19

Test & Assembly $4 $12 $21 $16 $23 $34 $37

Total $9 $35 $77 $124 $202 $326 $473

ECE4680 Lec2 Performance.10

January 23, 2003

CMOS improvements
Die size 2X / 3 years; Line widths halve / 7 years
25

20

15

Transistors per Unit Area

10

Die Size
0 1980 1983 1986 1989 1992

Capacity Logic DRAM disk 2x in 3 years 4x in 3 years 4x in 3 years

Speed

2x in 3 years 1.4x in 10 years 1.4x in 10 years

ECE4680 Lec2 Performance.11

January 23, 2003

Processor Performance

120 P e 100 r 80 f o r m a n c e 60 1.54X/yr 40 20 Sun-4/260 0 1987 1988 1989 1990 Year 1991 1992 1993 MIPS M/120 IBM RS6000/540 MIPS M2000 1.35X/yr HP 9000/750

IBM Power 2/590

DEC AXP 300

ECE4680 Lec2 Performance.12

January 23, 2003

The bottom line: Performance (and cost)


Airplane Boeing 747 BAC/Sud Concorde Douglas DC-8 DC to Paris 6.5 hours 3 hours 7.3 hours Range 4150 4000 8720 Speed (m.p.h.) 610 1350 544 Passengers 470 132 146 Throughput (p x m.p.h.) 286,700 178200 79,424 Cost ??? ??? ???

Different measurements lead to different results Time to do the task (Execution Time) execution time, response time, latency Tasks per day, hour, week, sec, ns. .. (Performance) throughput, bandwidth Cost renders the measurement more complex The bottom-line performance measurement is CPU execution time.

ECE4680 Lec2 Performance.13

January 23, 2003

The bottom line: Performance (and cost)

" X is n times faster than Y" means ExTime(Y) Performance(X) -------------- = ---------------------ExTime(X) Performance(Y) Time of Concorde vs. Boeing 747? Throughput of Boeing 747 vs. Concorde?

ECE4680 Lec2 Performance.14

January 23, 2003

Metrics of performance

Application Programming Language Compiler ISA Datapath Control Function Units Transistors Wires Pins

Answers per month Operations per second

(millions) of Instructions per second MIPS (millions) of (F.P.) operations per second MFLOP/s

Megabytes per second

Cycles per second (clock rate)

ECE4680 Lec2 Performance.15

January 23, 2003

Relating Processor Metrics (pp60-66)


CPU execution time = CPU clock cycles/pgm clock rate or CPU execution time = CPU clock cycles/pgm X clock cycle time CPU clock cycles/pgm = Instructions/pgm X avg. clock cycles per instr. or CPI = CPU clock cycles/pgm Instructions/pgm CPI tells us something about the Instruction Set Architecture, the Implementation of that architecture, and the program measured

ECE4680 Lec2 Performance.16

January 23, 2003

Aspects of CPU Performance


CPU time = Seconds = CPU time = Seconds = Program Program Instructions x Cycles x Seconds Instructions x Cycles x Seconds Program Instruction Cycle Program Instruction Cycle

instr. count Program Compiler Instr. Set Arch. Organization Technology


ECE4680 Lec2 Performance.17

CPI

clock rate

January 23, 2003

Aspects of CPU Performance


CPU time = Seconds = CPU time = Seconds = Program Program Instructions x Cycles x Seconds Instructions x Cycles x Seconds Program Instruction Cycle Program Instruction Cycle

instr. count Program Compiler Instr. Set Arch. Organization Technology


ECE4680 Lec2 Performance.18

CPI X (x) X X

clock rate

X X X

X X
January 23, 2003

Organizational Trade-offs

Application Programming Language Compiler ISA Datapath Control Function Units Transistors Wires Pins

3 factors: Where are they? How are they related?

Instruction Mix CPI

Cycle Time

ECE4680 Lec2 Performance.19

January 23, 2003

CPI How to compute?


Average cycles per instruction

CPI = (CPU Time Clock Rate) / Instruction Count = Clock Cycles / Instruction Count CPU time =

ClockCycleTime CPI i Ci
i =1

CPU clock cycles summed up

"instruction frequency" CPI =

CPI
i =1

Fi

where

Fi =

Ci Instruction _ Count

Invest Resources where time is Spent!

ECE4680 Lec2 Performance.20

January 23, 2003

Example (page 60)

Our favorite program runs in 10 sec on machine A, which has a 400MHz clock. We are trying to design a machine B with faster clock rate so as to reduce the execution time to 6 sec. The increase of clock rate will affect the rest of the CPU design, causing B to require 1.2 times as many clock cycles as machine A for this program. What clock rate should be?

ECE4680 Lec2 Performance.21

January 23, 2003

Answer: CPU time A = CPU clock cycle A / clock rate A ==> CPU clock cycle A = 10 sec x 400 x 10^6 Clock rate B = CPU clock cycle B / CPU time B = 1.2*400*10^6 / 6 = 800 MHz

ECE4680 Lec2 Performance.22

January 23, 2003

Example
Base Machine (Reg / Reg) and Instruction frequencies in the execution of a program: Op ALU Load Store Branch Freq 43% 21% 12% 24% Cycles 1 2 2 2

Question: What is the average CPI of the machine? CPI = 143% + 221% + 212% +224% = 1.57

ECE4680 Lec2 Performance.23

January 23, 2003

Example: (page 62) Suppose we have two implementations of the same instruction set. Machine A has a clock cycle time of 10 ns and an average CPI of 2.0 for some program. Machine B has a clock cycle time of 20 ns and an average CPI of 1.2 for the same program. Which is faster? And by how much? Let I denote the number of instructions of the program CPU time A = I * 2.0* 10 = 20 I CPU time B = I * 1.2 *20 = 24 I Machine A is 1.2 faster than B . Again we see 3 factors are related !

ECE4680 Lec2 Performance.24

January 23, 2003

Example (pp65-66)
ISA has 3 kinds of instructions:
Instruction class A B C CPI for this instruction class 1 2 3

One program has 2 code sequences:


Code Sequence 1 2 Instruction counts for instruction class A B C 2 4 1 1 2 1

Which code sequence has more instructions? Which will be faster? What is the CPI for each sequence? S.1 has 5 instructions; S.2 has 6. S.1 needs 2x1+1x2+2x3=10 cycles; S.2 needs 4x1+1x2+1x3=9 cycles. S.1 has CPI=10/5=2; S.2 has CPI=9/6=1.5
ECE4680 Lec2 Performance.25 January 23, 2003

Marketing Metrics
MIPS

= Instruction Count / Time * 10^6 = Clock Rate / CPI * 10^6

Million Instructions Per Seconds machines with different instruction sets ? programs with different instruction mixes ? dynamic frequency of instructions Peak MIPS: impractical uncorrelated with performance. ( see the next example) Many pitfalls?

MFLOP/S = FP Operations / Time * 10^6

Million Floating-point Operations Per Second machine dependent


ECE4680 Lec2 Performance.26

often not where time is spent ??


January 23, 2003

Example: (similar to example at PP78-79, but not the same) Referring to example at slide 23, assume we build an optimizing compiler for the load/store machine. The compiler discards 50% of the ALU instructions. 1) What is the CPI_opt ? 2) Ignoring system issues and assuming a 20 ns clock cycle time (50 MHz clock rate). What is the MIPS rating for optimized code versus unoptimized code? Does the MIPS rating agree with the rating of execution time?
Op ALU Load Store CPI MIPS
ECE4680 Lec2 Performance.27

Freq 43% 21% 12% 1.57 31.8

Cycle 1 2 2 2
Optimizing compiler

New Freq 27% 27% 15% 31% 1.73 28.9


January 23, 2003

Branch 24%

Why Do Benchmarks? Or How to evaluate an athlete?

Triathlon (3 sports) swimming bicycling running Pentathlon (5 sports) sprinting hurdling long jumping discus javelin Heptathlon (7 sports) Decathlon: (10 sports) 100-meter, 400-meter, 1,500-meter runs; 110-meter high hurdle; discus, javelin throws; shot-put; pole vault; high jump; long jump.
January 23, 2003

ECE4680 Lec2 Performance.28

Why Do Benchmarks?
How we evaluate differences Different systems Changes to a single system Provide a target Benchmarks should represent large class of important programs Improving benchmark performance should help many programs For better or worse, benchmarks shape a field Good ones accelerate progress good target for development Bad benchmarks hurt progress help real programs v. sell machines/papers? Inventions that help real programs dont help benchmark
ECE4680 Lec2 Performance.29 January 23, 2003

Programs to Evaluate Processor Performance (pp86-87)


(Toy) Benchmarks Small but easy to compile and run on simulators, convenient in early designing stage of a new machine. No compiler for novel machines. 10-100 line e.g.,: sieve, puzzle, quicksort Synthetic Benchmarks attempt to use a single benchmark to match average frequencies of real workloads or a set of benchmarks e.g., Whetstone(Algol60 Fortran), Dhrystone(Ada C) Kernels Time critical excerpts of Real programs Popular in scientific computing to illuminate performance of individual features of a machine. e.g., Livermore loops(21 loops), Linpack(linear algebra) Real programs e.g., gcc, spice
ECE4680 Lec2 Performance.30 January 23, 2003

Successful Benchmark: SPEC (pp87-89)


1987 RISC industry mired in bench marketing: (That is 8 MIPS machine, but they claim 10 MIPS!) EETimes (http://www.eet.com/) + 5 companies band together to perform Systems Performance Evaluation Committee (SPEC) in 1988: Sun, MIPS, HP, Apollo, DEC Create standard list of programs, inputs, reporting: some real programs, includes OS calls, some I/O

ECE4680 Lec2 Performance.31

January 23, 2003

SPEC first round


First round 1989; 10 programs = 4 for integer + 6 for FP, single number to summarize performance One program (matrix300): 99% of time in single line of code New front-end compiler could improve dramatically (Fig. 2.3)
800 700 600 500 400 300 200 100 0 gcc doduc li eqntott fpppp spice nasa7 matrix300 tomcatv epresso

Benchmark

ECE4680 Lec2 Performance.32

January 23, 2003

SPEC Evolution
Second round; SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,bcopy(a,b,c)=memcpy(b,a,c ) wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas Add SPECbase: dont allow program-specific optimization flags. Third round; 1995; new set of programs(8 int + 10fp) (fig. 2.6, p72) benchmarks useful for 3 years Base machine is changed from VAX-11/780 to Sun SPARC 10/40 Newer rounds include SPEC HPC96, SPEC JVM98, SPEC WEB99, SPEC OMP2001. See http://www.spec.org for more details.

ECE4680 Lec2 Performance.33

January 23, 2003

How to Summarize Results? Program 1: 1 sec on machine A, 10 sec on machine B Program 2: 1000 sec 100 sec What are your conclusions?

A is 10 times faster than B for program1. B is 10 times faster than A for Program2. Total execution time: a consistent summary measure B is 1001/110=9.1 times faster than A. Workload: need to consider the percentage/frequency of each program in the total job.

ECE4680 Lec2 Performance.34

January 23, 2003

Suppose n programs have execution time

How to Summarize Performance ti


1 n

where

Suppose the workload is ( w1 , w 2 ,.... w n ) where Arithmetic Mean AM ( t i ) =

i = 1, 2 ,... n .
wi = 1 .

i =1

i =1

ti =

1 ( t1 + t 2 +  + t n ) n

Weighted Arithmetic mean WAM Geometric Mean GM ( t i ) =

(ti ) =

i =1

t i w i = t1 w 1 + t 2 w 2 +  + t n w n
1 1 1

ti =

t1 t 2  t n = t1 n t 2 n  t n n
w1

i =1

Weighted Geometric Mean WGM Harmonic Mean HM ( t i ) =

( t i ) = t1

t2

w2

 tn

wn

i =1

1 ti

n 1 1 1 + + + t1 t 2 tn
n

Weighted Harmonic Mean GHM

(ti ) =

i =1

wi ti

n w1 w2 w + + + n t1 t2 tn
January 23, 2003

ECE4680 Lec2 Performance.35

How to Summarize Performance


3 means: arithmetic, geometric and harmonic. Which one is correct? Arithmetic mean (or weighted arithmetic mean) tracks execution time. Harmonic mean (or weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time Normalized execution time is handy for scaling performance (e.g., time reference on machine time on measured machine ) But do not take the arithmetic mean of normalized execution time, use of the geometric wards all improvements equally: program A going from 2 seconds to 1 second as important as program B going from 2000 seconds to 1000 seconds. See example p81

ECE4680 Lec2 Performance.36

January 23, 2003

Impact of Means on SPECmark89 for IBM 550


Ratio to VAX: Program gcc espresso spice doduc nasa7 li eqntott matrix300 fpppp tomcatv Mean Before After 30 35 47 46 78 34 40 78 90 133 54 Ratio
ECE4680 Lec2 Performance.37

Time: Before After 49 65 510 41 258 183 28 58 34 20 124 Ratio 51 67 510 38 140 183 28 6 35 19 108 1.16

Weighted Time: Before 8.91 7.64 5.69 5.81 3.43 7.86 6.68 3.43 2.97 2.01 After 9.22 7.86 5.69 5.45 1.86 7.86 6.68 0.37 3.07 1.94

29 34 47 49 144 34 40 730 87 138 72 1.33

54.42 49.99 Weighted Arith. Ratio 1.09


January 23, 2003

Geometric

Arithmetic

Example (p81)
Normalized to A Normalized to B Time on A program1 program2 Arithmetic mean Geometric mean 1 1000 500.5 31.6 Time on B 10 100 55 31.6 A 1 1 1 1 B 10 0.1 5.05 1 A 0.1 10 5.05 1 B 1 1 1 1

The difficulty arises from the use of arithmetic mean of normalized time. Geometric mean is independent of which data series we use for normalized time because

GM ( xi ) x = GM ( i ) GM ( yi ) yi
ECE4680 Lec2 Performance.38 January 23, 2003

Amdahl's Law
Speedup due to enhancement E:

Speedup(E) =

ExTime(without E) Performance(with E) = ExTime(with E) Performance(without E)


F

F S
Suppose that enhancement E accelerates a fraction F of the task by a factor S and the remainder of the task is unaffected then,

F ExTime(with E) = ((1 - F) + ) ExTime(without E) S


Speedup(E) = ExTime(without E) 1 1 = F F 1- F ((1 - F) + ) ExTime(without E) (1 - F) + S S
January 23, 2003

ECE4680 Lec2 Performance.39

Example: Suppose a person wants to travel from city A to city B by city C. The routes from A to C are in mountains and the routes from C to B are in desert. The distances from A to C , and from C to B are 80 miles and 200 miles, respectively. From A to C, walk at speed of 4 mph From C to B, walk or drive (at speed of 100 mph) Question: How long will it take for the entire trip How much faster from A to B by a car as opposed to walk

ECE4680 Lec2 Performance.40

January 23, 2003

Example: Suppose an enhancement runs 10 times faster than the original machine, but is only usable 40% of the time. Question: what is the overall speedup? Answer: Fraction_enhance = 0.4 Speedup_enhanced = 10 Speedup_overall = 1/(0.6+0.4/10) = 1.56

ECE4680 Lec2 Performance.41

January 23, 2003

Cost Summary
Integrated circuits driving computer industry Die costs goes up with the cube of die area

ECE4680 Lec2 Performance.42

January 23, 2003

Performance Evaluation Summary


CPU time CPU time = Seconds = Seconds Program Program = Instructions x Cycles x Seconds = Instructions x Cycles x Seconds Program Instruction Cycle Program Instruction Cycle

Time is the measure of computer performance! Good products created when have: Good benchmarks Good ways to summarize performance If not good benchmarks and summary, then choice between improving product for real programs vs. improving product to get more sales=> sales almost always wins. Remember Amdahls Law: Speedup is limited by unimproved part of program.

ECE4680 Lec2 Performance.43

January 23, 2003

Homework, due Feb. 3 class time (Monday) Question 1.1 through 1.26 Question 2.1 through 2.4 Question 2.10 through 2.13 Question 2.18 through 2.20 Question 2.41

ECE4680 Lec2 Performance.44

January 23, 2003

Вам также может понравиться