Вы находитесь на странице: 1из 58

Computer Architecture 13-14

Chapter 1: Measuring &


Understanding Performance

[Adapted from Computer Organization and Design, 4th Edition,


Patterson & Hennessy, 2009, MK]

Chapter 1. Measuring and understanding performance

Dept. of Computer Architecture, UMA, Sept 2013

What you Should Know from Last Year


Basic logic design & machine organization
logical minimization, component design
processor, memory, I/O

Create, assemble, run, debug programs in an


assembly language
MIPS preferred

Create, simulate, and debug hardware structures in a


simulator
Xillin

Create, compile, and run C (C++, Java) programs

Chapter 1. Measuring and understanding performance

Dept. of Computer Architecture, UMA, Sept 2013

Review: Some Basic Definitions


Kilobyte 210 or 1,024 bytes
Megabyte 220 or 1,048,576 bytes
sometimes rounded to 106 or 1,000,000 bytes

Gigabyte 230 or 1,073,741,824 bytes


sometimes rounded to 109 or 1,000,000,000 bytes

Terabyte 240 or 1,099,511,627,776 bytes


sometimes rounded to 1012 or 1,000,000,000,000 bytes

Petabyte 250 or 1024 terabytes


sometimes rounded to 1015 or 1,000,000,000,000,000 bytes

Exabyte 260 or 1024 petabytes


Sometimes rounded to 1018 or 1,000,000,000,000,000,000 bytes
Chapter 1. Measuring and understanding performance

Dept. of Computer Architecture, UMA, Sept 2013

Classes of Computers
1.

Desktop computers

(ej.: PC, laptop)

1. good performance,
2. single user
3. low cost
4. (graphics display, a keyboard, and a mouse)
2.

Servers (ej.: workstations, web servers, file storage)


1. larger programs
2. simultaneous users
3. network
4. security

Chapter 1. Measuring and understanding performance

Dept. of Computer Architecture, UMA, Sept 2013

Classes of Computers
3.

Supercomputers (weather, oil exploration, protein structure)


1. hundreds to thousands of processors,
2. terabytes of memory and petabytes of storage
3. high-end scientific and engineering applications

4.

Embedded computers (processors) (ej.: cell phones, cars, video


game, TV, digital cameras)

- A computer inside another device used for running one


predetermined application

Chapter 1. Measuring and understanding performance

Dept. of Computer Architecture, UMA, Sept 2013

The Computer Revolution


Progress in computer technology
Underpinned by Moores Law

Makes novel applications feasible


Computers in automobiles
Cell phones
Human genome project
World Wide Web
Search Engines

Computers are pervasive

Chapter 1. Measuring and understanding performance

Dept. of Computer Architecture, UMA, Sept 2013

The Processor Market


embedded growth >> desktop growth

Where else are embedded processors found?


Chapter 1. Measuring and understanding performance

Dept. of Computer Architecture, UMA, Sept 2013

What You Will Learn


How programs are translated into the machine language
And how the hardware executes them

About the hardware/software interface


What determines program performance
And how it can be improved

How hardware designers improve performance


What parallel processing is

Chapter 1. Measuring and understanding performance

Dept. of Computer Architecture, UMA, Sept 2013

Understanding Performance
Algorithm
Determines number of operations executed

Programming language, compiler, architecture


Determine number of machine instructions executed
per operation

Processor and memory system


Determine how fast instructions are executed

I/O system (including OS)


Determines how fast I/O operations are executed

Chapter 1. Measuring and understanding performance

10

Dept. of Computer Architecture, UMA, Sept 2013

Underneath the Program


Applications software
Written in high-level
language

Systems software
Hardware

System software
Operating system supervising program that interfaces the
users program with the hardware (e.g., Linux, MacOS,
Windows)
- Handles basic input and output operations
- Allocates storage and memory
- Schedules tasks & Provides for protected sharing among multiple
applications

Compiler translate programs written in a high-level language


(e.g., C, Java) into instructions that the hardware can execute
Chapter 1. Measuring and understanding performance

11

Dept. of Computer Architecture, UMA, Sept 2013

Below the Program, Cont


High-level language program (in C)
swap (int v[], int k)
(int temp;
temp = v[k];
v[k] = v[k+1];
v[k+1] = temp;
)

one-to-many
C compiler

Assembly language program (for MIPS)


swap:

sll
add
lw
lw
sw
sw
jr

$2, $5, 2
$2, $4, $2
$15, 0($2)
$16, 4($2)
$16, 0($2)
$15, 4($2)
$31

one-to-one
assembler

Machine (object, binary) code (for MIPS)


000000 00000 00101 0001000010000000
000000 00100 00010 0001000000100000
. . .
Chapter 1. Measuring and understanding performance

12

Dept. of Computer Architecture, UMA, Sept 2013

Advantages of Higher-Level Languages ?


Higher-level languages

Allow the programmer to think in a more natural language


Improve programmer productivity
Improve program maintainability
Allow programs to be independent of the computer on which
they are developed
Emergence of optimizing compilers that produce very efficient
assembly code optimized for the target machine

As a result, very little programming is done today at


the assembler level

Chapter 1. Measuring and understanding performance

13

Dept. of Computer Architecture, UMA, Sept 2013

Under the Covers


Same components for all kinds of computer
Five classic components of a computer input, output, memory,
datapath, and control

datapath + control =
processor (CPU)
Input/output includes
User-interface devices
Display, keyboard,
mouse
Storage devices
Hard disk, CD/DVD,
flash
Network adapters
For communicating
with other
computers
Chapter 1. Measuring and understanding performance

14

Dept. of Computer Architecture, UMA, Sept 2013

Opening the Box

Chapter 1. Measuring and understanding performance

15

Dept. of Computer Architecture, UMA, Sept 2013

512KB L2

Four out-oforder cores


on one chip
1.9 GHz
clock rate

Core
2

65nm
technology

Core
3

512KB L2

Northbridg
e
512KB L2

2MB shared L3
Cache

Core
1

512KB L2

AMDs Barcelona Multicore Chip

Three levels
of caches
(L1, L2, L3)
on chip

Core
4

Integrated
Northbridge

http://www.techwarelabs.com/reviews/processors/barcelona/
Chapter 1. Measuring and understanding performance

16

Dept. of Computer Architecture, UMA, Sept 2013

A Safe Place for Data


Volatile main memory
Loses instructions and data when power off

Non-volatile secondary memory


Magnetic disk
Flash memory
Optical disk (CDROM, DVD)

Chapter 1. Measuring and understanding performance

17

Dept. of Computer Architecture, UMA, Sept 2013

Networks
Communication and resource sharing
Local area network (LAN): Ethernet
Within a building

Wide area network (WAN: the Internet


Wireless network: WiFi, Bluetooth

Chapter 1. Measuring and understanding performance

18

Dept. of Computer Architecture, UMA, Sept 2013

The BIG Picture


Abstraction helps us deal with complexity
Hide lower-level detail

Instruction set architecture (ISA)


The hardware/software interface

Application binary interface


The ISA plus system software interface

Implementation
The details underlying and interface

Chapter 1. Measuring and understanding performance

19

Dept. of Computer Architecture, UMA, Sept 2013

Technology Trends
Electronics
technology continues
to evolve
Increased capacity
and performance
Reduced cost
Year

Technology

1951

Vacuum tube

1965

Transistor

1975

Integrated circuit (IC)

1995

Very large scale IC (VLSI)

2005

Ultra large scale IC

Chapter 1. Measuring and understanding performance

DRAM capacity

Relative performance/cost
1
35
900
2,400,000
6,200,000,000
20

Dept. of Computer Architecture, UMA, Sept 2013

Moores Law
In 1965, Intels Gordon Moore
predicted that the number of
transistors that can be
integrated on single chip would
double about every two years

Dual Core
Itanium with
1.7B transistors

feature size
&
die size

Chapter 1. Measuring and understanding performance

22

Courtesy, Intel

Dept. of Computer Architecture, UMA, Sept 2013

Technology Scaling Road Map (ITRS)

Year

2004

2006

2008

2010

2012

Feature size (nm)

90

65

45

32

22

Intg. Capacity (BT)

16

32

Fun facts about 45nm transistors


30 million can fit on the head of a pin
You could fit more than 2,000 across the width of a human
hair
If car prices had fallen at the same rate as the price of a
single transistor has since 1968, a new car today would cost
about 1 cent

Chapter 1. Measuring and understanding performance

23

Dept. of Computer Architecture, UMA, Sept 2013

Another Example of Moores Law Impact


DRAM capacity growth over 3 decades
1G
256M
512M

64M
128M
4M

16M

1M
64K

256K

16K

Chapter 1. Measuring and understanding performance

24

Dept. of Computer Architecture, UMA, Sept 2013

But What Happened to Clock Rates and Why?

Power (Watts)

Clock rates hit a


power wall

Chapter 1. Measuring and understanding performance

25

Dept. of Computer Architecture, UMA, Sept 2013

Tendency

For the P6, success criteria included performance above a


certain level and failure criteria included power
dissipation above some threshold.
Bob Colwell, Pentium Chronicles

Chapter 1. Measuring and understanding performance

26

Dept. of Computer Architecture, UMA, Sept 2013

A Sea Change is at Hand


The power challenge has forced a change in the design
of microprocessors
Since 2002 the rate of improvement in the response time of
programs on desktop computers has slowed from a factor of 1.5
per year to less than a factor of 1.2 per year

As of 2006 all desktop and server companies are


shipping microprocessors with multiple processors
cores per chip
Product
Cores per chip
Clock rate
Power

AMD
Barcelona

Intel
Nehalem

IBM Power 6 Sun Niagara


2

2.5 GHz

~2.5 GHz

4.7 GHz

1.4 GHz

120 W

~100 W?

~100 W

94 W

Plan of record is to double the number of cores per chip


per generation (about every two years)
Chapter 1. Measuring and understanding performance

27

Dept. of Computer Architecture, UMA, Sept 2013

What is Performance Metrics


Purchasing perspective
given a collection of machines, which has the
- best performance ?
- least cost ?
- best cost/performance?

Design perspective
faced with design options, which has the
- best performance improvement ?
- least cost ?
- best cost/performance?

Both require
basis for comparison
metric for evaluation

Our goal is to compare which factors affect performances


and their relative weight
Chapter 1. Measuring and understanding performance

28

Dept. of Computer Architecture, UMA, Sept 2013

Comparing Throughput and Response Time


Response time (execution time) the time between the
start and the completion of a task

Important to individual users

Throughput (bandwidth) the total amount of work done


in a given unit time

Important to data center managers

Comparison
We will need different performance metrics as well as
a different set of applications to benchmark
embedded and desktop computers, which are more
focused on response time, versus servers, which are
more focused on throughput
How are response time and throughput affected by
o Replacing the processor with a faster version?
o Adding more processors?

Well focus on response time for now


Chapter 1. Measuring and understanding performance

29

Dept. of Computer Architecture, UMA, Sept 2013

What is Speed
To maximize performance, need to minimize execution
time
performanceX = 1 / execution_timeX
If X is n times faster than Y, then
performanceX
execution_timeY
-------------------- = --------------------- = n
performanceY
execution_timeX
Decreasing response time frequently improves throughput

Chapter 1. Measuring and understanding performance

31

Dept. of Computer Architecture, UMA, Sept 2013

Relative Performance Example


If computer A runs a program in 10 seconds and
computer B runs the same program in 15 seconds, how
much faster is A than B?
We know that A is n times faster than B if
performanceA
execution_timeB
-------------------- = --------------------- = n
performanceB
execution_timeA
The performance ratio is

15
------ = 1.5
10

So A is 1.5 times faster than B

Chapter 1. Measuring and understanding performance

32

Dept. of Computer Architecture, UMA, Sept 2013

Measuring Execution Time


Elapsed time
Total response time= Wall clock Time = Elapsed Time
- it includes all aspects to complete a task
- Processing, I/O operations, OS overhead, idle time

Determines system performance

Productivity
Throughput: the total amount of work done in a given unit time

Chapter 1. Measuring and understanding performance

33

Dept. of Computer Architecture, UMA, Sept 2013

Measuring Execution Time


CPU time
Time spent processing a given job
- Substract I/O time, other jobs shares

Comprises user CPU time and system CPU time


Different programs are affected differently by CPU and system
performance
Elapsed
time

Example: time in Unix:

90.7u 12.9s 2:39


user CPU
time

65%

CPU utiliz.:
(90.7 + 12.9) /
(2*60 + 39)
I/O and other
processes

system
CPU time

Our goal: user CPU time + system CPU time


Chapter 1. Measuring and understanding performance

34

Dept. of Computer Architecture, UMA, Sept 2013

Review: Machine Clock Rate


Clock rate (clock cycles per second in MHz or GHz) is
inverse of clock cycle time (clock period)
CC = 1 / CR

one clock period

10 nsec clock cycle => 100 MHz clock rate


5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec (10-9) clock cycle => 1 GHz (109) clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
200 psec clock cycle => 5 GHz clock rate
Chapter 1. Measuring and understanding performance

35

Dept. of Computer Architecture, UMA, Sept 2013

Performance Factors
CPU execution time (CPU time) time the CPU spends
working on a task
Does not include time waiting for I/O or running other programs

CPU execution time = # CPU clock cyclesx clock cycle time


for a program
for a program
or

CPU execution time = #------------------------------------------CPU clock cycles for a program


for a program
clock rate
Can improve performance by increasing the the clock
rate or by reducing the number of clock cycles required
for a program
Hardware designer must often trade off clock rate against cycle
count
Chapter 1. Measuring and understanding performance
Dept. of Computer Architecture, UMA, Sept 2013
36

Improving Performance Example


A program runs on computer A with a 2 GHz clock in 10
seconds. What clock rate must computer B run at to run
this program in 6 seconds? Unfortunately, to accomplish
this, computer B will require 1.2 times as many clock
cycles as computer A to run the program.
CPU timeA = ------------------------------CPU clock cyclesA
clock rateA
CPU clock cyclesA = 10 sec x 2 x 109 cycles/sec
= 20 x 109 cycles
CPU timeB = ------------------------------1.2 x 20 x 109 cycles
clock rateB
clock rateB = ------------------------------1.2 x 20 x 109 cycles = 4 GHz
6 seconds
Chapter 1. Measuring and understanding performance

37

Dept. of Computer Architecture, UMA, Sept 2013

Clock Cycles per Instruction


Not all instructions take the same amount of time to
execute
One way to think about execution time is that it equals the
number of instructions executed multiplied by the average time
per instruction

# CPU clock cycles


# Instructions Average clock cycles
= for a program x
for a program
per instruction
Clock cycles per instruction (CPI) the average number
of clock cycles each instruction takes to execute
A way to compare two different implementations of the same ISA

CPI

CPI for this instruction class


A
B
C
1
2
3

Chapter 1. Measuring and understanding performance

38

Dept. of Computer Architecture, UMA, Sept 2013

Using the Performance Equation


Computers A and B implement the same ISA. Computer
A has a clock cycle time of 250 ps and an effective CPI of
2.0 for some program and computer B has a clock cycle
time of 500 ps and an effective CPI of 1.2 for the same
program. Which computer is faster and by how much?
Each computer executes the same number of instructions, I,
so
CPU timeA = I x 2.0 x 250 ps = 500 x I ps
CPU timeB = I x 1.2 x 500 ps = 600 x I ps
Clearly, A is faster by the ratio of execution times
performanceA
execution_timeB
600 x I ps
------------------- = --------------------- = ---------------- = 1.2
performanceB
execution_timeA
500 x I ps
Chapter 1. Measuring and understanding performance

39

Dept. of Computer Architecture, UMA, Sept 2013

Effective (Average) CPI


Computing the overall effective CPI is done by looking at
the different types of instructions and their individual
cycle counts and averaging
n

Overall effective CPI =

(CPIi x ICi)

i=1

Where ICi is the count (percentage) of the number of instructions


of class i executed
CPIi is the (average) number of clock cycles per instruction for
that instruction class
n is the number of instruction classes

The overall effective CPI varies by instruction mix a


measure of the dynamic frequency of instructions across
one or many programs
Chapter 1. Measuring and understanding performance

40

Dept. of Computer Architecture, UMA, Sept 2013

THE Performance Equation


Our basic performance equation is then
CPU time

= Instruction_count x CPI x clock_cycle


or

CPU time

Instruction_count x
CPI
----------------------------------------------clock_rate

These equations separate the three key factors that


affect performance
Can measure the CPU execution time by running the program
The clock rate is usually given
Can measure overall instruction count by using profilers/
simulators without knowing all of the implementation details
CPI varies by instruction type and ISA implementation for which
we must know the implementation details
Chapter 1. Measuring and understanding performance

41

Dept. of Computer Architecture, UMA, Sept 2013

Determinates of CPU Performance


CPU time

= Instruction_count x CPI x clock_cycle


Instruction_
count

CPI

Algorithm
Programming
language
Compiler
ISA
Core
organization
Technology
Chapter 1. Measuring and understanding performance

clock_cycle

X
42

Dept. of Computer Architecture, UMA, Sept 2013

A Simple Example
Op

Freq

CPIi

Freq x CPIi

ALU

50%

.5

.5

.5

.25

Load

20%

1.0

.4

1.0

1.0

Store

10%

.3

.3

.3

.3

Branch

20%

.4

.4

.2

.4

2.2

1.6

2.0

1.95

How much faster would the machine be if a better data cache


reduced the average load time to 2 cycles?
CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster

How does this compare with using branch prediction to shave


a cycle off the branch time?
CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster

What if two ALU instructions could be executed at once?


CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster
Chapter 1. Measuring and understanding performance

43

Dept. of Computer Architecture, UMA, Sept 2013

Summary: Evaluating ISAs


Design-time metrics:
Can it be implemented, in how long, at what cost?
Can it be programmed? Ease of compilation?

Static Metrics:
How many bytes does the program occupy in memory?

Dynamic Metrics:
How many instructions are executed? How many bytes does the
processor fetch to execute the program?
CPI
How many clocks are required per instruction?
How "lean" a clock is practical?

Best Metric: Time to execute the program!


depends on the instructions set, the
processor organization, and compilation
techniques.
Chapter 1. Measuring and understanding performance

44

Inst. Count

Cycle Time

Dept. of Computer Architecture, UMA, Sept 2013

Warning: MIPS as a Performance


Metric
MIPS: Millions of Instructions Per Second
Doesnt account for
- Differences in ISAs between computers
- Differences in complexity between instructions

MIPS =
=

Instruction count
Execution time 10 6
Instruction count
Clock rate
=
6
Instruction count CPI
CPI

10
6
10
Clock rate

CPI varies between programs on a given CPU


Chapter 1. Measuring and understanding performance

45

Dept. of Computer Architecture, UMA, Sept 2013

Workloads and Benchmarks


Benchmarks a set of programs that form a workload
specifically chosen to measure performance
SPEC (System Performance Evaluation Cooperative)
creates standard sets of benchmarks starting with
SPEC89. The latest is SPEC CPU2006 which consists
of 12 integer benchmarks (CINT2006) and 17 floatingpoint benchmarks (CFP2006).
www.spec.org

There are also benchmark collections for power


workloads (SPECpower_ssj2008), for mail workloads
(SPECmail2008), for multimedia workloads
(mediabench),
Chapter 1. Measuring and understanding performance

46

Dept. of Computer Architecture, UMA, Sept 2013

SPEC CPU Benchmark


Programs used to measure performance
Supposedly typical of actual workload

Standard Performance Evaluation Corp (SPEC)


Develops benchmarks for CPU, I/O, Web,

SPEC CPU2006
Elapsed time to execute a selection of programs
- Negligible I/O, so focuses on CPU performance

Normalize relative to reference machine


Summarize as geometric mean of performance ratios
- CINT2006 (integer) and CFP2006 (floating-point)
n

Execution time ratio


i=1

Chapter 1. Measuring and understanding performance

48

Dept. of Computer Architecture, UMA, Sept 2013

Comparing and Summarizing Performance


How do we summarize the performance for benchmark
set with a single number?
First the execution times are normalized giving the SPEC ratio
(bigger is faster, i.e., SPEC ratio is the inverse of execution time)
The SPEC ratios are then averaged using the geometric mean
(GM)
n

GM =

SPEC ratioi

i=1

Guiding principle in reporting performance measurements


is reproducibility list everything another experimenter
would need to duplicate the experiment (version of the
operating system, compiler settings, input set used,
specific computer configuration (clock rate, cache sizes
and speed, memory size and speed, etc.))
Chapter 1. Measuring and understanding performance

49

Dept. of Computer Architecture, UMA, Sept 2013

CINT2006 for Opteron X4 2356


Name

Description

IC10

CPI

Tc (ns)

Exec time

Ref time

SPECratio

perl

Interpreted string processing

2,118

0.75

0.40

637

9,777

15.3

bzip2

Block-sorting compression

2,389

0.85

0.40

817

9,650

11.8

gcc

GNU C Compiler

1,050

1.72

0.47

24

8,050

11.1

mcf

Combinatorial optimization

336

10.00

0.40

1,345

9,120

6.8

go

Go game (AI)

1,658

1.09

0.40

721

10,490

14.6

hmmer

Search gene sequence

2,783

0.80

0.40

890

9,330

10.5

sjeng

Chess game (AI)

2,176

0.96

0.48

37

12,100

14.5

libquantum

Quantum computer simulation

1,623

1.61

0.40

1,047

20,720

19.8

h264avc

Video compression

3,102

0.80

0.40

993

22,130

22.3

omnetpp

Discrete event simulation

587

2.94

0.40

690

6,250

9.1

astar

Games/path finding

1,082

1.79

0.40

773

7,020

9.1

xalancbmk

XML parsing

1,058

2.70

0.40

1,143

6,900

6.0

Geometric mean

11.7

High cache miss rates

Chapter 1. Measuring and understanding performance

51

Dept. of Computer Architecture, UMA, Sept 2013

Other Performance Metrics


Power consumption especially in the embedded market
where battery life is important
For power-limited applications, the most important metric is
energy efficiency

Chapter 1. Measuring and understanding performance

54

Dept. of Computer Architecture, UMA, Sept 2013

Power Trends

In CMOS IC technology

Power = Capacitive load Voltage 2 Frequency


30
Chapter 1. Measuring and understanding performance

5V 1V
55

1000

Dept. of Computer Architecture, UMA, Sept 2013

Reducing Power
Suppose a new CPU has
85% of capacitive load of old CPU
15% voltage and 15% frequency reduction

Pnew Cold 0.85 (Vold 0.85) 2 Fold 0.85


4
=
=
0.85
= 0.52
2
Pold
Cold Vold Fold
The power wall
We cant reduce voltage further
We cant remove more heat

How else can we improve performance?

Chapter 1. Measuring and understanding performance

56

Dept. of Computer Architecture, UMA, Sept 2013

Warning: Low Power at Idle


X4 power benchmark
At 100% load: 295W
At 50% load: 246W (83%)
At 10% load: 180W (61%)

Google data center


Mostly operates at 10% 50% load
At 100% load less than 1% of the time

Consider designing processors to make power


proportional to load

Chapter 1. Measuring and understanding performance

57

Dept. of Computer Architecture, UMA, Sept 2013

Amdahls Law
Pitfall: Improving an aspect of a computer and
expecting a proportional improvement in overall
performance
TCPU _ org
1
S=
=
TCPU _ imp Fm + (1 Fm)
Sm
Fm = Fraction of improvement
Sm = Factor of improvement

The opportunity for improvement is affected by how


much time the modified event consumes
Corollary 1: make the common case fast

The performance enhancement possible with a given


improvement is limited by the amount the improved
feature is used
Corollary 2: the bottleneck will limit the improv.
Chapter 1. Measuring and understanding performance

58

Dept. of Computer Architecture, UMA, Sept 2013

Amdahls Law
Example: multiply accounts for 80s/100s
How much improvement in multiply performance to
get 5 overall?
80
Cant be done!
20 =
+ 20
n

Study the limit cases in the Amdahl law


Fm=0

Sm=1

1
S=
=1
0 + (1)

Fm=1

1
S=
= 1 Sm=inf
Fm+ (1 Fm)

Chapter 1. Measuring and understanding performance

59

1
S=
= Sm
1 / Sm+ (11)
S=

1
0 + (1 Fm)

Dept. of Computer Architecture, UMA, Sept 2013

Tendency: the switch to multiprocessors


Uniprocessor Performance

Constrained by power, instruction-level parallelism,


memory latency
Chapter 1. Measuring and understanding performance

60

Dept. of Computer Architecture, UMA, Sept 2013

Why Multicore improves performance


Multicore microprocessors
More than one processor per chip

Requires explicitly parallel programming


Compare with instruction level parallelism
- Hardware executes multiple instructions at once
- Hidden from the programmer

Hard to do
- Programming for performance
- Load balancing
- Optimizing communication and synchronization

Chapter 1. Measuring and understanding performance

61

Dept. of Computer Architecture, UMA, Sept 2013

Manufacturing ICs

Yield: proportion of working dies per wafer

Chapter 1. Measuring and understanding performance

62

Dept. of Computer Architecture, UMA, Sept 2013

AMD Opteron X2 Wafer

X2: 300mm wafer, 117 chips, 90nm technology


X4: 45nm technology
Chapter 1. Measuring and understanding performance

63

Dept. of Computer Architecture, UMA, Sept 2013

Integrated Circuit Cost


Cost per wafer
Cost per die =
Dies per wafer Yield
Dies per wafer Wafer area Die area
1
Yield =
(1+ (Defects per area Die area/2))2

Nonlinear relation to area and defect rate


Wafer cost and area are fixed
Defect rate determined by manufacturing process
Die area determined by architecture and circuit design

Chapter 1. Measuring and understanding performance

64

Dept. of Computer Architecture, UMA, Sept 2013

Concluding Remarks
Cost/performance is improving
Due to underlying technology development

Hierarchical layers of abstraction


In both hardware and software

Instruction set architecture


The hardware/software interface

Execution time: the best performance measure


Power is a limiting factor
Use parallelism to improve performance

Chapter 1. Measuring and understanding performance

65

Dept. of Computer Architecture, UMA, Sept 2013