Ca 2012 03 12 Intro

Computer Architecture 2012 Introduction (lec1)
1
Computer Architecture
(MAMAS, 234267)
Spring 2012

Lecturer: Dan Tsafrir
Reception: Mon 18:30, Taub 611

12/3/2012
Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz
2
General Info
Grade
20% Exercise (mandatory)
80% Final exam

Textbook
Computer Architecture:
A Quantitative Approach (4
th
Edition)
by: Patterson & Hennessy

Other course information
Course web site:
http://webcourse.cs.technion.ac.il/234267/Spring2012
Lectures will be upload to the web a day before the class
3
Computer System Structure
4
Classical Motherboard Diagram
CPU
PCI
North Bridge DDR2 or DDR3
Channel 1
mouse
LAN
Lan
Adap
External
Graphics
Card
Mem BUS
CPU BUS
Cache
Sound
Card
speakers
South Bridge
PCI express 2.0
IO Controller
Hard
Disk
P
a
r
a
l
l
e
l

P
o
r
t

S
e
r
i
a
l

P
o
r
t

Floppy
Drive
keybrd
DDR2 or DDR3
Channel 2
USB
controller
SATA
controller
PCI express 1
Memory
controller
On-board
Graphics
DVD
Drive
IOMMU
More to the north
= closer to the CPU
= faster
5
Intel Core 2
Northbridge = MCH =
mem controller hub
Southbridge = ICH =
I/O controller hub
Notice bandwidths
65 to 45 nm
6
Intel Nehalem Core i3 i5 i7
For high-end i-Series chips,
Northbridge functionality
moved onto processor
(=> made faster)
45 to 32 nm
7
Intel Sandy Bridge Core i3 i5 i7
The trend
continues
32 to 22 nm
8
9
Course Focus
Start from CPU (=processor)
Instruction set, performance
Pipeline, hazards
Branch prediction
Out-of-order execution
Move on to Memory Hierarchy
Caching
Main memory
Virtual Memory
Move on to PC Architecture
Motherboard & chipset, DRAM, I/O, Disk, peripherals
End with some Advanced Topics
10
The Processor
11
Architecture vs. Microarchitecture
Architecture:
= The processor features as seen by its user
= Interface
Instruction set, number of registers, addressing modes,

Microarchitecture:
= Manner by which the processor is implemented
= Implementation details
Caches size and structure, number of execution units,

Note: different processors with different u-archs
can support the same arch
Example: Intel Pentium-IV vs. Intel Core2 Duo

We will address both

12
Why Should We Care?

Abstractions enhance productivity, so:
If we know the arch (=interface),
Why should we care about the u-arch (=internals)?

Same goes for arch
Just details for a programmer of a high-level language

Abstractions only work so long as whats below
works
The taxi story: http://vimeo.com/11478146 (4:50-6:00)

13
Recent Processor Trends
Source: http://www.scidacreview.org/0904/html/multicore.html
14
Well-Known Moores Law
Graph taken from: http://www.intel.com/technology/mooreslaw/index.htm
15
16
The Story in a Nutshell
Transistors
(1000s)
clock speed
(MHz)
power (W)
Instructions/cycle
(ILP)
17
Took the Industry by Surprise
18
Dire Implications: Performance
19
Dire Implications: Sales
20
Dire Implications: Sales
21
Dire Implications: Programmers
22
Supercomputing: Top 500 list
23
Dire Implications: Supercomputing
24
Processor Performance
25
Metrics: IC, CPI, IPC
CPUs work according to a clock signal
Clock cycle: measured in nanoseconds (10
-9
of a second)
Clock frequency = 1/|clock cycle|: in GHz (10
9
cycles/sec)

Instruction Count (IC)
Total number of instructions executed in the program

Cycles Per Instruction (CPI)
Average #cycles per Instruction (in a given program)

IPC (= 1/CPI) : Instructions per cycles.
Can be > 1; see the story in a nutshell slide
CPI =
#cycles required to execute the program
IC
26
Minimizing Execution Time
CPU Time - time required to execute a program

CPU Time = IC CPI clock cycle

Our goal:
minimize CPU Time (any of above components)
Minimize clock cycle: increase GHz (processor design)
Minimize CPI: u-arch (e.g.: more execution units)
Minimize IC: arch + u-arch (e.g.: SSE
TM
)

SSE = streaming SIMD extension (Intel)
27
Alternative Way to Calculate CPI
ICi = #times instruction of type-i is executed in program
IC = #instruction executed in program =

Fi = relative frequency of type-i instruction = ICi/IC
CPI
i
= #cycles to execute type-i instruction
e.g.: CPI
add
= 1, CPI
mul
= 3
#cycles required to execute the program:

CPI:

CPI
cyc
IC
CPI IC
IC
CPI
IC
IC
CPI F
i i
i
n
i
i
i
n
i i
i
n
= =
-
= - = -
=
= =

#
1
1 1
1
#
n
i i
i
cyc CPI IC
=
= -
IC IC
i
i
n
=
=
1
28
Performance Evaluation: How?

No simple answer

Performance depends on
Application
Input

Mathematical analysis
Typically impossible

What to do?
29
Benchmarks

Use benchmarks & measure how long it takes
Use real applications (=> no absolute answers)

Preferably standardized benchmarks (+input), e.g.,
SPEC INT: integer apps
Compression, C complier, Perl, text-processing,
SPEC FP: floating point apps (mostly scientific)
TPC benchmarks: measure transaction throughput (DB)
SPEC JBB: models wholesale company (Java server, DB)

Sometimes you see FLOPS (pick or sustained)
Supercomputers (top500 list), against LINPACK
30
-2%
0%
2%
4%
6%
Evaluating Performance
Use a performance simulator to evaluate the
performance of a new feature / algorithm
Models the uarch to a great detail
Run 100s of representative applications
Produce the performance s-curve
Sort the applications according to the IPC increase
Baseline (0%) is the processor without the new feature

-4%
-3%
-2%
-1%
0%
1%
2%
3%
Negative
outliers
Positive
outliers
Bad S-curve
Small negative
outliers
Positive
outliers
Good S-curve
31
Amdahls Law

Suppose we accelerate the computation such that
P = proportion of computation we make faster
S = speedup experienced by the proportion we improved

For example
If an improvement can speedup 40% of the computation
=> P = 0.4
If the improvement makes the portion run twice as fast
=> S = 2

Then overall speedup =
1
(1 )
P
P
S
+
32
Amdahls Law - Example

FP operations improved to run 2x faster
S = 2, but
P = only affects 10% of the program
Speedup:

Conclusion
Better to make common case fast

1 1 1
1.053
0.1
0.95
(1 ) (1 0.1)
2
P
P
S
= = ~
+ +
33
Amdahls Law Parallelism

When parallelizing a program
P = proportion of program that can be made parallel
1 - P = inherently serial
N = number of processing elements (say, cores)
Speedup:

Serial component imposes a hard limit

1
(1 )
P
P
N
+
1 1
lim
(1 )
(1 )
N
P
P
P
N
| |
|
=
|
+
\ .
34
The ISA is what the user
& compiler see

The HW implements the
ISA
instruction set
software
hardware
Instruction Set Design
35
Considerations in ISA Design
Instruction size
Long instructions take more time to fetch from memory
Longer instructions require a larger memory
Important for small (embedded) devices, e.g., cell phones

Number of instructions (IC)
Reduce IC => reduce runtime (at a given CPI & frequency)

Virtues of instructions simplicity
Simpler HW allows for: higher frequency & lower power
Optimization can be applied better to simpler code
Cheaper HW
36
Basing Design Decisions on Workload
Immediate arguments size in bits (histogram)

1% of data values > 16-bits
Having 16 bits is likely good enough

0%
10%
20%
30%
0

1

2

3

4

5

6

7

8

9

1
0

1
1

1
2

1
3

1
4

1
5

Immediate data bits
Int. Avg.
FP Avg.
37
CISC Processors
CISC - Complex Instruction Set Computer
Example: x86
The idea: a high level machine language
Once people programmed in assembly, CISC supposedly easier

Characteristic
Many instruction types, with a many addressing modes
Some of the instructions are complex
Execute complex tasks
Require many cycles
ALU operations directly on memory (e.g., arr[j] = arr[i]+n)
Registers not used (and, accordingly, only a few registers exist)
Variable length instructions
common instructions get short codes save code length
38
Rank instruction % of total executed
1 load 22%
2 conditional branch 20%
3 compare 16%
4 store 12%
5 add 8%
6 and 6%
7 sub 5%
8 move register-register 4%
9 call 1%
10 return 1%
Total 96%
Simple instructions dominate instruction frequency
But it Turns Out
39
CISC Drawbacks
Complex instructions and complex addressing modes
complicates the processor
slows down the simple, common instructions
contradicts Make The Common Case Fast
Compilers dont use complex instructions / indexing methods
Variable length instructions are real pain in the neck
Difficult to decode few instructions in parallel
As long as instruction is not decoded, its length is unknown
It is unknown where the instruction ends
It is unknown where the next instruction starts
An instruction may be over more than a single cache line
An instruction may be over more than a single page
40
RISC Processors
RISC - Reduced Instruction Set Computer
The idea: simple instructions enable fast hardware
Characteristic
A small instruction set, with only a few instructions formats
Simple instructions
execute simple tasks
Most of them require a single cycle (with pipeline)
A few indexing methods
ALU operations on registers only
Memory is accessed using Load and Store instructions only
Many orthogonal registers
Three address machine: Add dst, src1, src2
Fixed length instructions

Examples: MIPS
TM
, Sparc
TM
, Alpha
TM
, Power
TM
41
RISC Processors (Cont.)
Simple arch => simple u-arch
Room for larger on die caches
Smaller => faster
Easier to design & validate (=> cheaper to manufacture)
Shorten time-to-market
More general-purpose registers (=> less memory refs)

Compiler can be smarter
Better pipeline usage
Better register allocation

Existing RISC processor are not pure RISC
e.g., support division which takes many cycles
42
Compilers and ISA
Ease of compilation
Orthogonality:
no special registers
few special cases
all operand modes available with any data type or instruction
type
Regularity:
no overloading for the meanings of instruction fields
streamlined
resource needs easily determined

Register assignment is critical too
Easier if lots of registers

43
Still, CISC Is Dominant
x86 (CISC) dominates the processor market

Legacy
A vast amount of existing software
Intel, AMD, Microsoft benefit
But put lot of money to compensate for disadvantage

CISC internally arch emulates RISC
Starting at Pentium II and K6, x86 processors translate
CISC instructions into RISC-like operations internally
Inside core looks much like that of a RISC processor
44
Software Specific Extensions
Extend arch to accelerate exec of specific apps

Example: SSE
TM
Streaming SIMD Extensions
128-bit packed (vector) / scalar single precision FP (432)
Introduced on Pentium III on 99
8 new 128 bit registers (XMM0 XMM7)
Accelerates graphics, video, scientific calculations,

Packed: Scalar:
x0 x1 x2 x3
y0 y1 y2 y3
x0+y0 x1+y1 x2+y2 x3+y3
+
128-bits
x0 x1 x2 x3
y0 y1 y2 y3
x0+y0 y1 y2 y3
+
128-bits
45
BACKUP
46
Compatibility
Backward compatibility (HW responsibility)
When buying new hardware, it can run existing software:
i5 can run SW written for Core2 Duo, Pentium4, PentiumM,
Pentium III, Pentium II, Pentium, 486, 386, 268

BTW:

Forward compatibility (SW responsibility)
For example: MS Word 2003 can open MS Word 2010 doc
Commonly supports one or two generations behind

Architecture-independent SW
Run SW on top of VM that does JIT (just in time compiler):
JVM for Java and CLR for .NET
Interpreted languages: Perl, Python

Ca 2012 03 12 Intro

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Ca 2012 03 12 Intro

Загружено:

Авторское право:

Доступные форматы

Computer Architecture 2012 Introduction (lec1)

Вам также может понравиться