Академический Документы
Профессиональный Документы
Культура Документы
chi te
em Ar
Sy st ctur
e
r
pute
i t e
A rch
Com t er
o mpu b o
n t o C
C e ri
uctio C .
: Int
rod
a M ay t or
pt er 1
C a rl tru c
Cha g r. L In s
En 513/
CPE
Archit
visib ecture
le to i s th
Instru t he p o s e a t
Orga re and
c rogr t ri b u
tion
us e d t i o n se a m m tes
fo r d t, n um e r
m e c ha at a ber
n i sms re p r of bi
t ec h , ad d e s e n t at ts
niqu
niza e.g. I es. ress
in g
i on , I
/O
Orga
s th e
n iz a
re a
mu l t
are iply
u
im p l t i o n i ns t r
e i s ho w u c t io
Contr m
itect
e nt e feat n?
o l d ures
m em s ig
o r y n a ls , in
e.g. I technolo terfaces
s th e gy. ,
u ni t r e a
o
addi r is it do hardwa
Arch
t io n ne b re
? y rep multipl
eate y
d
A l l In t
the el x 8
s am 6 fam
arch e ba ily s
hare
Orga re and
itect s ic
The IB ure
tion
sh ar M Sy st
e e
arch the sam m/370 f
itect am il
niza
ure e b as ic y
u
T h is g
itect
com ives co
patib de
At lea ility
Organ st backward
Arch
betw ization s
een d i fe
di f e rs
rent
vers
io ns
• Fun
• Nee ctional
• And ds to be cor
once unlike rect
s
• Whadeployedoftware, dif
com t functi cult
to u
o a ls
plete ons pdat
• Rel ness asi should it e
de)
• D o e i a bl e supp
or t (
Turin
s it c
• Har onti
nu e
g
d fau
gn G
• Goo lt vs
tran
to p
erf o
rm c
spot gle st s ie n orre
s o ry t f a u lt ctly?
- me
• Spa m or
y err
relia c e s or s a
bility atellites nd s
• Hig vs d un
Desi
e sk t
op v
• “Fas h performa s se
r ve r
a se t ” is on n c e
t of l y
• Not importanmeaningfu
t tas l in t
anal just “Gi k s he c
og y gahe onte
• Imp rtz”
– tru
xt of
for a ossible ck v
s sp
ll pro g o al : orts
gram faste c ar
s st po
s sib
l e de
sign
• Low
• Per cost
• Cos unit manufa
(ma t of mak ctur
i ng c
sk c i n g o st (
• De
s
ost) frst
c h i p aft
wafe
r cos
Two ign cost t)
o a ls
er d
reas ( e si
• (D ons… huge d gn
ime/ ) e s i gn
team
• Lo
wp
d ollar
joke s, w
h y?
)
• En o w er/e
gn G
ergy nerg
• En
ergy
i n (bat
tery
y
• Cy
clic
o u t (co
oling
life,
cost
toda prob o f el
y lem, a n d re ectr
icity
• Ch very
m
l a ted
c os
)
Desi
a uch ts)
impo llenge: a pr
oble
rtan b a la
• An
d th
c e of
t h es
n cing
t he
m
:
/
tion
orce • Lea ai n : uirem
dom
a i ns
i o ns
grou ents a i ns
d to p w i th
• Scie diferen s im i
lar c
ge n o n t i t d e h ar a
fc: sign ct e r
i ng F
Do m me s we a s
l i ca t
• Firs e qu e
n c i
ther
pred
n av a t c om p n g ictio
l ba l u ti n n,
• Nee listics frg applicati
Ap p
foat d: large in g t on d
S hap
in g p a b l es o ma
mem in :
• E x a o in t o ry,
heav
mp l e
• Com s: CRAY y-du
ty
e-co m er T 3E ,
mm c ia l: d a IBM
• Nee rce, Goo tabase/ BlueGene
e
+ I/O d: data g le web
mo v se r v
b an d in g ,
• Exa w i dt
h
e m e n t,
h ig h
AMD m pl e s mem
O p te : S u n E ory
ron, nter
I n te p
l Xeo rise Ser
n ver,
• De s
• Needktop: home o
gr a p : inte fce,
hics/ g e r, me mult
• Exam network? mory bandimedia, game
width s
• Mob ples: Intel Co , inte
grate
:
ile : l re
• Need a p to 2 , Cor d
/
orce p s, m e i7,
ai ns
i o ns
integ : low p obile AMD
ra t e d owe p hone Athlo
n
• Lapt wirel
es s
r , integ
er p e
s
ops: rform
• Sma Intel
ing F
g an
S ha p
• Need s rocon
trolle
rs in
d
• Exam : low power, auto
mob
proc p l e l o w iles,
esso s : A RM c c o s t
• Over r s (D SPs)
h i p s, d ed
icate
one 1 b il d dig
per p lion ARM ital s
• De e h o ne ) c ores
sold
igna
l
s en s p ly E m in 20
ors be d d 0 6 (a
t l ea
• Need e d : disp
o s
st
: ext a b le
reme “sma
ly low rt d u
powe s t”
r, ex
trem
ely lo
w co
s
Func
t io n a
l Vie
w
ns
ratio
Ope
Data Movement
ns
ratio
Ope
Data Storage
ns
ratio
Ope
Central Main
Processing Memory
Unit
Computer
Systems
Interconnection
Input
Output
Communication
lines
Structure - The CPU
CPU
Computer Arithmetic
Registers and
I/O Login Unit
System CPU
Bus
Internal CPU
Memory Interconnection
Control
Unit
Structure - The Control Unit
Control Unit
CPU
Sequencing
ALU Login
Control
Internal
Unit
Bus
Control Unit
Registers Registers and
Decoders
Control
Memory
asic
a l
c ept
n
ratio
B
Co n
Ope
GE
r in
loca B,
Load tion
LLEN Ri , L
and OC
Stor
e Ri,
LOC
CH A
are th
to tr e only i
ansf nstr
mem e r da uctio
t n
regis ory and a betwe s availa
t t e ble
cont ers. Do he gene n the
ents n ra
of ei ot chan l purpo
ther ge t se
loca he
tion
A or
B.
Suppo
are a se that
vaila M
ble w ove and
ith t Add
Mov h e form instruc
e Lo ats tions
and c ation1
, Loc
GE
Add ation
Loca 2
tion1
, Loc
LLEN These
ation
2
the instr
oper uctio
frst and ns m
lo a t ov
oper cation, the sec e or add
a o o
both nd at th verwrit nd loca a copy o
CH A
e in t f
mem of the op frst loc g the o ion to th
ory o eran ation rigin e
Is it p r the
g e
d s ca
n b
. Eithe a
r or
l
ossib nera e in
of th le to l-pur t he
e se u s e few p o se re
part ty
(a)? pes to a er in giste
rs.
If ye c com s truc
s, g i plish ti ons
ve th the
e se task
quen in
ce.
C
Repr haract
esen er
t a t io
n
Instru
ct ion l
The
simp evel
of in l
stru est way
Para
com ction to ex llelis
m
elism
plete s in ecut
instr a e
uctio all steps process a sequ
the o ence
next n before of the c r is to
Mult
icore
instr
u c tion
star
.
ti ng t
urre
he s
nt
teps
Mult P ro c
of
ll
ip
fabr le proc esso
rs
Para
icate essi
core d n g
is us on a sin units ca
proc
esso ed for e gle chip n be
used rs ac .T
for t . The te h of the he term
Mu l t
iproc
he c
o mple
rm p
roces
se
t e chip sor is th
Mult esso . en
iproc rs
Shar e s sors
ed-m
Mes
sage
emo
ry m
pass ultip
roce
ing m ssor
ultic
omp
uter
s
Electr
Inte onic Nu
grat mer
C-
Ecker r And Coical
o
nd
t an mpu
Unive d Mau t er
ENIA
grou rsity chly
Trajec of Pe
nnsy
wea t or y ta b lvan
pons les f ia
Starte or
back
d 19
Finish 43
ed 1
To
o la t
946
Used
e for
war
e fo r
unti t
l1 955
Decim
al (n
20 ac ot b
ina
digit cumula ry)
s
s tors
e ta i l
Prog
ram
of 10
swit med
ches man
ually
C-d
18,0 by
00 v
30 t
ons
acuu
m tu
bes
15,0
ENIA
00 s
140
kW p
quar
e fee
cons o t
ump wer
5,00
0 ad
tion
seco ditio
nd ns p
Store
d Pro
Main g ra m
c
prog memory once
ra m pt
sto r
ALU o
von
s a n d d i ng
ring
p er a ata
Contr ting
on b
in s t r o l unit ina r
uctio i n t er y da
ta
n/Tu
and ns fr pret
exec o m m in g
Input uting em o
ry
ope r a n d ou
ated tput
m an
Com p
leted
1952
Structure of von Neumann
machine
1000
x 40
Binar b it w
ords
y nu
2x2 m ber
Set o bit instru
0
il s
(sto f regist ction
s
ra g e ers
d e ta M e m in C P U )
o
Mem ry Bufer R
ory
Instru Addres egister
ction s Re
Instru
IAS -
Reg giste
i r
s t er
Reg ction B
ister ufe
Progr r
Accu am Counte
mu l r
Mu l t i ato r
p l i er
Q uo
t i en
t
Stru
c ture
of IA
S–
deta
il
1947 - Eckert-Mauchly
c ia l
Computer Corporation
rs
UNIVAC I (Universal
pute Automatic Computer)
mer
US Bureau of Census 1950
calculations
Com
Com
IBM
1953 - the 701
IBM’s frst stored program
computer
Scientifc calculations
1955 - the 702
Business applications
Lead to 700/7000 series
Replaced vacuum tubes
Smaller
s
stor Cheaper
Less heat dissipation
si
Solid State device
Tran
rs
Base
NCR & RCA produced
pute small transistor
machines
stor
Com
IBM 7000
DEC - 1957
si
Produced PDP-1
Tran
Literally - “small
electronics”
nics
A computer is made up
of gates, memory cells
ctro
and interconnections
These can be
o e le
manufactured on a
semiconductor
e.g. silicon wafer
M ic r
Vacuu
Trans m tube - 1
Small istor - 1958 946-1957
-196
Up to scale integ 4
of
Mediu 100 device ration - 1
r
p u te
9
100-3m scale in s on a chip 65 on
ions
Large ,000 devicetegration -
3,000scale integ s on a chip to 1971
Very l - 100,000 ration - 1
Com
e ra t
0,00
large 0 de
vic e
Over sc a l
e
s on
a
100
,000 i n t e grat
,000 io n –
d e vi 1991
ce s
on a
c
-
hip
Incre
as e d
Gordo d e ns
ity o
n Mo
Numb ore
–c
f com
pone
doub er of tr o -f o nts o
ansi u n d er o n ch
le e v ip
Since ery stor
s
f Int
e
Law
year on a l
little 1 9 70 c hip w
’s d e ill
v el o
N um p me
nt h
b
mon er of tr as s
lowe
an si
Cost t hs s t o rs d o da
re’s
o u bles
unch f a chip ev er
y 18
ange has
Highe d rem
aine
elec r pack d a lm
trica i ng d ost
M oo
perf l e
orm paths, g nsity me
Smal ance iving
h
ans
ighe shorte
le r s
Redu ize g
ives
r r
c ed incre
Fewe pow
er a ased
fexi
r i nt nd c bility
relia erco oolin
bility n ne c g re
tions quir
incre e me
ases nts
Growth in CPU Transistor
Count
1964
Replaced (& not compatible with)
s
7000 series
serie
First planned “family” of
computers
Similar or identical instruction sets
Similar or identical O/S
36 0
Increasing speed
Increasing number of I/O ports (i.e.
more terminals)
IBM
-8
miniskirt!)
bench
$16,000
$100k+ for IBM 360
Embedded applications & OEM
BUS STRUCTURE
DEC - PDP-8 Bus Structure
1970
Mem r
Fairchild
or y
o
duct
Size of a single core
i.e. 1 bit of magnetic core
storage
ic on
l
Inte
All CPU components on a single
chip
4 bit
Followed in 1972 by 8008
8 bit
Both designed for specifc
applications
1974 - 8080
Intel’s frst general purpose
microprocessor
Pipelining
On board cache
it u p On board L1 & L2 cache
Branch prediction
d in g
n ce
Memory capacity
rma
increased
B a la
Memory speed lags
o
ons
Make DRAM “wider” rather than
“deeper”
es
Large data throughput demands
evic Processors can handle this
Problem moving data
Solutions:
I/ O D
Caching
Bufering
Higher-speed interconnection buses
More elaborate bus structures
Multiple-processor confgurations
Typi
Dev c al I/O
ice D
a ta
Rate
s
Processor components
Main memory
e
la n c I/O devices
Interconnection
is B a
structures
Key
Increase hardware speed of
ip
processor
and
ure
h Fundamentally due to shrinking logic
Orga nts in C
gate size
Parallelism
Power
o ck
Power density increases with density of logic
gin
s i ty
and clock speed
Dissipating heat
Spee with Cl
d Lo RC delay
De n Speed at which electrons fow limited by
resistance and capacitance of metal wires
connecting them
d an
Memory latency
Memory speeds lag processor speeds
Prob
Solution:
More emphasis on organizational and
architectural approaches
Intel Microprocessor
Performance
Typically two or three levels
e
of cache between processor
a c i ty
Cach and main memory
Chip density increased
More cache memory on chip
Cap
Faster cache access
ased
50%
Enable parallel execution of
instructions
ex
c
Pipeline works like assembly
Logi
Exec Compl
line
Diferent stages of execution of
diferent instructions at same
u t ion
time along pipeline
in g
rns
Can get a great deal of
parallelism
inish
Retu Further signifcant increases
likely to be relatively modest
reaching limit
Increasing clock rate runs into
power dissipation problem
Some fundamental physical
limits are being reached
Multiple processors on single chip
Large shared cache
h–
s
Within a processor, increase in
Core
performance proportional to square root
roac
of increase in complexity
If software can use multiple processors,
doubling number of processors almost
doubles performance
ip le
App
justifed
Power consumption of memory logic less
than processing logic
8080
frst general purpose microprocessor
8 bit data path
(1)
Used in frst personal computer – Altair
8086 – 5MHz – 29,000 transistors
much more powerful
t io n 16 bit
instruction cache, prefetch few instructions
8088 (8 bit external bus) used in frst IBM PC
80286
16 Mbyte memory addressable
u
up from 1Mb
Evol
80386
32 bit
Support for multitasking
80486
sophisticated powerful cache and instruction pipelining
x86
(2)
Multiple instructions executed in parallel
Pentium Pro
speculative execution
Evol
Pentium II
MMX technology
graphics, video & audio processing
Pentium III
x86
(3)
Core
x86
w it h
d u al
2 core
64 b
t io n Core
it a r
c h it e
ctur
2 Qu e
Four a d – 3G
Hz –
proc 82 0
esso
r s on milli
x86 chip on t
rans
arch istor
u
syst itect s
ems ure
Orga d o mi n
Evol
n iz a a nt
dr a m t io n o ut s
a id e e
atica nd t mb e
Instr lly echn
o lo g
d d ed
uctio y ch
ba c k n se ange
ward t a d
~1 i s co
m pa
rc hi t e
c t
nstr tibili ure evo
500 u ction ty lved
x86
in s t r p er w it h
m on
Se e u ction
s av
th a
d d ed
Inte a ila b
proc l we le
esso b pa
rs g es
for d
e ta i l
e d in
form
a t io n
on
Proc core
rs
esso
i
M u lt
(12x)
chip cache
• 296 mm2 (3x)
• data-parallel vector
• 3.2 GHz to 3.6 Ghz (SIMD) instructions,
(~1x) hyperthreading
• 0.7 to 1.4 Volts • Four-core
ARM evolved from RISC
design
e ms
ARM
Syst Used mainly in
embedded systems
d
computer
Dedicated function
E.g. Anti-lock brakes in car
Difer
ent
Di f e
r
size
s
optim ent con
st r a
Dife
izati
o n , r i n t s,
s
e u se
ents
rent
stem Safet requirem
fexi y, reliab e nt s
irem bility ility,
Lifesp , legis real-
d Sy
l a ti o time
an n ,
E nv i r
o nm
Static ental co
Requ
e dde
vs d ndit
Slow y na m
ic lo
i o ns
to fa
Comp s t sp
eeds
ads
uta t
D i sc r io n v
Emb
ete I/O i
dyna even n te n
m ic s t vs sive
cont
i n uo
us
Possible Organization of an Embedded System
Designed by ARM Inc.,
Cambridge, England
u t io n
Licensed to manufacturers
High speed, small die, low
power consumption
PDAs, hand held games, phones
E v ol
Secure applications
ARM
Key p
a ra m
Perf
o
eter
s
pow rmance
Syst
em
er co ,
nsumcost, siz
ptio e , se
c n curit
In Hz lock sp y, re
liabi
e
Cloc sment
Clock or m e d lity,
eed
ultip
Sign rat e
, c lo c
les o
f
to 1 a l s i n C P
k cy
k Sp
cle,
or 0 U ta cloc
s
Sign ke k t ic
k, c y
Asse
a ls m time c le t
to se
O pe r ay
c han ttl e d
own
im e
a t ion
Instru s ne ge a
t di f
an c e
c ti o ned to b e rent
Fetch exec
u ti o
e sync s pe e
ds
logic , decod n hron
al e, lo i n di ised
orm
Us u a ad an
d
s cret
e s te
instr lly requ s tore ps
, ari
Pipeli uctio ire m th
Perf
n u lt i p met
l e c lo ic or
n
exec ing gi ck c
utio v es ycle
So, cl n of inst simultan s pe
r
o ck ruct
spee ions eous
d is
not
th e
who
le st
ory
System Clock
Millions of instructions
n
per second (MIPS)
Rate
u c t io
Millions of foating point
instructions per second
(MFLOPS)
ution
Instr
Heavily dependent on
instruction set, compiler
design, processor
Exec
implementation, cache
& memory hierarchy
Progr
perf ams de
orm
Writte ance signed to te
n in st
Porta h ig h l
e ve l
Repre le b
l a ng
ks
s e nt uage
Syste s style
Easily ms, numer of task
ma r Wide measured ical, commerc
l y di
E.g. S s tribu
ted
ia l
Corp ystem
ch
orat Perf
CP U 2 i o n (S orman
P EC ce E
Ben
00 6 ) va l u
17 fo for comp a t io n
at ut
12 in ing point ation bo
3 mil teger progra programs in und
S p ee
lion
l in e s ms i
n C,
C++
C, C
++,
Fort
d an of co
de ran
d ra
Sing te m
le ta
sk a
e tr ic
n s
d th
roug
hput
Single task
Base runtime defned for each
benchmark using reference machine
Results are reported as ratio of
d
ic
Spee reference time to system run time
Metr
Trefi execution time for benchmark i on
reference machine
Tsuti execution time of benchmark i on test
system
SPEC
ic
is ca e as
num
Tref lcula
t ber
M e tr
i ref
e d a of pr
benc erenc s f ol
oces
hma e ex lo ws : sors
N nu rk i ecutio
n time
Tsuti mber of cop for
of pr elapsed ies r
un s
og r ti m imul
Rate
com am o e f tane
pleti n all rom s ousl
Agai on o f al N
l cop
proc
e
t ar t of
ex
y
n, a
geom ies o ssors un ecution
f pro t
etric
mea gramil
n is
calc
ulate
SPEC
d
Gene Amdahl [AMDA67]
Potential speed up of program
Law
using multiple processors
Concluded that:
Code needs to be parallelizable
ahl’s
Speed up is bound, giving
diminishing returns for more
processors
Task dependent
Am d
Law
ingl
u la
lleliz cod e pr
ove a b l e i n fn o ce
—F rh e ad e w itely ssor
ract ith no s
seri ion che
( 1 duli
Form
a l -f) o ng
h l’s —T f co
is to de i
t nhe
prog al exe rent
—N r am c utio ly
is n o n si n tim
n e fo
fully umber gle
a
pro c r
cod e xplo o f pro e sso r
Amd
e it parr c e
alle ssors th
l po
rtion at
s of
Conclusions
f small, parallel processors has little efect
N ->∞, speedup bound by 1/(1 – f)
Diminishing returns for using more processors