Вы находитесь на странице: 1из 12

Veljko Milutinovi

MTP:
Understanding the Essence
emilutiv@etf.bg.ac.yu
UNDERSTANDING THE MTP
Basic classification:
coarse grain (task level; switching to a new thread on a context switch)
versus
fine grain (instruction level; switching to a new thread every cycle)
Principal components:
Multiple activity specifiers (program counters, stack pointers, etc.)
Multiple register contexts
Thread synchronization mechanism
(memory tags, !"; #$way %oins, Monsoon; futures, M&'&; etc.)
(ast context switch ()annucci; *uller; etc.)
Differences between a trea! an! a process:
Thread may +e directly supported at the architecture level
(start,suspension,continuation may +e implemented in the )'&)
"rocess is implemented in the operating system layer
(start,suspension,continuation implemented in software)
"#ARSE$GRAINED MU%TITHREADING
THE &IRST MU%TITHREADING PR#'E"T: HEP
The first commercial M)M- +ased on multithreading (-enelcor, )nc., ./01)
2p to .3 "!Ms (each one running up to 1 user and 1 supervisory threads)
and a num+er of -MMs (with dataflow heritage)
on a multistage )*4 (any memory location accessi+le to any processor)
#THER MU%TITHREADING PR#'E"TS: ()*s
Tera ('mith, Tera)
Monsoon (&rvind, M)T and Motorola)
5T (&rvind, M)T and Motorola)
'uper &ctor Machine (6ao, Mc6ill)
!M$7 ('akai,8amaguchi,9odama, !T:)
M&'& (alstead,(u%ita, Multilisp)
;$Machine (-ally, M)T)
&lewife (&garwal, M)T)
"!M
"'2
"'2
"'2
-MM
)<*
-MM "!M
)*4
:&"
:&"
&i+,re MTPU-: 'tructure of the !" multiprocessor system (source= >)annucci/7?)
:egend=
"!M@"rocessing !lement Module
-MM@-ata Memory Module
"'2@"acket 'witch 2nit
:&"@:ocal &ccess "ath
)*4@)nterconnection 4etwork
)nstruction
'cheduler
"rocess
Aueue
("'Bs)
!ffective
&ddress
Task
Cegister
(ile
)nstruction
(etch
"rogram
Memory
<perand
(etch
Thread
*ontrol
(2
("
&dd
(2
("
Multiply
(2
)nteger
(2
Cegister&nd*onstantMemory
!xecute<peration
-ataMemory*ontroller
Cesult
'tore
:ocal
-ata
Memory
)nterprocessor
4etwork
&i+,re MTPU.: 'tructure of the !" processing element module (source= >)annucci/7?)
:egend=
(2@(unctional 2nit,
("@(loating$"oint.
&INE$GRAINED MU%TITHREADING
TRADITI#NA% &INE$GRAINED MU%TITHREADING: T&M
)n traditional fine$grain multithreading=
only one thread issuing instructions in each cycle.
SIMU%TANE#US MU%TITHREADING: SMT
'everal independent threads issuing instructions simultaneously
to multiple functional units of a superscalar (in a single cycle).
igher potentials for utilization of resources in a wide$issue processor.
Throughput on an 1$issue processor=
7 times of a superscalar and # times of a fine$grained multithread,
+ecause +oth horizontal and vertical waste are attacked simultaneously

)ssue'lots

*
y
c
l
e
s
orizontalBaste D / slots
EerticalBaste D .# slots
(ull)ssue'lot
!mpty)ssue'lot
&i+,re MTPU/: !mpty issue slots= horizontal waste and vertical waste (source= >Tullsen/F?)
:egend= 'elf$explanatory
'uperscaling not efficient for vertical waste;
multithreading not efficient for horizontal waste;
the 'MT is never not efficientG
So,rce of 0aste! Iss,e Slots Possible %atenc1$Hi!in+ or %atenc1$Re!,cin+ Tecni2,es
instruction T:H miss,
data T:H miss
decrease the T:H miss rates (e.g., increase the T:H sizes);
hardware instruction prefetching;
hardware or software data prefetching; faster servicing of T:H misses
) cache miss larger, more associative, or faster instruction cache hierarchy;
hardware instruction prefetching
- cache miss larger, more associative, or faster data cache hierarchy;
hardware or software prefetching;
improved instruction scheduling;
more sophisticated dynamic execution
+ranch misprediction improved +ranch prediction scheme;
lower +ranch misprediction penalty
control hazard speculative execution;
more aggressive if$conversion
load delays (first$level cache hits) shorter load latency;
improved instruction scheduling;
dynamic scheduling
short integer delay improved instruction scheduling
long integer,
short fp,
long fp delays
(multiply is the only long integer operation,
divide is the only long floating point operation) shorter latencies;
improved instruction scheduling
memory conflict (accesses to the same memory location in a single cycle)
improved instruction scheduling
&i+,re MTPU3: *auses of wasted issue slots and related prevention techniIues (source= >Tullsen/F?)
:egend=
T:H@Translation :ookaside Huffer.
&ll these techniIues have to +e utilized properly,
+efore the effects of 'MT can +e studied.
P,rpose of Test "ommon Elements Specific "onfi+,ration T
2nlimited (2s= Test A: (2s D J# 'M= 1 thread, 1$issue 3.37
eIual total issue +andwidth, )ssueHw D 1, Ceg'ets D 1 M"= 1 .$issue F..J
eIual num+er of register sets Test B: (2s D .3 'M= 7 thread, 7$issue J.7K
(processors or threads) )ssueHw D 7, Ceg'ets D 7 M"= 7 .$issue #.00
Test ": (2s D .3 'M= 7 thread, 1$issue 7..F
)ssueHw D 1, Ceg'ets D 7 M"= 7 #$issue J.77
2nlimited (2s= Test D: 'M= 1 thread, 1$issue, .K (2 3.J3
Test &, +ut limit 'M to .K (2s )ssueHw D 1, Ceg'ets D 1 M"= 1 .$issue procs, J# (2 F..J
2neIual issue HB= Test E: (2s D J# 'M= 1 thread, 1$issue 3.37
M" has up to four times Ceg'ets D 1 M"= 1 7$issue 3.JF
the total issue +andwidth Test &: (2s D .3 'M= 7 thread, 1$issue 7..F
Ceg'ets D 7 M"= 7 7$issue J.0#
(2 utilization= Test G: (2s D 1 'M= 1 thread, 1$issue F.JK
eIual (2s, eIual issue +w, uneIual reg sets )ssueHw D 1 M"= # 7$issue ../7
&i+,re MTPU4: *omparison of various (multithreading) multiprocessors and an 'MT processor (source= >Tullsen/F?)
:egend=
T@Throughput (instructions,cycle)
*urrent microprocessors are mostly 7$issue superscalars;
potentially, 'MT leads to 1$issue and .3$issue next$gen superscalars.
RE&EREN"ES
>)annucci/7? )annucci, C. &., 6ao, 6. C., alstead, C. . ;r., 'mith, H.,
Multithreaded Computer Architecture:
A Summary of the State of the Art,
9luwer &cademic "u+lishers, Hoston, Massachusetts, 2'&, -((3.
>Tullsen/F? Tullsen, -. M., !ggers, '. ;., :evy, . M.,
L'imultaneous Multithreading= Maximizing <n$*hip "arallelism,M
Proceedings of the ISCA-95, 'anta Margherita :igure, )taly, -((4,
pp. J/#N7KJ.
Veljko Milutinovi
MTP:
State of the Art
emilutiv@etf.bg.ac.yu
AN INDUSTRIA% MTP PR#"ESS#R
Problem:
Memory accesses are starting to dominate execution time of uniprocessors
Sol,tion:
*oarse grained uniprocessor multithreading in the )HM environment
"on!itions:
<+%ect oriented programming for on$line transactions processing
Reference:
>!ickmeyer/3? !ickmeyer, C. ;., ;ohnson, C. !., 9unkel, '. C., :iu, '.,
'Iillante, M. '.,
L!valuation of Multithreaded 2niprocessor
for *ommercial &pplication !nvironments,M
Proceedings of the ISCA-96, "hiladelphia, "ennsylvania,
Ma1 -((5, pp. #KJN#.#.
AN A"ADEMI" SMT PR#"ESS#R
Problem:
!xtending a conventional wide$issue superscalar to 'MT
Sol,tion:
*om+ining of the following principles=
(a) minimizing the changes to the conventional superscalar architecture
(+) making the single thread case to +e su+optimal (#O)
(c) achieving maximal throughput when running multiple threads
Throughput improvement= F.7 (smt) over #.F (eIuivalent superscalar)
"on!itions:
!ight threads with a modified Multiflow compiler
Reference:
>Tullsen/3? Tullsen, -. M., !ggers, '.;., !mer, ;. '., :evi, . M., :o, ;.:.,
'tamm, C. :.,
L!xploiting *hoice= )nstruction (etch and )ssue
on an )mplementa+le 'imultaneous Multithreading "rocessor,M
Proceedings of the ISCA-96, "hiladelphia, "ennsylvania,
Ma1 -((5, pp. ./.N#K#.
Veljko Milutinovi
MTP:
IFACT
emilutiv@etf.bg.ac.yu
"ombinin+ "atal1tic Mi+ration
an! "atal1tic Reincarnation
Essence:
Cesearch in progress at the 2niversity of Helgrade;
supported +y (4C'
*om+ining the +est of the two most promising approaches=
*atalytic Migration and *atalytic Ceincarnation
Two research activities working in parallel
numerical domain (-avidovic) and sym+olic domain (;anici%evic)
'peedup over traditional 'MT
is application dependent
References:
>Milutinovic/3a? Milutinovic, E.,
L'ome 'olutions for *ritical "ro+lems
in -istri+uted 'hared Memory,M
I !CCA "e#sletter, September -((5.
>Milutinovic/3+? Milutinovic, E.,
LThe Hest Method for "resentation of Cesearch Cesults,M
I !CCA "e#sletter, September -((5.

Вам также может понравиться