Gonzalez 2016

Ricardo E.
Gonzalez, Jeffrey Kuskin, and Ron Ho
Mark Horowitz
and His Impact on
Computer Architecture
His work in energy efficiency
and parallel multiprocessing
I
n the early 1990s, computer archi- problems were still paramount: systems designers were
tecture research, both academic and pushing on instruction-level parallelism using reorder
industrial, was flush with big problems buffers and speculation, multilevel branch prediction,
worthy of attack. Designers were trans- and of course multiscalar machines. They were also
lating workloads from minicomputers to working to tear down the memory wall, leveraging inclu-
the microprocessor, and so many classic performance sion as a key cache design concept, and extending caches
to hold both eject and inject data.
Digital Object Identifier 10.1109/MSSC.2016.2580259 Among the many microprocessor architecture problem
Date of publication: 2 September 2016 areas attracting attention were two in particular: energy
IEEE SOLID-STATE CIRCUITS MAGAZINE

1943-0582/162016IEEE su m m e r 2 0 16 31
efficiency and parallel multipro- Energy Efficiency supply, but power falls even faster!
cessing. Although interrelated, they Many architects up to the late 1980s Needless to say, this solution pro-
were distinctly separate windows relied on the concept of millions duces very slow, and hence relatively
into the design of large-scale com- of instructions per second (MIPS) uninteresting, systems.
puters. Energy efficiency was a new per watt to articulate notions of In a pair of early papers, Mark and
focus for many architects and laden efficiency. But using this ratio of Ricardo Gonzalez [1], [2] focused on
with semantics that often confused instruction throughput to power is the simple semantics of measuring
power with work. More importantly, nothing more than the inverse of energy and performance and arti
designers were pushing full throttle power-delay product and, hence, culated the benefits of optimizing
on the gigahertz race and needed to simply the inverse of energy. CMOS circuits for energy-delay prod-
uct. They first focused on a generic
model of CMOS circuits; later, with
the commercial adoption of back-
Mark embodied one of the most under- biasing allowing designers to easily
appreciated virtues of a faculty advisor: modify threshold voltage, they also
looked at trading off leakage and
he pushed students hard to finish their switching power to optimize efficien-
doctoral degrees. cy. See, for example, Figure 1, which
indicates what choices of thresh-
old voltage and supply voltage offer
understand what limits were rap- In the era of Intels Pentium proces- the best energy efficiency, assuming
idly approaching. By contrast, paral- sor, as CMOS gained dominance over variability in environmental condi-
lel machines really implied scalable bipolar devices, the problem with tions The interesting observation
coherency, and the big question was optimizing systems for MIPS per watt from this work arose from sensitiv-
how to enable that in a productive (energy) was thrown into sharp relief. ity studies around manufacturing
and efficient manner. In these two That is because CMOS circuits can variability. For designs operating close
areas, Prof. Mark Horowitz, along gracefully handle a dialed-down sup- to the minimum energy point, these
with his graduate students at Stan- ply voltage, unlike bipolar systems transistor variations would have
ford who focused on architecture, that lose levels of cascaded logic with an enormous effect on performance,
built groundbreaking systems, dem- reduced supplies. And then, because making the energy-delay product
onstrated new design paradigms, and power depends quadratically on sup- of that operating point much worse
articulated a number of fundamental ply, the solution to a power-delay than it would appear if variations
observations that laid the ground product optimization is to sim- were not considered. This work was
work for much of computer architec- ply turn down the voltage. Delay based on looking at the delay of
ture to come. degrades approximately linearly with simple inverterswith the rationale
that not only do the power and per-
formance of most CMOS gates scale
similarly to inverters but also the sad
but true fact that most of the gates in
2.0 a chip actually are inverters.
= 1.3
1.8 In later work on efficient designs,
1.6 Mark and his students realized that
using a single metric like energy-
Supply Voltage (V)
1.4
delay product might still lead to sub-
1.2
optimal designs, since it assumes
1.0 that the designers goal is to evenly
0.8 trade off between energy and delay.
0.6 Clearly not all designs share this as-
0.5
0.4 sumption. Papers by Dinesh Patil and
0.8 Sameh Galal leveraged the concepts
0.2
of Pareto optimality curves in the
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 broad design space of energy-per-
Threshold Voltage (V) operation and performance. This al-
lowed them to articulate families of
FIGURE 1: Counters of a constant energy-delay-product for CMOS circuits under variability machines that are highly energy effi-
in voltage, threshold, and temperature (from [1]). cient for single-thread performance,
32 su m m e r 2 0 16 IEEE SOLID-STATE CIRCUITS MAGAZINE

as well as machines focused on articles in this issue of IEEE Solid- third leg of the microprocessor archi-
energy-per-operation and area effi- State Circuits Magazine, but Guyeon tecture tripod, and they can be viewed
ciency (mm2-per-operation-per-second) Wei focused on optimizing off-chip from the lens of energy efficiency as
for throughput performance [3], [4]. interconnect energy using adaptive well. Patil and Galal employed convex
Figure 2 illustrates the tradeoffs be- supply regulation [8]. Of course, on- methods to optimize integer adders
tween area and energy and how maxi- chip interconnect, especially in a and floating point units for energy
mally efficient systems could span future of on-chip mesh networks and delay. Some interesting observations
the solution space along the blue tiled multiprocessors, was not to be arose, for example, in the case of
curve, and how real-world constraints overlooked, and Ron Ho wrote papers adders; many already understood that
like power density can motivate a par- enumerating wire scaling challenges different parallel-prefix adder struc-
ticular solution as optimal. and low-energy usage models [9], tures could minimize two of either
From his work on modeling effi-
cient processors, it became evident
that caches and local memories play
In a pair of early papers, Mark and Ricardo
an enormous role in making the sys- Gonzalez focused on the simple semantics
tem efficient. Fetching a bit from an of measuring energy and performance and
off-chip DRAM utterly dwarfs the en-
ergy cost of any on-chip computation articulated the benefits of optimizing CMOS
or transfer, so any workload relying circuits for energy-delay product.
on sustained DRAM usage has al-
ready lost the energetics war. On-chip
storage is thus a key component of [10]. One of the motivating obser- logic stages, wire tracks, or fanout.
system efficiency, and so Mark looked vations Mark made was that, until However, Patil pointed out that the
at building large and efficient SRAMs. then, architects and logic designers energy-efficient choice was, surpris-
In a series of papers, Bhadaraj Am- had the luxury of ignoring wires: ingly, to minimize logic stages and
rutur and Ken Mai developed simple they were relatively fast compared tracks and give up on fanout for the
techniques to leverage the regularity to logic gates, and hardware abstrac- simple reason that logical fanout
of memory arrays to cut energy tion languages like Verilog inferred and electrical fanout are not cre-
dissipation per access, for example, wires by simply reusing variables ated equal. A gate that must drive six
by switching high-capacitive lines at across lines of code. But this was others can actually drive one critical
a reduced voltage but doing so on changing. Wires were scaling in the gate and five super-small noncritical
both gate and source nodes of an ac- reverse direction as gates, and to side loads.
cess transistor. In later work, Mai ex- understand the system implications, Certainly a higher architectural
plored the generalization of efficient architects and architecture students perspective helps to ties together
memory arrayshow one would needed a how to use wires guide. So the individual components. Gonzalez,
build a flexible array fabric that Mark produced one [9]. and later Omid Azizi, focused on these
could be used either as a traditional Along with memories and inter- system-level explorations [11]. The
cache (with tags and read-modify- connect, compute circuits form the latter work, in particular, ambitiously
write mechanics for least-recently
used markers), as a scratch pad, as
storage for a gather-scatter engine,
or as an atomic/synchronization
0.05
buffer. This work, to minimize the 2
overhead of specialized hardware 0.04 /cm
W
from generalized building blocks, 50 2
P (W/GFlops)
x
=
W/cm
has proven remarkably prescient 0.03 D Ma = 30
with todays emergent work on ax
x
/ AM
0.02 P Ma
work load-specific lang uages, al-
gorithms, and hardware [5][7] and
0.01
provides the hardware bridge to the Optimal Design
realization of those application-tar-
0
geted systems. 0 0.02 0.04 0.06 0.08 0.10 0.12 0.14
Interconnects, both off and on
A (mm2/GFlops)
chip, act as another key consumer of
energy. Marks impact on the field of
serial links will be explored in other FIGURE 2: Energy and throughput Pareto tradeoffs under different constraints (from [4]).
IEEE SOLID-STATE CIRCUITS MAGAZINE su m m e r 2 0 16 33

FLASH was an enormous research success, Parallelism and Scalable
Shared Memory
spanning over five years of development Multiprocessing and the attempt to
and graduating more than 20 Ph.D. and M.S. bring massive parallelism to computer
architecture had its roots in the mini-
students, and Mark was the primary faculty computer era. By the early 1990s, many
member behind it. had turned toward cache coherency
as a foundational problem to solve.
Designers envisioned distributing me
modeled an entire microprocessor to a broad survey of commercial pro- mory among each of many so-called
extract incremental cost and benefit cessors, demonstrating empirically node cards, giving each processor fast
from a host of architectural features, a boundary of energy efficiency that local memory, but the primary debate in
again leveraging economic concepts has emerged over decades of indus- the computer architecture community
of Pareto optimality. Figure 3 shows try practice. was how to keep all those memories
Six observations on Mark Horowitz
Observation 1 topics not in that set. By contrast, if Mark asserts something, you had
Anyone who volunteers for IEEE or works on a conference committee better have a really, really good reason for disputing it.
will recognize the syndrome. You find yourself in a room full of people
you dont know very well, but you know their names and their reputa- Observation 4
tions, and you think I sure hope I dont make a fool of myself. Then Theres a related issue, and again, Mark exemplifies the right way to do
the saying better be silent and be thought a fool, than to speak up and it: hes a good listener. When others are making legitimate points, he
remove all doubt pops into your head. The discussion begins, and listens intently, and then he often does a remarkable thing (remarkable
you spend half your time thinking what the heck did they just say? for how seldom it happens): he asks follow-up questions intended to
and the other half thinking wait, that cant be right. And then Mark help the other person with their exposition and, also presumably, to help
Horowitz speaks up, cuts right to the heart of whats being discussed, Mark clearly understand what theyre proposing. Too often, tech folks in
and in a reasonable, nonoffensive way, gets two key points across: a similar position are simply waiting for whoever has the floor to shut up
1) youre all on the wrong track and 2) heres the right one. Its a beauti- so that they can start talking. We should all be more like Mark.
ful thing to watch. Observation 5
Observation 2 Mark is world class at knowing what level of abstraction is the right one for
Mark doesnt disdain philosophical and abstract discussions per se. In- helping to resolve or clarify whatever is under discussion. Sometimes that
stead, he deftly substitutes a quantitative argument that illuminates the level is so low that atomic physics and thermodynamics are the entrance
actual issue and effectively indirectly quashes the angels on a pin ar- fee; other times the big picture is required so that the details dont over-
gument that had been holding sway. In so doing, he single-handedly whelm. No matter what level is the right one, Mark will think carefully, con-
lowers the frustration level in the room, even on the part of the people sider what others are saying, and then reliably offer up some trenchant ob-
who were so intently estimating angel populations. servation that is universally on target, interesting, insightful, and influential.
Observation 3 Observation 6
Marks not afraid to say the words that, in certain tech circles, are almost Finally, you know that guy in the ads, where they say hes the most
never uttered: I dont know. Nobody can possibly know everything, interesting man in the world? I really dislike that guy and that ad. But in
but the percentage of people willing to fake it is awfully high sometimes. thinking about why I hate the ad, I realized that I have come to really
The problem seems not quite so chronic these days, when all the silent value people who are, in fact, interesting. They notice things about the
observers (see Observation 1) can instantly Google-check any particu- world, and then they think deeply about what their observations might
larly sketchy assertions made by inexplicably confident patriarchs of the mean about the way the physical world is and the way it works. I think
field. (I cant think of any cases where women pulled this stunt. Its usu- Mark is one of these people: hes interesting!
ally the men.) Not so long ago, however, meetings were routinely de- Bob Colwell
railed by people who actually were bona fide experts in a set of certain About the Author
things but who were equally willingand equally likelyto opine on Bob Colwell most recently served as director of the Microsystems Tech-
nology Office at DARPA. He was an Intel Fellow and the chief IA-32 ar-
Digital Object Identifier 10.1109/MSSC.2016.2580260 chitect at Intel for the Pentium Pro, II, III, and 4 microprocessors, retiring
Date of publication: 2 September 2016 in 2000. He received the ACM Eckert-Mauchly Award in 2005.
34 su m m e r 2 0 16 IEEE SOLID-STATE CIRCUITS MAGAZINE

coherentand whether a scalable Most importantly, when needed, [Mark] gave
cache-coherency system could be rea-
sonably implemented in hardware and critical autonomy to the graduate students
run fast enough. Some argued it could who made up the design team.
be done and that the system could
maintain a coherent shared memory
image without the programmers help. instead of the hardwired implementa- there is always the danger of over-scop-
Others pushed for simpler distributed tion that DASH used. ing the problem, and Mark played a
machines that could only share coher- By creating a custom and flexible critical role in containing the untrained
ency information through explicit mes- substrate for multiprocessing, FLASH appetites of the team to ensure that the
sage passing. could offer an integrated and stream- machine would be buildable. But per-
At Stanford, Mark worked on an early lined set of hardware primitives for haps most importantly, when needed,
machine, the DASH multiprocessor, both global cache coherence and also he gave critical autonomy to the gradu-
that implemented a ground-break- user-level message passing. It was, in ate students who made up the design
ing 64-node machine with distrib- effect, the ideal research vehicle to test team. Led primarily by Jeff Kuskin
uted cache-coherent shared memory and compare both type of systems. and Dave Ofelt, along with a talented
and could support several scientific FLASH was an enormous research team of other doctoral candidates, the
applications in the SPLASH suite of success, spanning over five years of team exemplified learning by doing, all
parallel applications [12]. Two of the development and graduating more under Marks mentoring.
faculty sponsors of DASH, Mark and than 20 Ph.D. and M.S. students, and As with any large-scale system,
Stanford President John Hennessy, Mark was the primary faculty member there were efforts down in the silicon,
not only guided the hardware devel- behind it. He helped to guide major with Mark teaching the team to build a
opment but also helped to facilitate architectural and design decisions of custom six-ported memory whose area
an early and highly successful tech- the node controller (nicknamed the and frequency was critical to the node
nology transfer to industry. The Sili- MAGIC chip), and he steered the team controllers performance. He helped
con Graphics Origin 2000 was one of not only in creating a useful simulation tackle computer-aided design (CAD)
the first commercial hardware-based framework to estimate its performance issues in finding out what new tools were
cache-coherent systems, and it was but also in using those results to cap- being developed, either in the commer-
architected and built by many of the ture the key research questions FLASH cial space or internal to companies,
same students who had built DASH needed to answer. Of course, with any and how to get access to those tools
and were later hired by SGI. This was group of bright-eyed graduate students, for his students. He worked to get large
an example of the general principle
articulated by computer pioneer Bert
Sutherland that technology transfer
is a contact sport. 100
Almost immediately after com-
(Nomalized for technology)
pleting DASH, Mark began the de

Energy Operation
vel opment of the next Stanford

parallel system, called FLASH, to
explore questions around the paral- 10
lel programming model and how it
related to distributed shared mem-
ory versus message passing and the Energy-Efficient Frontier
data structures and implementa- 1
tions on which those programming 0.00 0.01 0.10 1.00
Performance (Normalized for Technology)
models are built [13]. One of Marks
observations motivating FLASH was Intel 80386 Intel 80486 Intel Pentium Intel Pentium II
that ASIC development had suffi- Intel Pentium III Intel Pentium IV Intel Itanium Intel Itanium D
ciently advanced to the point that Intel Core 2 Intel Xenon Intel Atom Intel Core i7
Alpha 21064 Alpha 21164 Alpha 21264 MIPS
a reasonably complex system could
HP PA Power PC IBM-Power AMD K6
be built by a small team of motivated AMD K7 AMD Turion AMD Thlon AMD Operation
graduate students. That opened the AMD Phenom Sun Super Sparc Sun Ultra Sparc
door to build a programmable imple-
mentation of a so-called node con-
troller (the engine providing the FIGURE 3: A survey of microprocessors on a Pareto curve for the energy-per-operation and
cache coherency model primitives), performance (from [11]).
IEEE SOLID-STATE CIRCUITS MAGAZINE su m m e r 2 0 16 35

Computer architecture, much like the rest [10] R. Ho, K. Mai, and M. Horowitz, Efficient
on-chip global interconnects, in Proc.
of the silicon revolution, has undergone IEEE Symp. VLSI Circuits, 2003.
[11] O. Azizi, Design and optimization of
enormous transformations over the past few processors for energy efficiency: A joint
architecture-circuit approach, Ph.D.
dissertation, Stanford University, CA,
decades, and hand-held compute systems we 2010.
[12] D. Lenoski, J. Laudon, K. Gharachorloo, W.
casually store in our pockets today are Weber, A. Gupta, J. Hennessy, M. Horow-
itz, and M. Lam, The Stanford DASH mul-
profoundly more powerful than the tiprocessor, IEEE Computer, vol. 25, pp. 3,
1992.
mainframes of the past. [13] J. Kuskin, D. Ofelt, M. Heinrich, J. Hein-
lein, R. Simoni, K. Gharachorloo, J. Chap-
in, D. Nakahira, J. Baxter, M. Horowitz, A.
Gupta, M. Rosenblum, and J. Hennessy,
The Stanford FLASH multiprocessor, in
ASICs fabricated with the sponsorship discussed advances in energy efficien- Proc. Int. Symp. Computer Architecture,
1994.
of companies like Silicon Graphics, Inc. cy and parallel distributed memory, and
and LSI and then taught the team how to we simply lacked the space to continue
package, test, and sort the chips that the narrative into CAD tools, circuits, About the Authors
resulted. He was equally at home at a dry microarchitectures, compilers, and Ricardo E. Gonzalez received the B.S.,
erase board drawing stick figures of lay- so on. But partly due to these contribu- M.S., and Ph.D. degrees in electrical
out to floor plan critical circuits, wield- tions, each generation of computers has engineering from Stanford University,
ing a soldering iron to swap out a bad been just powerful enoughin compute California. He was a member of found-
connector, showing the students proper performance, or storage, or interconnect ing teams at Tensilica and Stretch,
oscilloscope techniques (hint: never use bandwidth, or richness of displayto where he led the development of con-
the auto-scale button around Mark), build the next generation of comput- figurable and extensible processors.
or grabbing a screwdriver to assemble ers. This feedback loop has allowed He has also worked at Intel, VMware,
a chassis. the engineering community to repeat- and Pure Storage. In the fall, he plans
Finally, Mark embodied one of the edly bootstrap design after design and, to pursue an M.S. degree in ecosystems
most under-appreciated virtues of a thus, put incredibly powerful systems and climate change from Imperial Col-
faculty advisor: he pushed students in large data farms, in our cars, in our lege London.
hard to finish their doctoral degrees. He pockets, and inextricably woven into Jeffrey Kuskin received an un
was an exacting mentor, but he worked the fabric of everyday life. dergraduate degree from Dartmouth
to ensure that students were driving College and M.S. and Ph.D. degrees in
toward an actual conclusion to their References electrical engineering from Stanford Uni-
[1] R. Gonzalez and M. Horowitz, Energy versity, California. From 1997 to 2000, he
Ph.D. tenures. If he was one of your dissipation in general purpose micropro-
advisors, you knew the weekly grill- cessors, IEEE J. Solid-State Circuits, vol. designed large-scale distributed shared
31, pp. 9, Sept. 1996. memory multiprocessors at Silicon
ing, painful as it might be at times, was
[2] R. Gonzalez and M. Horowitz, Supply
bringing you closer to completion. and threshold voltage scaling for low- Graphics. From 2000 to 2004, he devel-
power CMOS, IEEE J. Solid-State Circuits, oped wireless networking chip sets at
vol. 32, pp. 8, Aug. 1997.
Final Thoughts [3] D. Patil, O. Azizi, R. Anantharaman, R. Ho, Atheros Communications. Since 2004,
Computer architecture, much like the and M. Horowitz, Robust energy-effi- he has been at D.E. Shaw Research,
cient adder topologies, in Proc. IEEE
rest of the silicon revolution, has un- Symp. Computer Arithmetic, 2007. designing s pecial-purpose hardware
dergone enormous transformations [4] S. Galal and M. Horowitz, Energy-efficient to accelerate molecular dynamics sim-
floating-point unit design, IEEE Trans.
over the past few decades, and the hand- Computers, vol. 60, pp. 7, June 2010. ulations of proteins and other biologi-
held computer systems we casually [5] B. Amrutur and M. Horowitz, Speed and cal macromolecules.
power scaling of SRAMs, IEEE J. Solid-
store in our pockets today are profound- State Circuits, vol. 35, pp. 2, Feb. 2000. Ron Ho (ronho@ieee.org) received
ly more powerful than the mainframes [6] K. Mai, T. Mori, B. Amrutur, R. Ho, and M. his undergraduate, masters, and
Horowitz, Low-power SRAM design us-
of the past. One of the hallmarks of this ing half-swing pulse-mode techniques, doctoral degrees from Stanford Uni-
progress has been a virtuous feedback IEEE J. Solid-State Circuits, vol. 33, pp. 11, versity, California. He is currently
Nov. 1998.
loop in which advances in all aspects of [7] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, director of Interconnect IP at Altera,
computer technology become fun- and M. Horowitz, Smart memories: A modu- now part of Intel Corporation. From
lar reconfigurable architecture, in Proc. Int.
damental enablers for building new- Symp. Computer Architecture, 2000. 2003 to 2014, he was with Sun Micro-
er, faster, and more powerful machines. [8] G. Wei and M. Horowitz, A fully digital, systems (later Oracle Corporation),
energy-efficient adaptive power supply
Mark and his students have been an in- regulator, IEEE J. Solid-State Circuits, vol.
and from 1993 to 2003 he was with
tegral part of this cycle, making signifi- 34, pp. 4, Apr. 1999. Intel Corporation.
[9] R. Ho, K. Mai, and M. Horowitz, The fu-
cant and sometimes paradigm-shifting ture of wires, Proc. IEEE, vol. 89, pp. 4,
contributions across the space. Here we Apr. 2001.
36 s u m m e r 2 0 16 IEEE SOLID-STATE CIRCUITS MAGAZINE

Gonzalez 2016

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Gonzalez 2016

Загружено:

Авторское право:

Доступные форматы

Ricardo E.

Gonzalez, Jeffrey Kuskin, and Ron Ho

IEEE SOLID-STATE CIRCUITS MAGAZINE

32 su m m e r 2 0 16 IEEE SOLID-STATE CIRCUITS MAGAZINE

IEEE SOLID-STATE CIRCUITS MAGAZINE su m m e r 2 0 16 33

Six observations on Mark Horowitz

34 su m m e r 2 0 16 IEEE SOLID-STATE CIRCUITS MAGAZINE

pleting DASH, Mark began the de

vel opment of the next Stanford

IEEE SOLID-STATE CIRCUITS MAGAZINE su m m e r 2 0 16 35

36 s u m m e r 2 0 16 IEEE SOLID-STATE CIRCUITS MAGAZINE

Вам также может понравиться