Вы находитесь на странице: 1из 13

and Multithreading

Ben Lee, Oregon State University

A.R. Hurson, Pennsylvania State University

he dataflow model of execution offers attractive properties for parallel pro-

cessing. First, it is asynchronous: Because it bases instruction execution on
operand availability, synchronizationof parallel activities is implicit in the
dataflow model. Second,it is self-scheduling:Except for data dependencies in the p r e
gram, dataflow instructions do not constrain sequencing; hence, the dataflow graph
representation of a program exposes all forms of parallelism, eliminatingthe need to
explicitlymanage parallel execution. For high-speed computations,the advantage of

the dataflow approach over the control-flowmethod stems from the inherent paral-
lelism embedded at the instruction level. This allows efficient exploitation of fine-grain
parallelismin applicationprograms.
Due to its simplicity and elegance in describing parallelism and data dependencies,
the dataflow execution model has been the subject of many research efforts. Since the
early 197Os,a number of hardware prototypeshave been built and evaluated,’ and dif-
ferent designs and compiling techniques have been simulated? The experience gained
from these efforts has led to progressive development in dataflow computing.How-
ever, a direct implementation of computers based on the dataflow model has been
found to be an arduous task.
Studies from past dataflow projects revealed some inefficiencies in dataflow com-
puting? For example, compared to its control-flow counterpart, the dataflow model’s
fine-grained approach to parallelism incurs more overhead in instruction cycle exe-
Contrary to initial cution.The overhead involved in detectingenabled instructionsand constructingre-
sult tokens generally results in poor performance in applicationswith low degrees of
expectations, parallelism. Another problem with the dataflow model is its inefficiency in handling
implementing dataflow data structures (for example, arrays of data). The execution of an instructioninvolves
consuming tokens at the input arcs and generating result tokens at the output arcs.
computers has Since tokens represent scalar values, the representationof data structures(collections
of many tokens) poses serious problems.
presented a In spite of these shortcomings, we’re seeing renewed interest in dataflow comput-
monumental challenge. ing. This revival is facilitated by a lack of developmentsin the conventionalparallel-
processing arena, as well as by changes in the actual implementationof the dataflow
Now, however, model. One important development, the emergence of an efficient mechanism to
multithreading offers a detect enabled nodes, replaces the expensive and complex process of matching tags
used in past projects. Another change is the convergence of the control-flow and
viable alternative for dataflow models of execution. This shift in viewpoint allows incorporation of con-
ventional control-flow thread executioninto the dataflow approach, thus alleviating
building hybrid the inefficiencies associatedwith the pure-dataflow method. Finally, some researchers
architectures that suggest supportingthe dataflowconcept with appropriate compiling technology and
program representation instead of specific hardware. This view allows implementa-
exploit parallelism. tion on existing control-flow processors.These developments,coupled with experi-

August 1994 0018-9162/94/$4.008 1994 IEEE 27

mental evidence of the dataflow ap- are carriedby tokens. These tokens travel code, data, and destination list to one of
proach’s success in exposing substantial along the arcs connecting various in- the operation units and clears the pres-
parallelism in application programs, have structionsin the program graph. The arcs ence bits. The operation unit performs
motivatednew research in dataflow com- are assumed to be FIFO queues of un- the operation, forms result tokens, and
puting? However, issues such as program bounded capacity. However, a direct im- sends them to the update unit. The up-
partitioning and scheduling and resource plementation of this model is impossible. date unit stores each result in the appro-
requirements must still be resolved be- Instead, the dataflow execution model priate operand slot and checks the pres-
fore the dataflow model can provide the has been traditionally classified as either ence bits to determine whether the
necessary computing power to meet to- static or dynamic. activity is enabled.
day’s and future demands.
Static. The static dataflow model was Dynamic. The dynamic dataflow
proposed by Dennis and his research model was proposed by Arvind at MIT1
Past dataflow group at MIT? Figure 1 shows the gen-
eral organizationof their Static Dataflow
and by Gurd and Watson at the Univer-
sity of Manchester? Figure 3 shows the
architectures Machine. The activity store contains in- general organization of the dynamic
struction templates that represent the dataflow model. Tokens are received by
Three classic dataflow machines were nodes in a dataflow graph. Each instruc- the matching unit, which is a memory
developed at MIT and the University of tion template containsan operation code, containing a pool of waiting tokens. The
Manchester: the Static Dataflow Ma- slots for the operands, and destination unit’s basic operation brings together to-
chine? the Tagged-Token Dataflow Ar- addresses (Figure 2). To determine the kens with identicaltags. If a match exists,
chitecture(TfDA),’ and the Manchester availability of the operands, slots contain the corresponding token is extracted
Machine? These machines - including presence bits. The update unit is respon- from the matching unit, and the matched
the problems encountered in their design sible for detecting instructions that are token set is passed to the fetch unit. If no
and their major shortcomings - pro- available to execute. When this condition match is found, the token is stored in the
vided the foundation that has inspired is verified, the unit sends the address of matching unit to await a partner. In the
many current dataflow projects. the enabled instruction to the fetch unit fetch unit, the tags of the token pair
via the instruction queue. The fetch unit uniquely identify an instruction to be
Static versus dynamicarchitectures. In fetches and sends a complete operation fetched from the program memory. Fig-
the abstract dataflow model, data values packet containingthe correspondingop- ure 4 shows a typical instruction format
for the dynamic dataflow model. It con-
sists of an operational code, a literalkon-
stant field, and destination fields. The in-
struction and the token pair form the
Definitions enabled instruction, which is sent to the
processing unit. The processing unit exe-
cutes the enabled instructions and pro-
Dataflow execution model. In the dataflow execution model, an instruction
duces result tokens to be sent to the
executionis triggered by the availability of data rather than by a programcounter.
matching unit via the token queue.
Static dataflow executionmodel. The static approach allows at most one
of a node to be enabled for firing. A dataflow actor can be executed
Centralized or distributed. Dataflow ar-
only when all of the tokens are available on its input arc and no tokens exist on
chitectures can also be classified as cen-
any of its output arcs.
tralized or distributed,based on the orga-
Dynamic dataflow executionmodel. The dynamic approachpermits activa-
nization of their instruction memories.’
tion of several instancesof a node at the same time during runtime. To distin-
The Static Dataflow Machine and the
node, a tag associatedwith each token
Manchester Machine both have central-
lar token was generated. An actor is con-
ized memory organizations. MIT’s dy-
contain a set of tokens with identical tags.
namic dataflow organization is a multi-
ing scheme, any computation is completely
processor system in which the instruction
described by a pointer to an instruction(IP) and a pointer to an activationframe
memory is distributed among the process-
(FP). A typical instruction pointedto by an IP specifies an opcode, an offset in
ing elements.The choice between central-
the activation frame where the match will take place, and one or more displace-
ized or distributed memory organization
ments that define the destination instructionsthat will receive the result token(s).
has a direct effect on program allocation.
The actual matching process is achieved by checking the disposition of the slot
in the frame. If the slot is empty, the value of the token is written in the slot and
Comparison. The major advantage of
its presence bit is set to indicatethat the slot is full. If the slot is already full, the
the static dataflow model is its simplified
value is extracted, leaving the slot empty, and the corresponding instruction is
mechanism for detecting enabled nodes.
Presence bits (or a counter) determine
Thread. Within the scope of a dataflow environment, a thread is a sequence
the availability of all required operands.
of statically ordered instructions such that once the first instruction in the thread
However, the static dataflow model has
is executed, the remaining instructionsexecute without interruption (synchro-
a performance drawback when dealing
nization occurs only at the beginning of the thread). Multithreadingimplies the
with iterative constructs and reentrancy.
interleavingof these threads. Interleavingcan be done in many ways, that is, on
The attractiveness of the dataflow con-
every cycle, on remote reads, and so on.
cept stems from the possibility of concur-
rent executionof all independent nodes if
sufficient resources are available, but
reentrant graphs require strict enforce-
ment of the staticfiring rule to avoid non-
determinate behavior.’ (The static firing
rule states that a node is enabled for firing queue
when a token exists on each of its input
arcs and no token exists on its output arc.) Enabled
To guard against nondeterminacy, extra tokens instructions
arcs carry acknowledgesignals from con-
suming nodes to producing nodes. These
acknowledge signals ensure that no arc Figure 1. The basic organization of the Figure 3. The general organization of
will contain more than one token. static dataflow model. the dynamic datafiow model.
The acknowledge scheme can trans-
form a reentrant code into an equivalent
graph that allows pipelined execution of
consecutive iterations, but this transfor-
Opcode s: ODCOde I
mation increases the number of arcs and
tokens. More important, it exploits only
a limited amount of parallelism,since the Destination s,
Destination s,
execution of consecutive iterations can Destination s,
never fully overlap - even if no loop- Destination s,
carried dependenciesexist. Although this
inefficiency can be alleviated in case of Figure 2. An instruction template for Figure 4. An instruction format for the
loops by providing multiple copies of the the static dataflow model. dynamic dataflow model.
program graph, the staticdataflow model
lacks the general support for program-
ming constructs essential for any modern
programming environment (for example, dataflow computation model. Previous ex- formed, (3) computing the results,and (4)
procedure calls and recursion). periences have shown that performance generating and communicating result to-
The major advantage of the dynamic depends directly on the rate at which the kens to appropriate target nodes. Even
dataflow model is the higher perfor- matching mechanism processes token^.^ though the characteristicsof this instruc-
mance it obtains by allowing multiple to- To facilitate matching while considering tion cycle may be consistent with the pure-
kens on an arc. For example, a loop can the cost and the availability of a large-ca- dataflow model, it incurs more overhead
be dynamically unfolded at runtime by pacity associative memory, h i n d pro- than its control-flow counterpart.For ex-
creating multiple instances of the loop posed a pseudoassociative matching ample, matching tokens is more complex
body and allowing concurrent execution mechanism that typically requires several than simply incrementing a program
of the instances. For this reason, current memory accesses,l but this increase in counter.The generation and communica-
dataflow research efforts indicate a trend memory accesses severely degrades the tion of result tokens imposes inordinate
toward adopting the dynamic dataflow performance and the efficiency of the un- overhead compared with simply writing
model. However, as we’ll see in the next derlying dataflow machines. the result to a memory location or a regis-
section, its implementation presents a A more subtle problem with token ter. These inefficiencies in the pure-
number of difficult problems. matching is the complexity of allocating dataflow model tend to degrade perfor-
resources (memory cells). A failure to mance under a low degree of parallelism.
Lessons learned. Despite the dynamic find a match implicitly allocates memory Another formidable problem is the
dataflow model’s potentialfor large-scale within the matching hardware. In other management of data structures. The
parallel computer systems, earlier expe- words, mapping a code-block to a pro- dataflow functionality principle implies
riences have identifieda number of short- cessor places an unspecified commitment that all operations are side-effect free;
comings: on the processor’s matching unit. If this that is, when a scalar operation is per-
resource becomes overcommitted, the formed, new tokens are generated after
overhead involved in matching to- program may deadlock. In addition, due the input tokens have been consumed.
kens is heavy, to hardware complexity and cost, one However, absence of side effects implies
resource allocation is a complicated cannot assume this resource is so plenti- that if tokens are allowed to carry vec-
process, ful it can be wasted. (See Culler and tors, arrays, or other complex structures,
the dataflow instruction cycle is inef- Arvind4for a good discussion of the re- an operation on a structure element must
ficient, and source requirements issue.) result in an entirely new structure. Al-
*handling data structures is not A more general criticism leveled at though this solution is theoretically ac-
trivial. dataflow computing is instruction cycle in- ceptable, in practice it creates excessive
efficiency. A typical dataflow instruction overhead at the level of system perfor-
Detection of matching tokens is one of cycle involves (1)detecting enabled nodes, mance. A number of schemes have been
the most important aspects of the dynamic (2) determining the operation to be per- proposed in the literature: but the prob-

August 1994 29
Table 1. Architectural features of current datanow systems.

Category General Characteristics Machine Key Features

Pure- Implements the traditional Monsoonl.2 Direct matching of tokens using rendezvous slots
dataflow dataflow instruction cycle. in the frame memory (ETS model).
0 Direct matching of tokens Associates three temporary registers with
each thread of computation.
Sequential schedulingimplemented by
recirculatingschedulingparadigm using
a direct recirculationpath (instructions are
annotated with specialmarks to indicate that
the successor instruction is IP+l).
Multiple threads supported by FORK and
implicit JOIN.
Epsilon-23 A separate match memory maintains match
counts for the rendezvous slots, and each
operand is stored separately in the frame
Repeat unit is used to reduce the overhead of
copying tokens and to represent a thread of
computation (macroactor) as a linked-list.
Use of a register file to temporarily
store values within a thread.
Macro- Integration of a token-based EM-44 Use of macroactors based on the strongly
dataflow dataflow circular pipeline and connected arc model, and the execution of
an advanced control pipeline. macroactors using advanced control pipeline.
Direct matching of tokens. Use of registers to reduce the instruction cycle
and the communicationoverhead of
transferring tokens within a macroactor.
Special thread library functions,FORK and NULL,
to spawn and synchronize multiple threads.
Hybrid Based on conventional control- P-RIS6 Support for multithreading using a
flow processor, i.e., sequential token queue and circulatingcontinuations.
schedulingis implicit by the Context switching can occur on every cycle
RISC-based architecture. or when a thread dies due to LOADS or JOINS.
Tokens do not cany data,
only continuations. *T6 Overhead is reduced by off-loading
0 Provides limited token matching the burden of message handling and
capabilitythrough special synchro- synchronizationto separate coprocessors.
nization primitives(i.e., JOIN).
Message handlers implement inter- Threaded Placing all synchronization,scheduling,
processor communication. Abstract and storage management responsibilityunder
Can use both conventional and data- Machinel compiler control, e.g., exposes the token queue
flow compiling technologies. (i.e., continuation vector) for schedulingthreads.
Having the compiler produce specialized
message handlers as inlets to each code-block.

1. D.E. Culler and G.M. Papadopoulos,“The Explicit Token Store,”J. Parallel & Distributed Computing,Vol. 10,1990,pp. 289-308.
2. G.M. Papadopoulos and K.R. Traub, “Multithreading:A Revisionist View of Dataflow Architectures,” Proc. 18th Annual Int ‘1 Symp.
Computer Architecture, IEEE CS Press, Los Alamitos, Calif., Order No. 2146,1991,pp. 342-351.
3. V.G. Grafe and J.E. Hoch, “The Epsilon-2 Multiprocessor System,”J. Parallel & Distributed Computing,Vol. 10,1990,pp. 309-318.
4. S. Sakaiet al., “An Architectureof a DataflowSingle-ChipProcessor,” Proc. 16th Annual Int’l Symp. Computer Architecture, IEEE CS Press,
Los Alamitos,Calif., Order No. 1948(microfiche only), 1989,pp. 46-53.
5. R.S.Nikhil and Arvind, “Can Dataflow Subsume von Neumann Computing?” Proc. 16th Annual Int‘l Symp. Computer Architecture,
IEEE CS Press,Los Alamitos, Calif., Order No. 1948(microfiche only), 1989,pp. 262-272.
6. R. S . Nikhil, G.M. Papadopoulos, and Arvind, “*T: A Multithreaded Massively Parallel Architecture,” Proc. 19th Annual Int’l Symp.
Computer Architecture, IEEE CS Press,Los Alamitos, Calif., Order No. 2941 (microfiche only), 1992,pp. 156-167.
7. D.E. Culler et al., “Fine-Grain Parallelismwith Minimal Hardware Support: A Compiler-Controlled Threaded Abstracted Machine,”
Proc. Fourth Int‘l Con$ Architectural Support for Programming Languages and Operating Systems, 1991.

communication network

Token to/from
queue memory
Frames and
Direct other PES


~ start 1
communication network

Figure 5. Organizationof a pure-dataflow processing element. Figure 6. Organization of a hybrid processing element.

lem of efficiently representing and ma- (The processing unit contains an ALU categorized as hybrid organizations.
nipulating data structures remains a dif- and a target address calculation unit for The macro-dataflow organization,
ficult challenge. computing the destinationaddress(es).It shown in Figure 7, is a compromise be-
may also contain a set of registers to tem- tween the other two approaches. It uses
porarily store operands between instruc- a token-based circular pipeline and an
Recent architectural tions.) The pure-dataflow organizationis
a slight modification of an architecture
advanced control pipeline (a look-ahead
control that implements instruction
developments that implementsthe traditional dataflow prefetching and token prematching to
instruction cycle. The major differences reduce idle times caused by unsuccessful
Table 1 lists the features of six ma- between the current organizations and matches). The basic idea is to shift to a
chines that represent the current trend in the classic dynamic architectures are (1) coarser grain of parallelism by incorpo-
dataflow architecture: MIT’s Monsoon, the reversal of the instruction fetch unit rating control-flow sequencing into the
Sandia National Labs’ Epsilon-2, Elec- and the matching unit and (2) the intro- dataflow approach. E M 4 is an example
trotechnicalLabs’ EM-4, MIT’s P-RISC, duction of frames to represent contexts. of a macro-datafloworganization.
MIT’s *T, and UC Berkeley’s TAM These changes are mainly due to the im-
(Threaded Abstract Machine). Each pro- plementation of a new matching scheme. Token matching. One of the most im-
posal provides a different perspective on Monsoon and Epsilon-2are examplesof portant developments to emerge from
how parallel processing based on the machines based on this organization. current dataflow proposals is a novel and
dataflow concept can be realized; how- The hybrid organizationdeparts more simplifiedprocess of matching tags -di-
ever, they share the goal of alleviatingthe radically from the classic dynamic archi- rect matching. The idea is to eliminate
inefficiencies associated with the tectures. Tokens carry only tags, and the the expensive and complex process of as-
dataflow model of computation. From architecture is based on conventional sociativesearch used in previous dynamic
these efforts, a number of innovative so- control-flowsequencing (see Figure 6). dataflow architectures.In a direct match-
lutions have emerged. Therefore, architecturesbased on this or- ing scheme, storage (called an activation
ganization can be viewed as von Neu- frame) is dynamically allocated for all the
Three categories. The dataflow ma- mann machines that have been extended tokens generated by a code-block. The
chines currently advanced in the litera- to support fine-grained dataflow capa- actual location used within a code-block
ture can be classified into three cate- bility. Moreover, unlike the pure- is determined at compile time; however,
gories: pure-dataflow, macro-dataflow, dataflow organization where token the allocation of activation frames is de-
and hybrid. matching is implicit in the architecture, termined during runtime. For example,
Figure 5 shows a typical processing el- machines based on the hybrid organiza- “unfolding” a loop body is achieved by
ement (PE) based on the pure-dataflow tion provide a limited token-matching ca- allocating an activation frame for each
organization. It consists of an execution pability through special synchronization loop iteration. The matching tokens gen-
pipeline connected by a token queue. primitives. P-RISC, *T, and TAM can be erated within an iteration have a unique

August 1994 31
communication network

Code-block activation ............."..* Instruction memory


Frame memory

Form token

Presence bits
communication network
Figure 8. Explicit-token-storerepresentationof a dataflow
Figure 7. Organizationof a macro-datatlow processing element. program execution.

slot in the activationframe in which they the corresponding instruction is exe- the unique location in the operand seg-
converge. The actual matching process cuted. The result token(s) generated ment (called an entry point) to match to-
simply checks the disposition of the slot from the operation is communicated to kens and then to fetch the corresponding
in the frame memory. the destination instruction(s) by updat- instruction word.
In a direct matching scheme,any com- ing the IP according to the displace- In the Epsilon-2 dataflow multipro-
putation is completely described by a ment(s) encoded in the instruction. For cessor, a separate storage (match mem-
pointer to an instruction (IP) and a example,executionof the ADD operation ory) contains rendezvous slots for in-
pointer to an activation frame (FP). The produces two result tokens <FP.IP + 1, coming tokens (see Table 1,reference 3).
pair of pointers, <FP.IP>, is called a con- 3 . 5 5 and <FP.IP + 2, 3.%bL. Similar to Monsoon, an offset encoded
tinuation and correspondsto the tag part In one variation to the matching in the instruction word is used to deter-
of a token. A typical instruction pointed mine the match-memory location to
to by an IP specifies an opcode, an offset match tokens. However,unlike the Mon-
in the activation frame where the match I soon, each slot is associatedwith a match
will take place, and one or more dis- count that is initialized to zero. As tokens
placements that define the destinationin- Direct matching arrive, the match count is compared with
structions that will receive the result to- the value encoded in the opcode. If the
ken(s). Each destination is also schemes used in pure- match count is less than the value en-
accompanied by an input port (lefthight) dataflow and macro- coded in the opcode, the match count is
indicator that specifies the appropriate
input arc for a destination actor.
dataflow organizations incremented and stored back in the
match memory. Otherwise, the node is
To illustrate the operations of direct are implicit in the consideredenabled, and the match count
matching in more detail, consider the to- architecture. is reinitialized to zero. The instruction
ken-matchingscheme used in Monsoon. word also specifies offsets in the frame
Direct matching of tokens in Monsoon is memory where the actual operands re-
based on the Explicit Token Store (ETS) side. Therefore, in contrast to the scheme
model.' Figure 8 shows an ETS code- scheme, EM-4 maintains a simple one- used in Monsoon or EM-4, the opcode
block invocation and its corresponding to-one correspondence between the ad- specifiestwo separate frame locations for
instruction and frame memory. When a dress of an instruction and the address of the operands.
token arrives at an actor (for example, its rendezvous slot (Figure 9). This is Direct matching schemes used in the
ADD), the IP part of the cdntinuation achieved by allocating an operand seg- pure-dataflowand macro-dataflow orga-
points to the instruction that contains an ment -analogous to an activation frame nizations are implicit in the architecture.
offset r as well as displacement(s)for the - that contains the same number of In other words, the token-matching
destination instruction(s). Matching is memory words as the template segment mechanism provides the full generality
achieved by checking the disposition of that containsthe code-block. In addition, of the dataflow model of execution and
the slot in the frame memory pointed to the operand segment is bound to a tem- therefore is supported by the hardware.
by FP + r. If the slot is empty,the value of plate segmentby storing the correspond- Architectures based on the hybrid orga-
the token is written in the slot, and its ing segment number in the first word of nization, on the other hand, provide a
presence bit is set to indicate that the slot the operand segment. The token's con- limited direct matching capability
is full. If the slot is already full, the value tinuation contains only an FP and an off- through software implementation using
is extracted, leaving the slot empty, and set. These values are used to determine special JOIN instructions. P-RISC is an ex-

ample of such an architecture dataflow organization).For the
(Table 1, reference 5). P- sake of presentation, we side-
RISC is based on a conven- step discussion of the first ap-
Operandsegment proach, since sequential
tional RISC-type architec-
ture, which is extended to FP schedulingis implicit in hybrid
support fine-grain dataflow architectures,and focus on the
capability. To synchronize second.
two threads of computations, Control-flowsequencingcan
a JOIN x instruction toggles be incorporated into the
the contents of the frame lo- dataflow model in a couple
cation FP + x. If the frame lo- ways. The first method is a sim-
cation FP + x is empty, no ple recirculate scheduling
continuation is produced, and paradigm. Since a continuation
the thread dies. If the frame is completely described by a
location is full, it produces a pair of pointers <FP.IP>,
continuation <FP.IP + 1>. where IP represents a pointer
Therefore, a JOIN instruction to the current instruction, the
implements the direct match- successor instruction is simply
ing scheme and provides a Figure 9. The EM-4’sdirect matchingscheme. <FP.IP + 1>.In addition to this
general mechanism for syn- simple manipulation, the hard-
chronizing two threads of ware must support the immedi-
computations.Note that the ate reinsertionof continuations
JOIN operation can be generalizedto sup- ming medium to encode imperative op- into the execution pipeline. This is
port n-way synchronization;that is, the erations essential for execution of oper- achieved by using a direct recirculation
frame locationx is initialized to n - 1and ating system functions (for example, re- path (see Figure 5) to bypass the token
different JOIN operationsdecrementit un- source management).4 In addition, the queue.
til it reaches zero. JOIN instructions are instruction cycle can be further reduced One potential problem with the recir-
used in *T and TAM to provide explicit by using a register file to temporarily culation method is that successor contin-
synchronization. store operands in a grain, which elimi- uations are not generated until the end
nates the overhead involved in con- of the pipeline (“Form token unit” in Fig-
Convergence of dataflow and von structing and transferring result tokens ure 5). Therefore, execution of the next
Neumann models. Dataflow architec- within a grain. instruction in a computationalthread will
tures based on the original model pro- experience a delay equal to the number
vide well-integratedsynchronization at a of stages in the pipe. This means that the
very basic level - the instruction level. total number of cycles required to exe-
The combined dataflowlvon Neumann cute a single thread of computation in a
model groups instructions into larger The combined model k-stage pipeline will be k times the num-
ber of instructions in the thread. On the
grains so that instructionswithin a grain
can be scheduled in a control-flow fash-
groups instructions other hand, the recirculation method al-
ion and the grains themselves in a into larger grains lows interleaving up to k independent
dataflow fashion. This convergence com- for control-flow threads in the pipeline, effectively mask-
ing long and unpredictable latency due
bines the power of the dataflow model
for exposing parallelism with the execu- scheduling of instruc- to remote loads.
tion efficiency of the control-flow model. tions but dataflow The second technique for implement-
Although the spectrum of dataflowlvon
Neumann hybrid is very broad, two key
scheduling of grains. ing sequential scheduling uses the
macroactor concept.This scheme groups
features supportingthis shift are sequen- the nodes in a dataflow graph into
tial scheduling and use of registers to macroactors. The nodes within a
temporarily buffer the results between There are contrasting views on merg- macroactor are executed sequentially;
instructions. ing the two conceptually different execu- the macroactors themselves are sched-
tion models. The first is to extend existing uled according to the dataflow model.
Sequential scheduling. Exploiting a conventionalmultiprocessorsto provide The EM-4 dataflow multiprocessor im-
coarser grain of parallelism (compared dataflow capability (hybrid organiza- plements macroactors based on the
with instruction-levelparallelism)allows tion). The idea here is to avoid a radical strongly connected arc model (Table 1,
use of a simple control-flow sequencing departure from existing programming reference 4).This model categorizesthe
within the grain. This is in recognitionof methodology and architecturein favor of arcs of a dataflow graph as normal or
the fact that data-driven sequencing is a smoothertransition that provides incre- strongly connected. A subgraph whose
unnecessarily general and such flexible mental improvement as well as software nodes are connected by strongly con-
instruction scheduling comes at a cost of compatibilitywith conventionalmachines. nected arcs is called a strongly connected
overhead required to match tokens. The second approach is to incorporate block (SCB). An SCB is enabled (fired)
Moreover, the self-scheduling paradigm control-flow sequencing into existing when all its input tokens are available.
fails to provide an adequate program- dataflow architectures (pure- or macro- Once an SCB fires, all the nodes in the

August 1994 33
Token in
I Execution Dimline

Figure 10.
Linked-list rep-
resentationof a
dataflow graph
using the repeat
unit (d and r
represent the
repeat offsets).

block are executed exclusivelyby means the number of active CDs can be large, $.
of the advanced control pipeline. only threads occupying the execution Token out
Epsilon-2 uses a unique architectural pipeline are associated with T registers.
feature - the repeat unit - to imple- As long as a thread does not die, its in-
ment sequential scheduling (Table 1,ref- structions can freely use its registers. Figure 11. Use of a register file.
erence 3). The unit generates repeat to- Once a thread dies, its registers may be
kens, which efficiently implement data committed to a new thread entering the
fanouts in dataflow graphs and signifi- pipeline; therefore, the register values
cantly reduce the overhead of copying to- are not necessarily preserved across ecute without interruption? A thread de-
kens. The unit represents a thread of grain boundaries. All current architec- fines a basic unit of work consistent with
computation as a linked list and uses reg- tures use registers but, unlike Monsoon, the dataflow model, and current dataflow
isters to buffer the results between in- do not fix the number of registers asso- projects are adopting multithreading as
structions (see Figure 10). To generate a ciated with each computational grain. a viable method for combining the fea-
repeat token, it adds the repeat offset en- Therefore, proper use of resources re- tures of the dataflow and von Neumann
coded in the instruction word to the cur- quires a compile-time analysis. execution models.
rent token’s instruction pointer. Thus, the
repeat unit can prioritize instruction ex- Supportingmultiplethreads. Basically,
ecution within a grain. Multithreading hybrid dataflow architectures can be
viewed as von Neumann machines ex-
Register use. Inefficient communica- Architectures based on the dataflow tended to support fine-grained interleav-
tion of tokens among nodes was a major model offer the advantage of instruction- ing of multiple threads. As an illustration
problem in past dataflow architectures. level context switching.Since each datum of how this is accomplished, consider P-
The execution model requires combin- carries context-identifying information RISC, which is stronglyinfluenced by Ian-
ing result values with target addresses to in a continuation or tag, context switching nucci’s dataflowlvonNeumann hybrid ar-
form result tokens for communication to can occur on a per-instruction basis. chitecture? P-RISC is a RISC-based
successor nodes. Sequential scheduling Thus, these architectures tolerate long, architecture in the sense that, except for
avoids this overhead whenever locality unpredictable latency resulting from split loadktore instructions, instructions are
can be exploited, registers can be used to transactions. (This refers to message- frame-to-frame (that is, register-to-regis-
temporarily buffer results. based communication due to remote ter) operations that operate within the PE
The general method of incorporating loads. A split transaction involves a read (Figure 6). The local memory contains in-
registers in the dataflow execution request from processor A to processor B structions and frames. Notice that the P-
pipeline is illustrated in Figure 11.A set containing the address of the location to RISC executes three address instructions
of registers is associated with each com- be read and a return address, followed by on data stored in the frames; hence, “to-
putational grain, and the instructions in a response from processor B to processor kens” carry only continuations. The in-
the grain are allowed to refer to these A containing the requested value.) struction-fetch and operand-fetch units
registers as operands. For example, Combining instruction-level context fetch appropriate instructions and
Monsoon employs three temporary reg- switching with sequential scheduling operands, respectively, pointed to by the
isters (called T registers) with each com- leads to another perspective on dataflow continuation. The functional unit per-
putational thread. The T registers, com- architectures - multithreading. In the forms general RISC operations, and the
bined with a continuation and a value, context of multithreading, a thread is a operand store is responsible for storingre-
are called computation descriptors sequence of statically ordered instruc- sults in the appropriate slots of the frame.
(CDs). A CD completely describes the tions where once the first instruction is For the most part, P-RISC executes in-
state of a computational thread. Since executed, the remaining instructions ex- structions in the same manner as a con-

FORK labell I

A ’
FP.IP+l FP.labell

Thread 1 i Thread2 Remote

memory Synchronization Data

request coprocessor processor

label2: JOINX
Local memory

Figure 12. Application of FORK and

JOINconstructs. Figure 13. Organization of a *Tprocessor node.

ventional RISC. An arithmeticllogicin- spectively. For example, the general form from the token queue every pipeline cycle.
struction generates a continuation that is of a synchronizing memory read, such as *T, a successor to P-RISC, provides
simply <FP.IP + 1>.However, unlike ar- I-structure,is LOAD x , d, where an address similar extensionsto the conventionalin-
chitectures based on the pure-dataflow a in frame location FP + x is used to load struction set (Table 1,reference 6). How-
organization, 1P represents a program data onto frame location FP + d. Execut- ever, to improve overall performance, the
counter in the conventional sense and is ing a LOAD instruction generates an out- thread-execution and the message-han-
incremented in the instruction-fetch going message (via the loadlstore unit) dling responsibilities are distributed
stage. For a JUMP x instruction, the con- of the form among three asynchronous processors.
tinuation is simply <FP.x>. To provide The general organization of *T is shown
fine-grained dataflow capability, the in- <I-READ, U , FP.IP, d> in Figure 13. A *T node consists of the
struction set is extended with two special data processor, the remote-memory re-
instructions -FORK and JOIN -used to where a represents the location of the quest coprocessor, and the synchroniza-
spawn and synchronize independent value to be read and d is the offset rela- tion coprocessor,which all share a local
threads. These are simple operations ex- tive to FP. This causes the current thread memory. The data processor executes
ecuted within the normal processor to die. Therefore, a new continuation is threads by extractingcontinuationsfrom
pipeline, not operating system calls. A extracted from the token queue, and the the token queue using the NEXT instruc-
FORK instruction is a combination of a new thread will be initiated. On its return tion. It is optimized to provide excellent
JUMP and a fall-through to the next in- trip from the I-structure memory, an in- single-threadperformance. The remote-
struction. Executinga FORK label has two coming START message (via the start unit) memory request coprocessorhandles in-
effects. First, the current thread continues writes the value at location FP + d and coming remote LOADISTORE requests. For
to the next instruction by generating a continuesexecution of the thread. A pro- example, a LOAD request is processed by
continuation of the form <FP.IP + 1>. cedure is invoked in a similar fashion a message handler,which (1) accesses the
Second, a new thread is created with the when the caller writes the argumentsinto local memory and (2) sends a START mes-
same continuation as the current contin- the callee’s activationframe and initiates sage as a direct response without dis-
uation except that the “ I P is replaced by the threads. This is achieved by the in- turbing the data processor. On the other
“label” (that is, <FP.label>). JOIN, on the struction hand, the synchronization coprocessor
other hand, is an explicit synchronization handles returning LOAD responses and
primitive that provides the limited direct- START,
dv, &PIP, dd JOIN operations. Its responsibility is to (1)
matching capability used in other continually queue messages from the net-
dataflowmachines. FORK and JOIN opera- which reads a value v from FP + dv, a con- work interface; (2) complete the unfin-
tions are illustrated in Figure 12. tinuation <FP.IP> from FP+dFPIP, and ished remote LOADS by placing message
In addition to FORK and JOIN, a special an offset d from FP + dd. Executing this values in destination locations and, if
START message initiates new threads and instruction sends a START message to an necessary, performing JOIN operations;
implements interprocessor communica- appropriate processing element and ini- and (3) place continuations in the token
tion. The message of the form tiates a thread. queue for later pickup and execution by
P-RISC supportsmultithreading in one the data processor.
value, FP.IP, d> of two ways. In the first method, as long as In contrast to the two hybrid projects
a thread does not die due to LOADS or discussed thus far, TAM provides a con-
writes the value in the location FP + d JOINS, it is executed using the von Neu- ceptually differentimplementation of the
and initiates a thread describedby FP.IP. mann scheduling IP + 1. When a thread dataflow execution model and multi-
START messages are generated by LOAD dies, a context switch is performed by ex- threading (Table 1, reference 7). In
and STORE instructions used to implement tracting a token from the token queue, TAM, the executionmodel for fine-grain
I-structure reads and procedure calls, re- The second method is to extract a token interleaving of multiple threads is sup-

August 1994 35
Figure 14. A
FORK labdl
Threaded Ab-
stract Machine
(TAM) activa-

FP.label2, v i / FP.label2, vr

labell V
FP.label1, I
( .labell, vr


ported by an appropriate compilation thread scheduling is explicit and under FP.labell+l , FP.label2,
strategy and program representation, not compiler control. vl+M Vl+M
by elaborate hardware. In other words, In contrast to P-RISC, *T, and TAM,
rather than viewing the execution model Monsoon’s multithreading capability is
for fine-grain parallelism as a property based on the pure-dataflow model in the Figure 15. Application of FORK and
of the machine, all synchronization, sense that tokens not only schedule in- JOIN constructs and equivalent dataflow
scheduling, and storage management is structions but also carry data. Monsoon actor in Monsoon.
explicit and under compiler control. incorporates sequential scheduling, but
Figure 14 shows an example of a TAM with a shift in viewpoint about how the
activation. Whenever a code-block is in- architecture works. It uses presence bits
voked, an activation frame is allocated. to synchronize predecessorlsuccessor in- as thread-spawning and thread-synchro-
The frame provides storage for local vari- structions through rendezvous points in nizing primitives, these operations can
ables, synchronization counters, and a the activation frame. To implement the also be recognized as instructions in the
continuation vector that contains ad- recirculate scheduling paradigm, instruc- dataflow execution model. For example,
dresses of enabled threads within the tions that depend on such scheduling are a FORK instruction is similar to but less
called code-block. Thus, when a frame is annotated with a special mark to indicate general than a COPY operation used in the
scheduled, threads are executed from its that the successor instruction is IP + 1 . pure-dataflow model. JOIN operations are
continuation vector, and the last thread This allows each instruction within a implemented through the direct match-
schedules the next frame. computation thread to enter the execu- ing scheme. For example, consider the
TAM supports the usual FORK opera- tion pipeline every k cycles, where k is dataflow actor shown in Figure 15, which
tions that cause additional threads to be the number of stages in the pipe. The ac- receives two input tokens, evaluates the
scheduled for execution. A thread can be tual thread interleaving is accomplished sum, and generates two output tokens.
synchronized using SYNC (same as JOIN) by extracting a token from the token The operation of this actor can be real-
operations that decrement the entry queue and inserting it into the execution ized by a combination of instructions of
count for the thread. A conditional flow pipeline every clock cycle. Thus, up to k the form
of execution is supported by a SWITCH op- active threads can be interleaved through
eration that forks one of two threads the execution pipeline. labell[FP + offset]:
based on a Boolean input value. A STOP New threads and synchronizing ADD vl, vr II FORK laben,
(same as NEXT) operation terminates the threads are introduced by primitives sim-
current thread and causes initiation of an- ilar to those used in hybrid machines - where II represents the combination of an
other thread. TAM also supports inter- FORK and JOIN. For example, a FORK la- ADD and a FORK. In a pure-dataflow or-
frame messages, which arise in passing bell generates two continuations, <FP.la- ganization (including Monsoon), the ex-
arguments to an activation, returning re- bell> and <FP.IP + b.An implicit JOIN ecution pipeline handles token matching,
sults, or split-phase transactions, by instruction of the form arithmetic operation, and token forming.
associating a set of inlets with each code- Therefore, instructions that appear to be
block. Inlets are basically message han- label2 [FP + offset]:instruction distinctively multithreaded in fact can be
dlers that provide an external interface. viewed as part of the more traditional
As can be seen, TAM’S support for indicates the threads’ rendezvous slot is dataflow operations.
multithreading is similar to that of the at frame location [FP + offset];therefore, In EM-4, an SCB is thought of as a se-
other hybrid machines discussed. The the instruction will not execute until both quential, uninterruptable thread of con-
major difference is that thread scheduling continuations arrive. FORK and JOIN oper- trol. Therefore, the execution of multi-
in P-RISC and *T is local and implicit ations are illustrated in Figure 15. ple threads can be implemented by
through the token queue. In TAM, Although FORK and JOIN can be viewed passing tokens between SCBs. This is

achieved by having thread library func- switching occurs when a main memory al.6 proposed a partitioning scheme based
tions that allow parallelism to be ex- access is required (due to a cache miss). on dual graphs. A dual graph is a directed
pressed explicitly? For example, a master As a consequence,the thread granularity graph with three types of arcs: data, con-
thread spawns and terminates a slave in these models tends to be coarse, trol, and dependence. A data arc repre-
thread using functions FORK and NULL, re- thereby limiting the amount of paral- sents the data dependency between pro-
spectively. The general format for the lelism that can be exposed. On the other ducer and consumer nodes. A control arc
FORK function is given as hand, non-strict functionallanguagesfor represents the scheduling order between
dataflow architectures, such as Id, com- two nodes, and a dependence arc speci-
FORK(PE, func, n, argl, . . . ,argn), plicate partitioning due to feedback de- fies long latency operation from message
pendenciesthat may only be resolved dy- handlers (that is, inlets and outlets) send-
where PE specifies the processing ele- namically. These situationsarise because ingheceiving messages across code-block
ment where the thread was created, and of the possibility of functionsor arbitrary boundaries.
n represents the number of arguments in expressions returning results before all The actual partitioning uses only the
the thread. This function causes the fol- operands are computed (for example, I- control and dependence edges. First, the
lowing operations: (1) allocate a new structure semantics). Therefore, a more nodes are grouped as dependence sets
operand segment on the processing ele- restrictive constraint is placed on parti- that guarantee a safe partition with no
ment and link it to the template segment tioning programs written in non-strict cyclic dependencies.A safe partition has
specified by the token’s address portion, languages. the following characteristics6:(1) no out-
(2) write the arguments into the operand Iannucci’ has outlined several impor- put of the partition needs to be produced
segment, (3) send the NULL routine’s ad- tant issues to consider in partitioning pro- before all inputs are available; (2) when
dress as a return address for the newly the inputs to the partition are available,
created thread, and (4)continue the ex- all the nodes in the partition are exe-
ecution of the current thread. Once exe- cuted; and (3) no arc connects a node in
cution completes, the new thread termi- the partition to an input node of the same
nates by executing a NULL function. Thread definitions partition. The partitions are merged into
larger partitions based on rules that gen-
Using the FORK and NULL functions, a
master thread can distribute the work
vary according to erate safe partitions. Once the general
over a number of slave threads to com- language characteris- partitioning has been completed, a num-
pute partial results. The final result is col- tics and context- ber of optimizationsare performed in an
lected when each slave thread sends its attempt to reduce the synchronization
partial result to the master thread. Mul-
switching criteria. cost. Finally,the output of the partitioner
tithreading is achieved by switching is a set of threads wherein the nodes in
among threads whenever a remote LOAD each thread are executed sequentially
operation occurs. However, since each grams: First, a partitioning method and the synchronization requirement is
thread is an SCB (that is, an uninterrupt- should maximize the exploitable paral- determined statically and only occurs at
able sequence of instructions),interleav- lelism. In other words, the attempt to the beginning of a thread.
ing of threads on each cycle is not al- aggregateinstructionsdoes not imply re-
lowed. stricting or limiting parallelism.Instruc-
tions that can be grouped into a thread Future prospects
Partitioning programs to threads. An should be the parts of a code where little
important issue in multithreading is the or no exploitableparallelism exists. Sec- To predict whether dataflow comput-
partitioning of programs to multiple se- ond, the longer the thread length, the ing will have a legitimateplace in a world
quentialthreads. A thread defines the ba- longer the interval between context dominated by massively parallel von
sic unit of work for scheduling - and switches. This also increases the locality Neumann machines, consider the suit-
thus a computation’s granularity. Since for better utilization of the processor’s ability of current commercialparallel ma-
each thread has an associated cost, it di- resources. Third, any arc (that is, data de- chines for fine-grain parallel program-
rectly affects the amount of overhead re- pendency) crossing thread boundaries ming. Studies on TAM show that it is
quired for synchronization and context implies dynamic synchronization. Since possible to implement the dataflow exe-
switching. Therefore, the main goal in synchronization operations introduce cution model on conventional architec-
partitioning is maximizing parallelism hardware overhead and/or increase pro- tures and obtain reasonableperformance
while minimizing the overhead required cessor cycles, they should be minimized. (Table 1, reference 7). This has been
to support the threads. Finally, the self-scheduling paradigm of demonstrated by compiling Id90 pro-
A number of proposals based on the program graphs implies that execution grams to TAM and translatingthem, first
control-flow model use multithreadingas ordering cannot be independent of pro- to TLO,the TAM assembly language, and
a means of tolerating high-latency mem- gram inputs. In other words, this dynamic finally to native machine code for a vari-
ory operations, but thread definitions ordering behavior in the code must be ety of platforms, mainly CM-5.1° How-
vary according to language characteris- understood and considered as a con- ever, TAM’S translation of dataflow
tics and context-switching criteria. For straint on partitioning. graph program representationto control-
example, the multiple-context schemes A number of thread partitioning algo- flow execution also shows a basic mis-
used in Weber and Gupta9obtain threads rithms convert dataflow graph represen- match between the requirementsfor fine-
by subdividinga parallel loop into a num- tation of programs into threads based on grain parallelism and the underlying
ber of sequential processes, and context the criteria outlined above. Schauser et architecture. Fine-grain parallel pro-

August 1994 37
gramming models dynamically create - for example, by off-loading message TAM alleviates this problem to some ex-
parallel threads of control that execute handling to coprocessors, as in *T -may tent by relegating the responsibilitiesof
on data structuresdistributed among pro- reduce this even more. scheduling and storage management to
cessors. Therefore, efficient support for With the advent of multithreading,fu- the compiler.For example, continuation
synchronization,communication,and dy- ture dataflow machines will no doubt vectors that hold active threads are im-
namic scheduling becomes crucial to adopt a hybrid flavor, with emphasis on plemented as stacks,and all frames hold-
overall performance. improving the execution efficiency of ing enabled threads are linked in a ready
One major problem of supportingfine- multiple threads of computations. Ex- queue. However, both hardware and
grain parallelism on commercial parallel periments on TAM have already shown softwaremethods discussed are based on
machines is communication overhead. In how implementation of the dataflow ex- a naive, local scheduling discipline with
recent years, the communicationperfor- ecution model can be approached as a no global strategy. Therefore, appropri-
mance of commercial parallel machines compilation issue. These studies also in- ate means of directing scheduling based
has improved significantly." One of the dicate that considerable improvement is on some global-level understanding of
first commercial message-passing multi- possible through hardware support, program execution will be crucial to the
computers, Intel's iPSC, incurs a com- which was the original goal of dataflow success of future dataflow architectures.12
municationoverhead of several millisec- computer designers-to build a special- Another related problem is the alloca-
onds, but current parallel machines, such ized architecture that allows direct map- tion of framesand data structures.When-
as KSRl, Paragon, and CM-5, incur a ping of dataflow graph programs onto the ever a function is invoked,a frame must be
one-way communication overheadof just hardware. Therefore,the next generation allocated; therefore, how frames are allo-
25 to 86 microseconds." Despite this im- of dataflow machines will rely less on cated among processors is a major issue.
provement, the communicationoverhead very specialized processors and empha- Two extreme examples are a random al-
in these machines is too high to efficiently size incorporating general mechanisms location on any processor or local alloca-
support dataflow execution. This can be tion on the processor invoking the func-
attributed to the lack of integrationof the
tion. The proper selection of an allocation
network interface as part of hardware strategy will greatly affect the balance of
functionality.That is, individual proces- the computationalload. The distribution
sors offer high execution performanceon of data structuresamong the processors is
sequential streams of computations, but The Of closely linked to the allocation of frames.
communication and synchronization threading &Den& U 1
For example, an allocation policy that dis-
tributes a-large data structure among the
among processorshave substantial over-
on rapid support of processors may experience a large num-
In comparison, processors in current context switching- ber of remote messages, which will require
dataflow machines-communicateby exe- more threads to mask the delays. On the
cuting message handlers that directly other hand, allocating data structures lo-
move data in and out of preallocated stor- cally may cause one processor to serve a
age. Message handlers are short threads to support fine-grainparallelism into ex- large number of accesses, limiting the en-
that handle messages entirely in user isting sequential processors. The major tire system's performance. Therefore,the
mode, with no transfer of control to the challenge, however, will be to strike a bal-computationand data must be studied in
operating system. For example, in Mon- ance between hardware complexity and unison to develop an effective allocation
soon, message handlers are supported by performance. methodology.
hardware: A sender node can format and Despite recent advances toward de-
send a message in exactly one cycle, and veloping effective architecturesthat sup-

a receiver node can process an incoming port fine-grain parallelism and tolerate he eventual success of dataflow
message by storing its value and per- latency, some challengesremain. One of computers will depend on their
forming a JOIN operation in one or two cy- these challenges is dynamic scheduling. programmability. Traditionally,
cles. In hybrid-class architectures, mes- The success of multithreading depends they've been programmed in languages
sage handlers are implemented either by on rapid support of context switching. such as Id and SISAL (Streams and It-
specialized hardware, as in P-RISC and This is possible only if threads are resi- erations in a Single Assignment Lan-
*T, or through software (for example,in- dent at fast but small memories (that is, at guage)* that use functional semantics.
terrupt or polling) as part of the processor the top level of the storage hierarchy), These languages reveal high levels of
pipeline execution. The most recent hy- which limits the number of active threads concurrencyand translate onto dataflow
brid-class prototype under development and thus the amount of latency that can machines and conventional parallel ma-
at MIT and Motorola, inspired by the con- be tolerated. All the architectures de- chines via TAM. However,because their
ceptual *T design at MIT, uses processor- scribed -Monsoon, Epsilon-2, EM-4, P- syntax and semantics differ from the im-
integrated networking that directly inte- RISC, and *T - rely on a simple perative counterparts such as Fortran
grates communication into the MC88110 dataflow scheduling strategy based on and C, they have been slow to gain ac-
superscalarRISC microprocessor." Pro- hardware token queues. The generality ceptance in the programming commu-
cessor-integratednetworking efficiently of the dataflow scheduling makes it diffi- nity. An alternative is to explore the use
implementsthe message-passing mecha- cult to execute a logically related set of of established imperative languages to
nism. It provides a low communication threads through the processor pipeline, program dataflow machines. However,
overhead of 100 nanoseconds,and over- thereby removing any opportunityto uti- the difficulty will be analyzing data de-
lapping computation with communication lize registers across thread boundaries. pendencies and extracting parallelism

from source code that contains side ef- He is a member of ACM and the IEEE Com-
11. G.M. Papadopouloset al., “*T: Integrated
Building Blocksfor Parallel Computing,” puter Society.
fects. Therefore, more research is still Proc. Supercomputing93, IEEE CS Press,
needed to develop compilers for con- Los Alamitos, Calif., Order No. 4340,
ventional languages that can produce 1993,pp. 624-635.
parallel code comparable to that of par-
12. D.E. Culler, K.E. Schauser, and T. von
allel functional languages. H Eicken, “Two Fundamental Limits on
Dataflow Multiprocessing,” Proc. IFIP
Working Group 10.3 (Concurrent Systems)
References Working Conf: on Architecmre and Com-
pilation Techniquesfor Fine- and Medium-
Grain Parallelism,Jan. 1993.
1. Arvind and D.E. Culler, “Dataflow Ar-
chitectures,’’Ann. Review in Computer A.R. Hurson is on the computer engineering
Science,Vol. 1,1986,pp. 225-253. facultyat Pennsylvania State University. His
research for the past 12years has been directed
2. J.-L. Gaudiot and L. Bic, Advanced Top- toward the design and analysis of general as
ics in Dataflow Computing, PrenticeHall, well as special-purpose computers. He has
Englewood Cliffs, N.J., 1991. published over 120technical papers and has
served as guest coeditorof specialissues of the
3. Arvind, D.E. Culler,and K. Ekanadham, IEEE Proceedings, the Journal ofParallel and
“The Price of Fine-Grain Asynchronous Distributed Computing, and the Journal of In-
Parallelism: An Analysis of Dataflow tegrated Computer-Aided Engineering. He is
Methods,”Proc. CONPAR 88, Sept. 1988, the coauthor of Parallel Architecturesfor Data
pp. 541-555. Ben Lee is an assistant professor in the De- and Knowledge Base Systems (IEEE CS Press,
partment of Electrical and Computer Engi- to be publishedin October 1994),a member of
4. D.E. Culler and Arvind, “Resource Re- neering at Oregon State University.His in- the IEEE Computer Society Press Editorial
quirementsof Dataflow Programs,”Proc. terests include computer architectures, Board, and a speaker for the Distinguished
15th Ann. Int’l Symp. Computer Architec- parallel and distributed systems, program al- Visitors Program.
ture, IEEE CS Press,Los Alamitos,Calif., location, and dataflow computing. Lee re-
Order No. 861(microficheonly), 1988,pp. ceived the BEEE degree from the State Uni- Readers can contact Lee at Oregon State
141-150. versity of New York at Stony Brook in 1984 University, Dept. of Electricaland Computer
and the PhD degree in computer engineering Engineering, Corvallis, OR 97331-3211. His
from Pennsylvania State University in 1991. Internet addressis benl@ece.orst.edu.
5. B. Lee, A.R. Hurson, and B. Shirazi, “A
Hybrid Schemefor Processing Data Struc-
tures in a Dataflow Environment,” IEEE
Trans. Parallel and Distributed Systems,
Vol. 3, No. 1,Jan. 1992,pp. 83-96.
A quarterly magazinepublished by the IEEE Computer Society
6. K.E. Schauser et al., “Compiler-Con-
trolled Multithreadingfor Lenient Parallel
Languages,”Proc. Fifth ACM Conf:Func- Subscribe before August 31.
tional ProgrammingLanguagesand Com-
puter Architecture,ACM, New York, 1991, Receive the final 1994 issues for just $12.
pp. 50-72.
0 FALL 1994 -High-PerformanceFortran
7. R.A. Iannucci,“Towardsa Dataflow/von WINTER 1994 - Real-Time Computing
Neumann Hybrid Architecture,” Proc.
15th Ann. Int’l Symp. Computer Architec-
ture, IEEE CS Press, Los Alamitos,Calif., YES, SIGN ME UP! A s a member of the ComputerSocietyor another IEEE Society,I qualify for the
Order No. 861(microficheonly), 1988,pp. member rate of $12for a half-year subscription (two issues).
131-140. Society IEEE MembershipNo.
Since IEEE subscriptionsare annualized,orders received from September 1994through Februaly
8. M. Sat0 et al., “Thread-Based Program- 1994will be entered as full-yearsubscriptionsfor the 1995calendar year. Pay the full-yearrate of $24
ming for EM-4 Hybrid Dataflow Ma- for four quarterly issues. For membershipinformation,see pages 16A-B.
chine,”Proc. 19th Ann. Int’l Symp. Com-
puter Architecture, IEEE CS Press, Los
Alamitos, Calif., Order No. 2941 (mi- FULL SICNATURE Dam
crofiche only), 1992,pp. 146-155.
9. W.D. Weber and A. Gupta, “Exploring
the Benefits of Multiple Hardware Con-
texts in a Multiprocessor Architecture: STREETADDRESS
PreliminaryResults,” Proc. 16th Ann. Int’l
Symp, Computer Architecture, IEEE CS
Press, Los Alamitos, Calif., Order No.
1948(microficheonly),1989,pp. 273-280.

10. E. Spertus et al., “Evaluation of Mecha- Payment enclosed Residents of CA, DC,Canada,and Belgium, add applicable tax.
nisms for Fine-GrainedParallelPrograms
in the J-Machine and the CM-5,” Proc. 0 Charge to 0 Visa 0 MasterCard U American Express
20th Ann. Int’l Symp. Computer Architec-
ture, IEEE CS Press, Los Alamitos,Calif., NUMUER DATE

Order No. 3811 (microficheonly), 1993. ....................

CIRCLE, S , 90720-1264
August 1994