An Analysis of Parallel Processing at Microlevel

International Engineering Journal For Research & Development
E-ISSN No: 2349-0721

Volume 1: Isuue 1
www.iejrd.in Page 1

AN ANALYSIS OF PARALLEL PROCESSING AT MICROLEVEL
Vina S. Borkar 1
Dept. of Computer Science and Engineering
St. Vincent Pallotti College of Engineering and Technology, Nagpur, India
vinaborkar@gmail.com

------------------------------------------------------------------------------------------------------------------------
Abstract:-
To achieve performance processors rely on two forms of parallelism: instruction level parallelism
(ILP) and thread level parallelism (TLP).ILP and TLP are fundamentally identical: they both identify
independent instructions that can execute in parallel and therefore can utilize parallel hardware.ILP include,
In this paper we begin by examining the issues (dependencies, branch prediction. window size, latency) on
ILP from program structure. and give the use of thread-level parallelism as an alternative or addition to
instruction-level parallelism.
This paper explores parallel processing on an alternative architecture, simultaneous multithreading
(SMT), which allows multiple threads to compete for and share all of the processors resources every cycle.
The most compelling reason for running parallel applications on an SMT processor is its ability to use
thread-level parallelism and instruction-level parallelism interchangeably. By permitting multiple threads to
share the processors functional units simultaneously, the processor can use both ILP and TLP to
accommodate variations in parallelism.
Keywords-TLP, ILP, branch prediction, corse-grain, SMT.
I. Introduction
Instruction-level Parallelism (ILP) is a family of processor and compiler design techniques that speed
up execution by causing individual machine operations, such as memory loads and stores, integer additions and
floating point multiplications, to execute in parallel [1].This technique is that like circuit speed improvements,
but unlike traditional multiprocessor parallelism and massive parallel processing, they are largely transparent to
users.
ILP is also called as a technique called pipelining. Pipelining breaks down a processor into multiple
stages and creates a pipeline that instructions pass through. This pipeline functions much like an assembly line.
An instruction enters at one end, passes through the different stages of the pipe, and exits at the other end.
VLIWs and superscalars are examples of processors that derive their benefit from instruction-level parallelism,
and software pipelining and trace scheduling are example software techniques that expose the parallelism that
these processors can use.
A superscalar machine is one that can issue multiple independent instructions in the same cycle. A
super pipelined machine issues one instruction per cycle, but the cycle time is set much less than the typical
E-ISSN No: 2349-0721
Volume 1: Isuue 1
www.iejrd.in Page 2

instruction latency. A VLIW machine [8] is like a superscalar machine, except the parallel instructions must be
explicitly packed by the compiler into very long instruction words. A multithreaded processor aims to increase
processor utilization by sharing resources at a finer granularity than a conventional processor [3].SMT is a
technique that permits multiple independent threads to issue multiple instructions each cycle to a superscalar
processors functional units.
This paper is organized as follows. Section 2 discusses how executed ILP with processor. Section 3
discusses issues of the ILP .Section 4 discusses how ILP supports to exploit TLP with multithreaded processor.
What is SMT and its working and how exploit TLP and ILP discusses section 5.
II. Execution with ILP
A typical ILP processor has the same type of execution hardware as a normal RISC machine. The
differences between a machine with ILP and one without is that there may be more of that hardware, for
example several integer adders instead of just one, and that the control will allow, and possibly arrange,
simultaneous access to whatever execution hardware is present.
The execution hardware of a simplified ILP processor consisting of more than one functional units.
Typically ILP execution hardware allows multiple-cycle operations to be pipelined, so we may assume that all
operations can be initiated each cycle. Instruction-level parallel execution is that multiple operations are
simultaneously in execution, either as a result of having been issued simultaneously or because the time to
execute an operation is greater than the interval between the issuance of successive operations.
A superscalar that has two data paths can fetch two instructions simultaneously from memory. This
means that the processor must also have double the logic to fetch and decode two instructions at the same
time[2].For example, if in each cycle the longest latency operation is issued, this hardware could have 10
operations "in flight" at once, which would give it a maximum possible speed-up of a factor of 10 over a
sequential processor with similar execution hardware.
III. Different issue with ILP
In instruction-level parallelism must determine which instructions can be executed in parallel. If two
instructions are parallel, they can execute simultaneously in a pipeline without causing any stalls, assuming the
pipeline has sufficient resources.. If two instructions are dependent they are not parallel and must be executed in
order, though they may often be partially overlapped. If two instructions are data dependent they cannot execute
simultaneously or be completely overlapped.
A. Dependencies and hazards
Determining how one instruction relates to another is critical to determining how much parallelism is
available to exploit in an instruction stream. If two instructions are not dependent then they can execute
simultaneouslyassuming sufficient resources that is no structural hazards. Obviously, if one instruction
E-ISSN No: 2349-0721
Volume 1: Isuue 1
www.iejrd.in Page 3

depends on another, they must execute in order though they may still partially overlap. It is imperative then, to
determine exactly how much and what kind of dependency exists between instructions. The following sections
will describe the different kinds of non-structural dependency that can exist in an instruction stream. There are
three different types of dependencies: data dependencies (als0o true dependencies), name dependencies and
control dependencies about ILP.
1. Data dependencies
An instruction j can be considered data dependent on instruction i as follows: directly, where instruction i
produces a result that may be used by instruction j or indirectly, where instruction j is data dependent on
instruction k and k is data dependent on i etc. The indirect data dependence means that one instruction is
dependent on another if there exists a chain of dependencies between them. This dependence chain can be as
long as the entire program! If two instructions are data dependent, they cannot execute simultaneously nor be
completely overlapped.
A data dependency can be overcome in two ways: maintaining the dependency but avoiding the hazard
or eliminating a dependency by transforming the code. Code scheduling is the primary method used to avoid a
hazard without altering the dependency. Scheduling can be done in hardware or by softwarein this paper, in
the interests of brevity, only the hardware-based solutions will be discussed. The VLIW/EPIC report will cover
software-based code scheduling.
A data value may flow between instructions through registers or memory locations. When registers are
used, detecting the dependence is reasonably straightforward as register names are encoded in the instruction
stream. Dependencies that flow through memory locations are much more difficult to detect as the effective-
address of the memory location needs to be computed and the EA cannot be determined during the ID phase.
Compilers can be of great help in detecting and scheduling around these sorts of hazards; hardware can only
resolve these dependencies with severe limitations.
2. Name Dependencies
The second type of dependence is a name dependency. A name dependency occurs when two
instructions use the same register or memory location, called a name, but there is no flow of data between them.
There are two types of name dependencies between an instruction i that proceeds instruction j: an anti-
dependence occurs when j writes a register/memory that i reads (the original value must be preserved until i can
use it) or an output dependence occurs when i and j write to the same register/memory location (in this case
instruction order must be preserved.) Both anti-dependencies and output dependencies are name dependencies,
as opposed to true data dependency, as there is no information flow between the two instructions.
3. Data Hazards
A data hazard is created whenever there is a data dependency between instructions and they are close
enough to cause the pipeline to stall or some other reordering of instructions. Because of the dependency, we
E-ISSN No: 2349-0721
Volume 1: Isuue 1
www.iejrd.in Page 4

must preserve program order, that is, the order in which the instructions would execute in a non-pipelined
sequential processor. A requirement of ILP must be to maintain the correctness of a program and
reorder/overlap instructions only what correctness is not at risk.
There are three types of data hazards: read after write (RAW)j tries to read a source before i writes
ithis is the most common type and is a true data dependence; write after write (WAW)j tries to write an
operand before it is written by ithis corresponds to the output dependence; write after read (WAR)j tries to
write a destination before i has read itthis corresponds to an anti-dependency. Self evidently the read after
read(RAR) case is not a hazard.
4. Control Dependencies
A control dependency determines the order of an instruction i with respect to a branch, so that i is
executed in correct program order only if it should be. The first basic block in a program is the only block
without some control dependency. Consider the statements:
if (p1) S1 if (p2) S2
S1 is control dependent on p1 and S2 is control dependent on p2 but is not dependent on p1. In general there are
two constraints imposed by control dependencies: an instruction that is control dependent on a branch cannot be
moved before the branch and, conversely, an instruction that is not control dependent on a branch must not be
moved after the branch in such a way that its execution would be controlled by the branch.
IV.Using ILP Support to Exploit Thread Level Parallelism
Increasing performance by using ILP has the great advantage that it is reasonably transparent to the
programmer, ILP can be quite limited or hard to exploit in some applications. For example, an online
transaction-processing system has natural parallelism among the multiple queries and updates that are presented
by requests. These queries and updates can be processed mostly in parallel, since they are largely independent of
one another. This higher-level parallelism is called thread-level parallelism because it is logically structured as
separate threads of execution.
A thread is a separate process with its own instructions and data. A thread may represent a process that
is part of a parallel program consisting of multiple processes, or it may represent an independent program on its
own. Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to
execute. Unlike instruction-level parallelism, which exploits implicit parallel operations within a loop or
straight-line code segment, thread-level parallelism is explicitly represented by the use of multiple threads of
execution that are inherently parallel. Thread-level parallelism is an important alternative to instruction-level
parallelism primarily because it could be more cost-effective to exploit than instruction-level parallelism.
Thread-level and instruction-level parallelism exploit two different kinds of parallel structure in a program. A
data path designed to exploit higher amounts of ILP will find that functional units are often idle because of
either stalls or dependences in the code.
E-ISSN No: 2349-0721
Volume 1: Isuue 1
www.iejrd.in Page 5

Multithreading allows multiple threads to share the functional units of a single processor in an overlapping
fashion. To permit this sharing, the processor must duplicate the independent state of each thread. For example,
a separate copy of the register file, a separate PC, and a separate page table are required for each thread. The
memory itself can be shared through the virtual memory mechanisms, which already support
multiprogramming. In addition, the hardware must support the ability to change to a different thread relatively
quickly; in particular, a thread switch should be much more efficient than a process switch, which typically
requires hundreds to thousands of processor cycles.
There are two main approaches to multithreading.
Fine-grained multithreading:
switches between threads on each instruction, causing the execution of multiple threads to be
interleaved. This interleaving is often done in a round-robin fashion, skipping any threads that are stalled at that
time. To make fine-grained multithreading practical, the CPU must be able to switch threads on every clock
cycle. One key advantage of fine-grained multithreading is that it can hide the throughput losses that arise from
both short and long stalls, since instructions from other threads can be executed when one thread stalls. The
primary disadvantage of fine-grained multithreading is that it slows down the execution of the individual
threads, since a thread that is ready to execute without stalls will be delayed by instructions from other threads.
Suns Ultra T1 (Niagara) uses fine-grain multithreading.
Coarse-grained multithreading
It was invented as an alternative to fine-grained multithreading. Coarse-grained multithreading
switches threads only on costly stalls, such as level 2 cache misses. This change relieves the need to have thread
switching be essentially free and is much less likely to slow the processor down, since instructions from other
threads will only be issued when a thread encounters a costly stall.
Coarse-grained multithreading suffers, however, from a major drawback: It is limited in its ability to
overcome throughput losses, especially from shorter stalls. This limitation arises from the pipeline start-up costs
of coarse-grain multithreading. Because a CPU with coarse-grained multithreading issues instructions from a
single thread, when a stall occurs, the pipeline must be emptied or frozen. The new thread that begins executing
after the stall must fill the pipeline before instructions will be able to complete. Because of this start-up
overhead, coarse grained multithreading is much more useful for reducing the penalty of high-cost stalls, where
pipeline refill is negligible compared to the stall time.
V. Simultaneous Multithreading
Simultaneous multithreading (SMT) is a variation on multithreading that uses the resources of a multiple-
issue, dynamically scheduled processor to exploit TLP at the same time it exploits ILP. The key insight that
motivates SMT is that modern multiple-issue processors often have more functional unit parallelism available
than a single thread can effectively use. Furthermore, with register renaming and dynamic scheduling, multiple
E-ISSN No: 2349-0721
Volume 1: Isuue 1
www.iejrd.in Page 6

instructions from independent threads can be issued without regard to the dependences among them; the
resolution of the dependences can be handled by the dynamic scheduling capability. Figure2.conceptually
illustrates the differences in a processors ability t exploit the resources of a superscalar for the following
processor configurations:
A superscalar with no multithreading support
A superscalar with coarse-grained multithreading
A superscalar with fine-grained multithreading
A superscalar with simultaneous multithreading
In the superscalar without multithreading support, the use of issue slots is limited by a lack of ILP, a
topic we discussed in earlier sections. In addition, a major stall, such as an instruction cache miss, can leave the
entire processor idle. In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by
switching to another thread that uses the resources of the processor. Although this reduces the number of
completely idle clock cycles, within each clock cycle, the ILP limitations still lead to idle cycles. Furthermore,
in a coarse grained multithreaded processor, since thread switching only occurs when there is a stall and the new
thread has a start-up period, there are likely to be some fully idle cycles remaining.
In the fine-grained case, the interleaving of threads eliminates fully empty slots. Because only one
thread issues instructions in a given clock cycle, however, ILP limitations still lead to a significant number of
idle slots within individual clock cycles. In the SMT case, TLP and ILP are exploited simultaneously, with
multiple threads using the issue slots in a single clock cycle. Ideally, the issue slot usage is limited by
imbalances in the resource needs and resource availability over multiple threads.
Simultaneous multithreading uses the insight that a dynamically scheduled processor already has many
of the hardware mechanisms needed to support the integrated exploitation of TLP through multithreading. In
particular, dynamically scheduled superscalars have a large set of virtual registers that can be used to hold the
register sets of independent threads (assuming separate renaming tables are kept for each thread). Because
register renaming provides unique register identifiers, instructions from multiple threads can be mixed in the
data path without confusing sources and destinations across the threads.
Simultaneous multithreading has a dual effect on branch prediction, much as it has on caches.
Simultaneous multithreading is much less sensitive to the quality of the branch prediction than a single-threaded
processor. Still, better branch prediction is beneficial for both architectures. In case of register renaming larger
SMT register file requires a longer access time; to avoid increasing the processor cycle time, the SMT pipeline
was extended two stages to allow two cycle register reads and two-cycle writes[8]. Threads on an SMT
processor share the same cache hierarchy, so their working sets may introduce inter-thread conflict misses.
When increasing the number of threads from 1 to 8, the cache miss component of average memory access time
increases by less than 1.5 cycles on average, indicating the small effect of inter-thread conflict misses. In out of
E-ISSN No: 2349-0721
Volume 1: Isuue 1
www.iejrd.in Page 7

order execution, write buffering, and the use of multiple threads allow SMT to hide the small increases in
additional memory latency, and large speedups can be attained.

Figure 2 How four different approaches use the issue slots of a superscalar processor.
The horizontal dimension represents the instruction issue capability in each clock cycle. The vertical
dimension represents a sequence of clock cycles. An empty (white) box indicates that the corresponding issue
slot is unused in that clock cycle. The shades of grey and black correspond to four different threads in the
multithreading processors. Black is also used to indicate the occupied issue slots in the case of the superscalar
without multithreading support.
VI. Result Analysis
There are some of other architectures have been proposed that exhibit simultaneous multithreading in
some form. Tullsen, et al.[6] demonstrated the potential for simultaneous multithreading, but did not simulate a
complete architecture, nor did that paper present a specific solution to register file access or instruction
scheduling. Yamamoto, et al., [10] present an analytical model of multithreaded superscalar performance,
backed up by simulation. Their study models perfect branching, perfect caches and a homogeneous workload.
Hirata, et al., [9] present an architecture for a multithreaded superscalar processor and simulate its performance
on a parallel ray-tracing application. They do not simulate caches or TLBs and their architecture has no branch
prediction mechanism.
Yamamoto and Nemirovsky [11] simulate an SMT architecture with separate instruction queues and up
to four threads. In addition to these, Beckmann and Polychronopoulos [14], Gunther [12], Li and Chu [13], and
Govindarajan, et al., [15] all discuss architectures that feature simultaneous multithreading, none of which can
issue more than one instruction per cycle per thread. The M-Machine [16] and the Multiscalar project [17]
combine multiple-issue with multithreading, but assign work onto processors at a coarser level than individual
instructions..

VII.CONCLUSION
E-ISSN No: 2349-0721
Volume 1: Isuue 1
www.iejrd.in Page 8

Simultaneous Multithreading is an extension of hardware multithreading that increases parallelism in
all forms. SMT combines the instruction level parallelism experienced by pipelined, superscalar processors with
the thread level parallelism of multithreading. This allows the processor to issue multiple instructions from
multiple threads in a single clock cycle, thus increasing the overall instruction throughput. SMT attacks multiple
sources of lost resource utilization in wide-issue processors.
Reference
[1] B. Ramakrishna Rau, Joseph A. Fisher, Instruction-Level Parallel Processing: History, Overview and
Perspective..HPL-92-132 October, 1992.
[2] Harris, D. M., and Harris, S. L. Digital Design and Computer Architecture. Morgan Kaufmann, Amsterdam,
2007.
[3] M. Gulati and N. Bagherzadeh. Performance study of a multithreaded superscalar microprocessor. In Second
International Symposium on High Performance Computer Architecture, pages 291301, February 1996.
[4] Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Hank M.Levy, Jack L. Lo, and Rebecca L. Stamm.
Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In
International Symposium on Computer Architecture, May 1996.
[5] Jack L. Lo, Susan J. Eggers, Joel S. Emer, Henry M. Levy,Rebecca L. Stamm, and Dean M. Tullsen.
Converting thread-level parallelism into instruction-level parallelism via simultaneous multithreading. ACM
Transactions on Computer Systems, pages 322354, August 1997
.[6] Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy. Simultaneous multithreading: Maximizing on-chip
parallelism. In International Symposium on Computer Architecture,1995.
[7] Norman P. Jouppi and David W. Wall. Available instruction-level parallelism for superscalar and
superpipelined machines. Third International Symposium on Architectural Support for Programming Languages
and Operating Systems, pp. 272-282, April 1989
[8] Eggers, S. J., Emer, J. S., Levy, H. M., Lo, J. L., Stamm, R. L., and Tullsen, D. M. Simultaneous
multithreading: A platform for next-generation processors. IEEE Micro 17 (1997).
[9] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa. Elementary
processor architecture with simultaneous instruction issuing from multiple threads. In 19th Annual International
Symposium on Computer Architecture, pages 136145, May 1992.
[10] W. Yamamoto, M.J. Serrano, A.R. Talcott, R.C. Wood, and M. Nemirosky. Performance estimation of
multistreamed, superscalar processors. In Twenty-Seventh Hawaii International Conference on System
Sciences, pages I:195204, January1994.
E-ISSN No: 2349-0721
Volume 1: Isuue 1
www.iejrd.in Page 9

[11] W. Yamamoto and M. Nemirovsky. Increasing superscalar performance through multistreaming. In
Conference on Parallel Architectures and Compilation Techniques, pages 4958, June 1995.
[12] B.K. Gunther. Superscalar performance in a multithreaded microprocessor. PhD thesis, University of
Tasmania, December 1993.
[13] Y. Li and W. Chu. The effects of STEF in finely parallel multithreaded processors. In First IEEE
Symposium on High- Performance Computer Architecture, pages 318325, January 1995.
[14] C.J. Beckmann and C.D. Polychronopoulos. Microarchitecture support for dynamic scheduling of acyclic
task graphs. In 25
th
Annual International Symposium on Microarchitecture, pages 140148, December 1992.
[15] R. Govindarajan, S.S. Nemawarkar, and P. LeNir. Design and peformance evaluation of a multithreaded
architecture. In First IEEE Symposium on High-Performance Computer Architecture, pages 298307, January
1995.
[16] M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang, Y. Gurevich, andW.S. Lee. The M-Machine
multicomputer. In 28th Annual International Symposium on Microarchitecture, November 1995.
[17] G.S. Sohi, S.E. Breach, and T.N. Vijaykumar. Multiscalar processors. In 22nd Annual International
Symposium on Computer Architecture, pages 414425, June 199
E-ISSN No: 2349-0721
Volume 1: Isuue 1
www.iejrd.in Page 10

An Analysis of Parallel Processing at Microlevel

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

An Analysis of Parallel Processing at Microlevel

Загружено:

Авторское право:

Доступные форматы

International Engineering Journal For Research & Development

E-ISSN No: 2349-0721

Вам также может понравиться