Parallel Processing:: Multiple Processor Organization

Unit-5
Parallel Processing:
*Multiple processor organization:
• Single instruction, single data (SISD) stream: A single
processor executes a single instruction stream to operate on
data stored in a single memory. Uniprocessors fall into this
category.
• Single instruction, multiple data (SIMD) stream: A single
machine instruction controls the simultaneous execution of a
number of processing elements on a lockstep basis. Each
processing element has an associated data memory, so that
instructions are executed on different sets of data by
different processors. Vector and array processors fall into this
category.
• Multiple instruction, single data (MISD) stream: A sequence
of data is transmitted to a set of processors, each of which
executes a different instruction sequence. This structure is
not commercially implemented.
• Multiple instruction, multiple data (MIMD) stream: A set of
processors simultaneously execute different instruction
sequences on different data sets. SMPs, clusters, and NUMA
systems fit into this category. With the MIMD organization,
the processors are general purpose; each is able to process
all of the instructions necessary to perform the appropriate
data transformation. If the processors share a common
memory, then each processor accesses programs and data
stored in the shared memory, and processors communicate
with each other via that memory. The most common form of
such system is known as a symmetric multiprocessor (SMP).
multiple processors share a single memory or pool of
memory by means of a shared bus or other interconnection
mechanism; a distinguishing feature is that the memory
access time to any region of memory is approximately the
same for each processor. A more recent development is the
nonuniform memory access (NUMA) organization, As the
name suggests, the memory access time to different regions
of memory may differ for a NUMA processor. A collection of
independent uniprocessors or SMPs may be interconnected
to form a cluster. Communication among the computers is
either via fixed paths or via some network facility.
Processor Organization
Single Single Multiple Multiple

Instruction Instruction Instruction Instruction
Single Data Multiple Data Single Data Multiple Data
(SISD) (SIMD) (MISD) (MIMD)
Uniprocessor Vector Array Shared Distributed

Processor Processor memory memory
(tightly (Loosely
Coupled) (Coupled)
Symmetric Nonuniform Clusters

Multiprocessor memory
(SMP) (NUMA)
*Symmetric Multiprocessor:
Until fairly recently, virtually all single-user personal
computers and most workstations contained a single general-
purpose microprocessor. As demands for performance
increase and as the cost of microprocessors continues to
drop, vendors have introduced systems with an SMP
organization. The term SMP refers to a computer hardware
architecture and also to the operating system behavior that
reflects that architecture. An SMP can be defined as a
standalone computer system with the following
characteristics:
1. There are two or more similar processors of
comparable capability.
2. These processors share the same main memory and
I/O facilities and are interconnected by a bus or other
internal connection scheme, such that memory access time is
approximately the same for each processor.
3. All processors share access to I/O devices, either
through the same channels or through different channels
that provide paths to the same device.
4. All processors can perform the same functions (hence
the term symmetric).
5. The system is controlled by an integrated operating
system that provides interaction between processors and
their programs at the job, task, file, and data element levels.
The physical unit of interaction is usually a message or
complete file. In an SMP, individual data elements can
constitute the level of interaction, and there can be a high
degree of cooperation between processes.
The operating system of an SMP schedules processes or
threads across all of the processors. An SMP organization has
a number of potential advantages over a uniprocessor
organization, including the following:
• Performance: If the work to be done by a computer
can be organized so that some portions of the work can be
done in parallel, then a system with multiple processors will
yield greater performance than one with a single processor
of the same type.
• Availability: In a symmetric multiprocessor, because all
processors can perform the same functions, the failure of a
single processor does not halt the machine. Instead, the
system can continue to function at reduced performance.
• Incremental growth: A user can enhance the
performance of a system by adding an additional processor.
• Scaling: Vendors can offer a range of products with
different price and performance characteristics based on the
number of processors configured in the system.
Organization:
• Addressing: It must be possible to distinguish modules on
the bus to determine the source and destination
stination of data.
• Arbitration: Any I/O module can temporarily function as
“master.” A mechanism is provided to arbitrate competing
requests for bus control, using some sort of priority scheme.
• Time-sharing: When one module is controlling the bus,
other modules are locked out and must, if necessary,
suspend operation until bus access is achieved. These
uniprocessor features are directly usable in an SMP
organization. In this latter case, there are now multiple
processors as well as multiple I/O processo
processors all attempting
to gain access to one or more memory modules via the bus.
The bus organization has several attractive features:
• Simplicity: This is the simplest approach to multiprocessor
organization. The physical interface and the addressing,
arbitration, and time-sharing logic of each processor remain
the same as in a single-processor system.
• Flexibility: It is generally easy to expand the system by
attaching more processors to the bus.
• Reliability: The bus is essentially a passive medium, and the
failure of any attached device should not cause failure of the
whole system.
The main drawback to the bus organization is
performance. All memory references pass through the
common bus. Thus, the bus cycle time limits the speed of the
system. To improve performance, it is desirable to equip each
processor with a cache memory. This should reduce the
number of bus accesses dramatically.
An SMP operating system manages processor and other
computer resources so that the user perceives a single
operating system controlling system resources. In fact, such a
configuration should appear as a single-processor
multiprogramming system. In both the SMP and uniprocessor
cases, multiple jobs or processes may be active at one time,
and it is the responsibility of the operating system to
schedule their execution and to allocate resources. A user
may construct applications that use multiple processes or
multiple threads within processes without regard to whether
a single processor or multiple processors will be available.
Thus, a multiprocessor operating system must provide all the
functionality of a multiprogramming system plus additional
features to accommodate multiple processors. Among the
key design issues:
• Simultaneous concurrent processes: OS routines need to be
reentrant to allow several processors to execute the same IS
code simultaneously. With multiple processors executing the
same or different parts of the OS, OS tables and management
structures must be managed properly to avoid deadlock or
invalid operations.
• Scheduling: Any processor may perform scheduling, so
conflicts must be avoided. The scheduler must assign ready
processes to available processors.
• Synchronization: With multiple active processes having
potential access to shared address spaces or shared I/O
resources, care must be taken to provide effective
synchronization. Synchronization is a facility that enforces
mutual exclusion and event ordering.
• Memory management: Memory management on a
multiprocessor must deal with all of the issues found on
uniprocessor machines. In addition, the operating system
needs to exploit the available hardware parallelism, such as
multiported memories, to achieve the best performance. The
paging mechanisms on different processors must be
coordinated to enforce consistency when several processors
share a page or segment and to decide on page replacement.
• Reliability and fault tolerance: The operating system should
provide graceful degradation in the face of processor failure.
The scheduler and other portions of the operating system
must recognize the loss of a processor and restructure
management tables accordingly.
*Cache Coherence and MESI protocol:
Cache coherence:
The goal is to make sure that READ(X) returns the most
recent value of the shared variable X, i.e. all valid copies of a
shared variable are identical.
1. Software solutions
2. Hardware solutions
Software Solutions: Compiler tags data as cacheable and non-
cacheable. Only read-only data is considered cachable and
put in private cache. All other data are non-cachable, and can
be put in a global cache, if available.
Hardware Solution: Snooping Cache Widely used in bus-
based multiprocessors. The cache controller constantly
watches the bus.
Write Invalidate: When a processor writes into C, all copies of
it in other processors are invalidated. These processors have
to read a valid copy either from M, or from the processor
that modified the variable.
Write Broadcast: Instead of invalidating, why not broadcast
the updated value to the other processors sharing that copy?
This will act as write through for shared data, and write back
for private data.
MESI Protocol:
To provide cache consistency on an SMP, the data cache
often supports a protocol known as MESI. For MESI, the data
cache includes two status bits per tag, so that each line can
be in one of four states:
• Modified: The line in the cache has been modified
(different from main memory) and is available only in this
cache.
• Exclusive: The line in the cache is the same as that in main
memory and is not present in any other cache.
• Shared: The line in the cache is the same as that in main
memory and may be present in another cache.
• Invalid: The line in the cache does not contain valid data.
Event Local Remote
Read hit Use Local copy No action
Read miss I to S (or) I to E (S,E,M) to S
Write hit (S,E) to M (S,E,M) to I
Write miss I to M (S,E,M) to I
When a cache block changes its status from M, it first
updates the main memory.
READ MISS: When a read miss occurs in the local cache, the
processor initiates a memory read to read the line of main
memory containing the missing address. The processor
inserts a signal on the bus that alerts all other
processor/cache units to snoop the transaction.
There are a number of possible outcomes:
• If one other cache has a clean (unmodified since read from
memory) copy of the line in the exclusive state, it returns a
signal indicating that it shares this line. The responding
processor then transitions the state of its copy from exclusive
to shared, and the initiating processor reads the line from
main memory and transitions the line in its cache from
invalid to shared.
• If one or more caches have a clean copy of the line in the
shared state, each of them signals that it shares the line. The
initiating processor reads the line and transitions the line in
its cache from invalid to shared.
• If one other cache has a modified copy of the line, then that
cache blocks the memory read and provides the line to the
requesting cache over the shared bus. The responding cache
then changes its line from modified to shared.
1 The line sent to the requesting cache is also received and
processed by the memory controller, which stores the block
in memory.
• If no other cache has a copy of the line (clean or modified),
then no signals are returned. The initiating processor reads
the line and transitions the line in its cache from invalid to
exclusive.
READ HIT: When a read hit occurs on a line currently in the
local cache, the processor simply reads the required item.
There is no state change: The state remains modified, shared,
or exclusive.
WRITE MISS: When a write miss occurs in the local cache, the
processor initiates a memory read to read the line of main
memory containing the missing address. For this purpose, the
processor issues a signal on the bus that means read-with-
intentto- modify (RWITM). When the line is loaded, it is
immediately marked modified. With respect to other caches,
two possible scenarios precede the loading of the line
of data.
WRITE HIT: When a write hit occurs on a line currently in the
local cache, the effect depends on the current state of that
line in the local cache:
L1-L2 CACHE CONSISTENCY: We have so far described cache
coherency protocols in terms of the cooperate activity
among caches connected to the same bus or other SMP
interconnection facility. Typically, these caches are L2 caches,
and each processor also has an L1 cache that does not
connect directly to the bus and that therefore cannot engage
in a snoopy protocol. Thus, some scheme is needed to
maintain data integrity across both levels of cache and across
all caches in the SMP configuration.
*Multithreading and chip multiprocessor:
Multithreading:
Multithreading is the ability of a CPU (or a single core in
a multi-core processor) to execute
multiple processes or threads concurrently, supported by
the OS. This approach differs from multiprocessing. In a
multithreaded application, the processes and threads share
the resources of a single or multiple cores, which include the
computing units, the CPU caches, and the translation
lookaside buffer (TLB).
Where multiprocessing systems include multiple complete
processing units in one or more cores, multithreading aims to
increase utilization of a single core by using thread-level
parallelism, as well as instruction-level parallelism. As the
two techniques are complementary, they are sometimes
combined in systems with multiple multithreading CPUs and
with CPUs with multiple multithreading cores.
Advantages: If a thread gets a lot of cache misses, the other
threads can continue taking advantage of the unused
computing resources, which may lead to faster overall
execution, as these resources would have been idle if only a
single thread were executed. Also, if a thread cannot use all
the computing resources of the CPU (because instructions
depend on each other's result), running another thread may
prevent those resources from becoming idle.
Disadvantages: Multiple threads can interfere with each
other when sharing hardware resources such as caches
or TLBs. As a result, execution times of a single thread are
not improved and can be degraded, even when only one
thread is executing, due to lower frequencies or additional
pipeline stages that are necessary to accommodate thread-
switching hardware.
Overall efficiency varies; Intel claims up to 30% improvement
with its Hyper-Threading Technology,[1] while a synthetic
program just performing a loop of non-optimized dependent
floating-point operations actually gains a 100% speed
improvement when run in parallel. On the other hand, hand-
tuned assembly language programs
using MMX or AltiVecextensions and performing data
prefetches (as a good video encoder might) do not suffer
from cache misses or idle computing resources. Such
programs therefore do not benefit from hardware
multithreading and can indeed see degraded performance
due to contention for shared resources.
From the software standpoint, hardware support for
multithreading is more visible to software, requiring more
changes to both application programs and operating systems
than multiprocessing. Hardware techniques used to
support multithreading often parallel the software
techniques used for computer multitasking. Thread
scheduling is also a major problem in multithreading.
Types of Multithreading:
• Interleaved multithreading: This is also known as fine-
grained multithreading. The processor deals with two or
more thread contexts at a time, switching from one thread to
another at each clock cycle. If a thread is blocked because of
data dependencies or memory latencies, that thread is
skipped and a ready thread is executed.
• Blocked multithreading: This is also known as coarse-
grained multithreading. The instructions of a thread are
executed successively until an event occurs that may cause
delay, such as a cache miss. This event induces a switch to
another thread. This approach is effective on an in-order
processor that would stall the pipeline for a delay event such
as a cache miss.
• Simultaneous multithreading (SMT): Instructions are
simultaneously issued from multiple threads to the execution
units of a superscalar processor. This combines the wide
superscalar instruction issue capability with the use of
multiple thread contexts.
• Chip multiprocessing: In this case, the entire processor is
replicated on a single chip and each processor handles
separate threads. The advantage of this approach is that the
available logic area on a chip is used effectively without
depending on ever-increasing complexity in pipeline design.
This is referred to as multi-core.
Chip multiprocessor:
As an SMT cannot achieve the same degree of instruction-
level parallelism. This is because the chip multiprocessor is
not able to hide latencies by issuing instructions from other
threads. On the other hand, the chip multiprocessor should
outperform a superscalar processor with the same
instruction issue capability, because the horizontal losses will
be greater for the superscalar processor. In addition, it is
possible to use multithreading within each of the processors
on a chip multiprocessor, and this is done on some
contemporary machines.
Example Systems:
PENTIUM 4 More recent models of the Pentium 4 use a
multithreading technique that the Intel literature refers to as
hyperthreading [MARR02]. In essence, the Pentium 4
approach is to use SMT with support for two threads. Thus,
the single multithreaded processor is logically two
processors. IBM POWER5 The IBM Power5 chip, which is
used in high-end PowerPC products,
combines chip multiprocessing with SMT [KALL04]. The chip
has two separate processors, each of which is a
multithreaded processor capable of supporting two threads
concurrently using SMT. Interestingly, the designers
simulated various
alternatives and found that having two two-way SMT
processors on a single chip provided superior performance to
a single four-way SMT processor. The simulations showed
that additional multithreading beyond the support for two
threads might decrease performance because of cache
thrashing, as data from one thread displaces data needed by
another thread.
*Clusters:
Cluster is a set of loosely or tightly connected computers
working together as a unified computing resource that can
create the illusion of being one machine. Computer clusters
have each node set to perform the same task, controlled and
produced by software.
The components of a cluster are usually connected to each
other using fast area networks, with each node running its
own instance of an operating system. In most circumstances,
all the nodes use same hardware and the same operating
system, although in some setups different hardware or
different operating system can be used.
Types of Clusters –
Computer Clusters are arranged together in such a way so as
to support different purpose from general purpose business
needs such as web-service
service support, to computation intensive
scientific calculation. Basically there are three types of
Clusters, they are:
 Load-Balancing
Balancing Cluster – A cluster requires an effective
capability for balancing the load among available
computers. In this, cluster nodes share computational
workload so as to enhance the overall performance. For
example- a high-performance
performance cluster used for scientific
calculation would balance load from different algorithms
from the web-server
server cluster, which may just use a round
round-
robin method by assigning each new request to a
different node. This type of cluster is used on farms of
Web servers (web farm).
 Fail-Over
Over Clusters – The function of switching
applications and data resources over from a failed system
to an alternative system in the cluster is referred to as
fail-over. These types are used to cluster database of
critical mission, mail, file and application servers
 High-Availability Clusters – These are also known as “HA
clusters”. They offer a high probability that all the

resources will be in service. If a failure does occur, such as
a system goes down or a disk volume is lost, then the
queries in progress are lost. Any lost query, if retried, will
be serviced by different computer in the cluster. This type
of cluster is widely used in web, email, news or FTP
servers.
Benefits –
 Absolute scalability – It is possible to create a large
clusters that beats the power of even the largest

standalone machines. A cluster can have dozens
of multiprocessor machines.
 Additional scalability – A cluster is configured in such a
way that it is possible to add new systems to the cluster

in small increment. Clusters have the ability to add
systems horizontally. This means that more computers
may be added to the clusters to improve its performance,
redundancy and fault tolerance(the ability for a system to
continue working with a malfunctioning of node).
 High availability – As we know that each node in a cluster
is a standalone computer, the failure of one node does

not mean loss of service. A single node can be taken
down for maintenance, while the rest of the clusters
takes on the load of that individual node.
 Preferable price/performance – Clusters are usually set
up to improve performance and availability over single

computers, while typically being much more cost
effective than single computers of comparable speed or
availability.
*Non-uniform memory access (NUMA):
• Non-uniform memory access (NUMA): All processors have
access to all parts of main memory using loads and stores.
The memory access time of a processor differs depending on
which region of main memory is accessed. The last statement
is true for all processors; however, for different processors,
which memory regions are slower and which are faster differ.
• Cache-coherent NUMA (CC-NUMA): A NUMA system in
which cache coherence is maintained among the caches of
the various processors.
A NUMA system without cache coherence is more or less
equivalent to a cluster.
Motivation:
With a SMP system, there is a practical limit to number of
processors that can be used. One approach to archiving large
scale multiprocessing while retaining the flavour of SMP is
NUMA. The objective of NUMA is to maintain a transparent
system wide memory while permitting multiple
multiprocessor nodes. Each with its own bus or other
interconnected system.
Organization:
Each node in the CC-NUMA system includes some main
memory.
Each node must maintains some sort of directory that
gives it an indication of the location of various options of
memory and also cache status information.
NUMA pros and Cons:
The main advantage of a CC-NUMA system is that it can
deliver effective performance at higher levels of parallelism
than SMP, With-out running major S/W changes.
With multiple NUMA nodes the bus traffic on any
individual node is limited to demand that BUS can handle.
Even if the performance breakdown due to remote access
is addressed,
There are two other disadvantages for the CC-NUMA
approach. First: a CC-NUMA does not transparently look like
an SMP. Software changes will be required to move an
operating system and applications from an SMP to a CC-
NUMA system. These include page allocation, already
mentioned, process allocation, and load balancing by the
operating system. A second concern is that of availability.
*Vector computation:
There is a class of computational problems that are beyond
the capabilities of a conventional computer. These problems
are characterized by the fact that they require a vast number
of computations that will take a conventional computer to a
day or week to complete. In many science and engineering
applications, the problems can be formulated in terms of
vectors and matrices that lend themselves to vector
processing.
Computers with vector processing capacities are demand
in specialized applications. The following are representative
application areas where vector processing is of the utmost
importance.
Long range weather forecasting
Petroleum explorations
Seismic data analysis
Medical diagnosis
Aerodynamics and space flight simulation
Artificial intelligence and export system
Mapping the human genome
Image processing
Without shophisticated computers, many of the required
computations can not be completed within a reasonable
amount of time. To achieve the required level of high
performance it is necessary to utilize the fastest and most
reliable hardware and apply innovative procedures from
vector and parallel processing techniques.
Vector Operations:
Many scientific problems require arithmetic operations on
large array of numbers. These numbers are usually
formulated as vectors and matrices of floating point
numbers.
A vector Is an ordered set of one dimensional array of data
items. A vector v of length n is represented as a row vector
by V=[v1v2v3……..vn]. It may be represented as column vector
if data items are listed in a column.
A conventional sequential computer is capable of
processing operands one at a time. Consequently, operations
on vectors must be broken down into single computations
with sub scripted variables.
The element vi of vector v is written as v(i) and the index I
refers to a memory address or a register where the number is
stored.
A computer capable of vector processing eliminates the
overhead associated with the time it takes to fetch the
execute the instructions in the program loop.
C(1:100)=A(1:100)+B(1:100)
The vector instructions include the initial address of the
operands, the length of the vectors, and the operations to be
performed all in one composite instruction. It is essentially a
3-address instruction with 3 fields specifying the base
address of operand and an additional field that gives the
length of the data item in vectors.
Matrix Multiplication:
Matrix multiplication is one of the most computational
intensive operations performed in the computers with the
vector processors. The multiplication of 2 nxn matrices
consists of n2 inner products or n3 multiply-add operations.
An MxN matrix having M rows and N columns may be
considered as constituting a set of M row vectors and a set of
N column vectors.
Operation Base Base Base Vector
Code Address Address Address Length
Source 1 Source 2 destination
[Instruction format for vector processor]

Parallel Processing:: Multiple Processor Organization

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Parallel Processing:: Multiple Processor Organization

Загружено:

Авторское право:

Unit-5

Single Single Multiple Multiple

Uniprocessor Vector Array Shared Distributed

Symmetric Nonuniform Clusters

clusters”. They offer a high probability that all the

clusters that beats the power of even the largest

way that it is possible to add new systems to the cluster

is a standalone computer, the failure of one node does

up to improve performance and availability over single

Вам также может понравиться