Академический Документы
Профессиональный Документы
Культура Документы
DISTRIBUTED DATABASES
Authored by:
Sathyanarayana.S.V
Assistant Professor
Department of Information Science & Engineering
J.N.N.College of Engineering
SHIMOGA 577204
Email: sathya_sv@rediffmail.com
2
CONTENTS
PREFACE
- Sathyanarayana.S.V
UNIT - 1
INTRODUCTION TO COMPUTER NETWORKS
Structure
1.0 Objectives
4
1.1 Introduction
1.2 Network types
1.3 LAN Technologies
1.3.1 LAN Topologies
1.3.2 Medium Access Protocols
1.3.2.1 CSMA/CD Protocol
1.3.2.2 Token ring Protocol
1.4 WAN Technologies
1.4.1 Switching Techniques
1.4.1.1 Circuit Switching
1.4.1.2 Packet Switching
1.4.2 Routing Techniques
1.4.2.1 Static Routing
1.4.2.2 Dynamic Routing
1.5 Communication Protocols
1.5.1 The OSI Reference Model
1.6 Summary
1.0 Objectives: Computer networks provide the necessary means for communication
between computing elements of the system. At the end of this unit, you will come to
know the basic concepts of computer networking. We outline the characteristics of
local and wide area networks. We summarize the principles of protocols and protocol
layering. In total, this unit will help you to understand the advanced concept of task
execution called Distributed computing.
1.1 Introduction:
A computer network is a communication system that links end systems by
communication lines and software protocols to exchange data between two processes
running on different end systems of the network. The end systems are often referred
to as nodes, sites, hosts, computers, machines, and so on. The nodes may vary in size
and function. Size-wise, a node may be a small microprocessor, a workstation, a
5
In a simple multi-access bus network, all sites are directly connected to a single
transmission medium (called the bus) that spans the whole length of the network
(Fig 1.1). The bus is passive and is shared by all the sites for any message
transmission in the network. Each site is connected to the bus by a drop cable
using a T-connection or tap. Broadcast communication is used for message
Sites
transmission.
Shared bus
That is, a message is transmitted from one site to another by placing it on the
shared bus. An address designator is associated with the message. As the message
travels on the bus, each site checks whether it addresses to it and the addressed
site picks up the message.
A variant of the simple multi-access bus network topology is the multi-access
branching bus network topology. In such a network, two or more simple
Sitesmulti-
access bus networks are interconnected using repeaters (Fig 1.2). Repeaters are
the hardware devices used to connect cable segments. They amplify and copy
electric signals from one segment of a network to its next segment.
Repeater
Shared bus
8
Repeater
Sites
1.3.2 Medium-Access Control Protocols: In case of both multi-access bus and ring
networks, we say that all the sites of a network share a single channel, resulting in a
multi-access environment. In such an environment, it is possible that several sites try to
transmit information over the shared channel simultaneously. In this case, the transmitted
information may become scramble and must be discarded. The concerned sites must be
notified about the discarded information, so that they can retransmit their information. If
no special provisions are made, this situation may be repeated, resulting in a multi-access
environment to control the access to a shared channel. These schemes are known as
medium-access control protocols.
Clearly, in a multi-access environment, the use of a medium having high raw data
rate alone is not sufficient. The medium-access control protocol used must also provide
for efficient bandwidth use of the medium. Therefore, the medium-access control
protocol has a significant effect on the overall performance of a computer network, and
often it is by such protocols that the networks differ the computer network, and often it is
by such protocols that the networks differ the most. The three important performance
10
techniques are described next. When the packet reaches its receiving computers PSE, it
is delivered to the receiving computer.
Computer
s
B E
1
A PSE PSE
3
C PSE PSE
D F
once the circuit is established, exclusive access to it is guaranteed until the calls
terminated.
In this method, before data transmission starts, a physical circuit is constructed
between the sender and receiver computers during the circuit establishment phase. During
this phase, the channels constituting the circuit are reserved exclusively for the circuit;
hence there is no need for buffers at the intermediate PSEs. Once the circuit is
established, all packets of the data are transferred one after another through the dedicated
circuit without being buffered at intermediate sites, the packets appear to form a
continuous data stream. Finally, in the circuit termination phase, the circuit is torn down
as the last packet of the data transmitted. As soon as the circuit is torn down, the channels
that were reserved for the circuit become available for use by others. If a circuit cannot be
established because a desired channel is busy (being used), the circuit is said to be
blocked. Depending on the way blocked circuits are handled, the partial circuit may be
torn down, with establishment to be attempted later.
The main advantage of a circuit-switching technique:
Once the circuit is established, data is transmitted with no delay other than the
propagation delay, which is negligible.
Since the full capacity of the circuit is available for exclusively use by the
connected pair of computers, the transmission time required to send a message
can be known and guaranteed after the circuit has been successfully established.
However, the method requires additional overhead during circuit establishment and
circuit disconnection phases, and channel bandwidth may be wasted if the connected pair
of computers does not utilize the channel capacities of the path forming the circuit
efficiently. Therefore, the method is considered suitable only for long continuous
transmissions or for transmissions that require guaranteed maximum transmission delay.
It is preferred method for transmission of voice and real-time data in distributed
applications. Circuit switching is used in the Public Switched Telephone Network
(PSTN).
1.4.1.2 Packet Switching: In this method, instead of establishing a dedicated path
between a sender and receiver pair (of computers), the channels are shared for
transmitting packets of different sender-receiver pairs. That is, a channel is occupied by a
16
sender-receiver pair only while transmitting a single packet of the message of that pair;
the channel may then be used for transmitting either another packet of the sender-receiver
pair or a packet of the some other sender-receiver pair.
In this method, each packet of a packet of the message contains the address of the
destination computer, so that it can be sent to its destination independently of all other
packets. Notice that different packets of the same message may take a different path
through the network and, and at the destination computer, the receiver may get the
packets in an order different from the order in which they were sent. Therefore, at the
destination computer, the packets have to be properly reassembled into a message. When
a packet reaches a PSE (Refer Fig.1.4), the packet is temporarily stored there in a packet
buffer. The packet is then forwarded to a selected neighboring PSE when the next channel
becomes available and the neighboring PSE has an available packet buffer. Hence the
actual path taken by a packet to its destination is dynamic because the path is established
as the packet travels along. Packet-switching technique is also known as store-and-
forward communication because every packet is temporarily stored by each PSE along its
routing before it is forwarded to another PSE.
As compared to circuit switching, packet switching is:
Suitable for transmitting small amounts of data that are bursty in nature.
The method allows efficient usage of channels because the communication
bandwidth of a channel is shared for transmitting several messages.
Furthermore, the dynamic selection of the actual path to be taken by a packet
gives the network considerable reliability because failed PSEs or channel can be
ignored and alternate paths may be used. For example, in the WAN of Figure 1.4,
if channel 2 fails, using the path 1-3 the message can still be sent from computer
A to D.
On the other hand, due to the need to buffer each packet at every PSE and to
reassemble the packets at the destination computer, the overhead incurred is large.
Therefore, the method is inefficient for transmitting large messages.
Another drawback of the method is that there is no guarantee of how long it takes
a message to go from its source computer to its destination computer because the
17
time taken for each packet depends on the route chosen for the packet, along with
the volume of data being transferred along this route.
Packet switching is used in the X.25 public packet network and the Internet.
1.4.2 Routing Techniques: In a WAN, when multiple paths exist between the source and
destination computers of a packet, any one of the paths may be used to transfer the
packet. For example, in the WAN of Figure 1.4, there are two paths between computers E
and F: 3-4 and 1-2-4-and any one of the two may be used to transmit a packet from
computer E to F. The selection of the actual path to be used for transmitting a packet is
determined by the routing technique used. An efficient routing technique is crucial to the
overall performance of the network. This requires that the routing decision process must
be as fast as possible to reduce the network. This requires that the routing decision
process should be easily implementable in hardware. Furthermore, the decision process
usually should not require global state information of the network because such
information gathering is a difficult task and creates additional traffic in the network.
Routing algorithms are usually classified based on the following three attributes:
Place where routing decisions are made
Time constant of the information upon which the routing decisions are based
Control mechanism used for dynamic routing
Note that routing techniques are not needed in LANs because the sender of a message
simply puts the message on the communication channel and the receiver takes it off from
the channel. There is no need to decide the path to be used for transmitting the message
from the sender to the receiver.
Out of the three attributes let us only consider the second one as it very important
as far as our requirement is concerned. According to this, the routing algorithms are
classified as follows:
Static routing. In this method, routing tables (stored on PSEs) are set once and
do not change for very long periods of time. They are changed only when the
network undergoes major modifications. Static routing is also known as fixed
or deterministic routing. Static routing is simple and easy to implement.
However, it makes poor use of network bandwidth and causes blocking of a
18
packet even when alternative paths are available for its transaction. Hence,
static routing schemes ARE susceptible to component failures.
Dynamic routing. In this method routing tables are updated relatively
frequently, reflecting shorter-term changes in the network environment.
Dynamic routing strategy is also known as adaptive routing because it has a
tendency to adapt to the dynamically changing state of the network, such as
the presence of faulty or congested channels. Dynamic routing schemes can
use alternative paths for packet transmissions, making more efficient use of
network bandwidth and providing resilience to failures. The latter property is
particularly important for large-scale architectures, since expanding network
size can increase the probability of encountering a faulty network component.
In dynamic routing, however, packets of a message may arrive out of order at
the destination computer. Appending a sequence number to each packet and
property reassembling the packets at the destination computer can solve this
problem. The path selection policy for dynamic routing may either be minimal
or non-minimal. In the minimal policy, the selected path is one of the shortest
paths between the source and destination pair of computers. Therefore, every
channel visited will bring the packet closer to the destination. On the other
hand, in the non-minimal policy, a packet may follow a longer path, usually in
response to current network conditions. If the non-minimal policy is used,
care must be taken to avoid a situation in which the packet will continue to be
routed through the network but never reach the destination.
1.5 Communication Protocols:
For transmission of message data comprised of multiple packets, the sender and receiver
must also agree upon the method used for identifying the first packet and the last packet
of the packet. Moreover, agreement is also needed for handling duplicate messages,
avoiding buffer overflows, and assuring proper message sequencing. Network designers
define all such agreements, needed for communication between the communicating
parties, in terms of rules and conventions. The term protocol is used to refer to a set of
such rules and conventions.
19
Computer networks are implemented using the concept using the concept of
layered protocols.
According to this concept, the protocols of a network are organized into a series
of layers in such a way that each layer contains protocols for exchanging data and
providing functions in a logical sense with the peer entities at other sites in the
network.
Entities in the adjacent layers interact in a physical sense though the common
interface defined between the two layers by passing parameters such as headers,
trailers, and data parameters. The main reasons for using the concept of layered
protocols in network design are as follows:
The protocols of a network are fairly complex. Designing them in layers makes
their implementation more manageable.
Layering of protocols provides well-defined interfaces between the layers, so
that a change in one layer does not affect an adjacent layer. That is, the various
functionalities can be partitioned and implemented independently so that each
one can be changed as technology improves without the other ones being
affected. For example, a change to a routing algorithm in a network control
program should not affect the functions of message sequencing, which is located
in another layer of the network architecture.
Layering of protocols also allows interaction between functionality-paired
layers in different locations. This concept aids in permitting the distribution of
functions to remote sites.
The terms protocol suite, protocol family, or protocol stack are used to refer to
the collection of protocols (of all layers) of a particular network system.
1.5.1 The OSI Reference Model: The basic goal of communication protocols for
network systems is to allow remote computers to communicate with each other and to
allow users to access remote resources. On the other hand, the basic goal of
communication protocols for distributed systems is not only to allow users to access
remote resources but also to do so in a transparent manner. Several standards and
protocols for network systems are already available.
20
The number of layers, the name of each layer, and the functions of each layer may
be different from one network to other network.
To make the job of the network communication protocol designers easier, the
International Standardization Organization (ISO) has developed a reference model
that identifies seven standard layers and defines the jobs to be performed at each
layer. This model is called the Open System International Reference Model (OSI
model).
It is a guide, not a specification. It provides a framework in which standards can
be developed for the services and protocols at each layer. To provide an
understanding of the structure and functioning of layered network protocols, a
brief description of the OSI model here as shown in fig 1.5.
It is a seven-layer architecture in which a separate set of protocols is defined for
each layer. Thus each layer has an independent function and deals with one or
more specific aspects of the communication.
The seven layers are Physical, Data link, Network, Transport, Session,
Presentation, and Application. Let us deal them one by one in detail.
o Physical Layer: This specifies the physical link interconnection, including
electrical/photonic characteristics.
o Data link layer: This specifies how data travels between two end points of
a communication link( for e.g., a host and a packet switch). At this level,
data is delivered in a frame, which consists of a stream of binary data 0s
and 1s in which checksum techniques are applied to detect errors. The
Ethernet protocol is one example of this.
o Network layer: This defines the basic unit of transfer across the network
and includes the concept of multiplexing and routing. At this level the
software assembles a packet in the form the network expects and uses
layer 2 to transfer it over single links.
o Transport layer: This provides end to end reliability by having the
destination host communicate with source host to compensate for the fact
that multiple networks with different qualities of service my have been
utilized.
21
Process Process
A B
Interface Interface
Interface Interface
Interface Interface
Interface Interface
Interface Interface
Interface Interface
Network
UNIT-2
DISTRIBUTED COMPUTING SYSTEM AN INTRODUCTION
Structure
2.0 Objectives
2.1 Introduction
2.2 Distributed Computing System An Outline
2.3 Evolution of Distributed Computing System
2.4 Distributed Computing System Models
2.4.1 Minicomputer model
2.4.2 Workstation model
2.4.3 Workstation Server model
2.4.4 Processor pool model
2.4.5 Hybrid model
2.5 Uses of Distributed Computed System
2.6 Distributed Operating System
2.7 Issues in designing a Distributed Operating system
2.8 Introduction to Distributed Computing Environment
2.8.1 DCE
2.8.2 DCE Components
2.8.3 DCE Cells
2.9 Summary
2.0 Objectives: In this unit we will be learning a new task execution strategy called
Distributed Computing. By the end of this unit you will be able to understand this new
approach and the following related terminologies.
Distributed Computing System (DCS)
Distributed Computing models
Distributed Operating System
Distributed Computing Environment (DCE)
25
Interconnection hardware
Communication network
until the early 1970s that computers started to use the concept of time-sharing to
overcome this hurdle. Early time-sharing system had several dumb terminals attached to
main computer. These terminals were placed in a room different from the main computer
room. These terminals were placed in a room different from the main computer room.
Using these terminals, multiple users could now simultaneously execute interactive jobs
and share the resources of the computer system. In a time-sharing system, each user is
given the impression that he or she has his or her own computer because the system
switches rapidly from one users job to the next users job, executing only a very small
part of each job at a time. Although the idea of time-sharing was demonstrated as early as
1960, time-sharing computer systems were not common until the early 1970s because
they ware difficult and expensive to build. Parallel advancements in hardware technology
allowed reduction in the size and increase in the processing speed of computers, causing
large-sized computers to be gradually replaced by smaller and cheaper ones that had more
processing capability than their predecessors. These systems were called minicomputers.
The advent of time-sharing systems was the first step was distributed
computing systems because it provided us with two important concepts used in
distributed computing systems-
The sharing of computer resources simultaneously by many users
The accessing of computers from a place different from the main computer room.
Initially the terminals of a time-sharing system were dumb terminals and all processing
was done by the main computer system. Advancements in microprocessor technology in
the 1970s allowed the dumb terminals to be replaced by intelligent terminals so that the
concepts of offline processing and time sharing could be combined to have the
advantages of both concepts in a single system. Microprocessor technology continued to
advance rapidly, making available in the early 1980s single-user computers called
workstations that had computing power almost equal to that of minicomputers but were
available for only a small fraction of the price of a minicomputer. For example, the first
workstation developed at Xerox PARC (called Alto) had a high-resolution monochrome
display, a mouse 128 kilobytes of main memory, a 2.5 megabyte hard disk, and a micro
programmed CPU that executed machine-level instruction at speeds of 2-6 s. These
workstations were then used as terminals in the time-sharing systems. In these time-
29
sharing systems, most of the processing of users job could be done at the users own
computer, allowing the main computer to be simultaneously shared by a larger number of
users. Shared resources such as files, databases, and software libraries were placed on the
main computer. Centralized time-sharing systems described above had a limitation in that
the terminals could not be placed very far from the main computer room since ordinary
cables were used to connect the terminals to the main computer. However, in parallel,
there were advancements in compute networking technology in the late 1960s and early
1970s that emerged as two key networking technologies-
LAN (Local Area Network): The LAN technology allowed several computers
located within a building or a campus to be interconnected in such a way that these
machines could exchange information with each other at data rates of about 10
megabits per second (Mbps). The first high-speed LAN was the Ethernet developed
at Xerox PARC in 1973
WAN technology: allowed computers located far from each other (may be in
different cities or countries or continents) to be interconnected in a such a way that
these machines could exchange information with each other at data rates of about
56 kilobits per second (Kbps). The first WAN was the ARPANET (Advanced
Research Projects Agency Network) developed by the U.S. Department of Defense
in 1969.
The ATM technology: The data rates of networks continued to improve gradually in the
1980s providing data rates of up to 100 Mbps for LANs and data rates of up to 64 Kbps
for WANs. Recently (early 1990s) there have been another major advancements in
networking technology the ATM (Asynchronous Transfer Mode) technology. The ATM
technology is an emerging technology that is still not very well established. It will make
very high speed networking possible, providing data transmission rates up to 1.2 gigabits
per second (Gbps) in both LAN and WAN environments. The availability of such high-
bandwidth networks will allow future distributed computing systems to support a
completely new class of distributed applications, called multimedia applications, that deal
with the handling of a mixture of information, including voice, video and ordinary data.
The merging of computer and networking technologies gave birth to Distributed
computing systems in the late 1970s.
30
Mini-
Computer
Communication Mini-
Network Computer
Mini-
Computer
The minicomputer model may be used when resource sharing (such as sharing of
information databases of different types, with each type of database located on a different
31
machine) with remote users is desired. The early ARPA net is an example of a distributed
computing system based on the minicomputer model.
2.4.2 Workstation Model: As shown in figure 2.3, a distributed computing systems based
on the workstation model consists of several workstations interconnected by a
communication network. A companys office or a university department may have several
workstations scattered throughout a building or campus, each workstation equipped with
its own disk and serving as a single user computer. It has been often found that in such an
environment, at any one time (especially at night), significant proportions of the
workstations are idle (not being used), resulting the waste of large amounts of CPU time.
Therefore, the idea of the workstations model is to interconnect all these workstations by
a high-speed LAN so that idle workstations may be used to process jobs of users who are
logged onto other workstations and do not have sufficient processing power at their own
workstations to get their jobs processed efficiently.
Workstation
Workstation Workstation
Communication Workstation
Workstation Network
Workstation
Workstation
Workstation
workstation does not have sufficient processing power for executing the processor of
the submitted jobs efficiently, it transfers one or more of the processors from the
users workstation to some other workstation that is currently idle and gets the
process executed there, and finally the result of execution is returned to be users
workstation
2.4.3 Workstation-Server Model: The workstation model is a network of personal
workstations, each with its own disk and a local file system. A workstation with its own
local disk is usually called a diskful workstation and a workstation without a local disk is
called a diskless workstation. With the invention of high-speed networks, diskless
workstations have been more popular in network environments than diskful workstations,
making the workstation-server model path popular than the workstation mode for
building distributed computing systems.
As shown in Fig 2.4 a distributed computing system based on the workstation
server model consists of a few minicomputers and several workstations (most of which
are diskless, but a few of which may be diskful) interconnected by a communication
network.
For a number of reasons, such as higher reliability and better scalability, multiple
servers are often used for managing the resources of a particular type in a distributed
computing system. For example, there may be multiple file servers, each running on a
separate minicomputer and cooperating via the network, for managing the files of all the
users in the system. Due to this reason, a distinction is often is made between the services
that are provided to clients and the servers that provide them. That is, a server is an
abstract entity that is provided by one or more servers. For example, one or more file
servers may be used in a distributed computing system to provide file service to the users.
In this model, a user logs onto a workstation called his or her home workstation.
Normal computation activities required by the users process are performed at the users
home workstation, but requests for services provided by special servers (such as a file
server or a database server) are sent to a server providing that type of service that
performs the users requested activity and returns the result of request processing to the
users workstations. Therefore, in this model, the users processes need not to be migrated
to the server machines for getting the work done by those machines.
33
Workstation
Workstation Workstation
that this is not the true with the workstation model, in which the workstation
model, in which each workstation has its local file system, because different
mechanisms are needed to access local and remote files.
4. In the workstation-server model, the request-response protocol is mainly used
to access the services of the server machines. Therefore, unlike the
workstation model, this model does not need a process migration facility,
which is difficult to implement. The request-response protocol is known as the
client-server model of communication. In this model, a client process (which
in this case resides on a workstation) sends a request to server process (which
in this case resides on a minicomputer) for getting some services such as
reading a unit of a file. The server executes the request and sends back a reply
to the client that contains the result of processing.
5. A user has guaranteed response time because workstations are not used for
executing remote processes. However, the model does not utilize the
processing capability of idle workstations.
2.4.4 Processor-Pool Model: The processor-pool model is based on the observation that
most of the time a user does not need any computing power but once in a while he or she
may need a very large amount of computing power for a short time. Therefore, unlike the
workstation-server model in which a processor is allocated to each user, in the processor-
pool model the processors are pooled together to be shared by the users as needed. The
pool of processors consists of a large number of microcomputers and minicomputers
attached to the network. Each processor in the pool has its own memory to load and run a
system program or an application program of the distributed computing system.
As shown in Fig 2.5, in the pure processor-pool model, the processors in the pool
have no terminals attached directly to them, and users access the system from terminals
that are attached to the network via special devices. A special server (called a run server)
manages and allocates the processors in the pool to different users on a demand basis.
35
Terminals
Communication
Network
Run
Server File
Server
Pool of processors
When a user submits a job for computation, the run server temporarily assigns an
appropriate number of processors to his or her job. For example, if the users computation
job is the compilation of a program having n segments, in which each of the segments
can be compiled independently to produce separate relocatable object files, n processors
from the pool can be allocated to this job to compile all the n segments in parallel. When
the computation is completed the processors are returned to the pool for use by other
users.
In the processor-pool model there is no concept of a home machines. That is, a
user does not log onto a particular machines but to the system as a whole. This is in
contrast to other models in which each user has a home machine (e.g. a workstation or
36
minicomputer) onto which he or she logs and runs most of his or her programs there by
default.
Amoeba and the Cambridge Distributed Computing Systems are examples of
distributed computing systems based on the processor-pool model.
2.4.5 Hybrid Model: Out of the four models above, the workstation-server model is the
most widely used model for building distributed computing system. This is because a
large number of computer users only perform simple interactive tasks such as editing
jobs, sending electronic mails, and executing small programs. The workstation-server
model is ideal for such simple usage. However, in a workstation environment that has
groups of users who often perform jobs needing massive computation. The processor-
pool model is more attractive and suitable.
To combine the advantages of both the workstation-server and processor-pool
models, a hybrid model may be used to build a distributed computing system. This hybrid
model is based on the workstation-server model but with the addition of a pool of
processors. The processors in the pool can be allocated dynamically for computing that
are too large for workstations or that requires several computers concurrently for efficient
execution. In addition to efficient execution of computation-intensive jobs, the hybrid
model gives guaranteed response to interactive jobs by allowing them to the processed on
local workstations of the users. However, the hybrid model is more expensive to
implement than the workstation-server model or the processor-pool model.
From the models of distributed computing systems presented above, it is obvious
that distributed computing systems are much more complex and difficult to build than
traditional centralized systems (those consisting of a single CPU, its memory, peripherals,
and one or more terminals). The increased complexity is mainly due to the fact that in
addition to being capable of effectively using and managing a very large number of
distributed resources, the system software of a distributed computing system should also
be capable of handling the communication and security problems that are very different
from those of centralized systems. For example, the performance and reliability of a
distributed computing system depends to a great extent on the performance and reliability
of the underlying communication network. Special software is usually needed to handle
loss of messages, during transmission across the network or to prevent overloading of the
37
network that degrades the performance and responsiveness to the users. Similarly, special
software security measures are needed to protect the widely distributed shared resources
and services against intentional or accidental violation of access control and privacy
constraints.
Despite the increased complexity and the difficulty of building distributed
computing systems, the installation and use of distributed computing systems overweigh
their disadvantages. The technical needs, the economic pressures, and the major
advantages that have led to the emergence and popularity of distributed computing
systems are described here.
Inherently Distributed Applications: Distributed computing systems come into
existence in some very natural easy. For example, several applications are inherently
distributed in nature and require a distributed computing system for their realization.
For instance, in an employee database of a nationwide organization, the data pertaining
to a particular employee are generated at the employees branch office, and in addition
to the global need to view the entire database; there is a local need for frequent and
immediate access to locally generated data at each branch office. Such applications
require that some processing power be available at the many distributed locations for
collecting, preprocessing, and accessing data, resulting in the need for distributed
application are a computerized worldwide airline reservation system, a computerized
banking system in which a customer can deposit/withdraw money from his or her
account from any branch of the bank, and a factory automation system controlling
robots and machines all along an assembly line.
Information Sharing among Distributed Users: Efficient person-to-person
communication facility by sharing information over great distances is the one more
advantage. In a distributed computing system, the users working at other nodes of the
system can easily and efficiently share information generated by one of the users. This
facility may be useful in many ways. For example, two or more users who are
geographically far off from each other can perform a project but whose computers are
the parts of the same distributed computing system.
38
processors, and if one of the storage devices fails, the information can still be used from
the other storage device.
Availability: An important aspect of reliability is availability, which refers to the
fraction of time for which a system is available for use. In comparison to a centralized
system, a distributed computing system also enjoys the advantage of increased
availability.
Extensibility and Incremental Growth: Another major advantage of distributed
computing systems is that they are capable of incremental growth. That is, it is
possible to gradually extend the power and functionality of a distributed computing
system by simply adding additional resources (both hardware and software) to the
system as and when the need arises. For example, additional processors can be easily
added to the system to handle the increased workload of an organization that might
have resulted from its expansion. Extensibility is also easier on a distributed computing
system because addition of new resources to an existing system can be performed
without significant disruption of the normal functioning of the system. Properly
designed distributed computing systems that have the property of extensibility and
incremental growth are called open distributed systems.
Better Flexibility in Meeting Users Needs: Different types of computers are usually
more suitable for performing different types of computations. For example, computers
with ordinary power are suitable for ordinary data processing jobs, whereas high-
performance computers are more suitable for complex mathematical computations. In a
centralized system, the users have to perform all types of computations on the only
available computer.
2.5 Distributed Operating System (DOS):
Definition: It is defined as a program that controls the resources of a computer system
and provides its users with an interface or virtual machine that is more convenient to use
than the bare machine. According to this definition, the two primary tasks of an operating
system are as follows:
o To present users with a virtual machine that is easier to program than the
underlying hardware.
40
o To manage the various resources of the system. This involves performing such
tasks as keeping track of who is using which resource, granting resource requests
accounting for resource usage, and mediating conflicting requests from different
programs and users.
The Classification: The operating systems commonly used for distributed computing
systems can be broadly classified into two types - network operating systems and
distributed operating systems.
The three most important features
System Image: The most important feature used to differentiate between the two
types of operating systems is the image of the distributed computing system from
the point of view of its users. In case of a network operating system, the users
view the distributed computing systems as a collection of distinct machines
connected by a communication subsystem. That is, the users are aware of the fact
that multiple computers are being used. On the other hand, a distributed
operating system hides the existence of multiple computers and provides a single-
system image to its users. That is, it makes a collection of networked machines act
as a virtual uniprocessor.
Autonomy: A network operating system is built on a set of existing centralized
operating systems and handles the interface and coordination of remote operations
and communications between these operating systems. That is, in the case of a
network operating system, each computer of the distributed computing system has
its own local operating system (the operating system of different computers may
be the same or different), and there is essentially no coordination at all among the
computers except for the rule that when two processes of different computers
communicate with each other, they must use a mutually agreed on communication
protocol. Each computer functions independently of other computers in the sense
that each one makes independent decisions about the creation and termination of
their own processes and management of local resources. Notice that due to the
possibility of difference in local operating systems, the system calls for different
computers of the same distributed computing system may be different in this case.
On the other hand, with a distributed operating system, there is a single system
41
wide operating system and each computer of the distributed computing system
runs a part of this global operating system. The distributed operating system
tightly interweaves all the computers of the distributed computing system in the
sense that they work in close cooperation with each other for the efficient and
effective utilization of the various resources of the system. That is, processes and
several resources are managed globally (some resources are managed locally).
Moreover, there is a single set of globally valid system calls available on all
computers of the distributed computing system.
In short, it can be said that the degree of autonomy of each machine of a
distributed computing system that uses a network operating system is
considerably high as compared to that of machines of a distributed computing
system that uses a distributed operating system.
Fault tolerance capability: A network operating system provides little or no fault
tolerance capability in the sense that if 10% of the machines of the entire
distributed computing system are down at any moment, at least 10% of the users
11are unable to continue with their work. On the other hand, with a distributed
operating system, most of the users are normally unaffected by the failed
machines and can continue to perform their work normally, with only a 10% loss
in performance of the entire distributed computing system. Therefore, the fault
tolerance capability of a distributed operating system is usually very high as
compared as that of a network operating system.
Some Important Points to be noted with respect to DOS:
A distributed operating system is one that looks to its users like an ordinary
centralized operating system but runs on multiple, independent central
processing unit (CPUs). The key concept here is transparency. In other words,
the multiple processors should be invisible (transparent) to the user. Another
way of expressing the same idea is to say that the user views the system as a
virtual uniprocessor, not as a collection of distinct machines.
A distributed computing system that uses a network operating system is
usually referred to as a network system, whereas one that uses a distributed
42
access remote resources in the same way as local resources. That is, the
user interface, which takes the form of a set of system calls, should not
distinguish between local and remote resources, and it should be the
responsibility of the distributed operating system to locate the resources
and to arrange for servicing user requests in a user-transparent manner.
o Location Transparency: The two main aspects of location transparency are
as follows:
Name transparency: This refers to the fact that the name of a
resource (hardware or software) should not reveal any hint as to
the physical location of the resource. That is, the name of a
resource should be independent of the physical connectivity or
topology of the system or the current location of the resource.
Furthermore, such resources, which are capable of being moved
from one node to another in a distributed system (such as file) must
be allowed to move without having their names changed.
Therefore, resource names must be unique system wide
User mobility: This refers to the fact that no matter which
machine a user is logged onto, he or she should be able to access a
resource with the same name. That is, the user should not be
required to use different names to access the same resource from
two different nodes of the system.
o Replication Transparency: For better performance and reliability,
almost all distributed operating systems have the provision to create
replicas (additional copies) of files and other resources on different nodes
of the distributed system. In these systems, both the existence of multiple
copies of a replicated resource and the replication activity should be
transparent to the users. That is, two important issued related to replication
transparency are naming of replicas and replication control. It is the
responsibility of the system to name the various copies of a resource and
to map a user-supplied name of the resource to an appropriate replica of
the resource.
44
operating system is its central controlling part that provided basic system
facilities. It operates in a separate address space that a user cannot replace or
modify. The two commonly used models for kernels design in distributed
operating systems are monolithic kernel and the micro kernel.
o In monolithic kernel model, the kernel provides most operating system
services such as process management, and inter-process communication. As a
result, the kernel has a large, monolithic structure. Many distributed operating
systems that are extensions or imitations of the UNIX operating system use
the monolithic kernel model. This is mainly because UNIX itself has a large,
monolithic kernel.
o In the micro kernel model, the main goal is to keep the kernel as small as
possible. Therefore, in this model, the kernel is a very small nucleus of
software that provides only the minimal facilities necessary for implementing
additional operating system services. The only services provided by the kernel
in this model are inter-process communication, lowlevel device management
and some memory management. All other operating system services, such as
file-management, name management, additional process and memory
management activities, and much system call handling are implemented as a
user-level server processes.
Node 1 Node 2 Node n
Network hardware
Network hardware
Security: In order that the users can trust the system and rely on it, the various
resources of a computer system must be protected against destruction and
unauthorized access. Enforcing security in a distributed system is more difficult
than in a centralized system because of the lack of a single point of control and
the use of insecure networks for data communication. In a centralized system, all
users are authenticated by the system at login time, and the system can easily
check whether a user is authorized to perform the requested operation on an
accessed resource. In a distributed system, however, since the client-server model
is often used for requesting and providing services, when a client sends a request
message to a server, the server must have some way of knowing who is the client.
This is not so simple as it might appear because any client identification field in
the message cannot be trusted. This is because an intruder (a person or program
trying to obtain unauthorized access to system resources) may pretend to be an
authorized client or may change the message contents during transmission.
Therefore, as compared to a centralized system, enforcement of security in a
distributed system has the following additional requirements:
1. It should be possible for the sender of a message to know that the
intended receiver received the message.
2 It should be possible for receiver of a message to know that the
message was sent by the genuine sender
3. It should be possible for both the sender and receiver of a message to
be guaranteed that the contents of the message were not changed while
it was in transfer.
Emulation of Existing Operating Systems: For commercial success, it is
important that a newly designed distributed operating system be able to emulate
existing poplar operating systems such as UNIX. With this property, new software
52
can be written using the system call interface of the new operating system to take
full advantage of its special features of distribution, but a vast amount to already
existing old software can also be run on the same system without the need to
rewrite them. Therefore, moving to the new distributed operating system will
allow both types of software to be run side by side.
2.8 Introduction to Distributed Computing Environment (DCE):
The Open Software Foundation (OSF), a consortium of computer manufactures including
IBM, DEC, and Hewlett Packard defined a vendor independent distributed computing
environment (DCE).
2.8.1.DCE: It is not an operating system, nor it is an application. To a certain extent, it is
an integrated set of services and tools that can be installed as a coherent environment on
top of existing operating systems and serve as a platform for building and running
distributed applications.
A primary goal of DCE is vendor independence. It runs on many different kinds
of computers, operating systems, and networks produced by different vendors. For
example, some operating system to which DCE can be easily ported include OSF/1, AIX,
DOMAINS, OS, ULTRIX, HP-UX, SINIX, SunOS, UNIX System V, VMS, WINDOWS
and OS/2. On the other hand, it can be used with any network hardware and transport
software, including TCP/IP, X 2.5 as well as other similar products.
As shown in the below figure, DCE is a middleware software layered between the
DCE applications layer and the operating system and networking layer. The basic idea is
to take a collection of existing machines (possibly from different vendors), interconnect
them by a communication network, add the DCE software platform on top of the native
operating systems of the machines, and then be able to build and run distributed
applications. Each machine has its own local operating system, which may be different
from that of other machines. The DCE software layer on top of the operating system and
networking layer hides the differences between machines by automatically performing
data-type conversions when necessary.
2.8.2 DCE Components: It is a mix of various technologies developed independently
and nicely integrated by OSF. Each of these technologies forms a component of DCE.
The main components of DCE are as follows:
53
DCE applications
DCE Software
availability. A unique feature of DCE DFS is that it can also provide file
services to clients of other file systems.
2.8.3.DCE Cells: The DCE system is highly scalable in the sense that a system running
DCE can have thousands of computers and millions of users spread over a worldwide
geographic area. To accommodate such large systems, DCE uses the concept of cells.
This concept helps break down a large system into smaller, manageable units called cells.
In a DCE system, a cell is a group of users, machines, or other resources that
typically have a common purpose and share common DCE services. The minimum cell
configuration required a cell directory server, a security server, a distributed timeserver,
and one or more client machines. Each DCE client machine has client processes for
security service, cell directory service, distributed time service, RPC facility, and threads
facility. A DCE client machine may also have a process for distributed file service if a
cell configuration has a DCE distributed file server. Due to the use of the method of
intersection for clock synchronization, it is recommended that each cell in a DCE system
should have at least three distributed timeservers.
Check Your Progress 1: Answer the following questions in one or two sentences.
a) Define a distributed computing system.
b) List the different computing system models.
c) What is a DCE?
d) Define transparency in DCS.
e) List the various forms of transparencies expected by a DCS.
2.9 Summary:
Let us sum up the different concepts we have studied till here.
A distributed computing system is a collection of processors
interconnected by a communication network in which each processor has
its own local memory and other peripherals and communication between
any two processors of the system takes place by message passing over the
communication network.
The existing models for distributed computing systems can be broadly
classified into five categories, minicomputer, workstation-server,
processor-pool and hybrid.
Distributed computing system are much more complex and difficult to
build than the traditional centralized systems. Despite the increased
complexity and the difficulty of buildings, the installation and the use of
distributed computing system are rapidly increasing. This is mainly
because the advantages of distributed computing systems outweigh its
disadvantages.
The main advantages of distributed computing systems are (a) suitability
for inherently distributed applications. (b) Sharing of information among
distributed users and sharing of resources (d) better price performance
ratio (e) shorter response times and higher throughout (f) higher reliability
(g) extensibility and incremental growth and (h) better flexibility in
meeting users needs.
The operating systems commonly used for distributed computing systems
can be broadly classified into two types: network operating systems and
distributed operating systems. As compared to a network operating
system, a distributed operating system has better transparency and fault
capability and provides the image of a virtual uniprocessor to the users.
The main issue involved in the design of a distributed operating system is
transparency, reliability, flexibility, performance, scalability, heterogeneity,
security and emulation of existing operating systems.
56
UNIT 3
3.0 Objectives
3.1 Introduction
3.2 Review of Databases
3.2.1 The Relational Model
3.2.2 The Relational Operations
3.3 Distributed Processing- An Introduction
3.4 Features of Distributed Versus Centralized Databases
3.5 Uses of Distributed Databases
3.6 Distributed Database Management Systems
3.7 Summary
3.0 Objectives: The main objectives of this unit are:
Familiarization of
Basic Database Concepts
Distributed processing
Distributed Databases
Distributed Database Management Systems
3.1 Introduction:
In recent years databases have become an important area of information processing, and
it is easy to predict that importance will rapidly grow. There are both organizational and
technological reasons for this trend. The distributed databases eliminate many of the
problems of centralized databases and fit more naturally in the decentralized structures of
many organizations. A very vague definition of distributed database is that It is a
collection of data which belong logically to the same system but spread over the sites of a
computer network. To understand this new way of data processing, this Unit first
introduces the basic concepts of databases. Later an overview of distributed data
processing is discussed. In the next section, we differentiate the conventional centralized
57
database system and distributed database system. Finally, a Distributed Data Base
Management System (DDBMS) is discussed.
3.2 Review of Databases:
In this section we discuss some basic concepts of databases, which will be very much
necessary for understanding the further topics of this unit. A database model known as the
Relational model is used. Basically, the relational model allows the use of powerful, set-
oriented, associative expressions instead of the one-record-at-a-time primitives of more
procedural models like the Codasyl model. We use in this course two different data
manipulation languages: Relational algebra and SQL. SQL, which is a user-friendly
language, is used for dealing with the problems of writing application program for
distributed databases. Relational algebra is instead used for describing and manipulating
access strategies to distributed databases.
3.2.1 The Relational Model: Here some basic terminologies are defined. They are as
follows:
o Relations: These are the tables that are used for storing the data in relational
databases.
o Attributes: Each relation has a (fixed) number of columns, called Attributes.
o Tuples: These are the dynamic, time-varying number of rows.
o The number of attributes of a relation is called its grade.
o The number of tuples is called its cardinality.
o The set of possible values of a given attribute is called its domain
An Example for demonstrating the above terminologies:
EMPNUM NAME AGE DEPTNUM
2 Vasu 25 1
7 Varma 34 2
11 Rama 18 1
12 Krishna 22 3
14 Hari 30 1
3.2.2 The Relational Operations: The operations of relational algebra can be composed
into arbitrarily complex expressions; Expressions allow specifying the manipulations that
are required on relations in order to retrieve information from them.
o The different operations that are possible on relations: Here five basic
operations are defined: Selection, Projection, Cartesian product, Union, and
Difference. From these operations, some other operations are derived, such as
Intersection, Division, Join and Semi join. We have highlighted those
operations, which have an important application in distributed databases, such as
Join and Semi-join. The meaning of each operation is described referring to an
example.
Unary operations take only one relation as operand; they include selection
and projection.
The Selection SLF R where R is the operand to which the selection is
applied and F is a formula that expresses a selection predicate produces a
result relation with the same relation schema as the operand relation, and
containing the subset of the tuples of the operand, which satisfy the
predicate. The formula involves attribute names or constants as operand,
arithmetic comparison operators, and logical operators.
Projection PJAttr where Attr denotes a subset of the attributes of the
operand relation produces a result having these attributes as relation
schema.
Binary operations take two relations as operand; we review union,
difference, Cartesian product, join and semi-join.
The union R UN S is meaningful only between two relations R and S
with the same relation schema; it produces a relation with the same
relation schema as its operands and the union of the tuples of R and S (i.e
all tuples appearing either in R or in S or in both)
The difference R DF S is meaningful only between two relations R and S
with the same relation schema; it produces a relation with the same
relation schema as its operands and the difference between the tuples of R
and S (i.e all tuples appearing in R but not in S)
60
b 2 f a 1 a
a 1 a a 3 f
b 1 b a 3 f
a 1 d a 3 f
b 2 f a 3 f
f) Cartesian Product R CP S
A B C A R.B R.C T.B T.C D
B 1 B A 1 a 1 a 1
A 1 D A 1 a 2 a 3
A 2 F B 1 b 3 b 1
A 1 d 1 d 4
e) Difference R DF S g) Join R JN R.C=T.C T
A B C D A B C A B C
A 1 a 1 A 1 a a 1 a
A 1 d 4 B 1 b a 1 d
A 1 d
h) Natural join R NJN T i) Semi join RSJ R.C=T.C T j) Natural join RNSJ T
Figure 3.2 Operations of relational algebra.
The semi-join of two relations R and S is denoted as R SJ F S, where F is a
formula that specifies a join predicate. A semi-join is derived from
projection and join as follows:
R SJF S = PJ Attr(R) (RJNF S)
Where Attr(R) the set of all attributes of R.
The natural semi-join of two relations R and S, denoted R NSJ S, is
obtained by considering a semi-join with the same join predicate as in the
natural join.
o SQL Statement: A simple statement in SQL has the following structure:
Select (attribute list)
From (relation name)
Where (predicates)
The interpretation of this statement is equivalent to performing a selection
operation using the predicates of the where clause on the relation specified in
62
the from clause and then projecting the result on the attributes of the select
clause. consider the relation of Figure 2.1a, the statement:
Example 1: Consider the relation EMP of the Fig
Select NAME, AGE
From EMP
Where AGE > 20 AND DEPTNUM = 1
Returns the following relation:
NAME AGE
Verma 25
Hari 30
Nowadays distributed processing is the important area of information processing and also
it is implemented using the concept of Distributed databases. A Distributed database is a
collection of data, which belong logically to the same system but are spread over the sites
of computer network. The two important aspects of a distributed database:
Distribution: The data are not resided in centralized place
Logical correlation: The data are tied themselves with some properties
Let us understand the above terminologies by taking an example. Consider a Bank
Transaction. Here the bank has three branches at different locations .At each branch the
computer controls the teller terminals of the branch and the account database of the
branch (Figure 2 .1). Each local branch database is termed as SITE of the distributed
database connected by a communication network. All the local transactions are managed
by these local computers and will therefore be called as local applications. An example
of local application is a debit or a credit application performed on an account stored at the
same branch at which the application is requested.
Also there are some applications which accesses data at more than one branch
known as Global applications or Distributed applications. i.e., for example the transfer
of funds from account of one branch to an account of other branch. This is not that simple
issue as it involves updating the databases at two different places.
We can now summarize the above aspects and let us reform the definition of the
distributed database as a collection of data, which are distributed over different
computers of a computer network. Each site of the network has autonomous processing
capability and can perform local applications and also it participates at least in one
global application, which requires accessing data at several sites using a communication
subsystem. The most important aspect expected from the system is the cooperation
between the autonomous sites.
The following table gives the comparative study of the main features.
64
Performance aspects
Increased reliability and availability
3.6 Distributed Data Base Management Systems (DDBMS):
A Distributed Database Management System helps in the creation and management of
distributed databases .The software requirements for building a distributed database are
The database management component (DB)
The data communication component (DC)
The data dictionary (DD), which gives the idea of data distribution in the entire
network.
The distributed database component (DDB)
The fig 3.3 can show the interaction between the above components.
The services supported by the above components are:
Remote database access to an application program.
Some degree of distribution transparency
Support for database administration and control
Support for concurrency control and recovery of distributed transactions
The different types of the Distributed database accesses available:
a. Remote access via DBMS primitives: Here the application issues a request, which
refers to remote data. Fig 3.4 shows this scenario. This request is automatically routed by
the DDBMS to the site where the data is located; then it is executed and the
corresponding result is returned.
b. Remote access via auxiliary program: Here the application requires the auxiliary
program to be executed at the remote site, which accesses the remote database and
returns the result to requesting application. Fig 3.5 gives the exact picture about the
concept.
66
T T
T T
DB DC
DDB
Local DD
Database 1
Site 1
Site 2
Local DD
Database 2
DDB
T
T T
DB T
DC
Fig. 3.3 Components of a commercial DDBMS
67
DATABASE
Application ACCESS
DBMS1
program PRIMITIVE
Site 1
Site 2
DBMS2 Database 2
RESULT
Site 1
Site2
DATABASE
GLOBAL RESULT ACCESS
Auxiliary
DBMS2
PRIMITIVES
program AND
RESULTS
Database 2
UNIT-4
Structure
4.0 Objectives
4.1 Introduction
4.2 Reference Architecture for distributed databases
4.3 Types of data fragmentation
4.3.1 Horizontal Fragmentation
4.3.2 Derived Horizontal Fragmentation
4.3.3 Vertical Fragmentation
4.3.4 Mixed Fragmentation
4.4 Integrity Constraints in Distributed Databases
4.5 Summary
fragmentation methods are discussed. Also the integrity constraints in the distributed
transaction are explained.
4.2 Reference Architecture for Distributed Databases:
Here we have suggested reference architecture for the distributed databases as shown in
the fig.4.1.The different levels are conceptually helpful to understand the functioning of
the whole system. The various stages of the architecture are as follows.
Global Schema: defines all the data, which are contained in the distributed
database as if it is a centralized system. Here a set of global relations is used.
Fragmentation Schema: Each global relation is split into several non-
overlapping portions that are called as Fragments. The mapping between the
global relations and fragments is defined in the Fragmentation Schema. It is a
one to many relation such that several fragments correspond to one global
relation but only one global relation corresponds to one fragment. They are
indicated as Ri, the ith fragment of the global relation R.
Allocation Schema: The fragments are really the logical portions of the
global relation, which are physically dispersed at different sites of the
network. This schema defines at which site(s) a fragment is allocated. It is to
be noted that depending upon the requirement more than one fragment may be
allocated at a site. So this mapping determines whether the system is a
Redundant system or a Non redundant system.
Local Mapping Schema: We have already described the relationships
between the objects at the three top levels of this architecture. These three
levels are site independent; therefore, they do not depend on the data model of
the local DBMSs. At a lower level, it is necessary to map the physical images
to the objects that are manipulated by the local DBMS s. This mapping is
called a local mapping schema and depends on the type of local DBMS;
therefore in a heterogeneous system we have different types of local mappings
at different sites.
71
Global
Schema
Fragmentation Site
Schema Independent
Schemas
Allocation
Schema
(Other sites)
Local Local .
Mapping Mapping .
.
Schema 1 Schema 2 .
DBMS of DBMS of
Site 1 Site 2
Local
Local
database
database
at site 1
at site 2
Fig. 4.1 A reference architecture for distributed databases
72
We can generalize from the above example that in order to satisfy the completeness
condition, the set of qualifications of all fragments must be complete, at least with respect
to the set of allowed values. The reconstruction condition is always satisfied through the
union operation, and the disjoint ness condition requires that qualifications be mutually
exclusive.
4.3.2 Derived Horizontal Fragmentation: This is a type of fragmentation, which is
derived from the horizontal fragmentation of another relation.
Example: Consider a global relation
SUPPLY (SNUM, PNUM, DEPTNUM, QUAN)
where SNUM is a supplier number. If it is required that a fragment has to contain the
tuples for suppliers, which are in a given city, and then we have to go for derived
fragmentation. A semi-join operation with the fragments SUPLIER1 and SUPLIER2 is
needed in order to determine the tuples of SUPPLY, which correspond to the suppliers in
a given city. The derived fragmentation of SUPPLY can be therefore defined as follows:
SUPPLY1 = SUPPLY SJSNUM=SNUMSUPPLIER1
SUPPLY2 = SUPPLY SJSNUM=SNUMSUPPLIER2
The reconstruction of the global relation SUPPLY can be performed through the union
operation as was shown for SUPPLIER.
The completeness of the above fragmentation requires that there be no supplier numbers
in the SUPPLY relation, which are not contained also in the SUPPLIER relation. This is a
typical, and reasonable, integrity constraint for this database and usually is called as the
referential integrity constraint.
The disjoint ness condition is satisfied if a tuple of the SUPPLY relation does not
correspond to two tuples of the SUPPLIER relation that belong to two different
fragments. In this case this condition is easily verified, because the supplier numbers are
unique keys of the SUPPLIER relation.
4.3.3 Vertical Fragmentation:
The Vertical fragmentation of a global relation is the subdivision of its attributes into
groups; fragments are obtained by projecting the global relation over each group. This
can be useful in distributed databases where each group of attributes can contain data that
have common geographical properties. The fragmentation is correct; if each attribute is
75
mapped into at least one attribute of the fragments; moreover, it must be possible to
reconstruct the original relation by joining the fragments together.
Example: Consider a global relation
EMP (EMPNUM, NAME, SAL, TAX, MGRNUM, DEPTNUM)
A vertical fragmentation of this relation can be defines as
EMP1=PJ EMPNUM,NAME, MGRNUM, DEPTNUM EMP
EMP2=PJ EMPNUM, SAL,TAX EMP
The reconstruction of relation EMP can be obtained as
EMP=EMP1JN EMPNUM=EMPNUM EMP2
This is because; EMPNUM is a key of EMP.
Let us draw some important points to be noted from this example.
The purpose of including the key of the global relation into each
fragment is to ensure the reconstruction property.
An alterative way to provide the reconstruction property is to generate
tuple identifiers that are used as system-controlled keys. This can be
convenient in order to avoid the replication of large keys; moreover,
users cannot modify tuple identifiers.
Let us finally consider the problem of fragment disjoint ness. First, we have seen that at
least the key should be replicated in all fragments in order to allow reconstruction. In
fact, if we include the same attribute in two different vertical fragments, we know exactly
that the column that corresponds to this attribute.
For example, consider the following vertical fragmentation of relation EMP:
EMP1 = PJEMPNUM,NAME,MGRNUM,DEPTNUM EMP
EMP2 = PJEMPNUM,NAME,SAL,TAX EMP
The attribute NAME is replicated in both fragments. We can remove this attribute when
we reconstruct relation EMP through an additional projection operation.
EMP = EMP1 JNEMPNUM=EMPNUM PJEMPNUM, SAL, TAX EMP2
4.3.4 Mixed Fragmentation: The fragments that are obtained by the above
fragmentation operations are relations themselves, so that it is possible to apply the
fragmentation operations recursively, provided that the correctness conditions are
76
satisfied each time. The reconstruction can be obtained by applying the reconstruction
rules in reverse order.
Example: Consider the same global relation
EMP (EMPNUM, NAME, SAL, TAX, MGRNUM, DEPTNUM)
The following is a mixed fragmentation, which is obtained by applying the vertical
fragmentation of the previous example, followed by a horizontal fragmentation on
DEPTNUM:
EMP1 = SLDEPTNUM10 PJEMPNUM, NAME, MGRNUM, DEPTNUM EMP
EMP2 = SL10<DEPTNUM20 PJEMPNUM, NAME, MGRNUM, DEPTNUM EMP
EMP3 = SLDEPTNUM>20 PJEMPNUM, NAME, MGRNUM, DEPTNUM EMP
EMP4 = PJEMPNUM, NAME, SAL , TAX EMP
EMP
v
EMP4
h
The EXAMPLE_DDB:
The following codes shows the global and fragmentation schemata of EXAMPLE_DDB.
Most of the global relations of EXAMPLE_DDB and their fragmentation have been
already introduced. A DEPT relation, horizontally fragmented into three fragments on the
value of the DEPTNUM attribute, is added.
Global schema
EMP (EMPNUM, NAME, SAL, TAX, MGRNUM, DEPTNUM)
DEPT (DEPTNUM, NAME, AREA, MGRNUM)
SUPPLIER (SNUM, PNUM, DEPTNUM, QNUM)
Fragmentation schema
EMP1 = SLDEPTNUM 10 PJEMPNUM, NAME, MGRNUM, DEPTNUM(EMP)
EMP2 = SL10<DEPTNUM 20 PJEMPNUM, NAME, MGRNUM, DEPTNUM(EMP)
EMP3 = SLDEPTNUM >20 PJEMPNUM, NAME, MGRNUM, DEPTNUM(EMP)
EMP4 = PJEMPNUM, NAME, SAL, TAX(EMP)
DEPT1= SLDEPTNUM10(DEPT)
DEPT2= SL10<DEPTNUM20(DEPT)
DEPT3= SLDEPTNUM>20(DEPT)
SUPPLIER1 = SLCITY = SF (SUPPLIER)
SUPPLIER2 = SLCITY = LA (SUPPLIER)
SUPPLY1 = SUPPLY SJSNUM=SNUMSUPPLIER1
SUPPLY2 = SUPPLY SJSNUM=SNUMSUPPLIER2
4.4 Integrity Constraints In Distributed Databases:
When an update performed by a database application violates an integrity constraint, the
application is rejected and thus the correctness of data is preserved. A typical example of
integrity constraint is referential integrity, which requires that all values of a given
attribute of a relation exist also in some other relation. This constraint is particularly
useful in distributed databases, for ensuring the correctness of derived fragmentation. For
example, since the SUPPLY relation has a fragmentation which is derived from that of
SUPPLIER relation by means of a semi-join on the SUPNUM attribute, it is required that
all values of SUPNUM in SUPPLY be present also in SUPPLIER.
78
UNIT-5
DISTRIBUTED DATABASE DESIGN
Structure
5.0 Objectives
5.1 Introduction
5.2 A Framework for Distributed Database Design
5.2.1 Objectives of the Design of Data Distribution
5.2.2 Top Down and Bottom Up Approach A classical Design
Methodologies
5.3 The Design of Database Fragmentation
5.3.1 Horizontal Fragmentation
5.3.1.1 Primary Fragmentation
5.3.1.2 Derived Horizontal Fragmentation
5.3.2 Vertical Fragmentation
5.3.3 Mixed Fragmentation
5.4 The Allocation of Fragments
5.5 Summary
5.0 Objectives: In this unit you will come to know the different design aspects of
distributed databases. At the end of this unit you will be able to describe the topics like
A framework for distributed database design
The objectives of design of data distribution
Top Down and Bottom Up design approaches
The design of database fragmentation
Horizontal Fragmentation
Vertical Fragmentation
Mixed Fragmentation
The allocation of fragments
General Criteria for Fragment allocation
5.1 Introduction: The concept of data distribution itself is difficult to design and
implement because of various technical and organizational issues. So we need to have an
80
efficient design methodology. From the technical aspect, the interconnection of sites and
appropriate distribution of the data and applications to the sites depending upon the
requirement of applications and for optimizing performances. From the organizational
point, the issue of decentralization is crucial and distributing an application has a greater
effect on the organization. In recent years, lot of research work has taken place in this
area and the major outcome of this are:
o Design criteria for effective data distribution
o Mathematical background of the design aids
In the section 5.2 you will learn a framework of the design including the design
approaches like Top Down and Bottom Up. The section 5.3 explains about the design
of Horizontal and Vertical Fragmentation. In the section 5.4 we will give principles and
concepts in Fragment allocation.
5.2 A Framework for Distributed Database Design: The design of a centralized
database concentrates on:
Designing the conceptual schema that describes the complete database
Designing the Physical database, which maps the conceptual schema to the
storage areas and determines the appropriate access methods.
The above two steps contributes in distributed database towards the design of Global
schema and the design of local databases. The added steps are:
Designing the Fragmentation: - The actual procedure of dividing the existing
global relations into horizontal, vertical or mixed fragments
Designing the allocation of fragments: -Allocation of different fragments
according to the site requirements
Before designing the Distributed database a thorough knowledge about the application is
a must. In this case we expect the following things from the designer.
Site of Origin: The site from which the application is issued.
The frequency of invoking the request at each site
The number, type and the statistical distribution of accesses made by each
application to each required data.
In the coming section let us try to know the actual need of design of data distribution.
81
5.2.1 Objectives of the Design of Data Distribution: In the design of data distribution
the following objectives should be considered.
Processing locality: Reducing the remote references in turn maximizing the local
references is the primary aim of the data distribution. This can be achieved by
having redundant fragment allocation meeting the site requirements. Complete
locality is an extended idea, which simplifies the execution of application.
Availability and reliability of distributed data: Availability is achieved by
having multiple copies of the data for read only applications. Reliability is
achieved by storing the multiple copies of the information, as it will be helpful in
case of system crashes.
Workload distribution: workload distribution is the major goal to have high
degree of parallelism.
Storage costs and Processing locality: Cost criteria and Availability of storage
areas should be intelligently handled for effective data distribution.
Using the all above criteria may increase the design complexity. So important aspects are
taken as objectives depending upon the requirement and others are treated as constraints.
In the next section let us design a simple approach for maximizing the processing
locality.
5.2.2 Top Down and Bottom Up Approach A classical Design Methodologies:
There are two classical approaches as far as distributed databases design is concerned.
They are:
1. Top Down Approach: This may be quite useful when the system has to be
designed from the scratch. Here we follow the following steps:
Design of Global Schema.
Design of Fragmentation Schema.
Design of Allocation Schema.
Design of Local Schema (Design of Physical Databases).
2. Bottom - Up Approach: This can be used for an existing system. This approach
is based on the integration of existing schemata into a single, global schema. But
requires that the following aspects have to be fulfilled.
82
y = pi *
pi p
Where (pi* = pi or pi * =NOT pi) and y false
A fragment is the set of all tuples for which a min-term predicate holds.
A simple predicate is relevant respect to a set P of simple predicates if
there exists at least two min-term predicates of P whose expression differs
only in the predicate pi itself such that the corresponding fragments are
referenced in a different way by at least one application.
Let us try to understand the above terminologies by taking an example. Let us consider
the relations DEPT (DEPTNUM, NAME, AREA) and JOB(JOBID,JOB NAME). Let us
assume that only two departments are functioning i.e 1 & 2.Now some examples for
simple predicates are:
DEPTNUM =1 or DEPTNUM 2, DEPTNUM = 2 or DEPTNUM 1
JOB = programmer or JOB programmer.
The corresponding min-term predicates are
DEPTNUM =1 AND JOB = programmer
DEPTNUM =1 AND JOB programmer
DEPTNUM 1 AND JOB programmer
DEPTNUM 1 AND JOB = programmer
Now let us concentrate on some more supporting terminologies. Let P = {p 1,p2,.p n}be a
set of simple predicates. For correct and efficient fragmentation P must be complete and
minimal.
We say that a set P of predicates is complete if and only if any two tuples
belonging to the same fragment are referenced with the same probabilities
by any application.
The set P is minimal if all its predicates are relevant.
Example: In the above example, P1 ={DEPTNUM = 1} is not complete since the
application is even interested in the employees who are programmers. So in this case
P2 = {DEPTNUM =1,JOB = programmer} is complete and minimal. The set P3 =
{DEPTNUM =1, JOB = programmer, SAL > 50} is complete but not minimal since
SAL >50 is not relevant.
84
This query issued at any one of the sites. Let us assume that we have three sites in
our purview. Site 1 is in Shimoga, Site 2 is in Mysore and Site 3 is in between Shimoga
and Mysore. So, if the query is issued at Site 1 it references SUPPLIERS whose CITY is
Shimoga with almost 90% probability; if it is issued at Site 2 it references SUPPLIERS
of Shimoga and Mysore with equal probability; if it is issued at Site3 it references
85
SUPPLIERS whose CITY is Mysore with almost 90% probability. This is because the
obvious fact that department around one city tends to use suppliers, which are close to
them.
We can write the predicates for the above application,
P1: CITY = SHIMOGA
P2: CITY = MYSORE
Since the set {P1, P2} is complete and minimal, the search is terminated.
Let us now consider the global relation DEPT: DEPT (DEPTNUM, NAME,
AREA, MGRNUM). Some example predicates that are suitable for administrative
applications are considered.
P1: DEPTNUM < = 10
P2: (10 < DEPTNUM < = 20)
P3: DEPTNUM > 20
P4: AREA = NORTH
P5: AREA = SOUTH
If we assume that in the northern area the departments with DEPTNUM > 20 will
never be there, then AREA = NORTH implies that DEPTNUM > 20 is false. Thus the
fragments are reduced to the following four:
Y1: DEPTNUM < = 10
Y2: (10 < DEPTNUM < = 20) AND (AREA = NORTH )
Y3: (10 < DEPTNUM < = 20) AND (AREA = SOUTH )
Y4: DEPTNUM > 20
If we now concentrate about the fragment allocation we can easily allocate
fragments corresponding to y1 and y4 at sites 1 and 3.But depending upon the
requirement fragments y2 and y3 will be allocated to either sites 1 or 3.
5.3.1.2 Derived Horizontal Fragmentation: This is not based on the properties of its
own attributes, but it is derived from the horizontal fragmentation of another relation. It is
used to make the join between the fragments. A distributed join is a join between
horizontally fragmented relations. That is when you want to join the two relations G and
H you have to compare their fragments. Join Graphs can efficiently represent it. The fig
5.1 represents the different possible join graphs.
86
o Total: The join graph is total when it contains all possible edges between
fragments of G and H.
o Reduced: The join graph is reduced when some of the edges between G
and H are missing. Here we have two types:
Partitioned: A reduced graph is partitioned if the graph is
composed of two or more sub graphs without edge between them.
Simple: A reduced graph is simple if it is partitioned and each sub
graph has just one edge.
Example: Consider the relation SUPPLY (SNUM, PNUM, DEPTNUM, QUAN). Let us
take the following case. Some application
o Requires the information about supplies of given suppliers; thus they join
between SUPPLY and SUPPLIER in the SNUM attribute.
o Requires the information about supplies at a given department; then they
perform join between SUPPLY and DEPT on the DEPTNUM attribute.
Let us assume that the relation DEPT is horizontally fragmented on the attribute
DEPTNUM and that SUPPLIER is horizontally fragmented on the attribute SNUM. The
derived horizontal fragmentation can be obtained for relation SUPPLY by either
performing a Semi - join operation with SUPPLIER on SNUM or with DEPT on
DEPTNUM; both of them are correct.
5.3.2 Vertical Fragmentation: This requires grouping the attributes into sets, which are
referenced in the similar manner by applications. This method has been discussed by
considering two separate types of problems:
The Vertical Partitioning Problem: Here set must be disjoint. Of course
one attribute must be common. For example assume that a relation S is
vertically fragmented using this concept into S1 and S2.This can be useful
where an application can be executed using either S1 or S2.Otherwise
having the complete S at a particular site may be a unnecessary burden.
Two possible design approaches:
1. The split approach: The global relations are progressively
split into fragments
87
R1 R1 S1 R1 S1
S1
R2 S2
R2 R2
S2
R3 S3
R3 R3 S1
R4 S4
S3
R4 R4 S2
R5
Figure 5.1 The different possible join graphs
The Vertical Clustering Problem: Here sets can overlap. Here depending
upon the requirement you may have more than one common attribute in
the two different fragments of a global relation. It introduces Replication
within fragments, as some common attributes are present in the fragments.
It is suitable only for Read-Only applications; because for applications,
which involve frequent updating of these common attributes needs to be
referred to the sites where all these attributes are present. Therefore,
Vertical clustering is suggested where overlapping attributes are not
heavily updated.
88
Example: Consider the global relation EMPL (EMPNUM, NAME, SAL, TAX,
MGRNUM, DEPTNUM). The following are made:
Administrative applications, requires NAME, SAL, TAX of employees.
The department, requires NAME, MGRNUM and DEPTNUM
Here Vertical clustering is suggested as the attribute NAME is required in both the
fragments. So the fragments may be:
EMPL1 (EMPNUM, NAME, SAL, TAX)
EMPL2 (EMPNUM, NAME, MGRNUM, DEPTNUM)
5.3.3 Mixed Fragmentation: The simple way for performing this is:
Apply Horizontal fragmentation to Vertical fragments
Apply Vertical fragmentation to Horizontal fragments
Both these aspects are illustrated using the following diagrams 5.2 and 5.3.
A1 A2 A3 A4 A5
In this unit we have discussed the four phases of the design of Distributed databases:
Global schema, Fragmentation schema, Allocation schema and Local schema. Some
important aspects of design of fragmentation and allocation schemas are described in
detail. Also some of the practical examples are chosen for familiarizing the new concepts.
91
UNIT 6
OVERVIEW OF QUERY PROCESSING
Structure
6.0 Objectives
6.1 Introduction
6.2 Query Processing Problem
6.3 Objectives of Query Processing
6.4 Characterization of Query Processors
6.5 Layers of Query Processing
6.5.1 Query Decomposition
6.5.2 Data Localization
6.5.2 Global Query Optimization
6.5.3 Local Query Optimization
6.6 Summary
and
E 1 = SL ENO E3 (E)
G1 = SL ENO E3 (G)
The strategy A is better by a factor of 37, which is quite significant. Also it provides the
better distribution of work among the sites. The difference would be still better if we
assume slower communication and/or higher degree of fragmentation.
Result = E1 UN E2
E'1 E'2
Site 3 Site 4
E1 = E1 JN ENO G1 E2 = E2 JN ENO G2
G'1 G'2
Site 5
Site 2
Site 1 G = SL G1 G2 = SLRESP = Manager G2
1 RESP = Manager
(a) Strategy A
97
G1 G2 E1 E2
(b) Strategy B
Fig.6.1 Equivalent Distributed Execution Strategies
estimate the size of intermediate relations. The accuracy of the statistics can be
improved by periodical updating.
o Decision sites: Most of the systems use centralized decision approach, in which a
single site generates the strategy. However, the decision process could be
distributed among various sites participating in the elaboration of the best
strategy. The centralized approach is simpler but requires the knowledge of the
complete distributed database where as the distributed approach requires only
local information. Hybrid approach is better where the major decisions are taken
at one particular site and other decisions are taken locally.
o Exploitation of the Network Topology: the distributed query processor exploits
the network topology. With wide area networks, the cost function to be
minimized can be restricted to the data communication cost, which is a dominant
factor. This issue reduces the work of distributed query optimization, that can be
dealt as two separate problems: Selection of the global execution strategy, based
on the inter-site communication and selection of each local execution strategy,
based on a centralized query processing algorithms. With local area networks,
communication costs are comparable to I/O costs. Therefore, it is reasonable to
the distributed query processor to increase parallel execution at the cost of
increasing communication.
o Exploitation of Replicated fragments: For reliability purposes it is useful to have
fragments replicated at different sites. Query processors have to exploit this
information either statically or dynamically for processing the query efficiently.
o Use of semi- joins: The semi-join operation reduces the size of the data that are
exchanged between the sites so that the communication cost can be reduced.
6.5 Layers Of Query Processing:
The problem of query processing can itself be decomposed into several subprograms,
corresponding to various layers. In figure 6.2, a generic layering scheme for query
processing is shown where each layer solves a well-defined sub-problem. The input is a
query on distributed data expressed in relational calculus. This distributed query is posed
on global (distributed) relations, meaning that data distribution is hidden. Four main
layers are involved to map the distributed query into an optimized sequence of local
100
operations, each acting on a local database. These layers perform the functions of query
decomposition, data localization, global query optimization, and local query
optimization. The first three layers are performed by a central site and use global
information; the local sites do the fourth.
CALCULUS QUERY ON DISTRIBUTED
RELATIONS
QUERY GLOBAL
SCHEMA
FRAGMENT QUERY
GLOBAL STATISTICS
OPTIMIZATION ON
FRAGMENTS
OPTIMIZED LOCAL
QUERIES
6.5.1 Query Decomposition: The first layer decomposes the distributed calculus query
into an algebraic query on global relations. The information needed for this
transformation is found in the global conceptual schema describing the global relations.
However, the information about data distribution is not used here but in the next layer.
Thus the techniques used by this layer are those of a centralized DBMS.
Query decomposition can be viewed as four successive steps:
o The calculus query is rewritten in a normalized form that is suitable for
subsequent manipulation. Normalization of a query generally involves the
manipulation of the query quantifiers and of the query qualification by applying
logical operator priority.
o The normalized query is analyzed semantically so that incorrect queries are
detected and rejected as early as possible. Techniques to detect incorrect queries
exist only for a subset of relational calculus. Typically, they use some sort of
graph that captures the semantics of the query.
o The correct query (still expressed in relational calculus) is simplified. One way to
simplify a query is to eliminate redundant predicates.
o The calculus query is restructured as an algebraic query. The quality of an
algebraic query is defined in terms of expected performance. The traditional way
to do this transformation toward a "better" algebraic specification is to start with
an initial algebraic query and transform it in order to find a "good" one. The initial
algebraic query is derived immediately from the calculus query by translating the
predicates and the target statement into relational operations as they appear in the
query. This directly translated algebra query is then restructured through
transformation rules. The algebraic query generated by this layer is good in the
sense that the worse executions are avoided.
6.5.2 Data Localization: The input to the second layer is an algebraic query on
distributed relations. The main role of the second layer is to localize the querys data
using data distribution information. Relations are fragmented and stored in disjoint
subsets called fragments, each being stored at a different site. This layer determines
which fragments are involved in the query and transforms the distributed query into a
fragment query. Fragmentation is defined through fragmentations rules that can be
102
6.5.3 Global Query Optimization: The input to the third layer is a fragment query, that
is, an algebraic query on fragments. The goal of query optimization is to find an
execution strategy for the query, which is close to optimal. An execution strategy for a
distributed query can be described with relational algebra operations and communication
primitives (send/receive operations) for transferring data between sites. The previous
layers have already optimized the query for example, by eliminating redundant
expressions. However, this optimization is independent of fragments characteristics such
as cardinalities. In addition, communication operations are not yet specified. By
permuting the ordering of operations within one fragment query, many equivalent queries
may be found. Query optimization consists of finding the best ordering of
operations in the fragments query, including communication operations, which minimize
a cost function. The cost function, often defined in terms of time units, refers to
computing resources such as disk space, disk I/Os, buffer space, CPU cost,
communication cost and so on. An important aspect of query optimization is join
ordering, since permutations of the joint within the query may lead to improvements of
orders of magnitude. One basic technique for optimizing a sequence of distributed join
operations is through the semi-join operator. The main value of the semi-join in a
distributed system is to reduce the size of the join operands and then the communication
103
cost. The output of the query optimization layer is an optimized algebraic query with
communication operation included on fragments.
6.5.4 Local Query Optimization: The last layer us performed by all the sites having
fragments involved in query. Each sub-query executing at one site, called a local query, is
then optimized using the local schema of the site. At this time, the algorithms to perform
the relational operations may be chosen. Local optimization uses the algorithms of
centralized systems.
Check Your Progress: Answer the following: -
UNIT - 7
TRANSACTION MANAGEMENT AND CONCURRENCY CONTROL
Structure
7.0 Objectives.
7.1 Introduction.
7.2 Transaction Management
7.2.1 A Frame work for transaction management
7.2.1.1 Transactions properties
7.2.1.2 Transaction Management Goals
7.2.1.3 Distributed Transactions
7.2.2 Atomicity of Distributed Transactions
7.2.2.1 Recovery in Centralized Systems
7.2.2.2 Problems with respect to Communication in Distributed
databases
7.2.2.3 Recovery of Distributed Transactions
7.2.2.4 The Transaction control
7.3 Concurrency Control for Distributed Transactions
7.3.1 Concurrency Control Based on Locking in Centralized
Databases
7.3.2 Concurrency Control Based on Locking in Distributed
Databases
7.4 Summary
7.0 Objectives: At the end of this unit you will be able to:
Describe the different problems in managing the distributed transactions.
Discuss about recovery of distributed transactions.
Explain a popular algorithm called 2 Phase Commit Protocol
Discuss the aspects of Concurrency control in Distributed Transactions.
7.1 Introduction:
The management of distributed transaction means dealing with interrelated problem like
reliability, concurrency control and the efficient utilization of the resources of the
complete system. In this unit we have considered the well-known protocols like 2-Phase
commit protocol for recovery and 2-Phase locking for concurrency control. All the
aspects are discussed under different sections as given in the above structure.
106
Durability: Once a transaction is committed, the system must guarantee that the
results of operations will never be lost, independent of subsequent failures. The
activity of providing Durability of the transaction is called Database recovery.
Serializability: If many transactions execute concurrently, the result must be
same as if they were executed serially in the same order. The activity of providing
Serializability of the transaction is called Concurrency control.
Isolation: This property states that an incomplete transaction cannot disclose its
result to other transactions until it is committed. This property has to be strictly
followed to avoid a problem called Cascading Aborts (Domino Effect).
According to this all the transactions that has observed the partial results have to
be aborted.
These properties have to be fulfilled for the efficient transaction to happen. In the next
section we will see why the transaction management is an important aspect.
7.2.1.2 Transaction Management Goals: After knowing the performance characteristics
of transactions let us see what are the real goals of transaction management? The goal of
the transaction management in a Distributed database is to control the execution of
transactions so that:
1. Transactions have atomicity, durability, serializability, and isolation properties.
2. Their cost in terms of main memory, CPU, and number of transmitted control
messages and their response time are minimized.
3. The availability of the system is maximized.
The second point talks more about the efficiency of the transaction. Let us discuss in
detail about second and third point as we have already dealt the first point in the previous
sub section.
CPU and main memory utilization: It is a common aspect in both centralized and
distributed database. In case of concurrent transactions the CPU and main memory
should be properly scheduled and managed by the operating system. Otherwise it
becomes a bottleneck when the number concurrent transactions are more.
108
Control messages and their Response time: As the control messages does not carry
any fruitful data and only they are used to control the execution of transactions, there
should be very less exchange of such messages between the sites. The obvious reason is
the communication cost will be increased unnecessarily.
Another important aspect is the response time of each individual transaction.
This should be as small as possible for the better performance of the system. Definitely it
will be very crucial as in distributed system an additional time is required for
communication between different sites.
Availability: This should be discussed keeping the failure of the systems in mind. The
algorithms implemented by the transaction manager must bypass the site which is not
operational and provide the access to a site so that the request can be some how serviced.
After studying the goals of the transaction management in detail, in the coming
section we will suggest an appropriate model for Distributed transaction.
7.2.1.3 Distributed Transactions: A transaction is a part of the application. Ones some
application issues begin_transaction primitive; from this point onwards, all actions
which are performed by the application, until a commit or abort primitive is issued are to
be considered as one compete transaction. Now let us discuss a model for distributed
transaction. Let us study some related terminologies of this model.
Agents: An agent is a local process, which performs several functions on behalf
of an application. In order to cooperate in the execution of global operation
required by the application the agents have to communicate. As they are resident
at different sites, the communication between the agents is performed through
messages. There are various methods for organizing the agents to build a
structure of cooperating processes. In this model let us have a hypothetical
assumption of the method, which will be discussed in detail in the next section.
Root agent: There exists a root agent, which starts the whole transaction, so that
when the user requests the execution of an application, the root agent is started;
the site of the root agent is called Site of Origin of the transaction.
The root agent has the responsibility of issuing the begin_ transaction, commit
and abort primitives.
Only the root agent can request the creation of a new agent.
109
Finally, to summarize the distributed transaction model consists of a root agent that has
initiated the transaction and number of agents depending upon the application, which
works concurrently. All the primitives are executed by the root agent and these are not
local to the site of origin but also affect all the agents of transaction.
The following distributed transaction executed by scott updates the local sales
database, the remote hq database, and the remote maint database:
UPDATE scott.dept@hq.us.acme.com
SET loc = 'REDWOOD SHORES'
WHERE deptno = 10;
UPDATE scott.emp
SET deptno = 11
WHERE deptno = 10;
UPDATE scott.bldg@maint.us.acme.com
SET room = 1225
WHERE room = 1163;
COMMIT;
Example: Let us consider an example of a distributed transaction. The example is the
Fund Transfer operation between two accounts. A global relation
FUND_TRAN (ACC_ NUM, AMOUNT) is taken for to manage this application. The
application starts reading from the terminal the amount that has to be transferred, the
account numbers from which the amount must be taken and to which it must be credited.
Then the application issues a begin_transaction primitive and the usual operations starts
from now onwards. This is a global environment. The following transaction code (fig.7.2)
narrates the whole process.
If we assume that the accounts are distributed at different sites of a network like
the branches of the bank, at execution time various cooperating processes will perform
the transaction. For example, in the following transaction code (fig. 7.3) two agents are
shown. One of the two is the root agent. Here we assume that the from acc is located
at the root agent site and that the to acc is located at a different site, where the
AGENT1 is executed. When the root agent wants to perform the transaction then it
executes the primitive Create AGENT1; then it sends the parameters to AGENT1. The
root agent also issues begin_transaction, commit and abort primitives. All these
transaction operations will be carried out preserving the properties of distributed
transactions discussed in previous section.
111
FUND TRANSFER:
Read (terminal, $AMOUNT, $from acc, $to acc);
Begin_transaction;
Select AMOUNT into $FROM_AMOUNT
from ACC
where ACC_NUM = $from acc;
if $FROM_AMOUNT - $AMOUNT < 0 then abort
else begin
Update ACC
Set AMOUNT = AMOUNT - $AMOUNT
Where ACC = $from acc;
Update ACC
Set AMOUNT = AMOUNT + $AMOUNT
Where ACC = $to acc;
Commit
End
ROOT AGENT:
Read (terminal, $AMOUNT, $from acc, $to acc);
Begin_transaction;
Select AMOUNT into $FROM_AMOUNT
from ACC
where ACC_NUM = $from acc;
if $FROM_AMOUNT - $AMOUNT < 0 then abort
else begin
Update ACC
Set AMOUNT = AMOUNT - $AMOUNT
Where ACC = $from acc;
Create AGENT1;
112
crashes. The probability of such failures are very less compare to the other
two types. However, it is possible to make the possibility still less by having
the same information on several disks. Really this idea is the basis for the
concept of Stable Storage. A strategy known as the Careful Replacement is
used for this purpose, which states that, at every update operation, first one
copy of the information is updated, and then the correctness of the update is
verified, finally the remaining copies are updated.
Failures with loss of stable storage: In this type, some information stored in
stable storage is lost because of many, simultaneous failure of the storage
disks. Any way this probability cannot be reduced to zero.
o Step3: Undo the transactions, which are determined in the step1 and
redo the transactions that are determined in step2.
The checkpoint requires the following operations to be performed:
o All log records and the database updates which are still in the volatile
storage has to be recorded in the stable storage
o Writing to a stable storage a checkpoint record. A checkpoint record in
the log contains the indication of transactions that are active at the time
when the checkpoint is done (Active transaction is one begin
_transaction belongs to the log but not a commit or abort record).
The usage of checkpoints modifies step1 and step2 of the recovery
procedure as follows:
o Find and read the last checkpoint record.
o Keep all transactions written in the checkpoint record into the undo
set, which contains the transactions to be undone. The redo set, which
contains the transactions to be redone, is initially empty.
o Read the log file starting from the checkpoint record until the end. If a
begin_transaction is found, put the corresponding transaction in the
undo set. If a commit record is found, move the corresponding
transaction from the undo set to the redo set.
From the above discussion we can say that only the latest portion of the log
must be kept online, where as the remaining part can be kept in the stable
storage. So far only we have seen the failure of volatile storage. Let us see the
failures with the loss of stable storage. This can be studied by considering two
possibilities.
o Failures in which database information is lost, but logs are
safe: In this case, performing redo of all committed
transactions using the log does the recovery. Taking the
database to a Dump, which is an image of previous state,
which was stored on tape storage, is done before redoing the
transactions (Of course it is a lengthy process).
116
Multiple Failures and K-resiliency: Failures do not occur one at a time. A system which
can tolerate K failures is called K-resilient. In distributed databases, this concept is
applied to site failures and/or partitions. With respect to site failures, an algorithm is said
to be K- resilient if it works properly even if K sites are down. An extreme case of failure
is called Total Failure, where all sites are down.
7.2.2.3 Recovery of Distributed Transactions: Now let us consider recovery problems
in distributed databases. For this purpose, let us assume that at each site a Local
Transaction Manager is available. Each agent can issue begin_transaction, commit, and
abort primitives to its LTM. After having issued a begin_transaction to its LTM, an agent
will possess the properties of a local transaction. We will call an agent that has issued a
begin_transaction primitive to its local transaction manager a Sub-transaction. Also to
distinguish the begin_transaction, commit, and abort primitives of the distributed
transaction from the local primitives issued by each agent to its LTM, we will call the
later as local_begin, local_commit, and local_abort.
For building a Distributed Transaction Manager (DTM), the following
properties are expected from the LTM:
Ensuring the atomicity of sub-transaction
Writing some records on stable storage on behalf of the distributed
transaction manager
We need the second requirement, as some additional information must also be
recorded in such away that they can be recovered in case of failure. In order to
make sure that either all actions of a distributed transaction are performed or
none is performed at all, two conditions are necessary:
At each site either all actions are performed or none is performed
All sites must take the same decision with respect to the commitment
or abort of sub transaction.
Fig. 7.4 shows a reference model of Distributed transaction recovery.
ROOT Messages Messages Distribution
AGENT AGENT AGENT
118 Transaction
1 1 1
Local
LTM LTM LTM Transaction
at at at Manager
Site i Site j Site i (LTM)
3 1
DTM-AGENT DTM-AGENT
Local_begin
4
Send Create Req. Local Create
Local_begin
2
7 5
LTM
LTM
Write
begin_
transaction Write
in local begin_
transaction
log
Fig.7.5 Fund Transfer
7.2.2.4 The Transaction Control Statements: The following list describes transaction
control statements supported:
COMMIT
ROLLBACK (ABORT)
SAVEPOINT
Role Description
Database server A node that receives a request for information from another node.
Local coordinator A node that is forced to reference data on other nodes to complete its part of the transaction.
Commit point site The node that commits or rolls back the transaction as instructed by the global coordinator.
Clients: A node acts as a client when it references information from another node's
database. The referenced node is a database server. In 6.3, the node sales are a client of
the nodes that host the warehouse and finance databases.
Database Servers: A database server is a node that hosts a database from which a client
requests data. In Fig 7.6 an application at the sales node initiates a distributed transaction
that accesses data from the warehouse and finance nodes. Therefore, sales.acme.com
has the role of client node, and warehouse and finance are both database servers. In this
example, sales be a database server and a client because the application also modifies
data in the sales database.
Local Coordinators: A node that must reference data on other nodes to complete its part
in the distributed transaction is called a local coordinator. In, Fig 7.6 sales be a local
coordinator because it coordinates the nodes it directly references: warehouse and
finance. The node sales also happen to be the global coordinator because it coordinates all
the nodes involved in the transaction.
A local coordinator is responsible for coordinating the transaction among the nodes it
communicates directly with by:
Receiving and relaying transaction status information to and from those nodes
Passing queries to those nodes
Receiving queries from those nodes and passing them on to other nodes
Returning the results of queries to the nodes that initiated them
Global Coordinator: The node where the distributed transaction originates is called the
global coordinator. The database application issuing the distributed transaction is directly
connected to the node acting as the global coordinator. For example, in, Fig 7.6 the
123
transaction issued at the node sales references information from the database servers
warehouse and finance. Therefore, sales.acme.com is the global coordinator of this
distributed transaction. The global coordinator becomes the parent or root of the session
tree. The global coordinator performs the following operations during a distributed
transaction:
Sends all of the distributed transaction's SQL statements, remote procedure calls, and so
forth to the directly referenced nodes, thus forming the session tree
Instructs all directly referenced nodes other than the commit point site to prepare the
transaction
Instructs the commit point site to initiate the global commit of the transaction if all nodes
prepare successfully
Instructs all nodes to initiate a global abort of the transaction if there is an abort response
Commit Point Site: The job of the commit point site is to initiate a commit or roll back
(abort) operation as instructed by the global coordinator. The system administrator
always designates one node to be the commit point site in the session tree by
assigning all nodes commits point strength. The node selected as commit point
site should be the node that stores the most critical data. Fig 7.7 illustrates an
example of distributed system, with sales serving as the commit point site:
The commit point site is distinct from all other nodes involved in a distributed transaction in
these ways:
The commit point site never enters the prepared state. Consequently, if the commit point
site stores the most critical data, this data never remains in-doubt, even if a failure occurs.
In failure situations, failed nodes remain in a prepared state, holding necessary locks on
data until in-doubt transactions are resolved.
The commit point site commits before the other nodes involved in the transaction. In
effect, the outcome of a distributed transaction at the commit point site determines
whether the transaction at all nodes is committed or rolled back: the other nodes follow
the lead of the commit point site. The global coordinator ensures that all nodes complete
the transaction in the same manner as the commit point site.
124
Specifically, the commit point strength determines whether a given node is the
commit point site in the distributed transaction and thus commits before all of the
other nodes. This value is specified using the initialization parameter
COMMIT_POINT_STRENGTH. This section explains how the system determines the
commit point site. The commit point site, which is determined at the beginning of the
prepare phase, is selected only from the nodes participating in the transaction. The
following sequence of events occurs:
Of the nodes directly referenced by the global coordinator, the software selects the node
with the highest commit point strength as the commit point site.
The initially selected node determines if any of the nodes from which it has to obtain
information for this transaction has a higher commit point strength.
Either the node with the highest commit point strength directly referenced in the
transaction or one of its servers with a higher commit point strength becomes the commit
point site.
After the final commit point site has been determined, the global coordinator sends
prepare responses to all nodes participating in the transaction.
Fig 7.6 shows in a sample session tree the commit point strengths of each node (in
parentheses) and show the node chosen as the commit point site. The following
conditions apply when determining the commit point site:
A read-only node cannot be the commit point site.
If multiple nodes directly referenced by the global coordinator have the same commit
point strength, then the software designates one of these as the commit point site.
If a distributed transaction ends with an abort, then the prepare and commit phases are not
needed. Consequently, the software never determines a commit point site. Instead, the
global coordinator sends a ABORT statement to all nodes and ends the processing of the
distributed transaction.
As Fig 7.8 illustrates, the commit point site and the global coordinator can be
different nodes of the session tree. The commit point strength of each node is
communicated to the coordinators when the initial connections are made. The
coordinators retain the commit point strengths of each node they are in direct
communication with so those commit point sites can be efficiently selected during
126
two-phase commits. Therefore, it is not necessary for the commit point strength to be
exchanged between a coordinator and a node each time a commit occurs.
Fig 7.8 The commit point site and the global coordinator
maintains the integrity of the global database (the collection of databases participating
in the transaction) using the two-phase commit mechanism. This mechanism is
completely transparent, requiring no programming on the part of the user or application
developer. The commit mechanism has the following distinct phases, which the
software performs automatically whenever a user commits a distributed transaction:
Prepare The initiating node, called the global coordinator, asks participating nodes
phase other than the commit point site to promise to commit or roll back the
transaction, even if there is a failure. If any node cannot prepare, the
transaction is rolled back.
Commit If all participants respond to the coordinator that they are prepared, then the
phase coordinator asks the commit point site to commit. After it commits, the
coordinator asks all other nodes to commit the transaction.
When a node responds to the global coordinator that it is prepared to commit, the
prepared node promises to either commit or roll back the transaction later--but does not
make a unilateral decision on whether to commit or roll back the transaction. The
promise means that if an instance failure occurs at this point, the node can use the redo
records in the online log to recover the database back to the prepare phase.
128
Types of Responses in the Prepare Phase: When a node is told to prepare, it can
respond in the following ways:
Prepared Data on the node has been modified by a statement in the distributed
transaction, and the node has successfully prepared.
Read- No data on the node has been, or can be, modified (only queried), so no
only preparation is necessary.
Note that if a distributed transaction is set to read-only, then it does not use abort
segments. If many users connect to the database and their transactions are not set to READ
ONLY, then they allocate abort space even if they are only performing queries.
Abort Response: When a node cannot successfully prepare, it performs the following
actions:
Releases resources currently held by the transaction and roll back the local portion of the
transaction.
Responds to the node that referenced it in the distributed transaction with an abort
message.
These actions then propagate to the other nodes involved in the distributed transaction so
that they can roll back the transaction and guarantee the integrity of the data in the global
database. This response enforces the primary rule of a distributed transaction: all nodes
129
involved in the transaction either all commit or all roll back the transaction at the same
logical time.
Steps in the Prepare Phase: To complete the prepare phase, each node excluding the
commit point site performs the following steps:
The node requests that its descendants, that is, the nodes subsequently referenced,
prepare to commit.
The node checks to see whether the transaction changes data on itself or its descendants.
If there is no change to the data, then the node skips the remaining steps and returns a
read-only response
The node allocates the resources it needs to commit the transaction if data is changed.
The node saves redo records corresponding to changes made by the transaction to its
online redo log.
The node guarantees that locks held for the transaction are able to survive a failure.
The node responds to the initiating node with a prepared response or, if its attempt or the
attempt of one of its descendents to prepare was unsuccessful, with an abort response.
These actions guarantee that the node can subsequently commit or roll back the
transaction on the node. The prepared nodes then wait until a COMMIT or ABORT
request is received from the global coordinator.
After the nodes are prepared, the distributed transaction is said to be in-doubt .It retains
in-doubt status until all changes are either committed or aborted.
Commit Phase: The second phase in committing a distributed transaction is the commit
phase. Before this phase occurs, all nodes other than the commit point site
referenced in the distributed transaction have guaranteed that they are prepared, that
is, they have the necessary resources to commit the transaction.
Steps in the Commit Phase: The commit phase consists of the following steps:
The global coordinator instructs the commit point site to commit.
The commit point site commits.
The commit point site informs the global coordinator that it has committed.
The global and local coordinators send a message to all nodes instructing them to commit
the transaction.
130
At each node, the system commits the local portion of the distributed transaction and
releases locks.
At each node, the system records an additional redo entry in the local redo log, indicating
that the transaction has committed.
The participating nodes notify the global coordinator that they have committed.
During the prepare phase, the system determines the highest SCN at all nodes involved in
the transaction. The transaction then commits with the high SCN at the commit point site.
The commit SCN is then sent to all prepared nodes with the commit decision.
Forget Phase: After the participating nodes notify the commit point site that they have
committed, the commit point site can forget about the transaction. The following
steps occur:
After receiving notice from the global coordinator that all nodes have committed, the
commit point site erases status information about this transaction.
The commit point site informs the global coordinator that it has erased the status
information.
The global coordinator erases its own information about the transaction.
131
Response of 2-Phase Commit protocol for Failures: It is tough to all failures in which
no log information is lost. The response of the protocol in the presence of failure is now
discussed.
1. Site Failure:
A participant fails before having written the ready record in the log. In this
case, the coordinators timeout expires, and it takes the abort decision.
A participant fails after written the ready record in the log. In this case, the
operational sites correctly terminate the transaction (abort or commit). When
the failed site recovers, the restart procedure has to ask the coordinator or
some other participant about the result of the transaction.
The coordinator fails after having written the prepare record in the log before
having written the log, but before having written a global_commit or
global_abort record in the log. In this case all participants who have already
answered READY message must wait for the recovery of the coordinator. The
restart procedure of the coordinator resumes the commit protocol from the
beginning, reading the identity of the participants from the prepare record in
the log, and sending again prepare message to them. Each ready participant
must recognize that the new PREPARE message is a repetition of the previous
one.
The coordinator fails after having written a global_commit or global_abort
record in the log, but before having written the complete record in the log. In
this case, the coordinator must send the decision again to all participants;
participants who have not received the command have to wait until the
coordinator recovers.
The coordinator fails before having written the complete record in the log. In
this case, the transaction was already concluded, and no action is required at
start.
2. Lost Messages:
An answer message (READY or ABORT) from a participant is lost. In this
case the coordinators time out expires, and the whole transaction is aborted.
132
A PREPARE message is lost. In this case, the participant remains in wait. The
global result is same as the previous one as the coordinator does not receive an
answer.
A command message is lost i.e either COMMIT or ABORT. In this case, the
destination participant remains uncertain about the decision. Having a timeout
in the participant; if no command has been received after the timeout interval
from the answer can eliminate this problem, a request for the repetition of the
command is sent.
An ACK message is lost. In this case, the coordinator remains uncertain about
the fact that the participant has been received the command message.
Introducing a timeout in the coordinator can eliminate this problem; if no
ACK message is received after the timeout interval from the transmission of
the command, the coordinator will send the command again.
3.Network Partitions: Let us suppose that a simple partitions occurs, dividing the
sites in two groups the group contains the coordinator is called the coordinator
group the other the participant group. From the view of the coordinator the
partition is equivalent to the multiple failure of as set of participants, and the solution
is already discussed. From the view of the participant the failure is equivalent to a
coordinator failure and the situation is similar to the case already discussed.
It has been observed that the recovery procedure for a site that is involved in
processing a distributed transaction is more complex than that for a centralized
database.
7.3 Concurrency Control for Distributed Transactions:
In this section we discuss the fundamental problems, which are due to the concurrent
execution of transactions. We deal concurrency control based on locking. First, the 2-
Phase-locking protocol in centralized databases is presented; then, 2-phase-locking is
extended to distributed databases.
7.3.1 Concurrency Control Based on Locking in Centralized Databases: The basic
idea of locking is that whenever a transaction accesses a data item, it locks it, and that a
transaction which wants to lock a data item which is already locked by another
transaction must wait until the other transaction has released the lock (unlock).
133
We will simply assume that all transactions are performed according to the following
scheme:
(Begin Application)
Begin transaction
Acquire locks before reading or writing
Commit
Release locks
(End application)
In this way the transactions are well formed, 2-Phase locked and isolated.
Deadlock: A deadlock between two transactions arises if each transaction has locked a
data item and is waiting to lock a different data item which has already been locked by he
other transaction in the conflicting mode. Both transactions will wait forever in this
situation, and system intervention is required to unblock the situation. The system must
first find out the deadlock situation and force one transaction to release its locks, so that
the other one can proceed. i.e one transaction is aborted. This method is called as
Deadlock detection.
7.3.2 Concurrency Control Based on Locking in Distributed Databases:
Let us now concentrate about Distributed transaction concurrency control. Here some
assumptions like:
The local agents (LTMs) can lock and unlock local data items.
The LTMs interpret local locking primitives: local- lock-shared, local- lock-
exclusive and local unlock.
The global agent issue global primitives like: lock-shared, lock- exclusive and
unlock.
The most important result for distributed databases is the following:
If a distributed transactions are well-formed and 2-phaselocked, then 2-phase locking
is the correct locking mechanism in distributed transaction as well as in centralized
databases.
We shall now discuss the important problems that have to be solved by the Distributed
transaction manager (DTM).
135
Local
LTM LTM LTM Transaction
at at at Manager
Site i Site j Site i (LTM)
Site1 Site2
T1A1 T1A2
1 T2A2
T2A1
Structure
8.0 Objective
8.1 Introduction
8.2 Clock Synchronization
8.2.1 Synchronizing physical clocks
8.2.1.1 Crstians method for synchronizing clocks
8.2.1.2 The Berkeley algorithm
8.2.1.3 The Network Time Protocol
8.3 Logical Time and Logical clocks
8.3.1 Logical clocks
8.3.2 Totally Ordered Logical Clocks
8.4 Distributed Coordination
8.4.1 Distributed Mutual Exclusion
8.4.1.1The central server algorithm
8.4.1.2 A Distributed algorithm using logical clocks
8.4.1.3 A Ring based algorithm
8.5 Elections
8.5.1 The bully algorithm
8.5.2 A Ring based election algorithm
8.6 Summary
140
8.0 Objective: In this unit we introduce some topics related to the issue of coordination
in distributed systems. To start with, the notion of time is dealt. We go on to
explain the notion of logical clocks, which are tool for ordering events, without
knowing precisely when they occurred. The second half examines briefly some
algorithms to achieve distributed coordination. These include algorithms achieve
mutual exclusion among collection of processes, so as to coordinate their
accesses to shared resources. It goes on to examine how a group of processes
agree upon a new coordinator of their activities, after their previous coordinator
has failed or become unreachable. This process is called as Election.
8.1 Introduction:
This unit introduces some concepts and algorithms related to the timing and coordination
of events occurring in distributed systems. In the first half we describe the notion
of time in a distributed system. We discuss the problem of how to synchronize
clocks in different computers, and so time events occurring at them consistently,
and we even discuss the related problem of determining the order in which events
occurred. In the latter half of the chapter we introduce some algorithms whose
goal is to confer a privilege upon some unique member of collection of processes.
In the next section we explain methods whereby computer clocks can be
approximately synchronized, using message passing. However, we shall not be able to
obtain sufficient accuracy to determine, in many cases, the relative ordering of events on
some events by appealing to the flow of data between processes in a distributed system.
We go on to introduce logical clocks, which are used to define an order on events without
measuring the physical time at which they occurred.
8.2 Clock Synchronization: Time is an important and interesting issue in distributed
systems, for several reasons.
o First, the time is a quantity we often want to measure accurately. In order to know
at what time of day a particular event occurred at a particular computer for
example, for accountancy purposes it is necessary to synchronize its clock with
a trustworthy, external source of time. This is external synchronization. Also, if
computer clocks are synchronized with one another to a known degree of
accuracy, then we can, within the bounds of this accuracy, measure the interval
141
can occur. The rate at which events occur depends on such factors as the length of the
processor instruction cycle.
In order to compare timestamps generated by clocks of the same physical
construction at different computers, one might think that we need only know the relative
offset of one clocks counter from that of the other for example, that one count had the
value 1958 when the other was initialized to 0. Unfortunately, this supposition is based on
a false argument; computer clocks in practice are extremely unlikely to tick at the same
rate, whether or not they are of the same physical construction.
Clock drift: The crystal-based clocks used in computers are, like any other clocks,
subject to clock drift (Figure8.1), which means that they count time at different rates, and
so deviate. The underlying oscillators are subject to physical variations, with the
consequence that their frequencies of oscillation differ. Moreover, even the same clocks
frequency varies with the temperature. Designs exist that attempt to compensate for this
variation but they cannot eliminate it. The difference in the oscillation period
Network
Let the value of the software clock be Tskew when H=h, and let the actual time at
the point be Treal. We may have that Tskew > Treal or Tskew <Treal. If S is to give the actual
time after N further ticks, we must have
Tskew =(1 + a)h+b, and Treal + N = (1+a)(h + N) + b
By solving these equations, we find
A = (Treal - Tskew) / N and b= Tskew (1 + a)h.
8.2.1.1 Cristians method for synchronizing clocks :One way to achieve
synchronization between computers in a distributed system is for a central timeserver
process S to supply the time according to its clock upon request, as shown in Fig 8.2. The
timeserver computer can be fitted with a suitable receiver so as to be synchronized with
UTC. If a process P requests the time in a message mr, and receives the time value t in a
message mt , then in principle it could set its clock to the time t +Ttrans, where Ttrans,
mr
P
mt
Time server,s
between the timeserver and its UTC receiver can be achieved by a method that is similar
to the following procedure for synchronization between computers.
A process P wishing to learn the time from S can record the total round-trip time
with reasonable accuracy if its rate of clock drift is small. For example, the round-
trip time should be in the order of 1-10 milliseconds on a LAN, over which time a
clock with a drift rate of 10-6 varies by at most 10-5 milliseconds.
Let the time returned in Ss message mt be t. A simple estimate of the time to
which P should set its clock is t+ Tround/2, which assumes that the elapsed time is
split equally before and after S placed t in mt. Let the time between sending and
receipt for mr and mt. be min + x and min + y, respectively. If the value of min is
known or can be conservatively estimated, then we can determine the accuracy of
this result as follows.
The earliest point that S could have placed the time in mt was min after P
dispatched mt. The latest point at which it could have done this was min before mt
arrived at P. The time by SS clock when the reply messages arrives is therefore in
the range [t+min, t+Tround-min]. The width of this range is Tround 2 min, so the
accuracy is (Tround /2-min).
Discussion of Cristians algorithm: As described Cristians method suffers from the
problem associated with all services implemented by a single server, that the single
timeserver might fail and thus render synchronization impossible temporarily. Cristian
suggested, for this reason, that time should be provided by a group of synchronized
timeservers, each with a server for UTC time signals. For example, a client could
multicast its request to all servers, and use only the first reply obtained. Note that a
malfunctioning timeserver that replied with spurious time values, or an improper
timeserver that replied with deliberately incorrect times could wreak havoc in a computer
system.
8.2.1.2 The Berkeley algorithm: Gusella and Zatti describe an algorithm for internal
synchronization which they developed for collections of computers running Berkeley
UNIX. In it, a coordinator computer is chosen to act as the master. Unlike Cristians
protocol, this computer periodically polls the other computers whose clocks are to be
synchronized, called slaves. The slaves send back their clock values to it. The master
146
estimates their local clock times by observing the round-trip times (similarly to Cristians
technique), and it averages the values obtained (including its own clocks reading). The
balance of probabilities is that this average cancels out the individual clocks tendencies
to run fast or slow. The accuracy of the protocol depends upon a nominal maximum
round-trip time between the master and the slaves. The master eliminates any occasional
readings associated with larger times than this maximum.
Instead of sending the updated current time back to the other computers which
would introduce further uncertainty due to the message transmission time the master
sends the amount by which each individual slaves clock requires adjustment. This can be
a positive or negative value.
The algorithm eliminates readings from clocks that have drifted badly or that has
failed and provides spurious readings. Such clocks could have a significant adverse effect
if an ordinary average was taken. The master takes a fault-tolerant average. That is, a
subset of clocks is chosen that do not differ from one another by more than a specified
amount, and the average is taken only of readings from these clocks.
Gusella and Zatti describe an experiment involving 15 computers whose clocks
were synchronized to within about 20-25 milliseconds using their protocol. The local
clocks drift rate was measured to be less than 2 x 10 -5 , and the maximum round-trip time
was taken to be 10 milliseconds.
8.2.1.3 The Network Time Protocol: The Network Time Protocol (NTP) defines
architecture for a time service and a protocol to distribute time information over a wide
variety of interconnected networks. It has been adopted as a standard for clock
synchronization throughout the Internet.
NTPs chief design aims and features are:
To provide a service enabling clients across the Internet to be
synchronized accurately to UTC; despite the large and variable message
delays encountered in Internet communication. NTP employs statistical
techniques for the filtering of timing data and it discriminates between the
qualities of timing from different servers.
To provide a reliable service that can survive lengthy losses of
connectivity; there are redundant servers and redundant paths between the
147
servers. The servers can reconfigure so as still to provide the service if one
of them becomes unreachable.
To enable clients to resynchronize sufficiently; to offset the rate of drift
found in most computers. The service is designed to scale to large number
of clients and servers.
To provide protection against interference with the time service; whether
malicious or accidental. The time service uses authentication techniques to
check that timing data originate from the claimed trusted sources. It also
validates the return addresses of messages sent to it.
.
148
1
2
3
3
3
A network of servers located across the Internet provides the NTP service. Primary
servers are directly connected to a time such as a radio clock receiving UTC;
secondary servers are synchronized, ultimately, to primary servers. The servers are
connected in a logical hierarchy called a synchronization subnet (see Fig 8.3),
whose levels are called strata. Primary servers occupy stratum 1: they are at the
root. Stratum 2 servers are secondary servers that are synchronized directly to the
primary severs; stratum 3 are synchronized from stratum 2 servers, and so on. The
lowest level (leaf) severs execute in users workstations.
NTP servers synchronize modes: multicast, procedure-call and symmetric mode.
Multicast mode: is intended for use on a high-speed LAN. One or more servers
periodically multicasts the time to the servers running in other computers
connected by the LAN, which set their clocks assuming a small delay. This mode
can only achieve relatively low accurate, but ones which nonetheless are
considered sufficient for many purposes.
Procedure-call mode: is similar to the operation of Cristians algorithm, described
above. In this mode, one server accepts requests from other computers, which it
processes by replying with its timestamp (current clock reading). This mode is
suitable where higher accuracies are required that can be achieved with multicast
or where multicast is not supported in hardware. For example, file servers on the
same or a neighboring LAN , which need to keep accurate timing information for
file accesses, could contact a local master server in procedure-call mode.
Symmetric mode: is intended for use by the master servers that supply time
information in LANs and by the higher levels (lower strata) of the
synchronization subnet, where the highest accuracies are to be achieved. A pair of
servers operating in symmetric mode exchange messages bearing timing
information. Timing data are retained as part of an association between the servers
that is maintained in order to improve the accuracy of their synchronization over
time.
150
In all modes, messages are delivered unreliably, using the standard UDP Internet
transport protocol. In procedure-call mode and symmetric mode, messages are
exchanged in pairs. Each message bears timestamps of recent message events; the
local times when the previous NTP messages between the pair were sent and
received, and the local time when the current message was transmitted. The
recipient of the NTP message notes the local time when it receives the message.
The four times Ti-3, Ti-2, Ti-1, and Ti are shown in Fig 8.4 for messages m and m sent
between servers A and B. Note that in symmetric mode, unlike the Cristian
algorithm, there can be a non-negligible delay between the arrival of one message
and the dispatch of the next. Also, messages may be lost, but the three timestamps
carried by each message are nonetheless valid.
For each pair of messages sent between two servers the NTP protocol calculates an
offset oi that is an estimate of the actual offset between the two clocks, and a delay
di that is total transmission time for the two messages. If the true offset of the clock
at B relative to that at A is o, and if the actual transmission times for m and m are t
and t respectively, then we have:
Time
Server A Ti-3 Ti
Using the fact that t, t 0 it can be shown that oi dj/2 o oi + di/2. Thus oi is an
estimate of the offset, and di is a measure of the accuracy of this estimate.
8.3 Logical time and logical clocks:
From the point of view of any single process, events are ordered uniquely by times shown
on the local clock. However, as Lamport pointed out, since we cannot perfectly
synchronize clocks across a distributed system, we cannot in general use physical time to
find out the order of any arbitrary pair of events occurring within it.
The order of events occurring at different processes can be critical in a distributed
application. In general, we can use a scheme which is similar to physical causality, but
which applies in distributed systems, to order some of the events that occur at different
processes. This ordering is based on two simple and naturally obvious points:
If two events occurred at the same process, then they occurred in the order in
which it observes them.
Whenever a message is sent between processes, the event of sending the message
occurred before the event of receiving the message.
Lamport called the ordering obtained by generalizing these two relationships the
happened-before relation. It is also sometimes known as the relation of casual ordering
or potential casual ordering.
More formally, we write x y if two events x and y occurred at a single process p,
and x occurred before y. Using this restricted order we can define the happened-before
relation, denoted by, as follows:
HB1: If process p: x y, then , x y.
HB2: For any messages m, send(m) rcv(m).
- where send(m) is the event of sending the message, and rcv(m)
is the event of receiving it.
HB3: If x,y and z are events such that x y and y z, then x z.
P1
a b m1
Physical
P2 Time
c d m2
P3
e f
152
1 2
P1
a b m1
P2 3 4 Physical
c d Time
m2
P3 1 5
153
e f
Fig.8.6 Logical timestamps for the events shown in Fig 8.5
We denote the timestamp of event a at p by Cp(a), and by C(b) we denote the timestamp
of event b at whatever process it occurred.
To capture the happened-before relation , processes update their logical
clocks and transmit the values of their logical clock in messages as follows:
LC1 : Cp is incremented before each event is issued at process p: Cp :=
Cp + 1
LC2 : a) When a process p sends a message m, it piggybacks on m the
value t=Cp.
b) On receiving (m,t) a process q computes Cq:=max(Cq,t) and
then applies LCI before time stamping the event rev(m).
Although we increment clocks by one, we could have chosen any positive value. It can
easily be shown, by induction on the length of any sequence of events relating two events
a and b, that a b => C(a) < C(b).
Note that the converse is not true. If C(a) < C(b), then we cannot infer that a
b. In Fig 8.6 we illustrate the use of logical clocks for the example given in Fig 8.5. Each
of the processes p1,p2 and p3 has its logical clocks initialized to 0. The clock values given
are those immediately after the event to which they are adjacent. Note that, for example,
C(b) > C(e) but b || e.
8.3.2 Totally ordered logical clocks: Logical clocks impose only a partial order on the
set of all events, since some pairs of distinct events, generated by different processes,
have numerically identical timestamps. However, we can extend this to a total order-that
is, one for which all pairs of distinct events are ordered by taking into account the
identifiers of the processes at which events occur. If a is an event occurring at pa with
local timestamp Ta, and b is an event occurring at pb with local timestamps Tb, we define
the global logical timestamps for these events to be (Ta , pa) and (Tb , pb) respectively and
we define (Ta , pa) < (Tb , pb) if and only if either Ta < Tb or Ta = Tb and , pa < pb.
8.4 Distributed Coordination:
Distributed processes often need to coordinate their activities. For example, if a collection
of processes shares a resource or collection of resources managed by a server, then often-
154
shall assume that there is only one critical section managed by the server. Recall that the
protocol for executing a critical section is as follows.
enter() (*enter critical section block if necessary *)
(*access shared resources in critical section *)
exit() (*leave critical section other processes may now enter *)
Fig 8.7 shows the use of this server. To enter a critical section, a process sends a request
message to the server and awaits a reply from it. Conceptually, the reply constitutes a
token signifying permission to enter the critical section. If no other process has the token
at the time of the request, then the server replies immediately, granting the token. If the
token is currently held by another process, then the server does no reply
Queue of
Server
Requests 4
2
1. Grant
token
1. Request
token 1. Release
token P4
P1
P2
P3
requiring mutual exclusion could be picked to play a dual role as a server). Since the
server must be unique if it is to guarantee mutual exclusion, an election must be called to
choose one of the clients to create or act as the server, and to multicast its address to the
others. We shall describe some election algorithms below. When the new server has been
chosen, it needs to obtain the state of its clients, so that it can process their requests, as
the previous server would have done. The ordering of entry requests will be different in
the new server from that in the failed one unless precautions are taken, but we shall not
address this problem here.
8.4.1.2 A distributed algorithm using logical clocks: Ricart and Agrawala developed
an algorithm to implement mutual exclusion that is based upon distributed agreement,
instead of using a central server. The basic idea is that processes that require entry to a
critical section multicast a request message, and can enter it only when all the other
processes have replied to this message. The conditions under which a process replies to a
request are designed to ensure that conditions ME1 ME3 are met.
Again, we can associate permission to enter a token in this case takes multiple
messages. The assumptions of the algorithm are that the processes p1 pn know one
anothers addresses, that all messages sent are eventually delivered, and that each process
pi keeps a logical clock, updated according to the rules LC1 and LC2 of the previous unit.
Messages requesting the token are of the for <T, pi>, where T is the senders timestamp
and pi is the senders identifier. For simplicitys sake, we assume that only one critical
section is at issue, and so it does not have to be identified. Each process records its state
of having released the token (RELEASED), wanting the token (WANTED) or processing
the token (HELD) in a variable. The protocol is given in Fig 8.8.
Fig.8.8 Ricart and Agrawalas algorithm
On initialization:
State := RELEASED;
Then
Queue request from pi without replying;
Else
Reply immediately to pi;
End if
To release token
State := RELEASED;
Reply to any queued requests;
If a process requests the token is RELEASED everywhere else that is, no other
process wants it then all processes will reply immediately to the request and the
requester will obtain the token. If a token is HELD at some process, then that process will
not reply to requests until it has finished with the token, and so the requester cannot
obtain the token in the meantime. If two or more processes request token at the same
time, then whichever processs request bears the lowest timestamp will be the first to
collect (n-1) replies, granting it the token next. If the requests bear equal timestamps, the
process identifiers are compared to order them. Note that, when a process requests the
token, it defers processing requests from other processes until its own request has been
sent and the timestamp T is known. This is so that processes make consistent decisions
when processing requests.
To illustrate the algorithm, consider a situation involving three processes, p1, p2
and p3 shown in Fig 8.9. Let us assume that p3 is not interested in the token, and that p1
and p2 requests it concurrently. The timestamp of p1s request is 41, that of p2 is 34. When
p3 receives their requests, it replies immediately. When p2 receives p1s request it finds its
own request has the lower timestamp, and so does not reply, holding p1 off. However, p1
finds that p2s request has a lower timestamp than that of its own request, and so replies
immediately. On receiving this second reply, p2 possesses the token. When p2 releases the
token, it will reply to p1s request, and so grant it to token.
Obtaining the token takes 2(n-1) messages in the algorithm: (n-1) to multicast the
request, followed by (n-1) replies. Or, if there is hardware support for multicast, only one
message is required for the request; the total is then n messages. It is thus a processes
request mutual exclusion by multicasting. Considerably more expensive algorithm, in
general, than the central server algorithm just described. Note also, that while it is a fully
distributed algorithm, the failure of any process involved would make process
158
impossible. And in the distributed algorithm all the processes involved receive and
process every request, so no performance gain has been made over the single server
bottleneck, which does just the same. Finally, note that a process that wishes to obtain the
token and which was the last to process it still goes through the protocol as described,
even though it could simply decide locally to reallocate it to itself. Ricart and Agrawala
refined this protocol so that it requests n messages to obtain the token in the worst (and
common) case, without hardware support for multicast.
8.4.1.3 A ring-based algorithm: One of the simplest ways to arrange mutual exclusion
between n processes p1, ... pn is to arrange them in a logical ring. The idea is that exclusion
is conferred by obtaining a token in the form of a message passed from process to process
in a single direction-clockwise, say round the ring. The ring topology which is
unrelated to the physical interconnections between the underlying computers- which is
unrelated to the physical interconnections between the underlying computers is created
by giving each process the address of its neighbor. A process that requires the token waits
until it receives it, but retains it. To exit the critical section, the process sends the token on
to its neighbor.
41 P3
41
Reply
P1 Reply
Reply
34
34
41
34
P2
159
P3
P4
Token
If the process holding the token fails, then an election is required to pick a unique process
Fig. 8.10 A ring of processes transferring a mutual exclusion token.
from the surviving members, which will regenerate the token and transmit it as before.
Care has to be taken, in ensuring that the failed process really has failed, and
does not later unexpectedly inject the old token into the ring, so that there are tow tokens.
This situation can arise since process failure can only be ascertained by repeated failure
of the process to acknowledge messages sent to it.
8.5 Elections: An election is a procedure carried out to choose a process from a group,
for example to take over the role of a process that has failed. The main requirement is for
the choice of elected process to be unique, even if several processes call elections
concurrently.
160
8.5.1 The bully algorithm: The bully algorithm can be used when the numbers of the
group know the identities and addresses of the other members. The algorithm selects the
surviving members with the largest identifiers to function as the coordinator. We assume
that communication is reliable, but processes can be failed during an election. The
algorithm proceeds as follows.
There are three types of messages in this algorithm.
An election message is sent to announce an election
An answer message is sent in response to an election message
A coordinator message is sent to announce the identity of the new coordinator.
A process begins an election when it notices that the coordinator has failed. To begin an
election, a process sends an election message those processes that have a higher
identifier. It then awaits an answer message in response. If none arrives within a certain
time, the process considers itself the coordinator, and sends a coordinator message to all
processes with lower identifiers announcing this fact. Otherwise, the process waits
further limited period for a coordinator message to arrive from the new coordinator. If
none arrives, it begins another election.
If a process receives a coordinator message, it records the identifier of the
coordinator contained within it, and treats that process as the coordinator. If a process
receives an election message, it sends back an answer message, and begins another
election unless it has begun one already.
When a failed process is restarted, it begins an election. If it has the higher
process identifier, then it will decide that it is the coordinator, and announce this to the
other processes. Thus it will become the coordinator, even though the current coordinator
is functioning. It is this reason that the algorithm is called the bully algorithm. The
operation of the algorithm is shown in Fig 8.11. There are four processes p1-p4, and an
election is called when p1 detects the failure of the coordinator, p4, and announces an
election (stage 1 in the figure). On receiving an election message from p1, p2 and p3 send
answer messages to p1 and begin their own election; p3 sends an answer message to p2,
but p3 receives no answer message from the failed process p4 (stage 2). It therefore decides
that it is the coordinator. But before it can send out the coordinator message, it too fails
(stage 3). When p1s timeout period expires (which we assume occurs before p2s timeout
161
expires), it notices the absence of a coordinator message and begins another election.
Eventually, p2 is elected coordinator (stage 4).
8.5.2 A ring-based election algorithm: We give the algorithm of Chang and Roberts,
suitable for a collection of processes that are arranged in a logical ring (Refer fig8.12).
We assume that the processes do not know the identities of the others a priori, and that
each process knows only how to communicate with its neighbor in, say, the clockwise
direction. The goal of this algorithm is to elect a single coordinator, which is the process
with the largest identifier. The algorithm assumes that all the processes remain functional
and reachable during its operation. Initially, every process is marked as a non-participant
in an election. Any process can begin an election. It proceeds by marking itself as a
participant, placing its identifier in an election message and sending it to its neighbor.
When a process receives an election message, it compares the identifier in the message
with its own. If the arrived identifier is the greater, then it forwards the message to its
neighbor. If the arrived identifier is smaller and the receiver is not a participant then it
substitutes its own identifier in the message and forwards it; but it does not forward the
message if it is already a participant. On forwarding an election message in any case, the
process marks itself as a participant. If however, the received identifier is that of the
receiver itself, then this processs identifier must be the greatest, and it becomes the
coordinator. The coordinator marks itself as a non-participant, and forwards the message
to its neighbor.
162
election C
election
Stage 1 answer P2 P3 P4
P1
answer
election
C
Stage 2 election election
P1 P2 answer P3 P4
timeout
Stage 3
P1 P2 P3 P4
coordinator
Eventually.. C
Stage 4
P1 P2 P3 P4
Fig. 8.11 The bully algorithm: The election of coordinator p2, after the failure of p4
and then p3.
3
17
4
24
15
24
28
NOTE: The election was started by process 17. The highest process identifier
encountered so far is 24. Participant processes are shown darkened
The point of making processes as participant or non-participant is so that
messages arising when another process starts an election at the time are extinguished as
soon as possible, and always before the winning election result has been announced.
If only a single process starts an election, then the worst case is when its
anticlockwise neighbor has the highest identifier. A total of n-1 messages are then
required to reach this neighbor, which will not announce its election until its identifier has
completed another circuit, taking a further n messages. The elected message is then sent
n times, making 3n-1 messages in all.
An example of a ring-based election in progress is shown in Figure 10.12. The
election message currently contains 24, but process 28 will replace this with its identifier
when the message reaches it.
Check your Progress 1: Answer the following: -
1. Define Physical time and Logical time.
2. Define clock drift.
3. Define coordinated universal time.
4. List the design features of Network time Protocol.
5. State Happened-before relation.
6. Define Distributed mutual exclusion principle.
7. Name a distributed algorithm that uses the concept of logical clocks.
8. What for the ring-based algorithm is used for?
9. What are the three types of the messages used in bully algorithm?
10. What do you mean by distributed coordination?
Check your Progress 2: Answer the following: -
1. Explain the necessity of clock synchronization.
2. Describe Crstians method for synchronizing clocks.
3. Explain Berkley algorithm.
4. What is the need of the concept of logical clock? Explain Happened before
principle.
164
5. Discuss and compare the different algorithms used for maintaining mutual
exclusion in Distributed transaction.
6. Explain Election algorithms.
8.6 Summary:
In this unit first we have discussed the importance of accurate time keeping for
distributed systems. It then described algorithms for synchronizing clocks despite the
drift between them and the variability of message delays between computers.
The degree of synchronization accuracy that is practically obtainable fulfils many
requirements, but is nonetheless not sufficient to determine the ordering of an arbitrary
pair of events occurring at different computers. The happened-before relationship is a
partial order on events, which reflects a flow of information within a process, or via
messages between processes between them. Some algorithms require events to be
ordered in happened-before order, for example, successive updates made at separate
copies of data. Logical clocks are counters that are updated so as to reflect the happened-
before relationship between them.
The unit then described the need for processes to access shared resources under
conditions of mutual exclusion. Resource servers in all cases do not implement locks, and
a separate distributed mutual exclusion service is then required. Three algorithms were
considered which achieve mutual exclusion: the central server, a distributed algorithm
using logical clocks, and a ring-based algorithm. These are heavyweight mechanisms that
cannot withstand failure, although they can be modified to be fault-tolerant. On the
whole, it seems advisable to integrate locking with resource management.
Finally, the unit considered the bully algorithm and a ring-based algorithm whose
common aim is to elect a process uniquely from a given set even if several elections
take place concurrently. These algorithms could be used, for example, to elect a new
master timeserver, or a new lock server, when the previous one fails.
165
REFERENCES