Вы находитесь на странице: 1из 67

Distributed Databases

Chapter 1
An Overview

Reference: Distributed Database principles and concepts , Stefano Ceri , Giuseppe Pelagatti

Outline
Introduction.
Distributed database definition.
Centralized vs Distributed DB Features.
Why Distributed Databases?
Distributed Database Management Systems.

Centralized vs Distributed DB
Centralized

Distributed

Located and maintained in 1


location(Processor)

Collection of data belong to same system


but spread over sites of computer network.

Pros : all data is located in one place.

Emphasizes
1. Distribution: not same site(processor).
2. Logical correlation : some properties tie
data together.
(vague)

Cons:
- bottleneck may occur
- Single point of failure.

Belong to DDB or not?


A

bank example with


dispersed branches.
Local application access DB of
a branch.
DDB or not ?!
(Logically correlated)
Main c/cs of DDB
Global (Distributed)
applications

Belong to DDB or not?


Same Bank but on same
building.

Locality not wrt geographical


distribution but with 1
computer with its own DB
Global application.
Increase throughput &
reliability of DDB changes
handling of some problems.

Belong to DDB or not?


Data distributed on 3 back ends
performing DB management
functions & application executed
by different computer.
Why not ?

Database Distribution is not


relevant from application
viewpoint.
No local application ,
Integrated system such that no
one of the computer can
execute an application by itself

Distributed Databases Definition


DDB Is a collection of data distributed over different computer
network.
Each site
Has autonomous processing capability .
Can perform local applications.
Participate in the execution of at least one global
application which requires accessing data at several sites
using a communication subsystem.
Cooperation between autonomous site - most technological
problem.

Centralized vs Distributed DB Features


Centralized Control
Data independence
Reduction of redundancy
Complex physical structures for efficient access

Integrity , Recovery & Concurrency control.


Privacy & Security.

Centralized Control
Provide centralized control over the information resources of a whole
enterprise or organization.

In DDB
Depends on architecture (Example 1.2 lends to centralize control than 1.1)
Identify Hierarchical control structure
Global Database Administrator

Central responsibility
of whole DB

Local DBAs
Responsible for their local DB
Have high degree of autonomy
(Site autonomy)
Perform intersite coordination

Site Autonomy vary from complete with no centralized DBA to


completely centralized control

Data Independence
The actual organization of data is transparent to the application programmer
Programs written having conceptual view of data (conceptual schema)
& unaffected by changes in physical organization of data.
In Traditional DB
Multilevel architecture having different data description & mapping
Conceptual , storage and external schema developed.

In DDB
Same importance as traditional DB.
Introduce Distribution Transparency
Programs can be written as if the database were not distributed.
Correctness of programs unaffected by data movement from site to
another while speed of execution is affected
Obtained by introducing new levels and schemata (Ch3.)

Reduction of redundancy
In Traditional DB
Reduced by data sharing (several application access same files and
records) for
1. Inconsistencies among several copies of the same logical data
2. Storage space saved
In DDB
Data redundancy needed for
1. Increase locality of application if data replicated at all sites
2. Increase availability of the system as site failure does not stop
application execution at other site.
Data redundancy reduced for reasons same as Traditional DB.
Data replication convenience increase with
ratio of retrieval accesses (any copy) versus update accesses (all copies)
performed by applications to it. (ch4. DDB design)

Complex physical structures & efficient access


In Traditional DB
Secondary indexes, interfile chains & others.
Their support is important for DBMSs
Used to obtain efficient access to data

In DDB
Not right tool for efficient access.
Efficient access cant be provided by this structure as

1. Very difficult to build and maintain such structures.


2. Not convenient to navigate at record level in DDB

Navigation example
Find all PART records supplied by supplier S1
Application run from site1

Find SUPPLIER record with SUP# = S1;


Repeat until no more members in set
Find next PART record in SUPPLIER-PART set;
Output PART record;

(b) Codasyl-DBMS-like program


Codasyl :Committee on data system language

Navigation example
More efficient implementation (grouping processes)

Distributed Access Plan : how the data must be accessed


As navigational programming in centralized DB.
Steps
1. Execution of program local at single site
2. Transmission of files between sites
Can be written by programmer or automatically produced by
optimizer.

Optimizers Design problems


Categories

Global Optimization

Local Optimization

Which data must be accessed at which


site & which data files must consequently
be transmitted between sites.

how to perform local DB access at each


site.

Optimization parameter:
Communication cost
Accessing the local DBs cost
Importance of these factors depend on
relation between communication cost &
disk access cost , which depend on
communication network.

Research here aids in understanding how


DDB can be efficiently accessed even if
access plans not produced automatically.

Typical to traditional , no distributed DB


problems.(not to be considered here)

Integrity, recovery and concurrency control


Strongly Correlated issues .
Solution : providing transactions.
Transaction
Definition: Atomic unit of execution set of operations performed entirely
or not at all.
Example: Funds transfer example (debit & credit)
Problem: debit at an operation site & credit at non operational site
How to act ?! Abort transaction or find smart way to execute transfer even

if sites not simultaneously operating ?


Transaction atomicity enemies
Failures
Concurrency

Integrity, recovery and concurrency control


DB integrity
Transaction atomicity assure DB integrity by assuring all actions
transfer DB from consistent state to another are performed or initial
consistent state is preserved.

Recovery: Deals with preserving transaction atomicity in the


presence of failures.
Problems (Ch.9)

Concurrency Control: Deals with ensuring transaction atomicity in


the presence of concurrent execution of transactions.
Problems : Synchronization harder in DDB than in centralized DB (Ch.8)

Privacy and Security


In Traditional centralized DB
DBA has centralized control
DBA ensures only authorized access is performed

Without specialized control procedures , is weak to privacy & security


violations than older separate files based approaches
In DDB
Local DBAs face same DBA problems in traditional DB.
In DDB with very high degree of autonomy, local DBA more protected
through enforcing their own protection instead of central DBA.

Communication networks represents a weak point with respect to


protection
Problems of privacy & security(Ch.10)

Why Distributed Databases?


1. Organizational and economic reasons.
Many decentralized organizations structurally fitted by DDB
Economy of scale motivation for having large centralized computer
centers.
2. Interconnection of existing DBs
Necessity of performing global applications for DBs exist in organizations
Creating bottom-up DDB from existing local DBs having less effort from
completely new centralized DB creation
3. Incremental growth.
Adding new relatively autonomous branches for organizations
With centralized approach would have to
Either take care for future dimension expansion in initial design
difficult & expensive
Or the growth will have major impact on existing applications
4. Reduce communication overhead
w.r.t. centralized DB as in example 1.1
Maximization of locality of application is 1 primary objective in DDB
design

Why Distributed Databases?


5. Perform considerations
Several autonomous processors
High degree of parallelism increase performance
In DDB decomposition of data reflects application dependence
criteria, maximize application locality ; mutual interference
between different processors minimized.
Load is shared between different processors
Bottlenecks as communication network itself or common services
of the whole system are avoided.
6. Reliability and availability
Autonomous processing capability of sites do not guarantee
reliability but insures
Graceful degradation property: failures in DDB is can be higher
than in centralized DB for greater # of components but failure
affect only applications using failed site , complete system crash is
rare (techniques for reliable DDB building Ch.9)

Why Distributed Databases?


Why DDB development begun ?

1. Small computers instead of large mainframes constitutes


necessary h/w needed.
2. DDB development depends on Computer Network& Database

technologies Which are developed sufficiently.

Distributed Database Management Systems (DDBMSs)


Support creation and maintenance of DDBs.

Commercially available Distributed systems developed by centralized DBMSs


vendors
DDBMS extends centralized DBMSs by supporting communication &
cooperation between several instances of DBMS installed at local sites of
computer network.
Software components of DDBMS
1. DB management component (DB)
2. Data communication component(DC)
3. Data dictionary (DD) info about data distribution
in the network
4. Specialized Distributed DB component (DDB)

- DBMS will refer to (DB, DC, DD)


- DDBMS will refer to (DB,DC,DD,DDB)

Distributed Database Management Systems (DDBMSs)


Services provided by above type of systems are
Remote DB access by an application program
Some degree of distribution transparency.

Support for database administration & control


Some support for concurrency control & recovery of

distributed transactions

Distributed Database Management Systems (DDBMSs)


DDBMSs provides access remote DB by an application through

Units shipped between Systems by


1. DB access primitive
2. Result obtained by executing it
Assures distribution transparency

Distributed Database Management Systems (DDBMSs)

Auxiliary program executed at remote site is required by application which


1. Access remote DB
2. Return the result to requesting application
Efficient if many DB access is required for auxiliary program perform all
required access and send only result back.

Homogeneity and Heterogeneity of DDBMSs


Can be over
Hardware
Operating system
Local DBMSs

Managed by communication software

Homogenous DDBMS :
DDBMSs with same DBMS at each site.
Preferred to be built in case of top-down without preexisting system
development of DDB
Heterogeneous DDBMS :
At least two different DBMSs.
Added translating between different models of DBMSs problem.(Ch.15)
Used in case of integrating preexisting DBs .
Actually systems supported some degree of it with no translation
between different data model
Some systems support communication between different DC
components(mainly developed for compatibility reasons in centralized
systems) as in DBMSs produced for running on IBM computers.(Ch. 11)

Distributed Databases
Chapter 2
Review of Databases and
Computer Networks
Reference: Distributed Database principles and concepts , Stefano Ceri , Giuseppe Pelagatti

Outline
Concepts and notations needed.
Review of Databases.
Review of computer Networks.

Review of Databases
Data model : Relational
Allows powerful , self-oriented , associative expressions instead of
1-recored at a time primitives.
Data Manipulation language:
Relational Algebra: describe & manipulate access strategies to DDB
SQL: Writing application programs for DDB.
Review on

Relational model
Database applications, programs & transactions.

Relational model
Relations: tables.
Attributes: # of columns.

Tuples: # of rows.
Grade: # of attributes of a relation.
Cardinality: # of tuples.

Domain: Set of possible values


for a given attribute
Rules:
No 2 identical tuples in the same relation.
No defined order of the tuples of a relation.
Disregarding the position of each column in a relation.
Some attributes are key (single or composite).

Relational algebra
Collection of operations on relations takes 1 or more relation as operands
and produce 1 relation as result.
Can compose of arbitrary complex expressions.

Unary
Take only 1 relation as operand
Operators:
Selection
Projection

Operations

Binary
Take only 2 relations as operands
Operators:
Union
Difference
Cartesian product
Join
Semi-join

Relational algebra
Unary Operations
Selection SLFR
R: operand to which selection is applied
F: Formula express selection condition.
Projection PJAttrR
Attr: denotes a subset of the attributes of operand relation.
Replicated tuples are eliminated.

Relational algebra
Binary Operations
Union R UN S
R, S : relations.
Union tuples of R and S (All tuples appearing either in R or S or both)
UN(R1,R2,R3, .., Rn) = R1 UN R2 UN R3 ..UN Rn
Difference R DF S
The difference between tuples of R and S (All tuples on R but not S)

Relational algebra
Binary Operations
Cartesian Product R CP S
R, S : relations.
Every tuple of R is combined with every tuple of S o form one tuple of
the result.

Relational algebra
Binary Operations
Join R JNFS
F : formula specify join condition.
Equi-join : if only equality appears in F.
Join derived from selection and Cartesian product
R JNFS = SLF(R CP S)
Natural join R NJN S
Equi-join in which all attributes with same name in the 2 relations are
compared.
One of the 2 attributes is omitted from result if both have same name
and value
Semi-join R SJF S
F : formula specify join condition
Derived from projection and join
R SJF S = PJAttr(R) (R JNFS )
Where Attr(R) :Set of all attributes of R

Relational algebra
Binary Operations
Natural semi-join R NSJ S
Considering semi join with same join condition in natural join.

SQL
Simple Statement
Select [attribute list]
From [relation name]
Where [predicate]
Example

Database applications, programs & transactions


Application(denote function requested by user )
Sequence of operation requested by end user not a programmer with a single
activation request.
Can be Online (user request & get response at short time) or batch (operator
request)
Online can be classified as
Simple application: receive 1 input message & produce 1.
Conversation application: exchange several message with user.
Read only application: read data from DB with out updating

Transaction
Atomic unit of DB access wither executed entirely or not at all.

Program(denotes implementation of application in programming language)


Unit with own address space communicates with other programs through
messages & synchronization primitives.

Query(denotes DB request)
Expression in suitable language which defines portion of data contained in DB.

Review of Computer Network


Model
Communication network Connecting Hosts
Communication links (coaxial cables , satellite links, telephone linesetc)
includes several computers
Facility Provided
A process running at any site can send
a message to a process running at
any other site of the network.

Parameters to be consider
Delay of message delivery to its destination.
- Heavy usage increase delay
- Queuing analysis for messages will be
required to evaluate delay
Cost of transmitting message: fixed
- Cost associated with each message +
a cost proportional to message length
Reliability of Network
- Message correctly delivered

Message Broadcasting

Review of Computer Network


Types of Communication network & Network Topologies

Review of Computer Network


Protocols & Sessions:
Protocols:
Rules followed by 2 or more processes for communication.
Agreement between sender & receiver to exchange messages
How recognize and identify each other
How many messages exchanged
Messeages need answer or no etc
Sessions:
Established between 2 processes want to communicate & is held
until all necessary messages exchange.
Closing a session similar to hanging up in a phone call.
High level facility provided by communication network & does not
imply existence of direct physical connection
For irregular pattern exchange of messages; sending datagram
messages instead of establishing session

Review of Computer Network


ISO/OSI Reference architecture

(international

Interconnection)

Each level offer virtual communication


facility to higher level layer
Application Layer: DDB is particular
Application
Presentation layer: Conversation of
info between different forms of data
Representation
Session layer: Establishing and
maintaining sessions between processes
Transport layer: True source to
destination messages(broken
or compacted) delivery layer.

Network, Data-link & physical layer:


TL uses for performing function efficiently

standards organization / Open Systems

Distributed Databases
Chapter 3
Levels of Distribution Transparency

Reference: Distributed Database principles and concepts , Stefano Ceri , Giuseppe Pelagatti

Outline
Deals with different levels an application programmer views DDB
depending on DDBMSs provided distribution transparency.
Layered reference architecture for DDB.
Mapping between different distribution transparency levels.
(using relational model & relational algebra)

How applications can be written at different levels defined in the


architecture (app. Written using Pascal-like language with embedded SQL . Also
efficiency of DB access by app. strategies not concerned)

Read-Only applications
app. Access DDB for just Single Tuple
Update Applications
Distribution transparency with concerning accessing sets of
tuples problems.
Integrity constraints & their enforcement in DDB.

Reference architecture for Distributed Databases


Do not depend on data
model of local site DBMSs

Reference architecture for Distributed Databases


Global Schema:
Define all data contained in DDB as if DB is not distributed.
Using relational model - Consists of the definitions of a set of global relations.
Can be spitted to several no overlapping Fragments.

Fragmentation Schema:
Defines the mapping between global relations and fragments(1:M mapping)
Logical portions of physical global relations located at 1 or several sites of
network
Notation: Ri where R is the global relation , Ri is the ith fragment of R

Allocation Schema:
At which site(s) the fragment is located.
Type of mapping defined here determines DDB is redundant(1:M) or not(1:1).
Rj indicates physical image of global relation R at site j
A copy of a fragment at given site
Donated using global relation name & 2 indexes(fragment index and site index)
Indicates copy of fragment R2 located at site 3

Local mapping Schema:


Map physical images to the objects which are manipulated by the local DBMSs.
Depends on type of DBMS (different mapping in heterogeneous system) .

Reference architecture for Distributed Databases


Example:
Images can be copied.

Reference architecture for Distributed Databases


Objectives motivate the architecture features:
1. Separating the data fragmentation concept from data allocation
Transparency
Allow distinguish

Fragmentation transparency
Highest degree of transparency
Require
user
or
application
programmer works on global
relations.

concept.

Location transparency
Lower degree of transparency
Require
user
or
application
programmer works on fragments
instead of global

2. Explicit control of redundancy at fragmentation level


(R2 & R3 overlapping i.e. contain common data )
3. Independence from local DBMSs(called Local Mapping transparency)
Allow study DDBM problems without taking in account specific data models of local
DBMSs.
Replication Transparency:
Implied by location transparency (not distinguish in book)
User unaware of fragments replication.

Types of Data Fragmentation


Horizontal Fragmentation

Vertical Fragmentation

A Fragment :
Expression in a relational language, taking global relations as
operands and produces the fragment as a result.
Rules on defining fragments:
1. Completeness condition
- No data item do not belong to any fragment.
- Set of qualifications(conditions) of all fragments must be complete
2. Reconstruction condition.
Must be able to construct global relation from its fragment
3. Disjointness condition
Fragment be disjoint; so that replication of data can be controlled
explicitly at allocation level. (HZ fragmentation)

Types of Data Fragmentation


Horizontal Fragmentation:
Partition tuples of global relation into subsets
Example:

Applying Rules of fragmentation:


1. Completeness condition
if SF and LA are only cities values
2. Reconstruction condition.
3. Disjointness verified.

Types of Data Fragmentation


Derived Horizontal Fragmentation:
Example:

Applying Rules of fragmentation:


1. Completeness condition (Referential integrity constraint)
no supplier # in SUPPLY not contained also in SUPPLIER.
2. Reconstruction condition
3. Disjointness verified if tuple in SUPPLY does not corresponds to 2
tuples of SUPPLIER relation which belong to 2 different fragments.

Types of Data Fragmentation


Vertical Fragmentation:
Example:

Types of Data Fragmentation

Mixed Fragmentation:
Example:

Types of Data Fragmentation

Distribution transparency for Read-only Applications


Language definitions:
All variables: strings(arrays)

Input : read(filename, variable)


Output: write(filename, variable)
Filename : terminal if I/O performed at terminal
Pascal var used in SQL statement: prefixed with $ symbol
Pascal var used for Success or failure of a required DB operation: prefixed
with # symbol
SQL I/O

Distribution transparency for Read-only Applications


Simple Application (SUPINQUIRY): retrieve supplier name for a given
supplier #

Distribution transparency for Read-only Applications


Continue (SUPINQUIRY)
In 3.5-b can be written as

Distribution transparency for Read-only Applications


Continue (SUPINQUIRY)

Distribution transparency for Read-only Applications


Complex Application(SUPOFPART) : retrieve name of the supplier who
supplies a given part.

Distribution transparency for Update Applications

Distribution transparency for Update Applications

Distribution transparency for Update Applications

Distribution Database Access Primitives


Language definitions:
For DB access Query returns Several values not just 1 as before

Suffix REL : file by Pascal like & relation by SQL statement

Distribution Database Access Primitives

Distribution Database Access Primitives

Distribution Database Access Primitives

Integrity constraints in DDBs


Integrity Constraints samples:
Which data values are allowed (age must be between 0 and 100)
Which transactions are allowed( age cannot decrease)
Can involve single or multiple relations
All values of a given attribute of a relation exist also in some other
relation for ensuring correctness of derived fragmentation
Example

Вам также может понравиться