Вы находитесь на странице: 1из 48

UNIT-1

Query processing is the process by which a declarative query is translated into low-level data
manipulation operations. SQL is the standard query language that is supported in current
DBMSs.
Query Processing steps:

Parsing and Translating


o Translate the query into its internal form (parse tree).
o This is then translated into an expression of the relational algebra.
o Parser checks syntax, validates relations, attributes and access permissions
Evaluation
o The query execution engine takes a physical query plan
(aka execution plan),
executes the plan, and returns the result.
Optimization: Find the cheapest" execution plan for a query
A relational algebra expression may have many equivalent expressions, e.g.,

CName(_Price>5000((CUSTOMERS ORDERS) OFFERS))


CName((CUSTOMERS ORDERS) (_Price>5000(OFFERS)))
Representation as logical query plan (a tree):

Non-leaf nodes = operations of relational algebra (with parameters); Leaf nodes = relations

A relational algebra expression can be evaluated in many ways. An annotated expression


specifying detailed evaluation strategy is called the execution plan (includes, e.g.,
whether index is used, join algorithms, . . . )
Among all semantically equivalent expressions, the one with the least costly evaluation
plan is chosen. Cost estimate of a plan is based on statistical information in the system
catalogs.

Query optimization refers to the process by which the best execution strategy for a given query is
found from among a set of alternatives.
The process typically involves two steps: query decomposition and query optimization.
Query decomposition takes an SQL query and translates it into one expressed in relational algebra. In the
process, the query is analyzed semantically so that incorrect queries are detected and rejected as easily as
possible, and correct queries are simplified. Simplification involves the elimination of redundant
predicates which may be introduced as a result of query modification to deal with views, security
enforcement and semantic integrity control. The simplified query is then restructured as an algebraic
query.
For a given SQL query, there are more than one possible algebraic queries. Some of these algebraic
queries are better than others. The quality of an algebraic query is defined in terms of expected
performance.
The traditional procedure is to obtain an initial algebraic query by translating the predicates and the target
statement into relational operations as they appear in the query. This initial algebraic query is then
transformed, using algebraic transformation rules, into other algebraic queries until the best one is
found.
The best algebraic query is determined according to a cost function which calculates the cost of
executing the query according to that algebraic specification. This is the process of query optimization.

Optimization typically takes one of two forms: Heuristic Optimization or Cost Based
Optimization
In Heuristic Optimization, the query execution is refined based on heuristic rules for reordering
the individual operations.

With Cost Based Optimization, the overall cost of executing the query is systematically
reduced by estimating the costs of executing several different execution plans.

Query Optimization
We divide the query optimization into two types: Heuristic (sometimes called Rule based) and
Systematic (Cost based).

Heuristic Query Optimization


In this method relational algebra expressions are expressed in equivalent
expressions that take much less time and resource to process. As we illustrated,
repositioning relational algebra operations in certain ways does not affect the
results. First we present an example to show the effect of this repositioning and
then present a list of heuristic rules for optimizing relational algebra
expressions. Once an expression is optimized, it can then be implemented
efficiently.

A query can be represented as a tree data structure. Operations are at the interior nodes
and data items (tables, columns) are at the leaves.

The query is evaluated in a depth-first pattern.

For Example:
SELECT PNUMBER, DNUM, LNAME
FROM
PROJECT, DEPARTMENT, EMPLOYEE
WHERE DNUM=DNUMBER and MGRSSN=SSN and
PLOCATION = 'Stafford';

Or, in relational algebra:

on the following schema:


EMPLOYEE
FNAME
-------JOHN
FRANKLIN
ALICIA
JENNIFER
RAMESH

TABLE
MI LNAME
-- ------B SMITH
T WONG
J ZELAYA
S WALLACE
K NARAYAN

SSN
--------123456789
333445555
999887777
987654321
666884444

BDATE
--------09-JAN-55
08-DEC-45
19-JUL-58
20-JUN-31
15-SEP-52

ADDRESS
------------------------731 FONDREN, HOUSTON, TX
638 VOSS,HOUSTON TX
3321 CASTLE, SPRING, TX
291 BERRY, BELLAIRE, TX
975 FIRE OAK, HUMBLE, TX

S SALARY SUPERSSN DNO


- ------ --------- -M 30000 333445555 5
M 40000 888665555 5
F 25000 987654321 4
F 43000 888665555 4
M 38000 333445555 5

JOYCE
AHMAD
JAMES

A
V
E

ENGLISH 453453453 31-JUL-62 5631 RICE, HOUSTON, TX


JABBAR 987987987 29-MAR-59 980 DALLAS, HOUSTON, TX
BORG
888665555 10-NOV-27 450 STONE, HOUSTON, TX

DEPARTMENT TABLE:
DNAME
DNUMBER
--------------- --------HEADQUARTERS
1
ADMINISTRATION
4
RESEARCH
5

MGRSSN
--------888665555
987654321
333445555

PROJECT TABLE:
PNAME
PNUMBER
---------------- ------ProductX
1
ProductY
2
ProductZ
3
Computerization
10
Reorganization
20
NewBenefits
30

PLOCATION
---------Bellaire
Sugarland
Houston
Stafford
Houston
Stafford

MGRSTARTD
--------19-JUN-71
01-JAN-85
22-MAY-78

DNUM
---5
5
5
4
1
4

WORKS_ON TABLE:
ESSN
PNO
--------- --123456789
1
123456789
2
666884444
3
453453453
1
453453453
2
333445555
2
333445555
3
333445555
10
333445555
20
999887777
30
999887777
10
987987987
10
987987987
30
987654321
30
987654321
20
888665555
20

Which of the following query trees is more efficient ?

The left hand tree is evaluated in steps as follows:

F
M
M

25000 333445555 5
25000 987654321 4
55000
1

HOURS
----32.5
7.5
40.0
20.0
20.0
10.0
10.0
10.0
10.0
30.0
10.0
35.0
5.0
20.0
15.0
null

The right hand tree is evaluated in steps as follows:

Note the two cross product operations. These require lots of space and time (nested loops)
to build.

After the two cross products, we have a temporary table with 144 records (6 projects * 3
departments * 8 employees).

An overall rule for heuristic query optimization is to perform as many select and project
operations as possible before doing any joins.

There are a number of transformation rules that can be used to transform a query:
1. Cascading selections. A list of conjunctive conditions can be broken up into
separate individual conditions.

c1c2(E)= c1(c2(E))
2. Commutativity of the selection operation.
3. Cascading projections. All but the last projection can be ignored.
Assume that attributes A1, . . . ,An are among B1, . . . ,Bm. Then
A1,...,An( B1,...,Bm(E)) = A1,...,An(E)
4. Commuting selection and projection. If a selection condition only involves
attributes contained in a projection clause, the two can be commuted.
5. Commutativity of Join and Cross Product.
6. Commuting selection with Join.
If c only involves attributes from E1,then

c(E1 E2) = c(E1) E2


7. Commuting projection with Join.
8. Commutativity of set operations. Union and Intersection are commutative.
9. Associativity of Union, Intersection, Join and Cross Product.
10. Commuting selection with set operations.

c(E1 E2) = c(E1) c(E2)


11. Commuting projection with set operations.
A1,...,An(E1 E2) = A1,...,An(E1) A1,...,An(E2)
12. Logical transformation of selection conditions. For example, using DeMorgan's
law, etc.

13. Combine Selection and Cartesian product to form Joins.

Systematic (Cost based) Query Optimization

Just looking at the Syntax of the query may not give the whole picture - need to look at
the data as well.

Several Cost components to consider:


1. Access cost to secondary storage (hard disk)
2. Storage Cost for intermediate result sets
3. Computation costs: CPU, memory transfers, etc. for performing in-memory
operations.
4. Communications Costs to ship data around a network. e.g., in a distributed or
client/server database.

Of these, Access cost is the most crucial in a centralized DBMS. The more work we can
do with data in cache or in memory, the better.

Access Routines are algorithms that are used to access and aggregate data in a database.

An RDBMS may have a collection of general purpose access routines that can be
combined to implement a query execution plan.

We are interested in access routines for selection, projection, join and set operations such
as union, intersection, set difference, cartesian product, etc.

As with heuristic optimization, there can be many different plans that lead to the same
result.

In general, if a query contains n operations, there will be n! possible plans.


However, not all plans will make sense. We should consider:
Perform all simple selections first
Perform joins next
Perform projection last

Overview of the Cost Based optimization process


1. Enumerate all of the legitimate plans (call these P1...Pn) where each plan contains
a set of operations O1...Ok
2. Select a plan

3. For each operation Oi in the plan, enumerate the access routines


4. For each possible Access routine for Oi, estimate the cost
Select the access routine with the lowest cost
5. Repeat previous 2 steps until an efficient access routine has been selected for each
operation
Sum up the costs of each access routine to determine a total cost for the plan
6. Repeat steps 2 through 5 for each plan and choose the plan with the lowest total
cost.

Catalog Information for Cost Estimation


Information about relations and attributes:
NR: number of tuples in the relation R.
BR: number of blocks that contain tuples of the relation R.
SR: size of a tuple of R.
FR: blocking factor; number of tuples from R that fit into one block
(FR = [NR/BR])
V(A,R): number of distinct values for attribute A in R.
SC(A, R): selectivity of attribute A
=average number of tuples of R that satisfy an equality condition on A.
SC(A, R) = NR/V(A, R).
Information about indexes:
HTI: number of levels in index I (B+-tree).
LBI: number of blocks occupied by leaf nodes in index I (first-level blocks).
ValI: number of distinct values for the search key.

Measures of Query Cost

There are many possible ways to estimate cost, e.g., based on disk accesses, CPU
time, or communication overhead.
Disk access is the predominant cost (in terms of time); relatively easy to estimate;
therefore, number of block transfers from/to disk is typically used as measure.
Simplifying assumption: each block transfer has the same cost.
Cost of algorithm (e.g., for join or selection) depends on database buffer size; more
memory for DB buffer reduces disk accesses. Thus DB buffer size is a parameter for
estimating cost.
We refer to the cost estimate of algorithm S as cost(S). We do not consider cost of
writing output to disk.

Relational Algebra Equivalences:


Equivalence Rules (for expressions E, E1, E2, conditions Fi) Applying distribution and
commutativity of relational algebra operations
1.

(
F1

F2

(E)) = F1^F2(E)

2. F(E1[U, , --] E2) = F(E1) [U,,--] F(E2)


3. F(E1 X E2) = F0( F1(E1) X F2(E2));
F =F0 ^ F1 ^ F2, Fi contains only attributes of Ei; i = 1; 2.
4. A=B(E1 X E2) = E1(A=B) E2
5. A(E1 [U,,--] E2) A(E1) [U,,--] A(E2)
6. A(E1 X E2) = A1(E1) X A2(E2) with Ai = A { attributes in Ei}, i = 1, 2.
7. E1 [U,] E2 = E2 [U,] E1
(E1 U E2) U E3 = E1 U (E2 U E3) (the analogous holds for )
8. E1 X E2 = A1,A2(E2 X E1)
(E1 X E2) X E3 = E1 X (E2 X E3)
(E1 X E2) X E3 = ((E1 X E3) X E2)
9. E1 E2 = E2 E1
(E1 E2) E3 = E1 (E2 E3)

UNIT-2
Disadvantages of RDBMS

RDBMSs are not suitable for applications with complex data structures or new data types
for large, unstructured objects, such as CAD/CAM, Geographic information systems,
multimedia databases, imaging and graphics.
The RDBMSs typically do not allow users to extend the type system by adding new data
types.
They also only support first-normal-form relations in which the type of every column
must be atomic, i.e., no sets, lists, or tables are allowed inside a column.
Recursive queries are difficult to write.

MOTIVATING EXAMPLE
As a specific example of the need for object-relational systems, we focus on a new business data
processing problem that is both harder and (in our view) more entertaining than the dollars and
cents bookkeeping of previous decades. Today, companies in industries such as entertainment are
in the business of selling bits; their basic corporate assets are not tangible products, but rather
software artifacts such as video and audio.
We consider the fictional Dinky Entertainment Company, a large Hollywood conglomerate
whose main assets are a collection of cartoon characters, especially the cuddly and
internationally beloved Herbert the Worm. Dinky has a number of Herbert the Worm films, many
of which are being shown in theaters around the world at any given time. Dinky also makes a
good deal of money licensing Herbert's image, voice, and video footage for various purposes:
action figures, video games, product endorsements, and so on. Dinky's database is used to
manage the sales and leasing records for the various Herbert-related products, as well as the
video and audio data that make up Herbert's many films.
Traditional database systems, such as RDBMS, have been quite successful in developing the
database technology required for many traditional business database applications. However, they
have certain shortcomings when more complex database applications must be designed and
implementedfor example, databases for engineering design and manufacturing (CAD/CAM ),
scientific experiments, telecommunications, geographic information systems, and multimedia.
These newer applications have requirements and characteristics that differ from those of
traditional business applications, such as more complex structures for objects, longer-duration
transactions, new data types for storing images or large textual items, and the need to define
nonstandard application-specific operations.

Object-oriented databases were proposed to meet the needs of these more complex applications.
The object-oriented approach offers the flexibility to handle some of these requirements without
being limited by the data types and query languages available in traditional database systems. A
key feature of object-oriented databases is the power they give the designer to specify both the
structure of complex objects and the operations that can be applied to these objects.
Object database systems combine the classical capabilities of relational database management
systems (RDBMS), with new functionalities assumed by the object-orientedness. The traditional
capabilities include:

Secondary storage management


Schema management
Concurrency control
Transaction management, recovery
Query processing
Access authorization and control, safety, security
New capabilities of object databases include:
Complex objects
Object identities
User-defined types
Encapsulation

Type/class hierarchy with inheritance


Overloading, overriding, late binding, polymorphism
Mandatory features of object-oriented systems
Support for complex objects
A complex object mechanism allows an object to contain attributes that can themselves be
objects. In other words, the schema of an object is not in first-normal-form. Examples of
attributes that can comprise a complex object include lists, bags, and embedded objects.
Object identity
Every instance in the database has a unique identifier (OID), which is a property of an object
that distinguishes it from all other objects and remains for the lifetime of the object. In
object-oriented systems, an object has an existence (identity) independent of its value.
Each database object has identity, i.e. a unique internal identitifier (OID) (with no
meaning in the problem domain). Each object has one or more external names that
can be used to identify the object by the programmer.
Properties of OID:

It
It
It
It
It

is
is
is
is
is

unique
system generated
invisible to the user. That is it cannot be modified by the user.
immutable. That is, once generated, it is never regenerated.
a long integer value

Encapsulation
Object-oriented models enforce encapsulation and information hiding. This means, the state of
objects can be manipulated and read only by invoking operations that are specified within the
type definition and made visible through the public clause.
In an object-oriented database system encapsulation is achieved if only the operations are
visible to the programmer and both the data and the implementation are hidden.
Support for types or classes
Type: in an object-oriented system, summarizes the common features of a set of objects
with the same characteristics. In programming languages types can be used at
compilation time to check the correctness of programs.
Class: The concept is similar to type but associated with run-time execution. The term
class refers to a collection of all objects with the same internal structure (attributes) and
methods. These objects are called instances of the class.
Both of these two features can be used to group similar objects together, but it is normal
for a system to support either classes or types and not both.
Class or type hierarchies
Any subclass or subtype will inherit attributes and methods from its superclass or supertype.
Overriding, Overloading and Late Binding

Overloading: A class modifies an existing method, by using the same name, but with a
different list, or type, of parameters.
Overriding: The implementation of the operation will depend on the type of the object it is
applied to.
Late binding: The implementation code cannot be referenced until run-time.

Computational Completeness
SQL does not have the full power of a conventional programming language. Languages such as
Pascal or C are said to be computationally complete because they can exploit the full
capabilities of a computer. SQL is only relationally complete, that is, it has the full power of
relational algebra. Whilst any SQL code could be rewritten as a C++ program, not all C++
programs could be rewritten in SQL.
Mandatory features of database systems
A database is a collection of data that is organized so that its contents can easily be accessed,
managed, and updated. Thus, a database system contains the five following features:
Persistence
As in a conventional database, data must remain after the process that created it has
terminated. For this purpose data has to be stored permanently on secondary storage.
Secondary Storage Management
Traditional databases employ techniques, which manage secondary storage in order to improve
the performance of the system. These are usually invisible to the user of the system.
Concurrency
The system should provide a concurrency mechanism, which is similar to the concurrency
mechanisms in conventional databases.
Recovery
The system should provide a recovery mechanism similar to recovery mechanisms in
conventional databases.
Ad hoc query facility
The database should provide a high-level, efficient, application independent query facility.
This needs not necessarily be a query language but could instead, be some type of graphical
interface.

Structured Data types:


A structured data type is a form of user-defined data type that contains a sequence
of attributes, each of which has a data type. An attribute is a property that helps
describe an instance of the type. For example, if we were to define a structured type
called address_t, city might be one of the attributes of this structured type.
Structured types make it easy to use data, such as an address, either as a single
unit, or as separate data items, without having to store each of those items (or
attributes) in a separate column.

A structured data type can be used as the type for a column in a regular table, the
type for an entire table (or view), or as an attribute of another structured type.
When used as the type for a table, the table is known as a typed table.
Structured data types exhibit a behavior known as inheritance. A structured type
can have subtypes, other structured types that reuse all of its attributes and contain
their own specific attributes. The type from which a subtype inherits attributes is
known as its supertype.
For Example:
We have to create table employee
Name
FNam
e

Age

LName

Salar
y

Address
street

city

privinc
e

Postal_co
de

create type address_t as (street varchar(12), city varchar(12), province


varchar(12), postal_code char(6));
create type Name_t as (FName varchar(12),LName varchar(20));

create a new structure type by inheriting these two structure types


create type employee_t as(emp_id integer, ename Name_t, address address_t);

now we can create table of the above structure type


create table employee of employee_t
REF is emp_id system generated ;

We can also declare array type to define multivalued attributes


For Example:
Create type phone_t as( phoneno char(10) array[3]);

Here user can save three phone nos of a employee.


Complex objects, object identity. The database should consist of objects

having arbitrary complexity and an arbitrary number of hierarchy levels. Objects


can be aggregates of (sub-) objects.
An object typically has two components: state (value) and behavior (operations).
Hence, it is somewhat similar to a program variable in a programming language,
except that it will typically have a complex data structure as well as specific
operations defined by the programmer.
Types of objects:
Transient objects: Objects in an OOPL exist only during program execution and
are hence called transient objects.

Persistent objects: An OO database can extend the existence of objects so that


they are stored permanently, and hence the objects persist beyond program
termination and can be retrieved later and shared by other programs. In other
words, OO databases store persistent objects permanently on secondary storage,
and allow the sharing of these objects among multiple programs and applications.
This requires the incorporation of other well-known features of database
management systems, such as indexing mechanisms, concurrency control, and
recovery. An OO database system interfaces with one or more OO programming
languages to provide persistent and shared object capabilities.
Relationships, associations, links. Objects are connected by conceptual links.

For instance, the Employee and Department objects can be connected by a link
worksFor. In the data structure links are implemented as logical pointers (bidirectional or uni-directional).
Encapsulation and information hiding. The internal properties of an object

are subdivided into two parts: public (visible from the outside) and private
(invisible from the outside). The user of an object can refer to public properties
only.
Classes, types, interfaces. Each object is an instance of one or more classes.

The class is understood as a blueprint for objects; i.e. objects are instantiated
according to information presented in the class and the class contains the
properties that are common for some collection of objects (objects invariants).
Each object is assigned a type. Objects are accessible through their interfaces,
which specify all the information that is necessary for using objects.
Abstract data types (ADTs): a kind of a class, which assumes that any access

to an object is limited to the predefined collection of operations.


Operations, methods and messages. An object is associated with a set of

operations (called methods). The object performs the operation after receiving a
message with the name of operation to be performed (and parameters of this
operation).
Inheritance. Classes are organized in a hierarchy reflecting the hierarchy of real

world concepts. For instance, the class Person is a super class of the classes
Employee and Student. Properties of more abstract classes are inherited by more
specific classes. Multi-inheritance means that a specific class inherits from
several independent classes.
Polymorphism, late binding, overriding. The operation to be executed on an

object is chosen dynamically, after the object receives the message with the
operation name. The same message sent to different objects can invoke different
operations.
Persistence. Database objects are persistent, i.e., they live as long as necessary.

They can outlive programs, which created these objects.

Object Database Management Group (ODMG).


Special interest group to develop standards that allow ODBMS customers to write portable
applications
Standards include:

Object Model
Object Specification Languages
Object Definition Language (ODL) for schema definition
Object Interchange Format (OIF) to exchange objects between databases
Object Query Language
declarative language to query and update database objects
Language Bindings (C++, Java, Smalltalk)
Object manipulation language
Mechanisms to invoke OQL from language
Procedures for operation on databases and transactions

CHALLENGES IN IMPLEMENTING AN ORDBMS


The enhanced functionality of ORDBMSs raises several implementation challenges.
Some of these are well understood and solutions have been implemented in
products, others are subjects of current research. In this section we examine a few
of the key challenges that arise in implementing an efficient, fully functional
ORDBMS. Many more issues are involved than those discussed here

Storage and Access Methods


Since object-relational databases store new types of data, ORDBMS implementers
need to revisit some of the storage and indexing issues. In particular, the system
must efficiently store ADT objects and structured objects and provide efficient
indexed access to both.

Storing Large ADT and Structured Type Objects


Large ADT objects and structured objects complicate the layout of data on disk. This
problem is well understood and has been solved in essentially all ORDBMSs and
OODBMSs. We present some of the main issues here. User-defined ADTs can be
quite large. In particular, they can be bigger than a single disk page. Large ADTs,
like BLOBs, require special storage, typically in a different location on disk from the
tuples that contain them. Disk-based pointers are maintained from the tuples to the
objects they contain.

Structured objects can also be large, but unlike ADT objects they often vary in size during the
lifetime of a database. For example, consider the stars attribute of the films table. As the years
pass, some of the bit actors in an old movie may become famous. When a bit actor becomes
famous, we might want to advertise his or her presence in the earlier films. This involves an
insertion into the stars attribute of an individual tuple in lms. Because these bulk attributes can
grow arbitrarily, flexible disk layout mechanisms are required. An additional complication arises
with array types. Traditionally, array elements are stored sequentially on disk in a row-by-row
fashion, for example
A11,.A1n, A21,..,A2n Am1,.....,Amn

However, queries may often request sub arrays that are not stored contiguously on disk (e.g.,
A11,A21,...,Am1). Such requests can result in a very high I/O cost for retrieving the sub array. In
order to reduce the number of I/Os required in general, arrays are often broken into contiguous
chunks, which are then stored in some order on disk. Although each chunk is some contiguous
region of the array, chunks need not be row-by-row or column-by-column. For example, a chunk
of size 4 might be A11,A12,A21,A22, which is a square region if we think of the array as being
arranged row-by-row in two dimensions.

Indexing New Types


One important reason for users to place their data in a database is to allow for efficient access via
indexes. Unfortunately, the standard RDBMS index structures support only equality conditions
(B+ trees and hash indexes) and range conditions (B+ trees). An important issue for ORDBMSs
is to provide efficient indexes for ADT methods and operators on structured objects. Many
specialized index structures have been proposed by researchers for particular applications such as
cartography, genome research, multimedia repositories, Web search, and so on. An ORDBMS
company cannot possibly implement every index that has been invented. Instead, the set of index
structures in an ORDBMS should be user-extensible. Extensibility would allow an expert in
cartography, for example, to not only register an ADT for points on a map (i.e., latitude/longitude
pairs), but also implement an index structure that supports natural map queries (e.g., the R-tree,
which matches conditions such as Find me all theaters within 100 miles of Andorra).
One way to make the set of index structures extensible is to publish an access method interface
that lets users implement an index structure outside of the DBMS. The index and data can be
stored in a file system, and the DBMS simply issues the open , next ,and close iterator requests to
the users external index code. Such functionality makes it possible for a user to connect a
DBMS to a Web search engine, for example. A main drawback of this approach is that data in an
external index is not protected by the DBMSs support for concurrency and recovery. An
alternative is for the ORDBMS to provide a generic template index structure that is sufficiently
general to encompass most index structures that users might invent. Because such a structure is
implemented within the DBMS, it can support high concurrency and recovery. The Generalized
Search Tree (GiST) is such a structure. It is a template index structure based on B+trees, which
allows most of the tree index structures invented so far to be implemented with only a few lines
of user-defined ADT code.

Query Processing
ADTs and structured types call for new functionality in processing queries in ORDBMSs. They
also change a number of assumptions that affect the efficiency of queries. In this section we look
at two functionality issues (user-defined aggregates and security) and two efficiency issues
(method caching and pointer swizzling).

User-Defined Aggregation Functions

Since users are allowed to define new methods for their ADTs, it is not unreasonable to expect
them to want to define new aggregation functions for their ADTs as well. For example, the usual
SQL aggregates COUNT, SUM, MIN, MAX, AVGare not particularly appropriate for the
Image type schema.
Most ORDBMSs allow users to register new aggregation functions with the system. To register
an aggregation function, a user must implement three methods, which we will call initialize,
iterate, and terminate. The initialize method initializes the internal state for the aggregation. The
iterate method updates that state for every tuple seen, while the terminate method computes the
aggregation result based on the final state and then cleans up. As an example, consider an
aggregation function to compute the second-highest value in a field. The initialize call would
allocate storage for the top two values, the iterate call would compare the current tuples value
with the top two and update the top two as necessary, and the terminate call would delete the
storage for the top two values, returning a copy of the second-highest value.

Method Security
ADTs give users the power to add code to the DBMS, this power can be abused. A buggy or
malicious ADT method can bring down the database server or even corrupt the database. The
DBMS must have mechanisms to prevent buggy or malicious user code from causing problems.
It may make sense to override these mechanisms for efficiency in production environments with
vendor-supplied methods. However, it is important for the mechanisms to exist, if only to
support debugging of ADT methods, otherwise method writers would have to write bug-free
code before registering their methods with the DBMSnot a very forgiving programming
environment.One mechanism to prevent problems is to have the user methods be interpreted
rather than compiled . The DBMS can check that the method is well behaved either by restricting
the power of the interpreted language or by ensuring that each step taken by a method is safe
before executing it. Typical interpreted languages for this purpose include Java and the
procedural portions of SQL:1999
An alternative mechanism is to allow user methods to be compiled from a general-purpose
programming language such as C++, but to run those methods in a different address space than
the DBMS. In this case the DBMS sends explicit interprocess communications (IPCs) to the user
method, which sends IPCs back in return. This approach prevents bugs in the user methods (e.g.,
stray pointers) from corrupting the state of the DBMS or database and prevents malicious
methods from reading or modifying the DBMS state or database as well. Note that the user
writing the method need not know that the DBMS is running the method in a separate process:
The user code can be linked with a wrapper that turns method invocations and return values
into IPCs

Method Caching
User-defined ADT methods can be very expensive to execute and can account for the bulk of the
time spent in processing a query. During query processing it may make sense to cache the results
of methods, in case they are invoked multiple times with the same argument. Within the scope of
a single query, one can avoid calling a method twice on duplicate values in a column by either
sorting the table on that column or using a hash-based scheme much like that used for
aggregation. An alternative is to maintain a cache of method inputs and matching outputs as a

table in the database. Then to find the value of a method on particular inputs, we essentially join
the input tuples with the cache table. These two approaches can also be combined.

Pointer Swizzling
In some applications, objects are retrieved into memory and accessed frequently through their
oids, dereferencing must be implemented very efficiently. Some systems maintains table of oids
of objects that are (currently) in memory. When an object O is brought into memory, they check
each oid contained in O and replace oids of in-memory objects by in-memory pointers to those
objects. This technique is called pointer swizzling and makes references to in-memory objects
very fast. The downside is that when an object is paged out, in-memory references to it must
somehow be invalidated and replaced with its oid.

Query Optimization
New indexes and query processing techniques widen the choices available to a query optimizer.
In order to handle the new query processing functionality, an optimizer must know about the new
functionality and use it appropriately. In this section we discuss two issues in exposing
information to the optimizer (new indexes and ADT method estimation) and an issue in query
planning that was ignored in relational systems (expensive selection optimization).

Registering Indexes with the Optimizer


As new index structures are added to a systemeither via external interfaces or built-in template
structures like GiSTsthe optimizer must be informed of their existence, and their costs of
access. In particular, for a given index structure the optimizer must know (a) what WHERE
-clause conditions are matched by that index, and (b) what the cost of fetching a tuple is for that
index. Given this information, the optimizer can use any index structure in constructing a query
plan. Different ORDBMSs vary in the syntax for registering new index structures. Most systems
require users to state a number representing the cost of access, but an alternative is for the DBMS
to measure the structure as it is used and maintain running statistics on cost.

Expensive selection optimization


In relational systems, selection is expected to be a zero-time operation. For example, it requires
no I/Os and few CPU cycles to test if emp.salary <10 . However, conditions such as
is herbert(Frames.image)
can be quite expensive because they may fetch large objects off the disk and process them in
memory in complicated ways. ORDBMS optimizers must consider carefully how to order
selection conditions. For example, consider a selection query that tests tuples in the Frames table
with two conditions:
Frames.frameno<100 isherbert(Frame.image). It is probably preferable to check the frameno
condition before testing is herbert. The first condition is quick and may often return false, saving
the trouble of checking the second condition. In general, the best ordering among selections is a
function of their costs and reduction factors. It can be shown that selections should be ordered by
increasing rank, where rank = (reduction factor1)/cost. If a selection with very high rank

appears in a multi-table query, it may even make sense to postpone the selection until after performing joins. Note that this approach is the opposite of the heuristic for pushing selections. The
details of optimally placing expensive selections among joins are somewhat complicated, adding
to the complexity of optimization in ORDBMSs.

Comparison Between RDBMS, OODBMS, and ORDBMS


RDBMS

OODBMS

ORDBMS

Does not support


oriented features

Support user-defined ADTs,


structured types, object

Support user-defined ADTs,


structured types, object

identity and reference types,


and inheritance.

identity and reference types,


and inheritance.

RDBMS supports SQL

OODBMSs support ODL/OQL

Aimed
at
designing
management and finance
systems
i.e.:
hotel
management,
shop
management, etc.

OODBMSs aim to achieve


seamless integration with a
programming language
such as C++, Java or
Smalltalk. Such integration is
not an important goal for an

ORDBMSs
support
an
extended form of SQL,
An OODBMS is aimed at
applications where an objectcentric viewpoint is
appropriate;
that is, typical user sessions
consist of retrieving a few
objects and
working on them for long
periods, with related objects
(e.g., objects referenced

object

ORDBMS.

by the original objects)


fetched occasionally.
Transactions are short and
ad-hoc in nature

Transactions are complex


and are of long duration

Transactions are assumed to


be short and ordinary
mechanisms of RDBMS are
used to manage them.

Every record is uniquely


identified by primary key

Here every object is


uniquely
identified
by
system generated Object
ID

Here every object is


uniquely
identified
by
system generated Object
ID

RDBMS is suitable for


small
database
management systems like
Hotel
management,
university
management,
shop management, etc.

OODBMS is suitable for


advanced
applications
like: Computer Integrated
Manufacturing
(CIM),
Advanced
office
automation
systems,
Hospital
patient
care
tracking systems, etc. All

ORDBMS is suitable for


applications like: Complex
data
analysis,
Digital
Asset Management, Giographic Data, Bio-medical

of these applications are


characterized by having to
manage complex, highly
interrelated
information,
which is a strength of
object-oriented database
systems.
Examples
of
Oracle,
SQL
MySQL, etc

RDBMS:
server,

Examples of OODBMS:
Object
store,
Versant,
Gemstone, etc.

Examples of ORDBMS:
Postgres, SQL 92

Standard Query Language


is present i.e: SQL

Lack of standard query


language.

Lack of standard query


language

What is ownership semantics?


Ownership semantics applies when the sub-objects of complex object are encapsulated within the
complex object and are hence considered as part of complex object. This is also referred to as
is-part-of or is-component-of relationship.
What is reference semantics?
Reference semantics are applied when the components of the complex objects are themselves
independent objects but may be referenced from the complex object.

UNIT- 3
Parallel and Distributed Databases
A parallel database system is one that seeks to improve performance through parallel
implementation of various operations such as loading data, building indexes, and evaluating
queries.

Parallel Database Systems


A parallel database system tries to improve performance through parallelization of various
operations such as loading data , evaluating queries etc. the main goal of such system is to
improve the performance. Whereas, in case of distributed database systems, the data
distribution is the governing factor. The main goal of such systems is to increase the
availability and reliability.
Some terms that defines systems performance:
Throughput: Number of tasks (transactions) that can be completed in a given time interval.
Response Time: Amount of time taken to complete a single task from the time it is
submitted.
A system that processes large number of small transactions can improve throughput by
processing many transactions in parallel.
A system that processes large transactions can improve response time and throughput by
dividing each transaction into number of sub-transactions that can be executed in parallel.
Speed-Up: Running a given task in less time by increasing the degree of parallelism is call
speed up.
Speed Up = Ts/Tl
where Ts= Time required on small system
Tl= time required on large system with more resources.
A parallel system is said to demonstrate linear speed up if the speed up is N, when
resources are increased N times.
Scale-Up: Handling larger tasks in same amount of time by increasing the degree of
parallelism is called scale up.
Scale-Up= Ts/Tl
where Ts= time required to execute task of size Q
Tl= time required to execute task of size Q*N
The parallel system is said to demonstrate linear scale up on task of size Q if Ts=Tl when
resources are increased N times.
Parallel Database architectures:
Three main architectures are proposed for building parallel databases:
1. Shared - memory :- (All processors share common memery) where multiple CPUs are
attached to an interconnection network and can access a common region of main
memory.
In shared memory architecture, the processors and disks have access to common
memory via a bus or through an interconnection network.
A processor can send messages to other processors using memory writes.
This message sending is the much faster communication mechanism.
Advantage: Shared memory is an extremely effiecient communication between processors
and data in shared memory can be accessed by any processor without being moved with
software.
Disadvantage: shared memory architecture is not scalable beyond 32 or 64 processors,
since the bus or interconnection network becomes bottleneck.

2. Shared disk(All processors share common disk & have private memories). where each
CPU has a private memory and direct access to all disks through an interconnection
network.
Advantages: Each processor has its own local memory, so the memory bus is not
bottleneck.
This architecture provides higher degree of fault tolerance.(If a processor fails, the other
processors can take over its task)
Disadvantage: The interconnection to the disk subsystem is now a bottleneck.
3. Shared nothing (Each node of machine consists of a processor, memory and one or
more disks). where each CPU has local main memory and disk space, but no two CPUs
can access the same storage area; all communication between CPUs is through a
network connection.
Advantages: Instead of passing all I/O to go through a single interconnection network, only
queries to non local disks and result relations are passed through network.
These architectures are more scalable and can easily support large number of
processors.
Transmissions capacity increases as more nodes can be added.
Disadvantage: Cost of communication and non local disk access are higher as compared
to others because transmitting data involves software interaction at both ends.

PARALLEL QUERY EVALUATION


Parallel evaluation of a relational query in a DBMS with a shared-nothing architecture is
discussed. Parallel execution of a single query has been emphasized.
A relational query execution plan is a graph of relational algebra operators and the
operators in a graph can be executed in parallel. If an operator consumes the output of a
second operator, we have pipelined parallelism.

Each individual operator can also be executed in parallel by partitioning the input data
and then working on each partition in parallel and then combining the result of each
partition. This approach is called Data Partitioned parallel Evaluation.
Data Partitioning: Here large datasets are partitioned horizontally across several disk, this
enables us to exploit the I/O bandwidth of the disks by reading and writing them in parallel.
This can be done in the following ways:
a. Round Robin Partitioning
b. Hash Partitioning
c. Range Partitioning
a. Round Robin Partitioning :If there are n processors, the ith tuple is assigned to
processor i mod n
b. Hash Partitioning : A hash function is applied to (selected fields of) a tuple to determine
its processor.
Hash partitioning has the additional virtue that it keeps data evenly distributed even if the
data grows and shrinks over time.
c. Range Partitioning : Tuples are sorted (conceptually), and n ranges are chosen for the
sort key values so that each range contains roughly the same number of tuples; tuples in
range i are assigned to processor i.
Range partitioning can lead to data skew; that is, partitions with widely varying numbers of
tuples across partitions or disks. Skew causes processors dealing with large partitions to
become performance bottlenecks.

PARALLELIZING INDIVIDUAL OPERATIONS


Various operations can be implemented in parallel in a shared nothing architecture.
Bulk Loading and Scanning:
Pages can be read in parallel while scanning a relation and the retrieved tuples can
then be merged, if the relation is partitioned across several disks.
If a relation has associated indexes, any sorting of data entries required for building the
indexes during bulk loading can also be done in parallel.
Sorting:
Sorting could be done by redistributing all tuples in the relation using range partitioning.
Ex. Sorting a collection of employee tuples by salary whose values are in a certain
range.

For N processors each processor gets the tuples which lie in range assigned to it. Like
processor 1 contains all tuples in range 10 to 20 and so on.
Each processor has a sorted version of the tuples which can then be combined by
traversing and collecting the tuples in the order on the processors (according to the range
assigned)
The problem with range partitioning is data skew which limits the scalability of the
parallel sort. One good approach to range partitioning is to obtain a sample of the entire
relation by taking samples at each processor that initially contains part of the relation. The
(relatively small) sample is sorted and used to identify ranges with equal numbers of tuples.
This set of range values, called a splitting vector, is then distributed to all processors and
used to range partition the entire relation.
Joins:
Here we consider how the join operation can be parallelized
Consider 2 relations A and B to be joined using the age attribute. A and B are initially
distributed across several disks in a way that is not useful for join operation
So we have to decompose the join into a collection of k smaller joins by partitioning
both A and B into a collection of k logical partitions.
If same partitioning function is used for both A and B then the union of k smaller joins
will compute to the join of A and B.

DISTRIBUTED DATABASES
The idea of a distributed database is that the data should be physically stored at different
locations but its distribution and access should be transparent to the user.
Introduction to DBMS:
A Distributed Database should exhibit the following properties:
1) Distributed Data Independence: - The user should be able to access the database
without having the need to know the location of the data.
2) Distributed Transaction Atomicity: - The concept of atomicity should be distributed for
the operation taking place at the distributed sites.
Types of Distributed Databases are:a) Homegeneous Distributed Database is where the data stored across multiple sites is
managed by same DBMS software at all the sites.

b) Heterogeneous Distributed Database is where multiple sites which may be autonomous


are under the control of different DBMS software.
Architecture of DDBs :
There are 3 architectures: Client-Server:
A Client-Server system has one or more client processes and one or more server
processes, and a client process can send a query to any one server process. Clients are
responsible for user-interface issues, and servers manage data and execute transactions.
Thus, a client process could run on a personal computer and send queries to a server
running on a mainframe.
Advantages: 1. Simple to implement because of the centralized server and separation of functionality.
2. Expensive server machines are not underutilized with simple user interactions which are
now pushed on to inexpensive client machines.
3. The users can have a familiar and friendly client side user interface rather than unfamiliar
and unfriendly server interface
Collaborating Server:
In the client sever architecture a single query cannot be split and executed across
multiple servers because the client process would have to be quite complex and intelligent
enough to break a query into sub queries to be executed at different sites and then place
their results together making the client capabilities overlap with the server. This makes it
hard to distinguish between the client and server
In Collaborating Server system, we can have collection of database servers, each
capable of running transactions against local data, which cooperatively execute transactions
spanning multiple servers.
When a server receives a query that requires access to data at other servers, it
generates appropriate sub queries to be executed by other servers and puts the results
together to compute answers to the original query.
Middleware:
Middleware system is as special server, a layer of software that coordinates the
execution of queries and transactions across one or more independent database servers.
The Middleware architecture is designed to allow a single query to span multiple
servers, without requiring all database servers to be capable of managing such multi site
execution strategies. It is especially attractive when trying to integrate several legacy
systems, whose basic capabilities cannot be extended.

We need just one database server that is capable of managing queries and
transactions spanning multiple servers; the remaining servers only need to handle local
queries and transactions.

STORING DATA IN DDBS


Data storage involved 2 concepts
1. Fragmentation
2. Replication
Fragmentation:
It is the process in which a relation is broken into smaller relations called fragments and
possibly stored at different sites.
It is of 2 types
1. Horizontal Fragmentation where the original relation is broken into a number of
fragments, where each fragment is a subset of rows. The union of the horizontal fragments
should reproduce the original relation.
2. Vertical Fragmentation where the original relation is broken into a number of fragments,
where each fragment consists of a subset of columns.
The system often assigns a unique tuple id to each tuple in the original relation
so that the fragments when joined again should from a lossless join. The
collection of all vertical fragments should reproduce the original relation.
Replication:
Replication occurs when we store more than one copy of a relation or its fragment at
multiple sites.
Advantages:1. Increased availability of data: If a site that contains a replica goes down, we can find
the same data at other sites. Similarly, if local copies of remote relations are available, we
are less vulnerable to failure of communication links.
2. Faster query evaluation: Queries can execute faster by using a local copy of a relation
instead of going to a remote site.

Distributed catalog management :


Naming Object
Its related to the unique identification of each fragment that has been either
partitioned or replicated.

This can be done by using a global name server that can assign globally unique
names.
This can be implemented by using the following two fields:1. Local name field locally assigned name by the site where the relation is created. Two
objects at different sites can have same local names.
2. Birth site field indicates the site at which the relation is created and where information
about its fragments and replicas is maintained.
Catalog Structure:
A centralized system catalog is used to maintain the information about all the
transactions in the distributed database but is vulnerable to the failure of the site containing
the catalog.
This could be avoided by maintaining a copy of the global system catalog but it involves
broadcast of every change done to a local catalog to all its replicas.
Another alternative is to maintain a local catalog at every site which keeps track of all
the replicas of the relation.
Distributed Data Independence:
It means that the user should be able to query the database without needing to specify
the location of the fragments or replicas of a relation which has to be done by the DBMS
Users can be enabled to access relations without considering how the relations are
distributed as follows:
The local name of a relation in the system catalog is a combination of a user name and a
user-defined relation name.
When a query is fired the DBMS adds the user name to the relation name to get a local
name, then adds the user's site-id as the (default) birth site to obtain a global relation name.
By looking up the global relation name in the local catalog if it is cached there or in the
catalog at the birth site the DBMS can locate replicas of the relation.
Distributed query processing:
In a distributed system several factors complicates the query processing.
One of the factors is cost of transferring the data over network.
This data includes the intermediate files that are transferred to other sites for further
processing or the final result files that may have to be transferred to the site where the
query result is needed.
Although these cost may not be very high if the sites are connected via a high local n/w
but sometime they become quit significant in other types of network.

Hence, DDBMS query optimization algorithms consider the goal of reducing the
amount of data transfer as an optimization criterion in choosing a distributed query
execution strategy.
Consider an EMPLOYEE relation.
The size of the employee relation is 100 * 10,000=10^6 bytes
The size of the department relation is 35 * 100=3500 bytes

10,000 records
Each record is 100 bytes
Fname field is 15 bytes long
SSN field is 9 bytes long
Lname field is 15 bytes long
Dnum field is 4 byte long

100records
Each record is 35 bytes long
Dnumber field is 4 bytes long
Dname field is 10 bytes long
MGRSSN field is 9 bytes long
Now consider the following query:
For each employee, retrieve the employee name and the name of the department for which
the employee works.
Using relational algebra this query can be expressed as
FNAME, LNAME, DNAME ( EMPLOYEE * DNO=DNUMBER DEPARTMENT)
If we assume that every employee is related to a department then the result of this
query will include 10,000 records.
Now suppose that each record in the query result is 40 bytes long and the query is
submitted at a distinct site which is the result site.
Then there are 3 strategies for executing this distributed query:

1. Transfer both the EMPLOYEE and the DEPARTMENT relations to the site 3 that is your
result site and perform the join at that site. In this case a total of 1,000,000 + 3500 =
1,003,500 bytes must be transferred.
2. Transfer the EMPLOYEE relation to site 2 (site where u have Department relation) and
send the result to site 3. the size of the query result is 40 * 10,000 = 400,000 bytes so
400,000 + 1,000,000 = 1,400,000 bytes must be transferred.
3. Transfer the DEPARTEMNT relation to site 1 (site where u have Employee relation) and
send the result to site 3. in this case 400,000 + 3500 = 403,500 bytes must be transferred.
Nonjoin Queries in a Distributed DBMS:
Consider the following two relations:
Sailors (sid: integer, sname:string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: date, rname: string)
Now consider the following query:
SELECT S.age FROM Sailors S WHERE S.rating > 3 AND S.rating < 7
Now suppose that sailor relation is horizontally fragmented with all the tuples having a rating
less than 5 at Shanghais and all the tuples having a rating greater than 5 at Tokyo.
The DBMS will answer this query by evaluating it both sites and then taking the union of the
answer.
Joins in a Distributed DBMS:
Joins of a relation at different sites can be very expensive so now we will consider the
evaluation option that must be considered in a distributed environment.
Suppose that Sailors relation is stored at London and Reserves relation is stored at
Paris. Hence we will consider the following strategies for computing the joins for Sailors and
Reserves.
In the next example the time taken to read one page from disk (or to write one page to
disk) is denoted as td and the time taken to ship one page (from any site to another site) as
ts.

DISTRIBUTED CONCURRENCY CONTROL AND RECOVERY


The main issues with respect to the Distributed transaction are:
Distributed Concurrency Control
How can deadlocks be detected in a distributed database?
How can locks for objects stored across several sites be managed?

Distributed Recovery
When a transaction commits, all its actions across all the sites at which it executes
must persist.

When a transaction aborts none of its actions must be allowed to persist.


Concurrency Control and Recovery in Distributed Databases: For
currency control and recovery purposes, numerous problems arise in a
distributed DBMS environment that is not encountered in a centralized DBMS
environment.
This includes the following:
Dealing with multiple copies of the data items: The concurrency control
method is responsible for maintaining consistency among these copies. The
recovery method is responsible for making a copy consistent with other copies if
the site on which he copy is stored fails and recovers later.
Failure of individual sites: The DBMS should continue to operate with its
running sites, if possible when one or the more individual site fall. When a site
recovers its local database must be brought up to date with the rest of the sites
before it rejoins the system.
Failure of communication links: The system must be able to deal with failure
of one or more of the communication links that connect the sites. An extreme
case of this problem is that network partitioning may occur. This breaks up the
sites into two or more partitions where the sites within each partition can
communicate only with one another and not with sites in other partitions.
Distributed Commit: Problems can arise with committing a transactions that is
accessing database stored on multiple sites if some sites fail during the commit
process. The two-phase commit protocol is often used to deal with this problem.
Distributed Deadlock: Deadlock may occur among several sites so
techniques for dealing with deadlocks must be extended to take this into
account.
Lock management can be distributed across sites in many ways:
Centralized: A single site is in charge of handling lock and unlock requests for all
objects.
Primary copy: One copy of each object is designates as the primary copy. All requests
to lock or unlock a copy of these objects are handled by the lock manager at the site where
the primary copy is stored, regardless of where the copy itself is stored.
Fully Distributed: Request to lock or unlock a copy of an object stored at a site are
handled by the lock manager at the site where the copy is stored.
Distributed Deadlock
One issue that requires special attention when using either primary copy or fully
distributed locking is deadlocking detection
Each site maintains a local waits-for graph and a cycle in a local graph indicates a
deadlock.
For example:
Suppose that we have two sites A and B, both contain copies of objects O1 and O2 and
that the read-any write-all technique is used.

T1 which wants to read O1 and write O2 obtains an S lock on O1 and X lock on O2 at


site A, and request for an X lock on O2 at site B.
T2 which wants to read O2 and write O1 mean while obtains an S lock on O2 and an X
lock on O1 at site B then request an X lock on O1 at site A.
As shown in the following figure T2 is waiting for T1 at site A and T1 is waiting for T2 at
site B thus we have a Deadlock.

To detect such deadlocks, a distributed deadlock detection algorithm must be used and we
have three types of algorithms:
1. Centralized Algorithm:

It consist of periodically sending all local waits-for graphs to some one site that is
responsible for global deadlock detection.

At this site, the global waits-for graphs is generated by combining all local graphs and in
the graph the set of nodes is the union of nodes in the local graphs and there is an edge
from one node to another if there is such an edge in any of the local graphs.
2. Hierarchical Algorithm:
This algorithm groups the sites into hierarchies and the sites might be grouped by states,
then by country and finally into single group that contain all sites.

Every node in this hierarchy constructs a waits-for graph that reveals deadlocks involving
only sites contained in (the sub tree rooted at) this node.

Thus, all sites periodically (e.g., every 10 seconds) send their local waits-for graph to the
site constructing the waits-for graph for their country.

The sites constructing waits-for graph at the country level periodically (e.g., every 10
minutes) send the country waits-for graph to site constructing the global waits-for graph.

3. Simple Algorithm:

If a transaction waits longer than some chosen time-out interval, it is aborted.


Although this algorithm causes many unnecessary restart but the overhead of the
deadlock detection is low.
Distributed Recovery: Recovery in a distributed DBMS is more complicated than in a
centralized DBMS for the following reasons:
New kinds of failure can arise: failure of communication links and failure of remote site
at which a sub transaction is executing.
Either all sub transactions of a given transaction must commit or none must commit
and this property must be guaranteed despite any combination of site and link failures. This
guarantee is achieved using a commit protocol.
Normal execution and Commit Protocols:
During normal execution each site maintains a log and the actions of a sub transaction
are logged at the site where it executes.
The regular logging activity is carried out which means a commit protocol is followed to
ensure that all sub transactions of a given transaction either commit or abort uniformly.
The transaction manager at the site where the transaction originated is called the
Coordinator for the transaction and the transaction managers where its sub transactions
execute are called Subordinates.
Two Phase Commit Protocol:
When the user decides to commit the transaction and the commit command is sent to
the coordinator for the transaction.
This initiates the 2PC protocol:
The coordinator sends a Prepare message to each subordinate.
When a subordinate receive a Prepare message, it then decides whether to abort or
commit its sub transaction. it force-writes an abort or prepares a log record and then sends
a NO or Yes message to the coordinator.
Here we can have two conditions:
o If the coordinator receives Yes message from all subordinates. It force-writes a
commit log record and then sends a commit message to all the subordinates.
o If it receives even one No message or No response from some coordinates for a
specified time-out period then it will force-write an abort log record and then sends an abort
message to all subordinate.
Here again we can have two conditions:

o When a subordinate receives an abort message, it force-writes an abort log


record sends an ack message to coordinator and aborts the sub transaction.
o When a subordinates receives a commit message, it force-writes a commit log
record and sends an ack message to the coordinator and commits the sub transaction.

UNIT IV
INTRODUCTION TO DATABASE SECURITY
There are three main objectives to consider while designing a secure database application:
1. Secrecy: Information should not be disclosed to unauthorized users. For example, a
student should not be allowed to examine other students' grades.
2. Integrity: Only authorized users should be allowed to modify data. For example, students
may be allowed to see their grades, yet not allowed (obviously!) to modify them.
3. Availability: Authorized users should not be denied access. For example, an instructor
who wishes to change a grade should be allowed to do so.
A DBMS typically includes a database security and authorization subsystem that is
responsible for ensuring the security of portions of a database against unauthorized access.
It is now customary to refer to two types of database security mechanisms:
Discretionary Security mechanism: These are used to grant privileges to users, including the
capability to access specific data files, records, or fields in a specified mode(such as read,
insert,delete, or update).
Mandatory security mechanisms: These are used to enforce multilevel security by classifying
the data and users into various security classes (or levels) and then implementing the
appropriate security policy of the organization. For example, a typical policy is to purmit
users at a certain classification level to see only data items classified at the users own level.
An extension of this is role-based security, which enforces policies and privileges based on
the concept of roles.

ACCESS CONTROL

A DBMS should provide mechanisms to control access to data. A DBMS offers two main
approaches to access control.
Discretionary access control
Mandatory access control
Discretionary access control: It is based on the concept of access rights, or privileges,
and mechanisms for users. A privilege allows a user to access some data object in a certain
manner ( e.g., to read or to modify). A user who creates a database object such as a table or
a view automatically gets all applicable privileges on that object. SQL-92 supports
discretionary access control through the GRANT and REVOKE commands.
The GRANT command gives privileges to users.
The GRANT command gives privileges to base table and views. The syntax of this command
is as follows:
GRANT privileges ON object TO users [WITH GRANT OPTION]
Here object is either a base table or a view.
Several privileges can be specified including:
SELECT: The right to access (read) all columns of the table specified as object, including
columns added later through ALTER TABLE commands.
INSERT(column-name): The right to insert rows with (non-null or non default) values in the
named column of the table named as object. The privileges UPDATE(column-name) and
UPDATE are similar to INSERT.
DELETE: The right to delete rows from the table named as object.
REFERENCES(column-name): The right to define foreign keys (in other tables) that refer
to the specified column of the table object. REFERENCES without a column name specified
denotes this right with respect to all columns.
For Example:
Suppose that user joe has created the tables BOATS, RESERVES, and SAILORS. Some
examples of GRANT command that joe can now execute are:
GRANT INSERT, DELETE ON RESERVES TO Yuppy WITH GRANT OPTION
GRANT SELECT ON RESERVES TO Michel
GRANT SELECT ON SAILORS TO Michael WITH GRANT OPTION
GRANT UPDATE (rating) ON SAILORS TO Leah
GRANT REFERENCES (bid) ON BOATS TO Bill
Adding WITH GRANT OPTION at the end of the grant command allows the user who has been
granted the privilege to pass those privilege to other user.
In the above examples. Yuppy can insert or delete Reserves rows and can authorize
someone else to do the same. Michael can execute Select queries on Sailors and Reserves,
and he can pass this privilege to others for sailors, but not for Reserves.
The REVOKE command takes away privileges.
This is complementary command to GRANT that allows the withdrawal of privileges.
The syntax of REVOKE Command is as follows:
REVOKE [ GRANT OPTION FOR] Privileges
ON object FROM users {RESTRICT|CASCADE}

The command can be used to revoke either a privilege or just the grant option on a
privilege( by using the option GRANT OPTION FOR clause).
A user who has granted a privilege to other user may change his mind and want to withdraw
the granted privilege. The intuition behind exactly what effect a REVOKE command has is
complicated by the fact that a user may be granted the same privilege multiple times,
possible by different users.
When a user executes a REVOKE command with the CASCADE keyword, the effect is to
withdraw the named privileges or grant option from all users who currently hold these
privileges solely through a GRANT command that was previously executed by some user
who is now executing the REVOKE command. If these users received the privileges with the
grant option and passed it along, those recipients will also lose their privileges as
consequence of the REVOKE command unless they received these privileges independently.
For Example:
GRANT SELECT ON Sailors TO Art WITH GRANT OPTION (executed by Joe)
GRANT SELECT ON Sailors TO Bob WITH GRANT OPTION (executed by Art)
REVOKE SELECT ON Sailors FROM Art CASCADE

(executed by Joe)

Art loses the SELECT privilege on Sailors, of course. Then Bob, who received this privilege
from Art, and only Art, also loses this privilege.
If the RESTRICT keyword is specified in the REVOKE command, the command is rejected if
revoking the privileges just from the users specified in the command would result in other
privileges becoming abandoned.
Mandatory access control: It is based on system wide policies that cannot be changed by
individual users. In this approach each database object is assigned a security class, each
user is assigned clearance for a security class, and rules are imposed on reading and writing
of database objects by users. The DBMS determines whether a given user can read or write
a given object based on certain rules that involve the security level of the object and the
clearance of the user.
The popular model for mandatory access control, called the Bell-LaPadula model, is
described in terms of objects (e.g., tables, views, rows, columns), subjects (e.g., users,
programs), security classes, and clearances. Each database object is assigned a security
class, and each subject is assigned clearance for a security class; we will denote the class of
an object or subject A as class(A). The security classes in a system are organized according
to a partial order, with a most secure class and a least secure class. For simplicity, we
will assume that there are four classes: top secret (TS), secret (S), confidential (C), and
unclassified (U). In this system, TS > S > C > U, where A > B means that class A data is
more sensitive than class B data.
The Bell-LaPadula model imposes two restrictions on all reads and writes of database
objects:
1. Simple Security Property: Subject S is allowed to read object O only if class(S)
class(O). For example, a user with TS clearance can read a table with C clearance, but a user
with C clearance is not allowed to read a table with TS classification.
2. *-Property: Subject S is allowed to write object O only if class(S) class(O). For example,
a user with S clearance can only write objects with S or TS classification.

Multilevel Relations and Polyinstantiation

To apply mandatory access control policies in a relational DBMS, a security class must be
assigned to each database object. The objects can be at the granularity of tables, rows, or
even individual column values. Let us assume that each row is assigned a security class.
This situation leads to the concept of a multilevel table, which is a table with the surprising
property that users with di_erent security clearances will see a different collection of rows
when they access the same table.
Consider the instance of the Boats table shown in Figure below. Users with S and TS
clearance will get both rows in the answer when they ask to see all rows in Boats. A user
with C clearance will get only the second row, and a user with U clearance will get no rows.

bid

bname

color

Security class

101

Salsa

Red

102

Pinto

Brown

The Boats table is defined to have bid as the primary key. Suppose that a user with
clearance C wishes to enter the row <101, Picante,Scarlet, i>. We have a dilemma:
If the insertion is permitted, two distinct rows in the table will have key 101.
If the insertion is not permitted because the primary key constraint is violated, the user
trying to insert the new row, who has clearance C, can infer that there is a boat with
bid=101 whose security class is higher than C. This situation compromises the principle that
users should not be able to infer any information about objects that have a higher security
classification.

This dilemma is resolved by effectively treating the security classification as part of the key.
Thus, the insertion is allowed to continue, and the table instance is modified as shown in
Figure below.

bid

bname

color

Security class

101

Salsa

Red

101

Picante

Scarlet

102

Pinto

Brown

Users with clearance C or U see just the rows for Picante and Pinto, but users with clearance
S or TS see all three rows. The two rows with bid=101 can be interpreted in one of two ways:
only the row with the higher classification (Salsa, with classification S) actually exists, or
both exist and their presence is revealed to users according to their clearance level. The
choice of interpretation is up to application developers and users.

Covert Channels, DoD Security Levels


Even if a DBMS enforces the mandatory access control scheme discussed above, information
can flow from a higher classification level to a lower classification level through indirect
means, called covert channels. For example, if a transaction accesses data at more than
one site in a distributed DBMS, the actions at the two sites must be coordinated. The process
at one site may have a lower clearance (say C) than the process at another site (say S), and

both processes have to agree to commit before the transaction can be committed. This
requirement can be exploited to pass information with an S classification to the process with
a C clearance: The transaction is repeatedly invoked, and the process with the C clearance
always agrees to commit, whereas the process with the S clearance agrees to commit if it
wants to transmit a 1 bit and does not agree if it wants to transmit a 0 bit.
In this manner, information with an S clearance can be sent to a process with a C clearance
as a stream of bits. This covert channel is an indirect violation of the intent behind the *Property.

Role of the Database Administrator


The database administrator (DBA) plays an important role in enforcing the security related
aspects of a database design. In conjunction with the owners of the data, the DBA will
probably also contribute to developing a security policy. The DBA has a special account,
which we will call the system account, and is responsible for the overall security of the
system. In particular the DBA deals with the following:
1. Creating new accounts: Each new user or group of users must be assigned an
authorization id and a password. Note that application programs that access the database
have the same authorization id as the user executing the program.
2. Mandatory control issues: If the DBMS supports mandatory control some customized
systems for applications with very high security requirements (for example, military data)
provide such support the DBA must assign security classes to each database object and
assign security clearances to each authorization id in accordance with the chosen security
policy.
3.Audit trail: The DBA is also responsible for maintaining the audit trail, which is
essentially the log of updates with the authorization id (of the user who is executing the
transaction) added to each log entry. This log is just a minor extension of the log mechanism
used to recover from crashes. Additionally, the DBA may choose to maintain a log of all
actions, including reads, performed by a user. Analyzing such histories of how the DBMS was
accessed can help prevent security violations by identifying suspicious
patterns before an intruder finally succeeds in breaking in, or it can help track down an
intruder after a violation has been detected.

Encryption
A DBMS can use encryption to protect information in certain situations where the normal
security mechanism of the DBMS are not adequate. For example, an intruder may steal
tapes containing some data or tape a communication line. By storing and transmitting data
in an encrypted form, the DBMS ensures that such stolen data is not intelligible to the
intruder.
Encryption is basically done through encryption algorithm. The output of the algorithm is the
encrypted version of the data. There is also a decryption algorithm, which takes the
encrypted data and the encryption key as input and then returns the original data. This
approach is called Data Encryption Standard (DES). The main weakness of this approach is
that authorized users must be told the encryption key, and the mechanism for
communicating this information is vulnerable to clever intruders.
Another approach is called Public Key encryption. The encryption scheme proposed by
Rivest, Shamir, and Adleman, called RSA, is a well-known example of public-key encryption.

In this each authorized user has a public encryption key, known to everyone, and a private
decryption key, choosen by the user and known only to him or her.
For example: Consider a user called sam. Anyone can send sam a secret message by
encrypting the message using sams publicly known encryption key. Only sam can decrypt
this secret message because the decryption algorithm requires sams decryption key, known
only to sam. Since users choose their own decryption keys, the weakness of DES is avoided.

UNIT V

What is Postgres?
Traditional relational database management systems (DBMSs) support a data model
consisting of a collection of named relations, containing attributes of a specific type. In
current commercial systems, possible types include floating point numbers, integers,
character strings, money, and dates. It is commonly recognized that this model is
inadequate for future data processing applications. The relational model successfully
replaced previous models in part because of its "Spartan simplicity". However, as
mentioned, this simplicity often makes the implementation of certain applications very
difficult. Postgres offers substantial additional power by incorporating the following four
additional basic concepts in such a way that users can easily extend the system:
classes
inheritance
types
functions
Other features provide additional power and flexibility:
constraints
triggers
rules
transaction integrity
These features put Postgres into the category of databases referred to as object-relational.
Postgres is a client/server application. As a user, you only need access to the client portions
of the installation

POSTGRES ARCHITECTURE
Postgres uses a simple "process per-user" client/server model. A Postgres session consists of
the following cooperating UNIX processes (programs):
A supervisory daemon process (postmaster),
The users frontend application (e.g., the psql program), and
The one or more backend database servers (the postgres process itself).
A single postmaster manages a given collection of databases on a single host. Such a
collection of databases is called an installation or site. Frontend applications that wish to
access a given database within an installation make calls to the library. The library sends
user requests over the network to the postmaster (How a connection is established), which
in turn starts a new backend server process and connects the frontend process to the new
server. From that point on, the frontend process and the backend server communicate
without intervention by the postmaster. Hence, the postmaster is always running, waiting for
requests, whereas frontend and backend processes come and go.

Transactions in POSTGRES

Transactions are a fundamental concept of all database systems. The essential point of a
transaction is that it bundles multiple steps into a single, all-or-nothing operation. The
intermediate states between the steps are not visible to other concurrent transactions, and if some
failure occurs that prevents the transaction from completing, then none of the steps affect the
database at all.
For example, consider a bank database that contains balances for various customer accounts, as
well as total deposit balances for branches. Suppose that we want to record a payment of $100.00
from Alice's account to Bob's account. Simplifying outrageously, the SQL commands for this
might look like
UPDATE accounts SET balance = balance - 100.00
WHERE name = 'Alice';
UPDATE branches SET balance = balance - 100.00
WHERE name = (SELECT branch_name FROM accounts WHERE name = 'Alice');
UPDATE accounts SET balance = balance + 100.00
WHERE name = 'Bob';
UPDATE branches SET balance = balance + 100.00
WHERE name = (SELECT branch_name FROM accounts WHERE name = 'Bob');

The details of these commands are not important here; the important point is that there are
several separate updates involved to accomplish this rather simple operation. Our bank's officers
will want to be assured that either all these updates happen, or none of them happen. It would
certainly not do for a system failure to result in Bob receiving $100.00 that was not debited from
Alice. Nor would Alice long remain a happy customer if she was debited without Bob being
credited. We need a guarantee that if something goes wrong partway through the operation, none
of the steps executed so far will take effect. Grouping the updates into a transaction gives us this
guarantee. A transaction is said to be atomic: from the point of view of other transactions, it
either happens completely or not at all.
In PostgreSQL, a transaction is set up by surrounding the SQL commands of the transaction with
BEGIN and COMMIT commands. So our banking transaction would actually look like
BEGIN;
UPDATE accounts SET balance = balance - 100.00
WHERE name = 'Alice';
-- etc etc
COMMIT;

If, partway through the transaction, we decide we do not want to commit (perhaps we just
noticed that Alice's balance went negative), we can issue the command ROLLBACK instead of
COMMIT, and all our updates so far will be canceled.

PostgreSQL actually treats every SQL statement as being executed within a transaction. If you
do not issue a BEGIN command, then each individual statement has an implicit BEGIN and (if
successful) COMMIT wrapped around it. A group of statements surrounded by BEGIN and COMMIT
is sometimes called a transaction block.

XML stands for the eXtensible Markup Language. It is a new markup language, developed by
the W3C (World Wide Web Consortium)
Some of the areas where XML will be useful in the near-term include:
large Web site maintenance. XML would work behind the scene to simplify the creation of
HTML documents
exchange of information between organizations
off loading and reloading of databases
syndicated content, where content is being made available to different Web sites
electronic commerce applications where different organizations collaborate to serve a customer
scientific applications with new markup languages for mathematical and chemical formulas
electronic books with new markup languages to express rights and ownership
handheld devices and smart phones with new markup languages optimized for these
alternative devices
XML makes essentially two changes to HTML:
It predefines no tags.
It is stricter.
No Predefined Tags
Because there are no predefined tags in XML, you, the author, can create the tags that you need.
Example:
<price currency=usd>499.00</price>
<toc xlink:href=/newsletter>Pineapplesoft Link</toc>
Stricter
HTML has a very forgiving syntax. This is great for authors who can be as lazy as they want, but
it also makes Web browsers more complex. According to some estimates, more than 50% of the
code in a browser handles errors or sloppiness on the authors part.
XML Example:
A List of Products in XML
<?xml version=1.0?>
<products>

<product id=p1>
<name>XML Editor</name>
<price>499.00</price>
</product>
<product id=p2>
<name>DTD Editor</name>
<price>199.00</price>
</product>
<product id=p3>
<name>XML Book</name>
<price>19.99</price>
</product>
<product id=p4>
<name>XML Training</name>
<price>699.00</price>
</product>
</products>
In this context, XML is used to exchange information between organizations.
The XML Web is a large database on which applications can tap

Applications exchanging data over the Web


XML Syntax
The syntax rules were described in the previous chapters:

XML documents must have a root element

XML elements must have a closing tag

XML tags are case sensitive

XML elements must be properly nested

XML attribute values must be quoted

XML Schemas
The DTD is the original modeling language or schema for XML.
The syntax for DTDs is different from the syntax for XML documents.
The purpose of a DTD is to define the structure of an XML document. It defines the structure
with a list of legal elements:
Example:
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE note SYSTEM "Note.dtd">
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
<!DOCTYPE note
[
<!ELEMENT note (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>
XML Schema

W3C supports an XML-based alternative to DTD, called XML Schema:

<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="to" type="xs:string"/>
<xs:element name="from" type="xs:string"/>
<xs:element name="heading" type="xs:string"/>
<xs:element name="body" type="xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
XML NAMESPACES

XML Namespaces provide a method to avoid element name conflicts.


An XML namespace is a collection of element and attribute names. XML namespaces provide a
means for document authors to unambiguously refer to elements with the same name (i.e.,
prevent collisions). For example,
<subject>Geometry</subject>
And
<subject>Cardiology</subject>
Use element subject to mark up data. In the first case, the subject is something one studies in
school, whereas in the second case, the subject is a field of medicine. Namespaces can
differentiate these two subject elementsfor example:
<highschool:subject>Geometry</highschool:subject>
And
<medicalschool:subject>Cardiology</medicalschool:subject>

Benefits of the DTD


The main benefits of using a DTD are
The XML processor enforces the structure, as defined in the DTD.
The application accesses the document structure, such as to populate an element list.
The DTD gives hints to the XML processorthat is, it helps separate indenting from content.
The DTD can declare default or fixed values for attributes. This might result in a smaller
document.

XSL
XSL stands for EXtensible Stylesheet Language.
The World Wide Web Consortium (W3C) started to develop XSL because there was a need for
an XML-based Stylesheet Language.

XSL = Style Sheets for XML


XML does not use predefined tags (we can use any tag-names we like), and therefore the
meaning of each tag is not well understood.
A <table> tag could mean an HTML table, a piece of furniture, or something else - and a browser
does not know how to display it.
XSL describes how the XML document should be displayed!
XSL consists of three parts:

XSLT - a language for transforming XML documents

XPath - a language for navigating in XML documents

XSL-FO - a language for formatting XML documents

What is XSLT?
XSLT is a language for transforming XML documents into XHTML documents or to
other XML documents.

XSLT stands for XSL Transformations

XSLT is the most important part of XSL

XSLT transforms an XML document into another XML document

XSLT uses XPath to navigate in XML documents

XSLT is a W3C Recommendation

XPath is a language for navigating in XML documents. XSLT uses XPath to find
information in an XML document. XPath is used to navigate through elements and
attributes in XML documents.

What is XSL-FO?

XSL-FO is a language for formatting XML data

XSL-FO stands for Extensible Stylesheet Language Formatting Objects

XSL-FO is based on XML

XSL-FO is a W3C Recommendation

XSL-FO is now formally named XSL

Вам также может понравиться