Академический Документы
Профессиональный Документы
Культура Документы
in CSE, SKITM
Unit-1
Introduction: Architecture Advantages, Disadvantages, Data models, relational algebra, SQL
Normal forms.
Introductory Concepts
data—a fact, something upon which an inference is based (information or knowledge has value,
data has cost)
data item—smallest named unit of data that has meaning in the real world (examples: last
name, address, ssn, political party)
data aggregate (or group) -- a collection of related data items that form a
whole concept; a simple group is a fixed collection, e.g. date (month, day, year); a repeating
group is a variable length collection, e.g. a set of aliases.
database—collection of interrelated stored data that serves the needs of multiple users within
one or more organizations; a collection of tables in the relational model.
database management system (DBMS) -- a generalized software system for storing and
manipulating databases. Includes logical view (schema, sub-schema), physical view (access
methods, clustering), data manipulation language, data definition language, utilities - security,
recovery, integrity, etc.
database administrator (DBA) -- person or group responsible for the effective use of database
technology in an organization or enterprise.
DBMS is generally defined as a collection of logically related data and a set of programs to
access the data. Strictly speaking, this is definition of “Database System”, which comprises of
two components i.e. (i) Database and (ii) DBMS.
USER QUERIES
Query Processing
Software
DBMS
Storage Management
Software
DATABASE SYSTEM
Schema
Definition
DATA
DATABASE
DATABASE A Database is a collection of logically related data that can be recorded. The
information stored in the database must have the following implicit properties:-
(a) It must represent some real-world aspect; like a college or a company etc. The aspect
represented by the database is called its “Mini-world”.
(b) It must comprise a logically coherent collection of data, which should have well-
understood inherent meaning (semantics).
(c) The repository of data must be designed, developed and implemented for a specific
purpose. There must exist an intended group of users, who must have some pre-conceived
applications of the data.
For example, in the college database, sources of information will be students, faculty, labs etc.
The real-world events affecting the information in the database will be admissions, exams,
results & placements etc. The set of intended users will be faculty, students, admin staff etc.
Hierarchical database
A hierarchical data model is a data model in which the data is organized into a tree-like structure.
The structure allows repeating information using parent/child relationships: each parent can have
many children but each child only has one parent (also known as a 1:many ratio ). All attributes
of a specific record are listed under an entity type.
In a database, an entity type is the equivalent of a table; each individual record is represented as a
row and an attribute as a column. Entity types are related to each other using 1: N mapping, also
known as one-to-many relationships. this model is recognized as the first data base model
created by IBM in the 1960s
Network database
The network model is a database model conceived as a flexible way of representing objects and
their relationships. Its distinguishing feature is that the schema, viewed as a graph in which
object types are nodes and relationship types are arcs, is not restricted to being a hierarchy or
lattice.
The network model's original inventor was Charles Bachman, and it was developed into a
standard specification published in 1969 by the CODASYL Consortium.
Relational database
A relational database matches data by using common characteristics found within the data set.
The resulting groups of data are organized and are much easier for many people to understand.
For example, a data set containing all the real-estate transactions in a town can be grouped by the
year the transaction occurred; or it can be grouped by the sale price of the transaction; or it can
be grouped by the buyer's last name; and so on.
Such a grouping uses the relational model (a technical term for this is schema). Hence, such a
database is called a "relational database."
The software used to do this grouping is called a relational database management system
(RDBMS). The term "relational database" often refers to this type of software.
Relational databases are currently the predominant choice in storing financial records, medical
records, manufacturing and logistical information, personnel data and much more
Object-oriented database
An object database (also object-oriented database) is a database model in which information
is represented in the form of objects as used in object-oriented programming.
Object databases are a niche field within the broader DBMS market dominated by relational
database management systems (RDBMS). Object databases have been considered since the early
1980s and 1990s but they have made little impact on mainstream commercial data processing,
though there is some usage in specialized areas
Object-relational database
An object-relational database (ORD), or object-relational database management system
(ORDBMS), is a database management system (DBMS) similar to a relational database, but with
an object-oriented database model: objects, classes and inheritance are directly supported in
database schemas and in the query language. In addition, it supports extension of the data model
with custom data-types and methods.
A structured collection of data, describes the activities of one more related organizations
stored in the computer system [1]. This repository of data is tasked with maintaining and
presenting the data in a consistent and efficient fashion to the applications, and the users of such
applications, that use it.
1.2 DBMS:
Defining a database: Specify the data types, structures and constraints for the data.
Manipulating the database: Querying to retrieve specific data, update to reflect changes,
deletion, and generating reports.
Other features include protection or security measures to prevent unauthorized access, and
presentation and visualization of data.
All manipulations of the structure of the database or the information must be done through the
DBMS as shown below:
Database Features:
Database Management Systems were developed to handle the following difficulties of typical
file-processing systems supported by conventional operating systems.
Database Administration: By providing a common umbrella for a large collection of data that
is stored by several users, a DBMS facilitates maintenance and data administration tasks. A good
DBA can efficiently shield end-users from the chores of fine-tuning the data representation,
periodic back-ups etc.
Data Abstraction:
The main purpose of a database system is to provide users with an abstract view of the system.
The system hides certain details of how data is stored and maintained from level to level.
Physical Level: how and where data are actually stored, lowest level of abstraction with low
level data structures.
Conceptual Level: describes what data is stored, and relationship among the data and semantics
of the data. At this level, database administrator exists.
View Level: Highest level of abstraction, describes partial view of the database to a particular
group of users. This level can be many different views of the database.
DATA MODELS
SCHEMA:
A Schema can be defined as, a logical structure described in a formal language supported by the
DBMS [1]. In a relational database, the schema defines a table, fields, and relationships between
fields and tables. A Schema is analogous to type information of a variable in a program.
Instance:
An Instance is the actual content of the database at a particular point in time. An Instance is
analogous to the value of the variable.
Data Model:
A Data Model is a Collection of tools or concepts for describing data, the meaning of data, data
relationships, and data constraints. There are three different Groups:
In this model, the data is described at conceptual and view level. It provides fairly flexible
structuring capabilities. This model allows specifying data constraints explicitly. It includes:
Entity-relationship Model
Object-oriented Model
Entity-relationship Model:
Object-Oriented Model:
The Object-oriented Model is based on a collection of objects, like the E-R Model. An object
contains values stored in instance variables within the object. Unlike record-based models, these
values are themselves objects. Objects contain objects to an arbitrarily deep level of nesting. An
object also contains bodies of code that operate on the object, which are called Methods. Objects
that contain the same types of values and the same methods are grouped into classes. A class can
be viewed as a type definition for Objects, compared to the concept of an abstract data type in a
programming language. The only way in which one object can access the data of another object
is by invoking the method of that other object, which is called sending a message to the object.
In all Data Models, changing the interest rate entails changing code in application programs. In
object-oriented model, this only requires a change within the pay-interest method.
Unlike entities in the E-R Model, each object has its own unique identity, independent of the
values it contains. Two objects containing the same values are distinct. Distinction is maintained
in physical level by assigning distinct object identifiers.
In these models, the data is described at conceptual and view levels. These models specify
overall logical structure of the database. The database of this model is structured in fixed-format
records of several types. Each record type defines a fixed number of fields and attributes with
each field usually of fixed length. There are three different groups:
1. Relational Model
2. Network Model
3. Hierarchical Model
Relational Model:
This data model is based on first-order predicate logic. Its core idea is to describe a database as a
collection of predicates over a finite set of predicate variables, describing constraints on the
possible values and combination of values. Data and relationships are represented by a collection
of tables. Each table has a number of columns with unique names, e.g., customer, account. A
relational database allows the definition of data structures, storage and retrieval operations and
integrity constraints.
Network Model:
This model organizes data using two fundamental constructs, called records and sets. Records
contain fields, and sets define one-to-many relationships between records. Data are represented
by collection of records. A set consists of an owner record type, a set name, and a member record
type. An owner record type can also be a member or owner in another set. Relationships among
data are represented by links.
Hierarchical Model:
In this model, data is organized into a tree-like structure, implying a single upward link in each
record to describe the nesting, and a sort field to keep the records in a particular order in each
same-level list [8].Organizes data in to a tree-like structure. Hierarchy of parent and child data
segments exists. It has repeating information generally in child data segments. This model
collects all the instances of a specific record together as a record type. Links are created between
record types using Parent Child relationships. 1: N mapping exists between record types.
The DBMS must provide appropriate languages once the design is completed. A conceptual and
internal schema and mappings between the two for the database must be specified (DDL). Once
the database schemas are compiled and is populated with data, users must have some means to
manipulate the database (DML).
DDL is used to specify both conceptual and internal schemas as a set of definitions. DDL
statements are compiled, resulting in a set of tables stored in a special file called Data Dictionary
. Data Dictionary contains Metadata (Data about Data). DDL hides the implementation details
of the database schemas from the users.
A Language which facilitates , retrieval of information from the database, and Insertion of new
information into the database, and Deletion of information in the database, and Modification of
information in the database.
Low-Level or Procedural:
Typically retrieves individual records or objects from the database and processes each separately.
Needs to use programming language constructs, such as looping, to retrieve and process each
record from a set of records.
High-level or Nonprocedural:
User specifies what data is needed. Easier for use. May not generate code as efficient as that
produced by procedural languages.
Database Administrator:
A Database Administrator (DBA) is a person having central control over data and programs
accessing that data and is responsible for the following tasks:
5. Monitoring performance.
Database Users:
Application Programmers:
These people are computer professionals interacting with the system through DML calls
embedded in a program written in a host language. (E.g. C, Java, Pascal). The DML precompiler
converts DML calls to normal procedure calls in a host language. The host language compiler
then generates the object code. These are sometimes called Fourth-generation languages. The
often include features to help generate forms and display data.
Sophisticated Users:
These users interact with the system without writing programs. They form requests by writing
queries in a database query language. These are submitted to a query processor that breaks a
DML statement down into instructions for the database manager module.
Specialized Users:
These users are sophisticated users writing special database application programs. These may be
knowledge-based, expert systems and complex data systems (audio/video) etc.
Naïve Users:
These users are unsophisticated users who interact with the system by using permanent
application programs (e.g. Automated Teller Machine).
Data independence is the capacity to change the schema at one level of the architecture without
having to change the schema at the next higher level. We distinguish between logical and
physical data independence according to which two adjacent levels are involved. The former
refers to the ability to change the conceptual schema without changing the external schema. The
latter refers to the ability to change the internal schema without having to change the conceptual.
The capacity to change the conceptual schema without having to change the external schemas
and their associated application programs.
The capacity to change the internal schema without having to change the conceptual schema.
2. What are advantages of views: views are virtual(not real but in effect) tables or relations
which are based on user’s view of particular data base.
3. What is relational schema
Representation of relational database's entities, attributes within those entities,
and relationships between those entities
Represented as DDL or Visually
Example: Employee (Ename,Eid,sal,bdate,hiredate,sex ) where primary key is
underlined
4. What is DDL
DDL means Data Definition Language
Used by the DBA and database designers to specify the conceptual schema of a
database.
In many DBMSs, the DDL is also used to define internal and external schemas
(views).
DDL commands are
CREATE
ALTER
TRUNCATE
5. what is Cartesian product
This operation is used to combine tuples from two relations in a combinatorial
fashion.
Denoted by R(A1, A2, . . ., An) x S(B1, B2, . . ., Bm)
Result is a relation Q with degree n + m attributes:
i. Q(A1, A2, . . ., An, B1, B2, . . ., Bm), in that order.
The resulting relation state has one tuple for each combination of tuples—one
from R and one from S.
Hence, if R has nR tuples (denoted as |R| = nR ), and S has nS tuples, then R x S
will have nR * nS tuples.
The two operands do NOT have to be “type compatible”
A data model ---a collection of concepts that can be used to describe the
conceptual/logical structure of a database--- provides the necessary means to achieve this
abstraction.
By structure is meant the data types, relationships, and constraints that should hold for
the data. Most data models also include a set of basic operations for specifying
retrievals/updates.
7. What is data redundancy
Repeating the same data again and again is nothing but redundancy.Data redundancy
(such as tends to occur in the "file processing" approach) leads to wasted storage space,
duplication of effort (when multiple copies of a datum need to be updated), and a higher
likelihood of the introduction of inconsistency.
8. Write about Naïve users
Naive/Parametric end users: Typically the biggest group of users; frequently
query/update the database using standard canned transactions that have been
carefully programmed and tested in advance. Examples:
ii. bank tellers check account balances, post withdrawals/deposits
iii. Reservation clerks for airlines, hotels, etc., check availability of seats/rooms and make
reservations.
iv. Shipping clerks (e.g., at UPS) who use buttons, bar code scanners, etc., to update status
of in-transit packages.
9. What is DBMS
Database management system is software of collection of small programs to perform
certain operation on data and manage the data.
System catalog, which contains a description of the structure of each file, the type and storage
format of each field, and the various constraints on the data (i.e., conditions that the data must
satisfy).
The system catalog is used not only by users (e.g., who need to know the names of tables and
attributes, and sometimes data type information and other things), but also by the DBMS
software, which certainly needs to "know" how the data is structured/organized in order to
interpret it in a manner consistent with that structure.
Used to store schema descriptions and other information such as design decisions, application
program descriptions, user information, usage standards, contains all information stored in
catalog, but accessed by users rather than dbms.
Describes the (logical) structure of the whole database for a community of users. Hides
physical storage details, concentrating upon describing entities, data types, relationships,
user operations, and constraints. Can be described using either high-level or
implementation data model.
Applications:
1) Idea of Format of Data
2) Logical storage of Data
3) Structure of DBMS
1. Describe the three schema architecture. Why do we need mapping b/w schema levels
2. list the cases in which null values are appropriate with examples
3. differentiate b/w FPS and DBMS
4. design a conceptual data base design for health insurance system
5. compare and contrast Relational model and Hierarchical model
6. explain the basic operations of Relational Algebra with examples
7. Draw and explain the DBMS component modules
8. what are advantages of DBMS
9. explain the difference b/w among entity, entity type and relation ship set
10. what is integrity constraint explain deferent constraints in DBMS
11. what are the functions of DBA
12. write about architecture of DBMS
13. explain about various database users
14. what are various capabilities of DBMS
15. what is the difference b/w logical data independence and physical data independence
16. discuss the the main types of constraints on specialization and generalization
17. what is e-r model .explain the components E-R model
18. what is sql and various types of commands
19. explain about relation model and advantages of rm.
Important Questions:
GUIDELINE 1: Informally, each tuple in a relation should represent one entity or relationship
instance. (Applies to individual relations and their attributes).
Insertion anomalies
Deletion anomalies
a. When a project is deleted, it will result in deleting all the employees who work on
that project.
GUIDELINE 2:
Design a schema that does not suffer from the insertion, deletion and update
anomalies.
If there are any anomalies present, then note them so that applications can be
made to take them into account.
Null Values in Tuples
GUIDELINE 3:
Relations should be designed such that their tuples will have as few NULL values
as possible
Attributes that are NULL frequently could be placed in separate relations (with
the primary key)
Reasons for nulls:
Attribute not applicable or invalid
Attribute value unknown (may exist)
Value known to exist, but unavailable
Applications:
1) Understanding the nature of DBMS
2) Relational Algebra and calculus
Important Questions:
20. Describe the three schema architecture. Why do we need mapping b/w schema levels
21. list the cases in which null values are appropriate with examples
22. differentiate b/w FPS and DBMS
23. design a conceptual data base design for health insurance system
24. compare and contrast Relational model and Hierarchical model
25. explain the basic operations of Relational Algebra with examples
26. Draw and explain the DBMS component modules
27. what are advantages of DBMS
28. explain the difference b/w among entity, entity type and relation ship set
29. what is integrity constraint explain deferent constraints in DBMS
30. what are the functions of DBA
31. write about architecture of DBMS
32. explain about various database users
33. what are various capabilities of DBMS
34. what is the difference b/w logical data independence and physical data independence
35. discuss the the main types of constraints on specialization and generalization
36. what is e-r model .explain the components E-R model
37. what is sql and various types of commands
38. explain about relation model and advantages of rm.
Normalization:
The process of decomposing unsatisfactory "bad" relations by breaking up their attributes into
smaller relations
• Normalization is used to design a set of relation schemas that is optimal from the point of
view of database updating
• Normalization starts from a universal relation schema
1NF
Unorganized Relations
PRASANNA EENADU,VAARTHA,HINDU
number of papers. The items in the PaperList column do not have a consistent form.
• Generally, RDBMS can’t cope with relations like this. Each entry in a table needs to have a
single data item in it.
• All RDBMS require relations not to be like this - not to have multiple values in any column
(i.e. no repeating groups)
Name PaperList
SWETHA EENADU
SWETHA HINDU
SWETHA DC
PRASANNA HINDU
PRASANNA EENADU
PRASANNA VAARTHA
• And it has the property that we sought. It is in First Normal Form (1NF).
• So this will be the first requirement in designing our databases: Obtaining 1NF
1NF is obtained by
There are three approaches to removing repeating groups from unnormalized tables:
Removes the repeating groups by entering appropriate data in the empty columns of rows
containing the repeating data.
Removes the repeating group by placing the repeating data, along with a copy of the
original key attribute(s), in a separate relation. A primary key is identified for the new
relation.
By finding maximum possible values for the multi valued attribute and adding that many
attributes to the relation
Example:
The DEPARTMENT schema is not in 1NF because DLOCATION is not a single valued
attribute.
The relation should be split into two relations. A new relation DEPT_LOCATIONS is
created and the primary key of DEPARTMENT, DNUMBER, becomes an attribute of
the new relation. The primary key of this relation is {DNUMBER, DLOCATION}
Alternative solution: Leave the DLOCATION attribute as it is. Instead, we have one
tuple for each location of a DEPARTMENT. Then, the relation is in 1NF, but redundancy
exists.
A super key of a relation schema R = {A1, A2, ...., An} is a set of attributes S subset-of
R with the property that no two tuples t1 and t2 in any legal relation state r of R will have
t1[S] = t2[S]
A key K is a super key with the additional property that removal of any attribute from K
will cause K not to be a super key any more.
If a relation schema has more than one key, each is called a candidate key.
One of the candidate keys is arbitrarily designated to be the primary key, and the
others are called secondary keys.
A Prime attribute must be a member of some candidate key
A Nonprime attribute is not a prime attribute—that is, it is not a member of any
candidate key
Definition of FD
Inference Rules for FDs
Equivalence of Sets of FDs
Minimal Sets of FDs
Trivial functional dependency means that the right-hand side is a subset ( not necessarily a
proper subset) of the left- hand side.
• Have a one-to-one relationship between attribute(s) on the left- and right- hand side of a
dependency;
• hold for all time;
• are nontrivial.
Example:-
A functional dependency AB is partially dependent if there is some attributes that can be
removed from A and the dependency still holds.
2NF
Second normal form (2NF) is a relation that is in first normal form and every non--key attribute
is fully functionally dependent on the key.
The normalization of 1NF relations to 2NF involves the removal of partial dependencies. If a
partial dependency exists, we remove the functional dependent attributes from the relation by
placing them in a new relation along with
Obtaining 2NF
_ If a nonprime attribute is dependent only on a proper part of a key, then we take the given
attribute as well as the key attributes that determine it and move them all to a new relation
_ We can bundle all attributes determined by the same subset of the key as a unit
Transitive dependency
A relation that is in first and second normal form, and in which no non-primary-key attribute is
transitively dependent on the primary key.
The normalization of 2NF relations to 3NF involves the removal of transitive dependencies by
placing the attribute(s) in a new relation along with a copy of the determinant
3NF
if X
_ X is a superkey of R, or
_ A is a key attribute of R
Obtaining 3NF
Split off the attributes in the FD that causes trouble and move them, so there are two
relations for each such FD
Fig: The normalization process.(a) Normalizing EMP_PROJ into 2NF relations. (b)
Normalizating EMP_DEPT into 3NF relations
where as BCNF insists that for this dependency to remain in a relation, A must be a super key.
Fig: Boyce-Codd Normal form. (a) BCNF normalization with the dependency of FD2 being
“lost” in the decomposition.(b) A relation R in 3NF but not in BCNF
BCNF
� if X
Obtaining BCNF
� As usual, split the schema to move the attributes of the troublesome FD to another
Decomposition:
equivalent to F; that is
((R1(F)) υ . . . υ (Rm(F)))+ = F+
• B is a subset of A
or
• AUB=R
A MVD is defined as being nontrivial if neither of the above two conditions is satisfied.
Definition:
A join dependency (JD), denoted by JD(R1, R2, ..., Rn), specified on relation schema R,
specifies a constraint on the states r of R.
The constraint states that every legal state r of R should have a non-additive join
decomposition into R1, R2, ..., Rn; that is, for every such r we have
* (R1(r), R2(r), ..., Rn(r)) = r
Note: an MVD is a special case of a JD where n = 2.
A join dependency JD(R1, R2, ..., Rn), specified on relation schema R, is a trivial JD if
one of the relation schemas Ri in JD(R1, R2, ..., Rn) is equal to R.
Definition:
A relation schema R is in fifth normal form (5NF) (or Project-Join Normal Form
(PJNF)) with respect to a set F of functional, multivalued, and join dependencies if,
for every nontrivial join dependency JD(R1, R2, ..., Rn) in F+ (that is, implied by
F),
every Ri is a superkey of R.
Normalization
A technique for producing a set of relations with desirable properties, given the data
requirements of an enterprise
UNF is a table that contains one or more repeating groups 1NF is a relation in which the
intersection of each row and column contains one and only one value
2NF is a relation that is in 1NF and every non-primary-key attribute is fully functionally
dependent on the primary key.
BCNF is a relation in which every determinant is a candidate key 4NF is a relation that is in
BCNF and contains no trivial multi-valued dependency
UNIT -2
Query Processing: General strategies for query processing, transformations, expected size, statistics
in estimation, query improvement, view processing, query processor
Structure
1) Objectives
2) Introduction
3) Query Processing Problem
4) Objectives of Query Processing
5) Characterization of Query Processors
6) Layers of Query Processing
7) Query Decomposition
8) Data Localization
9) Global Query Optimization
10) Local Query Optimization
Objectives: In this unit we learn about an overview of query processing in Distributed Data Base
Management Systems (DDBMSs). This is explained with the help of Relational Calculus and
Relational Algebra because of their generality and wide use in DDBMSs. In this we discuss
DBMS module, called as Query Processor. This relieves the user from query optimization, a
time consuming task that is handled properly by the query processor.
This issue has considerably important both in Centralized and Distributed processing systems.
However, the query processing problem is much more difficult in distributed environments than
in the conventional systems. In exact, the relations involved in distributed queries may be
fragmented and/or replicated, there by inducing communication overhead costs.
So, in this unit let us discuss the different issues of query processing, about an ideal query
processor for distributed environment and finally, a layered software approach for distributed
query processing.
The main duty of a relational query processor is to transform a high-level query (in relational
calculus), into an equivalent lower level query (in relational algebra). The distributed database is
of major importance for query processing since the definition of fragments is based on the
objective of increasing reference locality, and sometimes-parallel execution for the most
important queries. The role of a distributed query processor is to map a high level query on a
distributed database (a set of global relations) into a sequence of database operations (of
relational algebra) on relational fragments. Several important functions characterize this
mapping:
The calculus query must be decomposed into a sequence of relational operations called
an algebraic query
The data accessed by the query must be localized so that the operations on relations are
translated to bear on local data (fragments)
The algebraic query on fragments must be extended with communication operations and
optimized with respect to a cost function to be minimized. This cost function refers to
computing resources such as disk I/Os, CPUs, and communication networks.
The low-level query actually implements the execution strategy for the query. The
transformation must achieve both correctness and efficiency. The well-defined mapping with the
above said functional characteristics makes the correctness issue easy. But producing an efficient
execution strategy is more complex. A relational calculus query may have many equivalent and
correct transformations into relational algebra. Since each equivalent execution strategy can lead
to different consumptions of computer resources, the main problem is to select the execution
strategy that minimizes the resource consumption.
Example: We consider the following subset of engineering database scheme given in fig.6.0: E
(ENO, ENAME, TITLE) G (ENO, JNO, RESP, DUR) and the simple user query: “ Find the
names of employees who are managing a project”.
E G
E4 D Programmer E3 J3 Consultant 10
E7 J3 Engineer 36
E8 J3 Manager 40
J S
and
NOTE: The following observations are made from the above example:
It can be observed that the second query avoids the Cartesian product (CP) of E and G,
consumes much less computing resource than the first and thus should be retained. That
is, we have to avoid performing Cartesian product operation on a full table.
In a centralized environment, the role of the query processor is to choose the best
relational algebra query for a given query among all equivalent ones.
In a distributed environment, relational algebra is not enough to express execution
strategies. It must be supported with operations for exchanging data between sites. The
distributed query processor has to select the best sites to process the data and the way in
which the data should be transformed with the choice of ordering the relations.
Example: This example illustrates the importance of site selection and communication for a
chosen relational algebra query against a fragmented database. We consider the following query:
This query is written considering the relations of the previous example. We assume that the
relations E and G are horizontally fragmented as follows:
Fragments G1, G2, E1 and E2 are stored at the sites 1,2,3, and 4, respectively, and the result is
expected at the site 5 as shown in the fig 6.1. For simplicity, we have ignored the project
operation here. In the figure two equivalent strategies for the above query are shown.
An arrow from site i to site j labeled with R indicates that relation R is transferred from
site i to site j.
Strategy A exploits the fact that relations E and G are fragmented in the same way in
order to perform the select and join operations in parallel.
Strategy B centralizes all the operations and the data at the result site before processing
the query.
Resource consumption of these two strategies:
Assumptions made:
Tuple access denoted as tupacc is 1 unit.
A tuple transfer, denoted as tuptrans, is 10 units.
Relations E and G have 400 and 1000 tuples respectively.
There are 20 managers in relation G.
The data is uniformly distributed among sites.
E and G relations are locally clustered an attributes RESP and ENO,
respectively.
There is direct access to tuples of G (respectively, E) based on the value of attribute
RESP (respectively, ENO)
The strategy A is better by a factor of 37, which is quite significant. Also it provides the better
distribution of work among the sites. The difference would be still better if we assume slower
communication and/or higher degree of fragmentation.
Result = E1 UN E2
E'1 E'2
Site 3 Site 4
E1 = E1 JN ENO G1 E2 = E2 JN ENO G2
G'1 G'2
Site 5
Site 2
Site 1 G = SL
1 RESP = ‘Manager’ G1 G2 = SLRESP = ‘Manager’ G2
G1 G2 E1 E2
(b) Strategy B
Fig. : Equivalent Distributed Execution Strategies
In distributed system, the communication cost factor is largely dominating the local
processing cost, so that the other cost factors are ignored.
In centralized systems, only CPU and I/O cost have to be considered.
It is very difficult to give the characteristics, which differentiates centralized and distributed
query processors. Still some of them have been listed here. Out of them, the first four are
common to both and the next four are particular to distributed query processors.
Languages: The input language to the query processor can be based on relational calculus
or relational algebra. The former requires an additional phase to decompose a query
expressed in relational calculus to relational algebra. In distributed context, the output
language is generally some form of relational algebra augmented with communication
primitives. That is it must perform perfect mapping between input languages with the
output language.
Types of optimization: Conceptually, query optimization is to choose a best point of
solution space that leads to the minimum cost. A popular approach called exhaustive
search is used. This is a method where heuristic techniques are used. In both centralized
and distributed systems a common heuristic is to minimize the size of intermediate
relations. Performing unary operations first and ordering the binary operations by the
increasing size of their intermediate relations can do this.
Optimization Timing: A query may be optimized at different times relative to the actual
time of query execution. Optimization can be done statically before executing the query
or dynamically as the query is executed. The main advantage of the later method is that
the actual sizes of the intermediate relations are available to the query processor, thereby
minimizing the probability of a bad choice. The main drawback of the dynamic method
is that the query optimization, which is an expensive one, must be repeated for each and
every query. So, Hybrid optimization may be better in some situation.
Statistics: The effectiveness of the query optimization is based on statistics on the database.
Dynamic query optimization requires statistics in order to choose the operation that has
to be done first. Static query optimization requires statistics to estimate the size of
intermediate relations. The accuracy of the statistics can be improved by periodical
updating.
Decision sites: Most of the systems use centralized decision approach, in which a single
site generates the strategy. However, the decision process could be distributed among
various sites participating in the elaboration of the best strategy. The centralized
approach is simpler but requires the knowledge of the complete distributed database
where as the distributed approach requires only local information. Hybrid approach is
better where the major decisions are taken at one particular site and other decisions are
taken locally.
Exploitation of the Network Topology: the distributed query processor exploits the network
topology. With wide area networks, the cost function to be minimized can be restricted
to the data communication cost, which is a dominant factor. This issue reduces the work
of distributed query optimization, that can be dealt as two separate problems: Selection
of the global execution strategy, based on the inter-site communication and selection of
each local execution strategy, based on a centralized query processing algorithms. With
local area networks, communication costs are comparable to I/O costs. Therefore, it is
reasonable to the distributed query processor to increase parallel execution at the cost of
increasing communication.
Exploitation of Replicated fragments: For reliability purposes it is useful to have fragments
replicated at different sites. Query processors have to exploit this information either
statically or dynamically for processing the query efficiently.
Use of semi- joins: The semi-join operation reduces the size of the data that are exchanged
between the sites so that the communication cost can be reduced.
The problem of query processing can itself be decomposed into several subprograms,
corresponding to various layers. In figure 6.2, a generic layering scheme for query processing is
shown where each layer solves a well-defined sub-problem. The input is a query on distributed
data expressed in relational calculus. This distributed query is posed on global (distributed)
relations, meaning that data distribution is hidden. Four main layers are involved to map the
distributed query into an optimized sequence of local operations, each acting on a local database.
These layers perform the functions of query decomposition, data localization, global query
optimization, and local query optimization. The first three layers are performed by a central site
and use global information; the local sites do the fourth.
QUERY GLOBAL
DECOMPOSITION
SCHEMA
RELATIONS
FRAGMENT QUERY
GLOBAL STATISTICS ON
FRAGMENTS
OPTIMIZATION
LOCAL LOCAL
LOCAL
OPTIMIZATION SCHEMA
SITES
OPTIMIZED LOCAL
QUERIES
Query Decomposition: The first layer decomposes the distributed calculus query into an
algebraic query on global relations. The information needed for this transformation is found in
the global conceptual schema describing the global relations. However, the information about
data distribution is not used here but in the next layer. Thus the techniques used by this layer are
those of a centralized DBMS.
Data Localization:
The input to the second layer is an algebraic query on distributed relations. The main role of the
second layer is to localize the query’s data using data distribution information. Relations are
fragmented and stored in disjoint subsets called fragments, each being stored at a different site.
This layer determines which fragments are involved in the query and transforms the distributed
query into a fragment query. Fragmentation is defined through fragmentations rules that can be
expressed as relational operations. A distributed relation can be reconstructed by applying the
fragmentation rules, and then deriving a program, called a localization program, of relational
algebra operations, which then act on fragments.
Generating a fragments query is done in two steps.
The distributed query is mapped into a fragment query by substituting each distributed
relation by its reconstruction program (also called materialization program.
The fragment query is simplified and restructured to produce another “good” query.
Simplification and restructuring may be done according to the same rules used in the
decomposition layer. As in the decomposition layer, the final fragment query is generally
far from optimal because information regarding fragments is not utilized.
The input to the third layer is a fragment query, that is, an algebraic query on fragments. The
goal of query optimization is to find an execution strategy for the query, which is close to
optimal. An execution strategy for a distributed query can be described with relational algebra
operations and communication primitives (send/receive operations) for transferring data between
sites. The previous layers have already optimized the query for example, by eliminating
redundant expressions. However, this optimization is independent of fragments characteristics
such as cardinalities. In addition, communication operations are not yet specified. By permuting
the ordering of operations within one fragment query, many equivalent queries may be found.
Query optimization consists of finding the “best” ordering of operations in the fragments
query, including communication operations, which minimize a cost function. The cost function,
often defined in terms of time units, refers to computing resources such as disk space, disk I/Os,
buffer space, CPU cost, communication cost and so on. An important aspect of query
optimization is join ordering, since permutations of the joint within the query may lead to
improvements of orders of magnitude. One basic technique for optimizing a sequence of
distributed join operations is through the semi-join operator. The main value of the semi-join in a
distributed system is to reduce the size of the join operands and then the communication cost.
The output of the query optimization layer is an optimized algebraic query with communication
operation included on fragments.
The last layer us performed by all the sites having fragments involved in query. Each sub-query
executing at one site, called a local query, is then optimized using the local schema of the site. At
this time, the algorithms to perform the relational operations may be chosen. Local optimization
uses the algorithms of centralized systems.
UNIT-3 & 4
Recovery: Reliability, transactions, recovery in centralized DBMS, reflecting updates, buffer
management, logging schemes, disaster recovery
1) Concept of Transaction
2) ACID properties
3) Serializability
4) Locks – implementation
What is a Transaction?
A transaction is a logical unit of work –
It may consist of a simple SELECT to generate a list of table contents, or a series of related
UPDATE command sequences.
A database request is the equivalent of a single SQL statement in an application program or
transaction.
UPDATE PRODUCT
SET PROD_QOH = PROD_QOH - 100
WHERE PROD_CODE = ‘X’;
UPDATE ACCT_RECEIVABLE
SET ACCT_BALANCE = ACCT_BALANCE + 500
WHERE ACCT_NUM = ‘Y’;
- If both transactions are not completely executed, the transaction yields an inconsistent
database.
- Consistent state only if both transactions are fully completed
- DBMS doesn’t guarantee transaction represents real-world event but it must be able to
recover the database to a previous consistent state. (For instance, the accountant inputs a
wrong amount.)
Transaction Properties
All transactions must display atomicity, durability, serializability, and isolation.
Atomicity –
All transaction operations must be completed
Incomplete transactions aborted
Durability –
Permanence of consistent database state
Serializability –
Conducts transactions in serial order
Important in multi-user and distributed databases
Isolation –
Transaction data cannot be reused until its execution complete
Consistency – (To preserve integrity of data, the database system must ensure: atomicity,
consistency, isolation, and durability (ACID).)
Execution of a transaction in isolation preserves the consistency of the database.
A single-user database system automatically ensures serializability and isolation of the
database because only one transaction is executed at a time.
The atomicity and durability of transactions must be guaranteed by the single-user DBMS.
The multi-user DBMS must implement controls to ensure serializability and isolation of
transactions – in addition to atomicity and durability – in order to guard the database’s
consistency and integrity.
Transaction State
Active, the initial state; the transaction
stays in this state while it is executing.
Partially committed, after the final
statement has been executed.
Failed, after the discovery that normal
execution can no longer proceed.
Aborted, after the transaction has been
rolled back and the database restored to
its state prior to the start of the transaction. Two options after it has been aborted:
Restart the transaction – only if no internal logical error but hardware or software
failure.
Kill the transaction – once internal logical error occurs like incorrect data input.
Committed, after successful completion. The transaction is terminated once it is aborted or
committed.
UPDATE PRODUCT
UPDATE ACCT_RECEIVABLE
COMMIT;
In fact, the COMMINT statement used in this example is not necessary if the UPDATE
statement is the application’s last action and the application terminates normally.
Transaction Log
The DBMS use transaction log to track all transactions that update database.
May be used by ROLLBACK command for triggering recovery requirement.
May be used to recover from system failure like network discrepancy or disk crash.
While DBMS executes transactions that modify the database, it also updates the
transaction log. The log stores:
Record for beginning of transaction
Each SQL statement
- The type of operation being performed (update, delete, insert).
- The names of objects affected by the transaction (the name of the table).
- The “before” and “after” values for updated fields
- Pointers to previous and next entries for the same transaction.
Commit Statement – the ending of the transaction.
Table 1 Transaction Log Example
1. If a system failure occurs, the DBMS will examine the transaction log for all uncommitted or
incomplete transactions, and it will restore (ROLLBACK) the database to its previous state
Concurrency Control
- Coordinates simultaneous transaction execution in multiprocessing database
- Ensure serializability of transactions in multiuser database environment
- Potential problems in multiuser environments
- Three main problems: lost updates, uncommitted data, and inconsistent retrievals
Lost updates
Assume that two concurrent transactions (T1, T2) occur in a PRODUCT table which records
a product’s quantity on hand (PROD_QOH). The transactions are:
Transaction Computation
T1: Purchase 100 units PROD_QOH = PROD_QOH + 100
T2: Sell 30 units PROD_QOH = PROD_QOH - 30
Table 2 Normal Execution of Two Transactions
Note: this table shows the serial execution of these transactions under normal circumstances,
yielding the correct answer, PROD_QOH=105.
1. Suppose that a transaction is able to read a product’s PROD_QOH value from the table
before a previous transaction has been committed.
2. The first transaction (T1) has not yet been committed when the second transaction (T2) is
executed.
3. T2 sill operates on the value 35, and its subtraction yields 5 in memory.
4. T1 writes the value 135 to disk, which is promptly overwritten by T2.
Uncommitted Data
When two transactions, T1 and T2, are executed concurrently and the first transaction (T1) is
rolled back after the second transaction (T2) has already accessed the uncommitted data –
thus violating the isolation property of transactions. The transactions are:
Transaction Computation
T1: Purchase 100 units PROD_QOH = PROD_QOH + 100 (Rollback)
T2: Sell 30 units PROD_QOH = PROD_QOH - 30
Note: the serial execution of these transactions yields the correct answer.
Note: the uncommitted data problem can arise when the ROLLBACK is completed after T2 has
begun its execution.
Inconsistent Retrievals
When a transaction calculates some summary (aggregate) functions over a set of data while
other transactions are updating the data.
The transaction might read some data before they are changed and other data after they are
changed, thereby yielding inconsistent results.
1. T1 calculates the total quantity on hand of the products stored in the PRODUCT table.
2. T2 updates PROD_QOH for two of the PRODUCT table’s products.
Table 6 Retrieval During Update
Note: T1 calculates PROD_QOH but T2 represents the correction of a typing error, the user
added 30 units to product 345TYX’s PROD_QOH, but meant to add the 30 units to product
‘123TYZ’s PROD_QOH. To correct the problem, the user executes 30 from product
345TYX’s PROD_QOH and adds 30 to product 125TYZ’s PROD_QOH.
Note: The initial and final PROD_QOH values while T2 makes the correction – same results but
different transaction process.
Note:
The transaction table in Table 8 demonstrates that inconsistent retrievals are possible during
the transaction execution, making the result of T1’s execution incorrect.
Unless the DBMS exercises concurrency control, a multi-user database environment can
create chaos within the information system.
Note: the table below show the possible conflict scenarios if two transactions, T1 and T2, are
executed concurrently over the same data.
sum A + B.
Serializability – A (possibly concurrent) schedule is serializable if it is equivalent to a serial
schedule. Different forms of schedule equivalence give rise
to the notions of:
1. conflict serializability
2. view serializability
Conflict Serializability: Instructions li and lj of
transactions Ti and Tj respectively, conflict if and only if
there exists some item Q accessed by both li and lj, and at
least one of these instructions wrote Q.
1. Ii = read(Q), Ij = read(Q). Ii and Ij don’t conflict.
2. Ii = read(Q), Ij = write(Q). They conflict.
3. Ii = write(Q), Ij = read(Q). They conflict
4. Ii = write(Q), Ij = write(Q). They conflict
If a schedule S can be transformed into a schedule S’ by a series of swaps of non-
conflicting instructions, we say that S and S’ are conflict equivalent.
We say that a schedule S is conflict serializable if it is conflict equivalent to a serial
schedule.
View Serializability: Let S and S´ be two schedules with the same set of transactions. S
and S´ are view equivalent if the following three conditions are met:
1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then
transaction Ti must, in schedule S’, also read the initial value of Q.
2. For each data item Q if transaction Ti executes read(Q) in schedule S, and that value
was produced by transaction Tj (if any), then transaction Ti must in schedule S´ also
read the value of Q that was produced by transaction Tj.
3. For each data item Q, the transaction (if any) that
performs the final write(Q) operation in
schedule S must perform the final write(Q)
operation in schedule S’.
As can be seen, view equivalence is also based purely on reads and writes alone.
A schedule S is view serializable it is view equivalent to a serial schedule.
Lock Granularity
Lock granularity indicates level of lock use: database, table, page, row, or field (attribute).
Database-Level
The entire database is locked.
Transaction T2 is prevented to use any tables in the database while T1 is being executed.
Good for batch processes, but unsuitable for online multi-user DBMSs.
Refer Figure 2, transactions T1 and T2 cannot access the same database concurrently, even if
they use different tables. (The access is very slow!)
Table-Level
The entire table is locked. If a transaction requires access to several tables, each table may
be locked.
Transaction T2 is prevented to use any row in the table while T1 is being executed.
Two transactions can access the same database as long as they access different tables.
It causes traffic jams when many transactions are waiting to access the same table.
Table-level locks are not suitable for multi-user DBMSs.
Refer Figure 3, transaction T1 and T2 cannot access the same table even if they try to use
different rows; T2 must wait until T1 unlocks the table.
Page-Level
The DBMS will lock an entire diskpage (or page), which is the equivalent of a diskblock as a
(referenced) section of a disk.
A page has a fixed size and a table can span several pages while a page can contain several
rows of one or more tables.
Page-level lock is currently the most frequently used multi-user DBMS locking method.
Figure 4 shows that T1 and T2 access the same table while locking different diskpages.
T2 must wait for using a locked page which locates a row, if T1 is using it.
Row-Level
With less restriction respect to previous discussion, it allows concurrent transactions to
access different rows of the same table even if the rows are located on the same page.
It improves the availability of data, but requires high overhead cost for management.
Refer Figure 5 for row-level lock.
Field-Level
It allows concurrent transactions to access the same row, as long as they require the use of
different fields (attributes) within a row.
The most flexible multi-user data access, but cost extremely high level of computer
overhead.
Lock Types
The DBMS may use different lock types: binary or shared/exclusive locks.
A locking protocol is a set of rules followed by all transactions while requesting and
releasing locks. Locking protocols restrict the set of possible schedules.
Binary Locks
Two states: locked (1) or unlocked (0).
Locked objects are unavailable to other objects.
Unlocked objects are open to any transaction.
Transaction unlocks object when complete.
Every transaction requires a lock and unlock operation for each data item that is accessed.
Note: the lock and unlock features eliminate the lost update problem encountered in table 3.
However, binary locks are now considered too restrictive to yield optimal concurrency
conditions.
Shared/Exclusive Locks
Shared (S Mode)
Exists when concurrent transactions granted READ access
Produces no conflict for read-only transactions
Issued when transaction wants to read and exclusive lock not held on item
Exclusive (X Mode)
Exists when access reserved for locking transaction
Used when potential for conflict exists (also refer Table 9)
Issued when transaction wants to update unlocked data
Lock-compatibility matrix
A transaction may be granted a lock on an item if the
requested lock is compatible with locks already held on
the item by other transactions
Any number of transactions can hold shared locks on an
item, but if any transaction holds an exclusive on the item no other transaction may hold
any lock on the item.
If a lock cannot be granted, the requesting transaction is made to wait till all incompatible
locks held by other transactions have been released. The lock is then granted.
Reasons to increasing manager’s overhead
M.TECH (Computer Science & Engineering) 61
Shabnam Sangwan, A.P. in CSE, SKITM
The type of lock held must be known before a lock can be granted
Three lock operations exist: READ_LOCK (to check the type of lock), WRITE_LOCK
(to issue the lock), and UNLOCK (to release the lock).
The schema has been enhanced to allow a lock upgrade (from shared to exclusive) and a
lock downgrade (from exclusive to shared).
Problems with Locking
Transaction schedule may not be serializable
Managed through two-phase locking
Schedule may create deadlocks
Managed by using deadlock detection and prevention techniques
Two-Phase Locking
- Two-phase locking defines how transactions acquire and relinquish (or revoke) locks.
1. Growing phase – acquires all the required locks without unlocking any data. Once all
locks have been acquired, the transaction is in its locked point.
2. Shrinking phase – releases all locks and cannot obtain any new lock.
Governing rules
Two transactions cannot have conflicting locks
No unlock operation can precede a lock operation in the same transaction
No data are affected until all locks are obtained
In the example for two-phase locking protocol (Figure 6), the transaction acquires all the
locks it needs (two locks are required) until it reaches its locked point.
When the locked point is reached, the data are modified to conform to the transaction
requirements.
The transaction is completed as it released all of the locks it acquired in the first phase.
Updates for two-phase locking protocols:
Two-phase locking does not ensure freedom from deadlocks.
Cascading roll-back is possible under two-phase locking. To avoid this, follow a
modified protocol called strict two-phase locking. Here a transaction must hold all its
exclusive locks till it commits/aborts.
Rigorous two-phase locking is even stricter: here all locks are held till commit/abort. In
this protocol transactions can be serialized in the order in which they commit.
There can be conflict serializable schedules that cannot be obtained if two-phase locking
is used.
However, in the absence of extra information (e.g., ordering of access to data), two-
phase locking is needed for conflict serializability in the following sense:
Given a transaction Ti that does not follow two-phase locking, we can find a transaction
Tj that uses two-phase locking, and a schedule for Ti and Tj that is not conflict
serializable.
Deadlocks
- Occurs when two transactions wait for each other to unlock data. For example:
T1 = access data items X and Y
T2 = access data items Y and X
Deadly embrace – if T1 has not unlocked data item Y, T2 cannot begin; if T2 has not unlocked
data item X, T1 cannot continue. (Refer Table 11.)
Starvation is also possible if concurrency control manager is badly designed.
For example, a transaction may be waiting for an X-lock (exclusive mode) on an item,
while a sequence of other transactions request and are granted an S-lock (shared mode)
on the same item.
The same transaction is repeatedly rolled back due to deadlocks.
Control techniques
Deadlock prevention – a transaction requesting a new lock is aborted if there is the
possibility that a deadlock can occur.
If the transaction is aborted, all the changes made by this transaction are rolled back,
and all locks obtained by the transaction are released.
It works because it avoids the conditions that lead to deadlocking.
Deadlock detection – the DBMS periodically tests the database for deadlocks.
If a deadlock is found, one of the transactions (the “victim”) is aborted (rolled back
and restarted), and the other transaction continues.
Deadlock avoidance – the transaction must obtain all the locks it needs before it can be
executed.
The technique avoids rollback of conflicting transactions by requiring that locks be
obtained in succession.
The serial lock assignment required in deadlock avoidance increase action response
times.
Control Choices
If the probability of deadlocks is low, deadlock detection is recommended.
If the probability of deadlocks is high, deadlock prevention is recommended.
If response time is not high on the system priority list, deadlock avoidance might be
employed.
Implementation of Locking
A Lock manager can be implemented as a separate process to which transactions send lock
and unlock requests
The lock manager replies to a lock request by sending a lock grant messages (or a message
asking the transaction to roll back, in case of a deadlock)
The requesting transaction waits until its request is answered
The lock manager maintains a datastructure called a lock table to record granted locks and
pending requests
The lock table is usually implemented as an in-memory hash table indexed on the name of
the data item being locked
Lock Table
Black rectangles indicate granted locks, white ones
indicate waiting requests
Lock table also records the type of lock granted or
requested
New request is added to the end of the queue of
requests for the data item, and granted if it is
compatible with all earlier locks
Unlock requests result in the request being deleted,
and later requests are checked to see if they can now
be granted
If transaction aborts, all waiting or granted requests of the transaction are deleted
lock manager may keep a list of locks held by each transaction, to implement this
efficiently
Concurrency Control with Time Stamping Methods
Assigns global unique time stamp to each transaction
Produces order for transaction submission
Properties
Uniqueness: ensures that no equal time stamp values can exist.
Monotonicity: ensures that time stamp values always increase.
DBMS executes conflicting operations in time stamp order to ensure serializability of the
transaction.
If two transactions conflict, one often is stopped, rolled back, and assigned a new time
stamp value.
Each value requires two additional time stamps fields
Last time field read
Last update
Time stamping tends to demand a lot of system resources because there is a possibility that
many transactions may have to be stopped, rescheduled, and re-stamped.
Timestamp-Based Protocols
Each transaction is issued a timestamp when it enters the system. If an old transaction Ti has
time-stamp TS(Ti), a new transaction Tj is assigned time-stamp TS(Tj) such that TS(Ti) <
TS(Tj).
The protocol manages concurrent execution such that the time-stamps determine the
serializability order.
In order to assure such behavior, the protocol maintains for each data Q two timestamp
values:
W-timestamp(Q) is the largest time-stamp of any transaction that executed write(Q)
successfully.
R-timestamp(Q) is the largest time-stamp of any transaction that executed read(Q)
successfully.
The timestamp ordering protocol ensures that any conflicting read and write operations are
Transaction Recovery
Four important concepts to affect recovery process –
Write-ahead-log protocol – ensures that transaction logs are always written before any
database data are actually updated.
Redundant transaction logs – ensure that a disk physical failure will not impair the
DBMS ability to recover data.
Database buffers – create temporary storage area in primary memory used to speed up
disk operations and improve processing time.
Database checkpoint – setup an operation in which the DBMS writes all of its updated
buffers to disk and registered in the transaction log.
Transaction recovery procedure generally make use of deferred-write and write-through
techniques.
Deferred-write (or Deferred-update)
Changes are written to the transaction log, not physical database.
Database updated after transaction reaches commit point.
Steps:
1. Identify the last checkpoint in the transaction log. This is the last time transaction
data was physically saved to disk.
2. For a transaction that started and committed before the last checkpoint, nothing
needs to be done, because the data are already saved.
3. For a transaction that performed a commit operation after the last checkpoint, the
DBMS uses the transaction log records to redo the transaction and to update the
database, using “after” values in the transaction log. The changes are made in
ascending order, from the oldest to the newest.
4. For any transaction with a RP::BACK operation after the last checkpoint or that
was left active (with neither a COMMIT nor a ROLLBACK) before the failure
occurred, nothing needs to be done because the database was never updated.
Write-through (or immediate update)
Immediately updated by during execution
Before the transaction reaches its commit point
Transaction log also updated
Transaction fails, database uses log information to ROLLBACK
Steps:
1. Identify the last checkpoint in the transaction log. This is the last time transaction
data was physically saved to disk.
2. For a transaction that started and committed before the last checkpoint, nothing
needs to be done, because the data are already saved.
3. For a transaction that committed after the last checkpoint, the DBMS redoes the
transaction, using “after” values in the transaction log. Changes are applied in
ascending order, from the oldest to the newest.
4. For any transaction with a ROLLBACK operation after the last checkpoint or that
was left active (with neither a COMMIT nor a ROLLBACK) before the failure
occurred, the DBMS uses the transaction log records to ROLLBACK or undo the
operations, using the “before” values in the transaction log. Changes are applied
in reverse order, from the newest to the oldest.
Important Questions:
The determinant of a functional dependency is the attribute or group of attributes on the left-
hand side of the arrow in the functional dependency. The consequent of a fd is the attribute or
group of attributes on the right-hand side of the arrow.
4. What is Normalization
The process of decomposing unsatisfactory "bad" relations by breaking up their
attributes into smaller relations.
Normalization is a process of analyzing relation schemas so that the following
can be achieved
1. Minimizing redundancy
2. Minimizing insertion, updating, deletion anomalies
5. What is revoke command
Revoke is a DDL command which is used to disallow the privileges that are granted by
DBA using Grant command.
6. What is Transaction
Def 1: Logical unit of database processing that includes one or more access operations (read
-retrieval, write - insert or update, delete).
Transaction boundaries:
7. What is Trigger
Triggers are simply stored procedures that are ran automatically by the database whenever
some event happens.
Two schedules are said to be view equivalent if the following three conditions hold:
1. The same set of transactions participates in S and S’, and S and S’ include the
same operations of those transactions.
2. For any operation Ri(X) of Ti in S, if the value of X read by the operation has
been written by an operation Wj(X) of Tj (or if it is the original value of X before
the schedule started), the same condition must hold for the value of X read by
operation Ri(X) of Ti in S’.
3. If the operation Wk(Y) of Tk is the last operation to write item Y in S, then
Wk(Y) of Tk must also be the last operation to write item Y in S’.
View serializability:
Define Latches
Locks held for a short duration are called Latches. Latches do not follow concurrency
methods rather than they used to guarantee the physical integrity of a page when that page
being written from the buffer to disk.
Define Granularity
Granularity means size of data item which may be one of the following
1. A database record
2. A field value of a database record
3. A disk block
4. A whole file
5. The whole database
Important Questions:
UNIT-5
Object Oriented Data Base Development: Introduction, Object Definition language, creating
object instances, Object query language.
An object-oriented database system must satisfy two criteria: it should be a DBMS, and it should
be an object-oriented system, i.e., to the extent possible, it should be consistent with the current
crop of object-oriented programming languages. The first criterion translates into five features:
persistence, secondary storage management, concurrency, recovery and an ad hoc query facility.
The second one translates into eight features: complex objects, object identity, encapsulation,
types or classes, inheritance, overriding combined with late binding, extensibility and
computational completeness.
Relational databases store data in tables that are two dimensional. The tables have rows and
columns. Relational database tables are "normalized" so data is not repeated more often than
necessary. All table columns depend on a primary key (a unique value in the column) to identify
the column. Once the specific column is identified, data from one or more rows associated with
that column may be obtained or changed. Breaking complex information out into simple data
takes time and is labor intensive. Code must be written to accomplish this task.
Object oriented databases should be used when there is complex data and/or complex data
relationships. This includes a many to many object relationship. Object databases should not be
used when there would be few join tables and there are large volumes of simple transactional
data.
2. Multimedia Applications
1. Objects don't require assembly and disassembly saving coding time and execution time to
assemble or disassemble objects.
3. Complex Data
4. Reduced paging
5. Easier navigation
3. OODBMS is based on object model – which lacks the solid theoretical foundation of the
relational model on which the RDBMS is built.
4. OODBMSs do not provide a standard ad hoc query language, as relational systems do.
5. The lack of compatibility between different OODBMSs makes switching from one piece
of software to another very difficult. With RDBMSs, different products are very similar,
and switching from one to another is relatively easy.
FEATURES OF OODBMS
Mandatory Features of an OODBMS:
The OODBMS features, which include 13 mandatory features and optional characteristics of
OODBMS, are defined as follows:
It must be possible to construct complex objects from existing objects. Simplest objects are
objects such as integers, characters, byte strings of any length, Booleans and floats (one might
add other atomic types).
Examples include sets, lists, and tuples that allow the user to define aggregation of objects as
attributes.
Sets are critical because they are a natural way of representing collections from the real world.
Tuples are critical because they are a natural way of representing properties of an entity.
Lists or arrays are important because they capture order, which occurs in the real world, and they
also arise in many scientific applications, where people need matrices or time series data.
The OID must be independent of the object’s state. This feature allows the system to compare
objects at two different levels: comparing the OID (identical objects) and comparing the object’s
state.
An object has an existence which is independent of its value. Thus two notions of object
equivalence exist: two objects can be identical (they are the same object) or they can be equal
(they have the same value). This has two implications: one is object sharing and the other one is
object updates.
Object updates: Object identity is also a powerful data manipulation primitive that can be the
basis of set, tuple and recursive complex object manipulation.
Objects have a public interface, but private implementation of data and methods. The
encapsulation feature ensures that only the public aspect of the object is seen, while the
implementation details are hidden. Need for modularity. Modularity is necessary to structure
complex applications designed and implemented by a team of programmers.
This rule allows the designer to choose whether the system supports types or classes. Types are
used mainly at compile time to check type errors in attribute value assignments.
A type, corresponds to the notion of an abstract data type. It has two parts: the interface and the
implementation (or implementations). The interface part is visible to the users of the type, the
implementation of the object is seen only by the type designer.
A class, is the same as that of a type, but it is more of a run-time notion. It contains two aspects:
an object factory and an object warehouse. The object factory can be used to create new objects,
by performing the operation new on the class, or by cloning some prototype object representative
of the class. The object warehouse means that attached to the class is its extension, i.e., the set of
objects that are instances of the class.
An object must inherit the properties of its super classes in the class hierarchy. Ensures code
reusability.
This feature allows us to use the same method’s name in different classes. The OO system
decides which implementation to access at run time, on the basis of the class to which the object
belongs. Also known as late binding or dynamic binding.
The basic notions of programming languages are augmented by features common to the database
data manipulation language (DML), thereby allowing us to express any type of operation in the
language.
The final OO feature concerns its ability to define new types. No management distinction
between user-defined types and system-defined types.
The conventional DBMS stores the data permanently on disk, that is, the DBMS displays data
persistence. OO system usually keep the entire object space in memory, once the system is shut
down, the entire object is lost.
Rule 10: The system must be able to manage very large databases.
Typical OO systems limit the object space to the amount of primary memory available. For
example: Smalltalk cannot handle objects larger than 64K. Therefore, a critical OODBMS
feature is to optimize the management of secondary storage devices by using buffers, indexes,
data clustering, and access path selection techniques.
Conventional DBMSs are especially capable in this area. OODBMS must support the same level
of concurrency as conventional systems.
Rule 12: The system must be able to recover from hardware and software failures.
OODBMS must offer the same level of protection from hardware and software failures that the
traditional DBMS provides. It must provide support for automated backup and recovery tools.
Relational DBMSs have provided a standard database query method through SQL (Structured
Query Language). OODBMS provided and object query language (OQL) with similar capability.
OODBMS CONCEPTS
Object Oriented Database Concepts:
Basic Object-Oriented DB concepts are:
1. Objects
2. Object Identity
3. Attributes
4. Object State
5. Messages and Methods
6. Encapsulation
7. Classes
8. Inheritance
9. Method Overloading
10. Polymorphism
11. Object Classification
Objects:
An Object is an abstract representation of a real-world entity that has a unique identity,
embedded properties, and the ability to inherit with other objects.
Note: The difference between an object and entity is that entity has data components and
relationships but lacks manipulative ability.
Fig: Objects
Object Identity:
Objects described by their attributes, known as instance variables. Each attribute has a unique
name and a data type associated with it. It has a domain, which logically groups and describes
the set of all possible values that an attribute can have.
For Ex: GPA domain, “any positive number between 0.00 and 4.00, with only two decimal
places”.
Objects attribute can be single-valued or multivalued same as in E-R model. Object attributes
may reference one or more other objects.
For Ex: the attribute MAJOR refers to a Department object, the attributes COURSE_TAKEN
refers to a list of course objects.
Object State:
Object State is a set of values that the object’s attributes have at a given time. It can be changed
by changing the values of object’s attributes. To change the object’s attribute values, send a
message to the object which will invoke a method.
Methods are the code that performs a specific operation on object’s data. It protects data from
direct and unauthorized access by other objects. It represents object behaviour. They change the
object’s attributes values or to return the value of selected object attributes.
Every method is identified by a name and has a body. The body is composed of instructions
written in programming language to represent a real-world action.
Xgpa = 0
Xgpa = (SEMESTER_GPA*SEMESTER_HOURS+PREVIOUS_GPA*PREVIOUS_HOURS)/
(SEMESTER_HOURS+PREVIOUS_HOURS)
Ability to hide the objects internal details (attributes and methods) from the message sender is
known as encapsulation.
Classes:
A class is a blueprint to create objects, which includes shared structure (attributes) and behavior
(methods) that all similar objects created share. It contains the description of the data structure
and the method implementation details for objects in that class.
An object must belong to only one class as an instance of that class (instance-of relationship). A
class is similar to an abstract data type. A class may also be primitive (no attributes), e.g. integer,
string, Boolean. Each object in a class is known as a class instance or object instance. It
encapsulates state through data placeholders called member variables and behavior through
reusable code called methods.
OO concepts (Class, Object, Method, State, Identity, Protocol, and Messages) together are shown
in a pictorial diagram:
Fig : OO CHARACTERSTICS
Inheritance:
Inheritance derives a new class (subclass) from an existing class (superclass). Subclass inherits
all the attributed and methods of the existing class and may have additional attributes and
methods. An important benefit of inheritance in OO systems is the notion of substitutability.
Inheritance is a way to form new classes, known as derived classes, take over (or inherit)
attributes and behavior of the pre-existing classes, called base classes (or ancestor classes).
1. Single
2. Multiple
Single Inheritance:
Exists, when a class has only one immediate (parent) superclass above it. A message is sent to an
object instance, to search for matching method through entire hierarchy, using the following
sequence:
4. Method is found.
Multiple Inheritance:
Exists, when a class can have more than one immediate (parent) superclass above it. A class can
inherit behaviours and features from more than one superclass, whereas in single inheritance, a
class may inherit from at most one superclass.
Ex: Motorcycle subclass inherits characteristics from both the Motor Vehicle and Bicycle
superclasses. From Motor Vehicle superclass, the motorcycle subclass inherits: Characteristics,
such as fuel requirements, engine pistons, and horsepower. Behavior, such as start motor, fills
gas, and depress clutch.
From Bicycle superclass, the Motorcycle subclass inherits: Characteristics, such as two wheels
and handlebars. Behavior, such as straddle the seat and move the handlebar to turn.
Encapsulation:
Encapsulation is the ability of an object to be a container for related properties (i.e. data
variables) and methods (i.e. functions). Data hiding is the ability of objects to protect variables
from external access. Variables marked as private can only be seen or modified through the use
of public accessor and mutator methods.
Encapsulation is used to implement abstraction. We combine the data and methods that operate
on the data and put them in a single unit. Every module of the system can change independently,
no impact to the other modules.
For Ex: Think of a person driving a car. He doesn’t need to know how the engine works or the
gear changes work, to be able to drive the car (Encapsulation). Instead, he needs to know things
how much turning the steering wheel needs, etc (Abstraction).
Method Overriding:
The ability of a subclass to override a method in its superclass allows a class to inherit from a
superclass whose behavior is "close enough" and then override methods as needed. A subclass
cannot override methods that are declared final in the superclass (by definition, final methods
cannot be overridden). A subclass must override methods that are declared abstract in the
superclass, or the subclass itself must be abstract. It is the process of defining a function in the
child class with same name. The child class method hides parent class method. Method
Overriding is used to provide different implementations of a function so that a more specific
behavior can be realized.
For Ex: Defined a Bonus method as shown above, to compute a Christmas bonus for all
employees. Bonus computation depends on the type of the employee. In this case, with the
exception of pilots, an employee receives a Christmas bonus equal to 5 percent of his salary.
Pilots receive a Christmas bonus on accumulated flight pay rather than on annual salary. By
defining the Bonus method and in the Pilot Subclass, we are overriding Employee Bonus method
for all objects that belong to the Pilot subclass.
Polymorphism:
Polymorphism is the capability of an action or method to do different things based on the object
that is acting up on. Overloading and Overriding are two types of polymorphism.
We may use the same name for a method defined in different classes in the class hierarchy. The
user may send the same message to different objects that belong to different classes and yet
generate the correct response.
The Pilot monthPay method definition overrides and expands the Employee monthPay method
defined in the Employee superclass.
The monthPay method that was defined in the Employee superclass is reused by the Pilot and
Mechanic subclasses.
OODBMS should support complex objects representation. It should be extensible; i.e. it should
be capable of defining new data types and operations to be carried out on them. It should support
encapsulation; i.e. data representation and method’s implementation must be hidden from
external entities. It should exhibit inheritance; i.e. an object must be able to inherit the properties
(data and methods) of other objects.
First deficiency in RDBMS is with SQL-92 relational language, which is limited. It supports a
restricted set of built-in data types that accommodate only number and strings, but whereas many
database applications are dealing with complex objects such as geographic points, text and
digital data. The problem is how this data is used. Second deficiency in RDBMS is that it suffers
from certain structural shortcomings. Relational tables are flat and do not provide good support
for nested structures, such as sets and arrays. Third deficiency is RDBMS did not take the
advantage of Object-Oriented approaches which have gained widespread acceptance. OO
techniques reduce costs and improve information system quality by adopting an object-centric
view of software development.
First deficiency in OODBMS is, vendors rediscovered the difficulties of tying database design
too closely to application design. Second deficiency in OODBMS is, they relearned that
declarative languages such as SQL-92 bring such tremendous productivity gains that
organizations will pay for additional computational resources they require. Third deficiency in
OODBMS is, they rediscovered the fact that a lack of standard data model leads to design errors
and inconsistencies.
The main drawback of OODBMSs has been poor performance. Unlike RDBMSs, query
optimization for OODBMs is highly complex. OODBMSs also suffer from problems of
scalability, and are unable to support large-scale systems.
Thus, ORDBMS emerged as a way to enhance the capabilities of RDBMS with some of the
features that appeared in ODBMS.
An object-oriented database consists of class as schema, object as data, and each object has OID
as an unique identifier, and data operation as encapsulation. A subclass and a superclass object is
physically one object but in different view such that a subclass inherits the data and operation of
a superclass. Object associated with each other through Stored OID in bi-direction references,
that is, association and inverse association. A Stored OID is a reference by use of OID which is
generated by the system. Polymorphism means overloading in object-oriented database such that
the same function can produce different output depending on the values of online input
parameters.
In mapping relational schema into an object-oriented schema, each relation is mapped into a
class, and each foreign key is mapped into an association attribute which is the data structure of
Stored OID. Each superclass and subclass relations are also mapped into superclass and subclass
in object-oriented database with subclass inheriting data and operation of superclass.
A method is an application program with a set of operations accessing a class inside object-
oriented database. In other words, a method describes object operations limited to a class only.
{Attribute}
{Method}
An complex data type is an object inside another object. A primitive data type is a data type that
cannot be decomposed further.
An OID is an object identity which is system generated with a unique address. A Stored OID is
an OID stored in another object used as a pointer for reference.
Inheritance means the reuse (inherit) of superclass data and operations in subclass. The benefit is
to eliminate data redundancy in data storage, and providing different logical view of superclass
and subclass of the same object. Same object appears in superclass view and subclass view, but
stored as ONE object inside OODB.
A class is a set of objects grouped together to form a class which is the data structure of OODB
schema. An object is an object data including object title, attribute and method.
A superclass is a class that includes subclass(es). A subclass is a class that is inside a superclass
and can inherit data, and method of the superclass.
An object refers to each other in bi-directional pointers (pointers and inverse pointers). An object
can refer to itself in recursive pointer.
A data model is an DDL plus an DML of a database. A database system is a database storage
plus database basic functions of transaction process, recovery, concurrency control and security
etc.
A Set in an Association Attribute data type means referring to a set of multiple occurrences of
other objects by using a set of stored OID in one-to-many association between two objects.
A Set value is a set of multiple values, which is a valid data type in OODB.
Polymorphism is an overloading with same function name call but will give different result
depending on the runtime parameters.
The general rules of mapping an EER model into an OODB schema are:
Step 4 Map categorization into multiple inheritance of subclass of OODB. One inheritance
overrides another inheritance if there is a conflict.
Tutorial question:
An Extended Entity Relationship model has been designed for the database. Show the object-
oriented database schema for the implementation of the EER model design. (Classes (50%),
Attributes(50%).
Name
Boat_Person Birth_date
Birth_place
Center_name
c n 1 Detention_
Refugee Name Name Non-refugee Detain
center
Name Professional
Status Detain_Date
isa d
Resettle_date
Accepted_
Country Resettle Name Waiting_refugee Name
1 refugee
n
Country_name n n
1
1
Departure_
Depature_center_name Open_center Open_center_name
center
This chapter introduces the basic concepts of object oriented databases. Its purpose is to help
you decide whether you should investigate such products further, and to understand how they
might work.
Object Oriented Databases generally provide persistent storage for objects. In addition, they
may provide one or more of the following: a query language; indexing; transaction support with
rollback and commit; the possibility of distributing objects transparently over many servers.
These features are described in the following sections. Some database vendors may charge
separately for some of these features.
Some Object Oriented Databases also come with extra tools such as visual schema designers,
Integrated Development Environments and debuggers.
Context
Unlike a relational database, which usually works with SQL, an object oriented database works
in the context of a regular programming language such as C++, C or Java. Furthermore, an
object-oriented database may be host-specific, or it may be able to read the same database from
multiple hosts, or even from multiple kinds of host, such as a SPARC server under Solaris 8 and
a PC under Linux. Some object oriented database servers can support heterogeneous clients, so
that the SPARC system and a PC and a Macintosh (for example) might all be accessing the same
database.
Persistance
With a relational database, you store information explicitly in tables, and then get it back again
later with queries. Although you can use an object oriented database in that way, it's not the only
way.
Consider a computer aided design application in which the user can save and load complex
engineering drawings into memory. With a file-based system, loading a drawing might involve
reading a large external file into memory and creating tens of thousands of objects before the
user can start working. With a relational database the software would run database queries to
create those same objects.
With an object oriented database, the software calls a database function to load the illustration,
but objects are not created in memory until they are needed: instead, they are stored in the
database, and only references are loaded into memory.
When an object is changed, the database silently writes the changes to the database, keeping the
in-database version up to date at all times. When the user presses "save", all the application does
is to commit the current transaction; since the database is already up to date, this is generally
very fast. The code no longer needs to be able to read or write the proprietary save file format,
and may well also run faster.
The most widely used DBMS is the Relational Database Management System (RDBMS). This
system is based on a table structure that stores and manages data. A table is a predefined
categories of datum that are made up of rows and columns. The columns store the fields that
define the category of data. Each row holds a complete record for the table where the data is
stored. Each table has a key field that uniquely identifies the table. The key field is the field that
is used to create relationships between other tables in an effort to connect data. This type of
organization allows data to be stored in smaller increments and then connected by through
association. A key field is a unique field that identifies the table and allows relationships to be
created between tables. Business rules are applied to the tables and fields to ensure the data is
accessed and used properly. SQL (Standard Query Language) is the tool/language that is used to
interact with and between tables to utilize the data in ways that is meaningful to the business
rules.
The Object Oriented Database Management System (OODMS) do not have as high a usage rate.
This type of DBMS provides high performance for companies with extensive amounts of data
that is highly complex. OODMSs incorporate Object Oriented technology where the data is seen
as an object. Data is defined as an objects and classes (collections of like minded objects). The
data objects utilize the concept of inheritance, where the lower classes inherit the data definitions
and methods from the upper classes. The class defines only the data it is associated with. This
helps to determine how the classes of objects relate to each other. Data is accessed in a
transparent manner through intersections of persistent objects.
So why would one choose RDBMS or OODBMS? There is no real right or wrong answer to this
question. The choice made is based on the data to be stored/managed, the type of database
needed and the technology preferences of the company providing the service or company who is
receiving the service. Often the choice is made based on the skill set available and the DBMS
that is already available.
Regardless of the preference, each DBMS has its benefits and drawbacks. OOBMS are
documented as being easy to maintain as classes and objects can be developed and updated
separate from the system. Performance is also high with OODBMSs as one can store complex
datasets in their entirety and therefore process data more quickly. Due to the class structure, the
data can be more easily distributed across networks as well as the distribution of work. A query
language is not necessary since the interaction of the data is done by transparently accessing the
objects. No keys are needed to identify the datasets or create connections between the
relationships. Many developers find the programming time to be reduced with an OODBMS
since objects inherit the characteristics of the classes. The use of classes also helps to ensure the
integrity of the data. In addition, a class is reusable for the existing database and other databases
so that it can be distributed more easily across networks.
On the other hand, Relational Database Management Systems (RDMS) are much easier to learn
and create. Many of the available systems have a GUI interface that makes the technology
available to people who are not highly technical. Since the database is not dependent on a
complex schema, increasing the capability and size is relatively easy. Ad-hoc queries can also be
added using Structured Query Language (SQL) once the production database has been
completed. In addition, the data can be used independently as the tables are set up as separate
entities rather than grouped in class.
Both systems have their drawbacks as well. OODBMSs have their drawbacks. They can be
somewhat complex and difficult to learn due to the object oriented technology. When a change
needs to be made to the database, the entire schema must be updated. Queries are dependent
upon the system and therefore must be predetermined in the planning stages. Adding queries to
the database after the fact is a difficult task.
While RDBMSs are easier to use, they are limited to simple data types and therefore do not
support more complex types such as multimedia. In addition, if the data that needs to be
processed is complicated and extensive, performance may suffer. While there are lots of
solutions within this family of database systems, they may not be robust enough to handle larger
scale projects.
Conclusion
Both types of database technologies provide a solution for the right type of project. The
choice to use one vs. the other depends on the type of project, skills of the development group
and the technology available for the company who is looking for a DBMS.
UNIT-6
Distributed Databases: Basic concepts, options for distributing a database, distributed DBMS.
A homogenous distributed database system is a network of two or more Oracle Databases that
reside on one or more machines. Below Figure illustrates a distributed system that connects three
databases: hq, mfg, and sales. An application can simultaneously access or modify the data in
several databases in a single distributed environment. For example, a single query from a
Manufacturing client on local database mfg can retrieve joined data from the products table on
the local database and the dept table on the remote hq database.
For a client application, the location and platform of the databases are transparent. You can also
create synonyms for remote objects in the distributed system so that users can access them with
the same syntax as local objects. For example, if you are connected to database mfg but want to
access data on database hq, creating a synonym on mfg for the remote dept table enables you to
issue this query:
An Oracle Database distributed database system can incorporate Oracle Databases of different
versions. All supported releases of Oracle Database can participate in a distributed database
system. Nevertheless, the applications that work with the distributed database must understand
the functionality that is available at each node in the system. A distributed database application
cannot expect an Oracle7 database to understand the SQL extensions that are only available with
Oracle Database.
The terms distributed database and distributed processing are closely related, yet have
distinct meanings. There definitions are as follows:
Distributed database
A set of databases in a distributed system that can appear to applications as a single data
source.
Distributed processing
The operations that occurs when an application distributes its tasks among different
computers in a network. For example, a database application typically distributes front-
end presentation tasks to client computers and allows a back-end database server to
manage shared access to a database. Consequently, a distributed database application
processing system is more commonly referred to as a client/server database application
system.
The terms distributed database system and database replication are related, yet distinct. In
a pure (that is, not replicated) distributed database, the system manages a single copy of all data
and supporting database objects. Typically, distributed database applications use distributed
transactions to access both local and remote data and modify the global database in real-time.
The term replication refers to the operation of copying and maintaining database objects in
multiple databases belonging to a distributed system. While replication relies on distributed
database technology, database replication offers applications benefits that are not possible within
a pure distributed database environment.
Most commonly, replication is used to improve local database performance and protect the
availability of applications because alternate data access options exist. For example, an
application may normally access a local database rather than a remote server to minimize
network traffic and achieve maximum performance. Furthermore, the application can continue to
function if the local server experiences a failure, but other servers with replicated data remain
accessible.
The Oracle Database server accesses the non-Oracle Database system using Oracle
Heterogeneous Services in conjunction with an agent. If you access the non-Oracle Database
data store using an Oracle Transparent Gateway, then the agent is a system-specific application.
For example, if you include a Sybase database in an Oracle Database distributed system, then
you need to obtain a Sybase-specific transparent gateway so that the Oracle Database in the
system can communicate with it.
Alternatively, you can use generic connectivity to access non-Oracle Database data stores so
long as the non-Oracle Database system supports the ODBC or OLE DB protocols.
Heterogeneous Services
Heterogeneous Services (HS) is an integrated component within the Oracle Database server and
the enabling technology for the current suite of Oracle Transparent Gateway products. HS
provides the common architecture and administration mechanisms for Oracle Database gateway
products and other heterogeneous access facilities. Also, it provides upwardly compatible
functionality for users of most of the earlier Oracle Transparent Gateway releases.
For each non-Oracle Database system that you access, Heterogeneous Services can use a
transparent gateway agent to interface with the specified non-Oracle Database system. The agent
is specific to the non-Oracle Database system, so each type of system requires a different agent.
The transparent gateway agent facilitates communication between Oracle Database and non-
Oracle Database systems and uses the Heterogeneous Services component in the Oracle
Database server. The agent executes SQL and transactional requests at the non-Oracle Database
system on behalf of the Oracle Database server.
Generic Connectivity
Generic connectivity enables you to connect to non-Oracle Database data stores by using either a
Heterogeneous Services ODBC agent or a Heterogeneous Services OLE DB agent. Both are
included with your Oracle product as a standard feature. Any data source compatible with the
ODBC or OLE DB standards can be accessed using a generic connectivity agent.
The advantage to generic connectivity is that it may not be required for you to purchase and
configure a separate system-specific agent. You use an ODBC or OLE DB driver that can
interface with the agent. However, some data access features are only available with transparent
gateway agents.
A database server is the Oracle software managing a database, and a client is an application that
requests information from a server. Each computer in a network is a node that can host one or
more databases. Each node in a distributed database system can act as a client, a server, or both,
depending on the situation.
A client can connect directly or indirectly to a database server. A direct connection occurs when
a client connects to a server and accesses information from a database contained on that server.
For example, if you connect to the hq database and access the dept table on this database as
in below Figure, you can issue the following:
This query is direct because you are not accessing an object on a remote database.
In contrast, an indirect connection occurs when a client connects to a server and then accesses
information contained in a database on a different server. For example, if you connect to
the hq database but access the emp table on the remote sales database as in above Figure you can
issue the following:
A database link is a pointer that defines a one-way communication path from an Oracle Database
server to another database server. The link pointer is actually defined as an entry in a data
dictionary table. To access the link, you must be connected to the local database that contains the
data dictionary entry.
A database link connection is one-way in the sense that a client connected to local database A
can use a link stored in database A to access information in remote database B, but users
connected to database B cannot use the same link to access data in database A. If local users on
database B want to access data on database A, then they must define a link that is stored in the
data dictionary of database B.
A database link connection allows local users to access data on a remote database. For this
connection to occur, each database in the distributed system must have a unique global database
name in the network domain. The global database name uniquely identifies a database server in
a distributed system.
Database Link
One principal difference among database links is the way that connections to a remote database
occur. Users access a remote database through the following types of links:
Connected Users connect as themselves, which means that they must have an account on the
user link remote database with the same username as their account on the local database.
Fixed user Users connect using the username and password referenced in the link. For
link example, if Jane uses a fixed user link that connects to the hq database with the
username and password scott/tiger, then she connects as scott, Jane has all the
privileges in hq granted to scott directly, and all the default roles that scott has been
granted in the hq database.
Current user A user connects as a global user. A local user can connect as a global user in the
link context of a stored procedure, without storing the global user's password in a link
definition. For example, Jane can access a procedure that Scott wrote, accessing
Scott's account and Scott's schema on the hq database. Current user links are an
aspect of Oracle Advanced Security.
A shared database link is a link between a local server process and the remote database. The link
is shared because multiple client processes can use the same link simultaneously.
When a local database is connected to a remote database through a database link, either database
can run in dedicated or shared server mode. The following table illustrates the possibilities:
Dedicated Dedicated
A shared database link can exist in any of these four configurations. Shared links differ from
standard database links in the following ways:
Different users accessing the same schema object through a database link can share a
network connection.
When a user needs to establish a connection to a remote server from a particular server
process, the process can reuse connections already established to the remote server. The
reuse of the connection can occur if the connection was established on the same server
process with the same database link, possibly in a different session. In a non shared
database link, a connection is not shared across multiple sessions.
When you use a shared database link in a shared server configuration, a network
connection is established directly out of the shared server process in the local server.
The great advantage of database links is that they allow users to access another user's objects in a
remote database so that they are bounded by the privilege set of the object owner. In other words,
a local user can access a link to a remote database without having to be a user on the remote
database.
To understand how a database link works, you must first understand what a global database
name is. Each database in a distributed database is uniquely identified by its global database
name. The database forms a global database name by prefixing the database network domain,
specified by the DB_DOMAIN initialization parameter at database creation, with the individual
database name, specified by the DB_NAME initialization parameter.
The name of a database is formed by starting at the leaf of the tree and following a path to the
root. For example, the mfg database is in division3 of the acme_tools branch of the com domain.
The global database name for mfg is created by concatenating the nodes in the tree as follows:
mfg.division3.acme_tools.com
While several databases can share an individual name, each database must have a unique global
database name. For example, the network domain
sus.americas.acme_auto.com and uk.europe.acme_auto.com each contain a sales database. The
global database naming system distinguishes the sales database in the Americas division from
the sales database in the Europe division as follows:
sales.us.americas.acme_auto.com
sales.uk.europe.acme_auto.com
Typically, a database link has the same name as the global database name of the remote database
that it references. For example, if the global database name of a database is sales.us.oracle.com,
then the database link is also called sales.us.oracle.com.
Connected user links have no connected string associated with them. The advantage of a
connected user link is that a user referencing the link connects to the remote database as the same
user. Furthermore, because no connect string is associated with the link, no password is stored in
clear text in the data dictionary.
Connected user links have some disadvantages. Because these links require users to have
accounts and privileges on the remote databases to which they are attempting to connect, they
require more privilege administration for administrators. Also, giving users more privileges than
they need violates the fundamental security concept of least privilege: users should only be given
the privileges they need to perform their jobs.
A benefit of a fixed user link is that it connects a user in a primary database to a remote database
with the security context of the user specified in the connect string. For example, local
user Joe can create a public database link in Joe’s schema that specifies the fixed user Scott with
password tiger. If Jane uses the fixed user link in a query, then Jane is the user on the local
database, but she connects to the remote database as scott/tiger.
UNIT-7
Data warehousing: Introduction Basic concepts, data warehouse architecture, data
characteristics, reconciled data layer data transformations, derived data layer user interface.
META
DATA HIGHLY
QUERY MANAGER
OPERATIONAL
LOAD MANAGER
SUMMERIZED DATA
END USER
SOURCE
DETAILED DATA
WAREHOUSE MANAGER
ARCHIVE / BACK UP
Fig : Each component and the tasks performed by them are explained below:
The data in a data warehouse comes from operational systems of the organization as well as from
other external sources. These are collectively referred to as source systems. The data extracted
from source systems is stored in a area called data staging area, where the data is cleaned,
transformed, combined, deduplicated to prepare the data for us in the data warehouse. The data
staging area is generally a collection of machines where simple activities like sorting and
sequential processing takes place. The data staging area does not provide any query or
presentation services. As soon as a system provides query or presentation services, it is
categorized as a presentation server. A presentation server is the target machine on which the
data is loaded from the data staging area organized and stored for direct querying by end users,
report writers and other applications. The three different kinds of systems that are required for a
data warehouse are:
1. Source Systems
2. Data Staging Area
3. Presentation servers
The data travels from source systems to presentation servers via the data staging area. The entire
process is popularly known as ETL (extract, transform, and load) or ETT (extract, transform, and
transfer). Oracle’s ETL tool is called Oracle Warehouse Builder (OWB) and MS SQL Server’s
ETL tool is called Data Transformation Services (DTS).
1. OPERATIONAL DATA
The sources of data for the data warehouse is supplied from:
(i) The data from the mainframe systems in the traditional network and hierarchical
format.
(ii) Data can also come from the relational DBMS like Oracle, Informix.
(iii) In addition to these internal data, operational data also includes external data
obtained from commercial databases and databases associated with supplier and
customers.
2. LOAD MANAGER
The load manager performs all the operations associated with extraction and loading data into the
data warehouse. These operations include simple transformations of the data to prepare the data
for entry into the warehouse. The size and complexity of this component will vary between data
warehouses and may be constructed using a combination of vendor data loading tools and
custom built programs.
3. WAREHOUSE MANAGER
The warehouse manager performs all the operations associated with the management of data in
the warehouse. This component is built using vendor data management tools and custom built
programs. The operations performed by warehouse manager include:
(i) Analysis of data to ensure consistency
(ii) Transformation and merging the source data from temporary storage into data
warehouse tables
(iii) Create indexes and views on the base table.
(iv) Denormalization
(v) Generation of aggregation
(vi) Backing up and archiving of data
In certain situations, the warehouse manager also generates query profiles to determine which
indexes ands aggregations are appropriate.
Data Warehousing
several TB)
(many tables, few columns per table) (few tables, many columns per table)
Staging Area
Transform
Load
Data
Warehouse
Ad Hoc Query Tools Data Mining
4. Data tends to exist at multiple levels of granularity. Most important, the data tends to be of a
historical nature, with potentially high time variance.
6. The DW should have a capability for rewriting history, that is, allowing for “what-if” analysis.
Logical Design
Customer
1
«pk» CustId
Name
Fact Table
Ship Calendar
1 «fk» CustID * CustType
«pk» ShipDateID
M.TECH (Computer Science *& Engineering)
«fk» ShipDateID City 107
Ship Date *
Bind Style
State Province
«fk» BindID
Ship Month «pk» BindId
«dd» JobID
1 Country
Shabnam Sangwan, A.P. in CSE, SKITM
Country
State Province
Cust Type
City
Customer
Ship Year
4. QUERY MANAGER
The query manager performs all operations associated with management of user queries. This
component is usually constructed using vendor end-user access tools, data warehousing
monitoring tools, database facilities and custom built programs. The complexity of a query
manager is determined by facilities provided by the end-user access tools and database.
5. DETAILED DATA
This area of the warehouse stores all the detailed data in the database schema. In most cases
detailed data is not stored online but aggregated to the next level of details. However the detailed
data is added regularly to the warehouse to supplement the aggregated data.
8. META DATA
The data warehouse also stores all the Meta data (data about data) definitions used by all
processes in the warehouse. It is used for variety of purposed including:
(i) The extraction and loading process – Meta data is used to map data sources to a
common view of information within the warehouse.
(ii) The warehouse management process – Meta data is used to automate the
production of summary tables.
(iii) As part of Query Management process Meta data is used to direct a query to the
most appropriate data source.
The structure of Meta data will differ in each process, because the purpose is different.
EXTRACT
Some of the data elements in the operational database can be reasonably be expected to be useful
in the decision making, but others are of less value for that purpose. For this reason, it is
necessary to extract the relevant data from the operational database before bringing into the data
warehouse. Many commercial tools are available to help with the extraction process. Data
Junction is one of the commercial products. The user of one of these tools typically has an easy-
to-use windowed interface by which to specify the following:
(i) Which files and tables are to be accessed in the source database?
(ii) Which fields are to be extracted from them? This is often done internally by
SQL Select statement.
(iii) What are those to be called in the resulting database?
(iv) What is the target machine and database format of the output?
(v) On what schedule should the extraction process be repeated?
TRANSFORM
The operational databases developed can be based on any set of priorities, which keeps changing
with the requirements. Therefore those who develop data warehouse based on these databases are
typically faced with inconsistency among their data sources. Transformation process deals with
rectifying any inconsistency (if any).
CLEANSING
Information quality is the key consideration in determining the value of the information. The
developer of the data warehouse is not usually in a position to change the quality of its
underlying historic data, though a data warehousing project can put spotlight on the data quality
issues and lead to improvements for the future. It is, therefore, usually necessary to go through
the data entered into the data warehouse and make it as error free as possible. This process is
known as Data Cleansing.
Data Cleansing must deal with many types of possible errors. These include missing data and
incorrect data at one source; inconsistent data and conflicting data when two or more source are
involved. There are several algorithms followed to clean the data, which will be discussed in the
coming lecture notes.
LOADING
Loading often implies physical movement of the data from the computer(s) storing the source
database(s) to that which will store the data warehouse database, assuming it is different. This
takes place immediately after the extraction phase. The most common channel for data
movement is a high-speed communication link. Ex: Oracle Warehouse Builder is the API from
Oracle, which provides the features to perform the ETL task on Oracle Data Warehouse.
End-user query processor – A program or utility that allows end users to retrieve data and
generate reports without writing application programs.
Data access and control logic – The system software that controls access to the physical
database and maintains various internal data structures (for example, indices and
pointers).
Database – The physical data store (or stores) combined with the schema.
Schema – A store of data that describes various aspects of the “real” data, including data
types, relationships, indices, content restrictions, and access controls.
Physical data store – The “real” data as stored on a physical storage medium (for
example, a magnetic disk).
A database schema is a store of data that describes the content and structure of the physical data
store (sometimes called metadata—data about data). It contains a variety of information about
data types, relationships, indices, content restrictions, and access controls.
3. Why have databases become the preferred method of storing data used by an information
system?
Databases are a common point of access, management, and control. They allow data to be
managed as an enterprise-wide resource while providing simultaneous access to many different
users and application programs. They solve many of the problems associated with separately
maintained data stores, including redundancy, inconsistent security, and inconsistent data access
methods.
4. List four different types of database models and DBMSs. Which are in common use
today?
The four database models are hierarchical, network (CODASYL), relational, and object-oriented.
Hierarchical and network models are technologies of the 1960s and 1970s and are rarely found
today. The relational model was developed in the 1970s and widely deployed in the 1980s and
1990s. It is currently the predominant database model. The object-oriented database model was
first developed in the 1990s and is still being developed today. It is expected to slowly replace
the relational model over the next decade.
5. With respect to relational databases, briefly define the terms row and field.
Row – The portion of a table containing data that describes one entity, relationship, or object.
Field – The portion of a table (a column) containing data that describes the same fact about all
entities, relationships, or objects in the table.
6. What is a primary key? Are duplicate primary keys allowed? Why or why not?
A primary key is a field or set of fields, the values of which uniquely identify a row of a table.
Because primary keys must uniquely identify a row, duplicate key values aren’t allowed.
7. What is the difference between a natural key and an invented key? Which type is most
commonly used in business information processing?
is one that is assigned by a system (for example, a social security or credit card number). Most
keys used in business information processing are invented.
8. What is a foreign key? Why are foreign keys used or required in a relational database?
Are duplicate foreign key values allowed? Why or why not?
A foreign key is a field value (or set of values) stored in one table that also exists as a primary
key value in another table. Foreign keys are used to represent relationships among entities that
are represented as tables. Duplicate foreign keys are not allowed within the same table because
they would redundantly represent the same relationship. Duplicate foreign keys may exist in
different tables because they would represent different relationships.
9. Describe the steps used to transform an ERD into a relational database schema.
7. Choose appropriate data types and value restrictions for each field.
A one-to-many relationship is represented by adding the primary key field(s) of the table that
represents the entity participating in the “one” side of the relationship to the table that represents
the entity participating in the “many” side of the relationship.
13. What is referential integrity? Describe how it is enforced when a new foreign key value is
created, when a row containing a primary key is deleted, and when a primary key value is
changed.
Referential integrity is content constraint between the values of a foreign key and the values of
the corresponding primary key in another table. The constraint is that values of the foreign key
field(s) must either exist as values of a primary key or must be NULL. A valid value must exist
in the foreign key field(s) before the row can be added. When a row containing the primary key
is deleted, the row with the foreign key must also be deleted for the data to maintain referential
integrity. A primary key should never be changed; but in the event that it is, the value of the
foreign key must also be changed.
14. What types of data (or fields) should never be stored more than once in a relational
database? What types of data (or fields) usually must be stored more than once in a relational
database?
If a table represents an entity, the primary key values of each entity represented in the table are
redundantly stored (as foreign keys) for every relationship in which the entity participates.
15. What is relational database normalization? Why is a database schema in third normal
form considered to be of higher quality than an unnormalized database schema?
Relational database normalization is a process that increases schema quality by minimizing data
redundancy. A schema with tables in third normal form has less non-key data redundancy than a
schema with unnormalized tables. Less redundancy makes the schema and database contents
easier to maintain over the long term.
16. Describe the process of relational database normalization. Which normal forms rely on
the definition of functional dependency?
The process of normalization modifies the schema and table definitions by successively applying
higher order rules of table construction. The rules each define a normal form, and the normal
forms are numbered one through three. First normal form eliminates repeating groups that are
embedded in tables.
Second and third normal forms are based on a concept called functional dependency—a one-to-
one correspondence between two field values. Second normal form ensures that every field in a
table is functionally dependent on the primary key. Third normal form ensures that no non-key
field is functionally dependent on any other non-key field.
17. Describe the steps used to transform a class diagram into an object database schema.
18. What is the difference between a persistent class and a transient class? Provide at least
one example of each class type.
An object of a transient class exists only for the duration of a program execution (for example,
the user interface of the program). An object of a persistent class (for example, a customer object
in a billing system) retains its identity and data content between program executions.
19. What is an object identifier? Why are object identifiers required in an object database?
An object identifier is a key or storage address that uniquely identifies an object within an object-
oriented database. Object identifiers are needed to represent relationships among objects. A
relationship is represented by embedding the object identifier of a participating object in the
other participating object.
A class on a class diagram is represented “as is” in an object database. That is, each object of the
class type is stored in the database along with its data content and methods.
The object identifier of each participating object is embedded in the other participating object.
The object on the “one” side of the relationship might have multiple embedded object identifiers
to represent multiple participants on the “many” side of the relationship.
The object identifier of each participating object is embedded in the other participating object.
The objects on both sides of the relationship might have multiple embedded object identifiers to
represent multiple participants on the other side of the relationship.
23. What is an association class? How are association classes used to represent many-to-
many relationships in an object database?
24. Describe the two ways in which a generalization relationship can be represented in an
object database.
Generalization relationships can be represented directly (for example, using the ODL keyword
extends) or indirectly as a set of one-to-one relationships.
25. Does an object database require key fields or attributes? Why or why not?
Key fields aren’t required because they aren’t needed to represent relationships. However, they
are usually included because they are useful for a number of reasons, including guaranteeing
unique object content and searching or sorting database content.
26. Describe the similarities and differences between an ERD and a class diagram that
models the same underlying reality.
Each entity on an ERD corresponds to one class on a class diagram. The one-to-one, one-to-
many, and many-to-many relationships among those classes are the same as those on the ERD.
27. How are classes and relationships on a class diagram represented in a relational database?
28. What is the difference between a primitive data type and a complex data type?
A primitive data type (for example, integer, real, or character) is directly supported (represented)
by the CPU or a programming language. A complex data type (for example, record, linked list,
or object) contains one or more data elements constructed using the primitive data types as
building blocks.What are the advantages of having an RDBMS provide complex data types?
Providing complex data types in the RDBMS allows a wider range of data to be represented. It
also minimizes compatibility problems that might result from using different programming
languages or hardware.
29. Does an ODBMS need to provide predefined complex data types? Why or why not?
30. Why might all or part of a database need to be replicated in multiple locations?
Database accesses between distant servers and clients must traverse one or more network links.
This can slow the accesses due to propagation delay or network congestion. Access speed can be
increased by placing a database replica close to clients.
31. Briefly describe the following distributed database architectures: replicated database
servers, partitioned database servers, and federated database servers. What are the comparative
advantages of each?
Replicated database servers – An entire database is replicated on multiple servers, and each
server is located near a group of clients. Best performance and fault tolerance for clients because
all data is available from a “nearby” server.
Partitioned database servers – A database is partitioned so that each partition is a database subset
used by a single group of clients. Each partition is located on a separate server, and each server is
located close to the clients that access it. Better performance and less replication traffic than
replicated servers if similar collocated clients use only a subset of database content.
Federated database servers – Data from multiple servers with different data models and/or
DBMSs is pooled by implementing a separate (federated) server that presents a unified view of
the data stored on all the other servers. The federated server constructs answers to client queries
by forwarding requests to other servers and combining their responses for the client. Simplest
and most manageable way to combine data from disparate DBMSs into a single unified data
store.
32. What additional database management complexities are introduced when database
contents are replicated in multiple locations?
Replicated copies are redundant data stores. Thus, any changes to data content must be
redundantly implemented on each copy. Implementing redundant maintenance of data content
requires all servers to periodically exchange database updates.
UNIT-8
Object Relational Databases: Basic concepts enhanced SQL, advantages of object relational
approach.
Several major software companies including IBM, Informix, Microsoft, Oracle, and Sybase have
all released object-relational versions of their products. These companies are promoting a new,
extended version of relational database technology called object-relational database
management systems also known as ORDBMSs
A certain group thinks that future applications can only be implemented with pure object-
oriented systems. Initially these systems looked promising. However, they have been
unable to live up to the expectations. A new technology has evolved in which relational
and object-oriented concepts have been combined or merged. These systems are called
object-relational database systems. The main advantages of ORDBMSs are massive
scalability and support for object-oriented features.
• Multimedia applications
• Incorporation of business rules
• Reusability (inheritance)
• Nested complex types
• Relationships
• Options
• Object-oriented databases ?
• Object-relational databases ?
Advantages
Reuse comes from the ability to extend the database server so that core functionality is
performed centrally, rather than coded in each application.
An example is a complex type (or extended base type) which is defined within the database,
but is used by many applications. Previously it was required to define this type in every
application that used it, and develop the interface between the software ‘type’ and its
representation in the database. Sharing is a consequence of this reuse.
From a practical point of view, end-users are happier to make the smaller ‘leap’ from
relational to object-relational, rather that have to deal with a completely different paradigm
(object-oriented).
Disadvantages
Relational purists believe that the simplicity of the original model was its strength.
Thus, there is a large semantic gap between the o-o and o-r database worlds.
ORDBMS engineers are data focused while OODB engineers have models which attempt
to mirror the real-world (data & behaviour).
The third-generation DSM was devised by Stonebraker’s group (of proposers) and defines those
principles that ORDBMS designers should follow.
• Functions (including database procedures and methods) and encapsulation are a good
idea.
• Unique identifiers for tuples should be assigned by the DBMS only if a user-defined
primary key is unavailable.
• Rules (triggers or constraints) will become a major feature in future database systems.
They should not be associated with a specific function or collection.
• There should be more that one way to specify collections: one using enumeration of
members, and a second using the query language to specify membership.
• Queries and results should be the lowest level of communication between client and
server.
Rules
• Rules are valuable in that they protect the integrity of data in a database.
• The general form of a rule is “on the occurrence of event x do action y”.
• The are four variations in the proposed standard for ORDBs: update-update, query-
update, update-query, and query-query rules.
Update-Update Rules
• In this case, the event is an update, and the action is an update.
• This is useful in cases where it is necessary to implement an audit eg. Create a new tuple
in the Audit relation with username, date and description, each time a change is made to
the Salary relation.
M.TECH (Computer Science & Engineering) 122
Shabnam Sangwan, A.P. in CSE, SKITM
ON UPDATE TO Salary
DO
• In the above example, the current username, date and the lname of the updated employee
(in Salary) are recorded. Note that if we were only interested in one or some group of
employees we could use a where clause (see next example).
Query-Update Rules
• Similar to the previous example: a user is accessing the Salary relation (for a specific
employee), and the system automatically records it. In this case, only for employee A515.
DO
• Suppose that the deletion of tuples from the Author table is not recommended since new
titles may come into stock.
ON DELETE TO Author
DO
ShowMessage “Deleting “+Author.name+”prevents new titles being entered into the database”
• The query in this case is select Author.name which is used in the message.
Query-Query Rules
• In this case, both the event and the action are read-only queries.
• A example is where one retrieval operation will require an attribute from some other
relation.
• For example, when viewing details for a customer (from the Customer relation), their
credit may be listed as “A2”, where the actual value for “A2” is inside a Credit relation.
(Note we could do the same using a join query)
ON SELECT TO Customer X
DO
Select C.value
From Credit C