DBMSit

Database Management Systems
DATABASE MANAGEMENT SYSTEMS
2 MARKS - LECTURE NOTES – Q BANK

DEPARTMENT OF INFORMATION TECHNOLOGY
CODE : UIT 6087
SEMETER : VI
YEAR : 2010
Prepared by
R.
Saravanan
Sr. Lect, IT
1 |Page Department of Information Technology

UNIT I
Introduction to Database Systems: Overview – Data Models – Database System
Architecture – History of Database Systems. Entity-Relationship Model: Basic
Concepts – Constraints – Keys – Design Issues – Entity Relationship Diagram –
Weak Entity Sets – Extended E-R Features – Design of an E-R Database Schema –
Reduction of E-R Schema to Tables – The Unified Modeling Language UML.
UNIT II
Relational Model: Structure of Relational Databases – Relational Algebra –
Extended Relational Algebra Operations – Modification of Database – Views –
Tuple Relational Calculus – Domain Relational Calculus. SQL: Background – Basic
Structure – Set Operations – Aggregate Functions – Null Values – Nested
Subqueries – Views – Complex Queries – Modification of the database – Joined
Relations – Data-Definition Language – Embedded SQL –Dynamic SQL – Other
SQL Features. Other Relational Languages: Query-by-Example – Datalog – User
Interfaces and Tools
UNIT III
Integrity and Security: Domain Constraints – Referential Integrity – Assertions –
Triggers – Security and Authorization – Authorization in SQL – Encryption and
Authentication. Relational-Database Design: First Normal Form – Pitfalls in
Relational-Database Design – Functional Dependencies – Decomposition –
Desirable Properties of Decomposition – Boyce-Codd Normal Form – Third Normal
Form – Fourth Normal Form – More Normal Forms – Overall Database Design
Process.
UNIT IV
Storage and File Structures: Overview of Physical Storage Media – Magnetic Disks
– RAID – Tertiary Storage – Storage Access – File Organization – Organization of
Records in Files – Data-Dictionary Storage. Indexing and Hashing: Basic Concepts
– Ordered Indices – B+-Tree Index Files – B-Tree Index Files – Static Hashing –
Dynamic Hashing – Comparison of Ordered Indexing and Hashing – Index
Definition in SQL – Multiple-Key Access

UNIT V
Transactions: Transaction concept – Transaction State – Implementation of
Atomicity and Durability – Concurrent Executions – Serializability – Recoverability
– Implementation of Isolation – Transaction Definition in SQL – Testing for
Serializability
Concurrence Control: Lock-Based Protocols – Timestamp-Based Protocols –

Validation-Based Protocols – Multiple Granularity – Multiversion Schemens –
Deadlock Handling – Insert and Delete Operations – Weak Levels of Consistency –
Concurrency of Index Structures. Recovery System: Failure Classification –
Storage Structure – Recovery and Atomicity – Log-Based Recovery – Shadow
Paging – Recovery with Concurrent Transactions – Buffer Management – Failure
with Loss of Nonvolatile Storage – Advance Recovery Techniques – Remote
Backup Systems
TEXT BOOK
 Silberschatz, Korth, Sudarshan, “Database System Concepts”, 4th Edition –

McGraw-Hill Higher Education, International Edition 2002. Chapters: 1 to 7,
11, 12, 15 to 17.
REFERENCE BOOKS
 Fred R McFadden, Jeffery A Hoffer, Mary B. Prescott, “Moden Database

Management:, Fifth Edition, Addison Wesley, 2000.
 Elmasri, Navathe, “Fundamentals of database Systems”, Third Edition,

Addison Wesley, 2000.
 Jefrey D.Ulman, Jenifer Widom, “A First Course in Database Systems:,

Pearson Education Asia, 2001.
 Bipin C Desai, “An Introduction to Database Systems”, Galgotia

Publications Pvt Limited, 2001.

UNIT
Unit-I- I
2 mark Q & A:
1. What is DBMS and what’s the goal of it?

DBMS is a collection of interrelated data and a set of programs to access
that data. The goal is to provide the environment that is both convenient and
efficient to use in retrieving and storing data base information.
2. What are the advantages of DBMS?

• Centralized Control
• Data Independence allows dynamic change and growth potential.
• Data Duplication is eliminated with control Redundancy.
• Data Quality is enhanced.
• Controlling redundancy
• Restricting unauthorized access
• Providing multiple user interfaces
• Enforcing integrity constraints.
• Providing back up and recovery
3. What are the Disadvantages of DBMS?

• Cost of software, hardware and migration is high.
• Complexity of Backup
• Problem associated with centralization.
4. List any eight applications of DBMS.

• Banking
• Airlines
• Universities
• Credit card transactions
• Tele communication
• Finance
• Sales
• Manufacturing
• Human resources
5. What are the disadvantages of File Systems?

• Data Redundancy and Inconsistency.
• Difficulty in accessing data.
• Data Isolation
• Integrity problems
• Security Problems
• Concurrent access anomalies
6. Give the levels of data abstraction?

a. Physical level
b. Logical level
c. View level

7. Define the terms

a. Physical schema
b. Logical schema.
Physical schema: The physical schema describes the database are
design at the physical level, which is the lowest level of abstraction
describing how the data are actually stored.
Logical schema: The logical schema describes the database design at

the logical level, which describes what data are stored in the
database and what relationship exists among the data.
8. What is conceptual schema?

The schemas at the view level are called subschemas that describe
different views of the database.
9. Define data model?

A data model is a collection of conceptual tools for describing data,
data relationships, data semantics and consistency constraints.
10. What is storage manager?

A storage manager is a program module th at pr ovides the interface
between the low level data stored in a database and the application
programs and queries submitted to the system.
11. What are the co mponents of storage manager?

The storage manager components include
a) Authorization and integrity manager
b) Transaction manager
c) File manager
d) Buffer manager
12. What is the purpose of storage manager?

The storage manager is responsible for the following
a) Interaction with the file manager
b) Translation of DML commands in to low level file system
commands
c) Storing, retrieving and updating data in the database
13. List the data structures implemented by the storage manager

The storage manager implements the following data structure
a) Data files
b) Data dictionary
c) Indices
14. What is a data dictionary?

A data dictionary is a data structure which stores meta data about the
structure of the database ie. the schema of the database.
15. What is an entity relationship model?

The entity relationship model is a collection of basic objects called
entities and relationship among those objects. An entity is a thing or

object in the real world that is distinguishable from other objects.
16. What are attributes? Give examples.

An entity is represented by a set of attributes. Attributes are descriptive
properties possessed by each member of an entity set.
Example: possible attributes of customer entity are customer name,
customer id, customer street, customer city.
17. What is relationship? Give examples

A relationship is an association among sev eral entities.
Example: A depositor relationship associates a customer with each
account that he/she has.
18. Define Entity, Entity Set, and extensions of entity set. Give one
example for each.
• Entity – object or thing in the real world. Egs., each person, book, etc.
• Entity set – set of entities of the same entity that share the same
properties or attributes. Eg., Customer, loan, account, etc
• Extensions of entity set – individual entities that constitute a set. Eg.,
individual bank customers
19.Define and give examples to illustrate the four types of attributes in

database.
• Simple and Composite attribute
o Simple (address)
o Composite (street, city, district)
• Single-valued and multi-valued attribute
o Single-valued (roll no)
o Multi-valued (colors = {R,B,G})
• Null attribute
o No values for attribute
• Derived attribute
o Age derived from Birth date
20. Define the terms

i) Entity type & ii) Entity set
Entity type: An entity type defines a collection of entities that
have the same attributes.
Entity set: The set of all entities of the same type is termed as an entity
set.
21. What is meant by the degree of relationship set?

The degree of relationship type is the number of participating entity types.
22. Define the terms
i) Key attribute
ii) Value set
Key attribute: An entity type usually has an attribute whose values
are distinct from each individual entity in the collection. Such an attribute
is called a key attribute.
Value set: Each simple attribute of an entity type is associated with

a value set that specifies the set of values that may be assigned to
that attribute for each individual entity.
23. Define relationship and participation.

o Relationship is an association among several entities.
o Association between entity sets is referred as participation.
24. Define mapping cardinality or cardinality ratio.

Mapping cardinality or cardinality ratio is the way to express the
number of entities to which another entity can be associated through a
relationship set. These are most useful in describing binary relationship
sets.
25. Explain the four types of mapping cardinality with example.

For a binary relationship set R between entity sets A and B, the
mapping cardinality must be one of the following: (Draw the diagrams
also)
• One-to-one
An entity in A is associated with at most one entity in B and an entity in
B is associated with at most one entity in A. Eg., Roll no entity in
Student info entity set and marks details entity set.
• One-to-many
An entity in A is associated with any number of entities in B. But an
entity in B is associated with at most one entity in A. Eg., one customer
with many loans.
• Many-to-one
An entity in A is associated with at most one entity in B. But an entity
in B can be associated with any number of entities in A. Eg., street and
city associated to a single person.
• Many-to-many
An entity in A is associated with any number of entities in B. But an
entity in B can be associated with any number of entities in A. Eg.,
same loan by several business partners.
26. Differentiate total participation and partial participation. (i.e., write

Definition and Example with illustration)
• Total – participation of an entity set, E in relationship, R is total if every

entity in E participates in at least one relationship in R. Eg., loan entity
set.
• Partial – if only some entities in E participate in R. Eg., payment weak

entity set.

27. Define E-R diagram.

Overall structure of a database can be expressed graphically by E-R
diagram for simplicity and clarity.
28. Define weak Entity set. Give an example and explain why it is weak
entity set.
Entity set with no sufficient attributes to form a primary key.
• Payment entity set is weak since duplication exists
29. Define discriminator or partial key of a weak entity set. Give

example.
Set of attributes that allow distinction to be made among all those
entities in the entity set that depend on one particular strong entity. Eg.,
payment_no in payment entity set. It is also called as partial key.
30. Explain Referential Integrity.

Referential Integrity means relationship between tables. Foreign
keys are used. Foreign key is the column whose values are derived from
the Primary key of the same or some other table.
Format for creating a foreign key is given below.
Syntax:
create table <table name>(columnname data type (size) constraint
constraint_name references parent table name);
31. Define Instances and schemas.

Instances:
The collection of information stored in the database at a particular moment
is called the instance of the database.
Schemas:
The overall design of the database is called the schema of the database.
32. Define and explain the two types of Data Independence.

Two types of data independence are
Physical data independence:
The ability to modify the physical schema without causing
application programs to be rewritten in the higher levels. Modifications in
physical level occur occasionally whenever there is a need to improve
performance.
Logical data independence:
The ability to modify the logical schema without causing application
programs to be rewritten in the higher level (external level). Modifications
in physical level occur frequently more than that in physical level,
whenever there is an alteration in the logical structure of the database.
33. Define transaction.

A transaction is a collection of operations that performs a single
logical function in a database application. Each transaction is a unit of both
atomicity and consistency. Properties of transaction are atomicity,
consistency, isolation, and durability.
34. Define the type types of DML.

Two types:
Procedural DML:
It requires a user to specify what data are needed and how to get
those data.
Non-Procedural DML:
It requires a user to specify what data are needed without specifying
how to get those data.
35. List out the functions of DBA.

• Schema definition
• Storage structure and access-method definition
• Schema and physical modification
• Granting of authorization for data access
• Integrity constraint specification
36. What is the need for DBA?

The need of DBA is to have central control of both the data and the
programs that access those data. The person who has such central control
over the system is called the DataBase Administrator.
37. Define weak and strong entity sets?

Weak entity set
Entity set that do not have key attribute of their own are
called weak entity sets.
Strong entity set
Entity set that has a primary key is termed a strong entity set.
38. What does the cardinality ratio specify?

Mapping cardinalities or cardinality ratios express the number of
entities to which another entity can be associated. Mapping
cardinalities must be one of the following:
• One to one
• One to many
• Many to one
• Many to many
39. Explain the two types of participation constraint.

Total
The participation of an entity set E in a relationship set R is said
to be total if every entity in E participates in at least one
relationship in R.
Partial
if only some entities in E participate in relationships in R, the
participation of entity set E in relationship R is said to be partial.
40. Explain DML pre-compiler.

DML precompiler converts DML statements embedded in an
application program to normal procedure calls in the host language. The
precompiler must interact with the DML compiler to generate the
appropriate code

41. Define file manager and buffer manager.
File manager:
File manager manages the allocation of space on disk storage and
the data structures used to represent information stored on disk.
Buffer manager:
Buffer manager is responsible for fetching data from the disk
storage into the main memory, and deciding what data to cache in
memory.
42. Define Data Dictionary.

DDL statements are compiled into a set of tables that is stored in a
special file called data dictionary. Data dictionary contains the meta-data,
which in turn is data about the data.
Lecture notes:
10 | P a g e Department of Information Technology
DATABASE MANAGEMENT SYSTEMS
DBMS contains information about a particular enterprise

• Collection of interrelated data.
• Set of programs to access the data.
• An environment that is both convenient and efficient to use.
In other way
 A very large, integrated collection of data.

 Models real-world enterprise.
 Entities (e.g., students, courses)
 Relationships (e.g., Madan is taking Computer)
Goal
 A Database Management System (DBMS) is a software package that is
designed to store and manage databases.
Database Applications
 Banking: all transactions

 Airlines: reservations, schedules
 Universities: registration, grades
 Sales: customers, products, purchases
 Online retailers: order tracking, customized recommendations
 Manufacturing: production, inventory, orders, supply chain
 Human resources: employee records, salaries, tax deductions
 Databases touch all aspects of our lives
Purpose of Database Systems
In the early days, database applications were built directly on top of file
systems
Drawbacks of using file systems
 Data redundancy and inconsistency

Multiple file formats, duplication of information in different files.
 Difficulty in accessing data

Need to write a new program to carry out each new task
 Data isolation - multiple files and formats
 Integrity problems
• Integrity constraints (e.g. account balance > 0) become
“buried” in program code rather than being stated explicitly
• Hard to add new constraints or change existing ones
 Atomicity of updates
• Failures may leave database in an inconsistent state with
partial updates carried out
• Example: Transfer of funds from one account to another
should either complete or not happen at all
• Program to transfer Rs.200 from A to B acc. System failure
occurs then rs. 200 is removed from A but not credited to B
 Concurrent access by multiple users

• Concurrent accessed needed for performance
• Uncontrolled concurrent accesses can lead to inconsistencies
• Example: Two people reading a balance and updating it at
the same time
 Security problems
o Hard to provide user access to some, but not all, data
Database systems offer solutions to all the above

problems
Files vs DBMS
 Application must stage large datasets between main memory and

secondary storage
o (e.g., buffering, page-oriented access, 32-bit addressing, etc.)
 Special code for different queries
 Must protect data from inconsistency due to multiple concurrent users
 Crash recovery
 Security and access control
Why Use a DBMS?
 Data independence and efficient access.

 Reduced application development time.
 Data integrity and security.
 Uniform data administration.
 Concurrent access, recovery from crashes
Why Study Databases?
 Shift from computation to information

o at the “low end”: scramble to webspace
o at the “high end”: scientific applications
 Datasets increasing in diversity and volume.
o Digital libraries, interactive video, Human

 DBMS encompasses most of CS
o OS, languages, theory, “A”I, multimedia, logic
VIEW OF DATA
A major purpose of a database system is to provide users with an abstract

view of the data. That is, the system hides certain details of how the data are
stored and maintained.
Data Abstraction
The need for efficiency has led designed to use complex data Structures to
represent data in the database. Developers hide the complexity from users
through several levels of abstraction
Physical level
 Lowest level of abstraction

 Describes how the data are Actually stored.
 Describes complex low-level data structures in detail.
Logical level
 The next –higher level of abstraction describes what data are stored
in the database,
 and what relationships exist among those data.
 The logical level thus describes the entire database in terms of a
small number of relatively simple structures.
View of level
The highest level of abstraction describes only part of the entire database.
Even though the logical level uses simpler structures, complexity remains
because of the variety of information stored in a large database.

Fig 1.1 The Three levels of Data Abstraction
Instances and Schema
The collection of information stored in the database at a particular moment

is called instance of the database.
The overall design of the database is called the database schema.
Database systems have several schemas, partitioned according to the

levels of abstraction.
The physical schema describes the database design at the physical level,
The logical schema describes the database design at the logical level.
A database may also have several schemas at the view level, sometimes
called subschema, that describe different views of the database.
The logical schema is by far the most important, in terms of its effect on
application programs, since programmers construct applications by using the
logical schema.
The physical schema is hidden beneath the logical schema, and can
usually be changed easily without affecting application programs.

Applications programs are said to exhibit physical data independence if

they do not depend on the physical schema, and thus need not be rewritten if the
physical schema changes.
DATA MODELS
• A data model is a collection of concepts for describing data.

• A schema is a description of a particular collection of data, using the given
data model.
• The relational model of data is the most widely used model today.
• Main concept: relation, basically a table with rows and columns.
• Every relation has a schema, which describes the columns, or fields.
Underlying the structure of a database is the data model a collection of

conceptual tools for describing data, data relationships, data semantics,
and consistency constraints. It can be categorized into two models they are
 The entity-relationship model and

 The relational model
Both provide a way to describe the design of a database at the logical level.
Entity-Relationship Model
The entity-relationship (E-R) data model is based on a perception of a real

world that consists of a collection of basic objects, called entities, and of
relationships among these objects.
For eg., each person is an entity; and bank accounts can be considered as
entities: Entities are described in a database by a set of attributes.
For eg., the attributes acc_no and bal may describe one particular account
in a bank and they form attributes of the account entity set. Similarly, attributes
cust_name, cust_street address and cust_id may describe a customer entity.
An extra attribute cust_id is used to uniquely identify customers (since it

may be possible to have two customers with the same name, street address and
city).
A relationship is an association among several entities. For eg., a depositor

relationship associates a customer with each account that she has. The set of all
entities of the same type and the set of all relationships of the same type are
termed an entity set and relationship set, respectively.
The overall logical structure (schema) of a database can be expressed

graphically by an E-R diagram, which is built up from the following components:

• Rectangles - which represent entity sets

• Ellipses - which represent attributes
• Diamonds - which represent relationships among entity sets
• Lines, which link attribute to entity sets and entity sets to
relationships.
Fig 1.2 An example E-R diagram
Relational Model
The relational model uses a collection if tables to represent both data and
the relationships among those data. Each table has multiple columns, and each
column has a unique name.
The relational data model is the most widely used data model, and a vast
majority of current database systems are based on the relational model.
The relational model is at a lower level of abstraction than the E-R model.
Database designs are often carried out in the E-R model, and then
translated to the relational model.
For eg., it is easy to see customer and account correspond to the entity
sets of the same name, while the table depositor corresponds to the relationship
set depositor.

Other Data Models
The object-oriented data model is another data model that has seen
increasing attention. The object-oriented model can be seen as extending the E-R
model with notions of encapsulation, methods and object identity.
The object-relational data model combines features of the object-

oriented data model and relational data model. Semistructured data models
permit the specification of data where individual data items of the same type
may have different sets of attributes. The extensible markup language (XML) is
widely used to represent semistructured data.
Two other data models are
 Network model and

 Hierarchical data model,
Preceded the relational data model. These models were tied closely to the
underlying implementation, and complicated the task of modeling data.
Database Languages
Database System provides a
DDL (Data Definition Language) – to specify the database schema and

DML (Data Manipulation Language) – to express the database queries and

updates
Data dictionary/ Data Directory – Contains Metadata i.e data about data
Eg: schema of a table
The system consults the data dictionary before reading or modifying the
actual data.
DML
Retrieval
Insertion
Deletion
Modification
Database Users
There are different types of user that play different roles in a database
environment. Following is a brief description of these users:
Application Programmers
Application programmer is the person who is responsible for implementing

the required functionality of database for the end user. Application programmer
works according to the specification provided by the system analyst.
End Users
End users are those persons who interact with the application directly.
They are responsible to insert, delete and update data in the database. They get
information from the system as and when required. Different types of end users
are as follows:
Naive Users: Naive users are those users who do not have any technical
knowledge about the DBMS. They use the database through application
programs by using simple user interface. They perform all operations by
using simple commands provided in the user interface.
Example:
The data entry operator in an office is responsible for entering
records in the database. He performs this task by using menus and buttons
etc. He does not know anything about database or DBMS. He interacts with
the database through the application program.
Sophisticated Users: Sophisticated users are the users who are familiar
with the structure of database and facilities of DBMS. Such users can use a
query language such as SQL to perform the required operations on
databases. Some sophisticated users can also write application programs.
Database Administrator

Database administrator is responsible for, managing the whole database

system. The one who designs creates and maintains the database. And also
manages the users who can access this database, and controls integrity issues.
And monitors the performance of the system and makes changes in the system
as and when required.

DATABASE SYSTEM STRUCTURE

A database system is partitioned into modules that deal with each of the
responsibilities of the overall system.
The functional components of a database system can be broadly divided
into
 The storage manager and
 The query processor components.
The storage manager is important because databases typically require a

large amount of storage space. For eg.,Corporate databases are too large.
So the main memory of computer cannot store that much of information,

the information is stored on disks. Data are moved between disk storage and
main memory as needed and also speed of retrieval is slow.
The database systems structure the data as to minimize the need to move
data between disk and main memory.
The query processor is important because it helps the database system

simplify and facilitate access to data. High-level views help to achieve this goal.
The job of the database system is to translate updates and queries written
in a nonprocedural language, at the logical level, into an efficient sequence of
operations at the physical level.
Storage Manager
 Storage manager is a program module that provides the interface

between the low-level data stored in the database and the application
programs and queries submitted to the system.
 The storage manager is responsible to the following tasks:
 interaction with the file manager

 efficient storing, retrieving and updating of data
The storage manager components include:
 Authorization and integrity manager, which tests for the

satisfaction of integrity constraints and checks authority to access the
data.
 Transaction manager, which ensures that the database remains
in a consistent state despite system failures and that concurrent
transaction execution proceed without conflicting.
 File Manager, which manages the allocation of space on disk
storage and the data structures used to represent information stored
on disk.
 Buffer manager, which is responsible for fetching data from disk
storage into main memory, and deciding what data to cache in main
memory. It is used to handle data sizes that are much larger than the
size of main memory.

The storage manager implements several data structures as part of the

physical system implementation.
 Data files, which store the database itself.

 Data Dictionary, which stores metadata about the structure of the
database, in particular the schema of the database.
 Indices, which provide fast access to data items that hold
particular values.
Query Processor
The query processor includes:
• DDL interpreter, which interprets DDL statements and

records the definitions in the data dictionary.
• DML complier, which translates DML statements in a query
language.
A query can usually be translated into any of a number of alternatives

evaluation plans that all give the same result. The DML complier also performs
query optimization, that is, it picks the lowest cost evaluation plan from among
the alternatives.
• Query evaluation engine, which executes low-level

instructions generated by the DML compiler.
HISTORY OF DATABASE SYSTEMS
In twentieth century the Punch Cards were introduced, invented by

Hollerith. These punch cards were later widely used as entering data into
computers.
 1950s and early 1960s
Magnetic tapes were developed for data storage. Data processing tasks
such as payroll were automated, with data stored on tapes.
Tapes could be read only sequentially and data sizes were much larger
than main memory, thus data processing programs were forced to process
data in a particular order, by reading and merging data from tapes and card
checks.
 Late 1960s and 1970s
Widespread use of hard disks allowed direct access to data. With disks,
network and hierarchical databases could be created that allowed data
structures such as lists and trees to be stored on disk.
In 1970 the relational model and non-procedural ways of querying data in

the relational model, and relational databases were born. The Relational
model is hiding the implementations details completely from the programmer.
 1980s
Relational databases could not match the performance of existing network

and hierarchical databases.
That changed with System R, are fully functional System R prototype led to
IBM’s first relational database product, SQL/DS.
Initial commercial relational databases systems, such as IBM DB2, Oracle,

Ingres and DEC Rdb fro efficient processing of declarative queries.
The 1980s also saw much research on parallel and distributed

databases, as well as on object-oriented databases.
Early 1990s:
The SQL language for decision support applications, which are very
intensive. Decision support and querying re-emerged as a major application
area for databases. Tools for analyzing large amounts of data saw large
growths in usage.
Many databases vendors introduced parallel database products in this
period.
 Late 1990s:
The major event was the explosive growth of the World Wide Web.
Databases were deployed much more extensively than ever before.
Database Systems had to support very high transaction processing rates
and high reliability. Database system also had to support Web interfaces to
data.
Entity Relationship Model
 It is a high level conceptual data model that describes the structure of db

in terms of entities, relationship among entities & constraints on them.
 Basic Concepts of E-R Model:

- Entity
- Entity Set
- Attributes
- Relationship
- Relationship set
- Identifying Relationship

Entity
-It is an object that exists in the real world.
Example:
- Person, Employee, Car, Home etc..
Object with conceptual Existence

- Account, loan, job etc…
Entity Set
- A set of entities of the same type.
Attributes
- A set of properties that describe an entity.
Types of Attributes
Simple (or) atomic vs. Composite

- An attribute which can’t be sub divided. (Eg.Age)
- An attribute which can be divided into sub parts is called as composite
attribute.
e.g. Address- Apartment no.
- Street
- Place
- City
- District
Single Valued vs. Multivalued

 An attribute having only one value (e.g.. Age,eid,sno)
 An attribute having multiple values (e.g.. Deptlocat- A dept can be located
in several places)
Stored Vs Derived
 Stored attribute is one that has some value
 where as derived attribute is a one where its value is derived from stored
attribute.
E.g. SA-DOB
DA- Age derived from DOB.
Key Attribute
 An attribute which is used to uniquely identify records.
E.g.. eid, sno, dno

Relationship
 It is an association among several entities. It specifies what type of
relationship exists between entities.
Relationship set:
 It is a set of relationships of the same type.
Weak Entity Set:

 No key attributes.
Constraints
Two of the most important constraints are

 a. Mapping Constraints
 b. Participation constraints
Participation constraints

a. Mapping Cardinalities:
Express the no.of entities to which another entity can be associated via a
relationship set
Types
 One-to-One
An entity in set A is associated with at most one entity in set B and vice
versa.
 One-to-many
An entity in set A is associated with zero or more no. of entities in set B

and an entity in B is associated with at most one entity in A.

 Many-to-One
One or more no. of entities in set A is associated with at most one

entity in B. An entity in B can be associated with any no. of entities in A.
 Many-to-Many
One or more no. of entities in set A is associated with one or more
no. of entities in set B.

b. Participation Constraints:
Total Participation
 The participation of an entity set E in a relationship set R is said to be total
if every entity in E participates in atleast one relationship in R.
Partial Participation:
 The participation of an entity set E in a relationship set R is said to be
partial if only a few of the entities in E participated in relationship in R.
Keys
 It is used to uniquely identify entities within a given entityset or a

relationship set.
 Keys in Entity set:
Primary Key:
– It is a key used to uniquely identify an entity in the entity set.
– E,g eno,rno,dno etc…
Super Key:
 It is a set of one or more attributes that allow us to uniquely identify an
entity in the entity set. Among them one must be a primary key attribute.
E.G.. Eid (primary key) and ename together can be identify an entity in
entity set.

Candidate key:
 They are minimal super keys for which no proper subset is a superkey.
E.g.. Ename and eaddr can be sufficient to identify an employee in
employee set.
{eid} and {ename,eaddr} – Candidate keys
FOREIGN Keys
 An attribute which makes a reference to an attribute of another entity type

is called foreign key
Domain
 A range of values can be defined for an attribute and is called as Domain

of that attribute.
E.g.. Age – attribute A
Domain (A)= {1,2,….100}

KEYS:
 A key allows us to identify a set of attributes that suffice to distinguish

entities form each other.
 Keys also help uniquely identify relationships and thus distinguish
relationships from each other.
Keys for Entity sets:
 A super key of an entity set is asset of one or more attributes whose

values uniquely determine each entity.
 A candidate key of an entity set is a minimal super key.
Example: Customer-id is a candidate key of customer.
 Account-number is a candidate key of account.

 Although several candidate keys may exist, one of the candidate keys is
selected to be the primary key.
Keys for Relationship sets:
 The combination of primary keys of the participation entity sets forms a

super key of a relationship set.
 (customer-id, account-number) is the super key of depositor.
 If we wish to track all access-date to each account by each customer, we
cannot assume a relationship for each access. We can use a multivalued
attribute though.
 Must consider the mapping cardinality of the relationship set when
deciding that what the candidate keys are.
 Need to consider semantics of relationship set in selecting the primary key
in case of more than one candidate key.
Participation Constraints:
 The participation of an entity set E in a relationship set R is said to be total

if every entity in E participates in at least one relationship in R.
 If only some entities in E participate in relationships in R, the participation
of entity set E in relationship R is said to be partial.
ENTITY-RALATIONSHIP DIAGRAMS (E-R DIAGRAMS)
 An E-R Diagram can express the overall logical structure of a database

graphically.

Components of E-R Diagrams and their notations:

Alternative E-R notations
E-R diagrams corresponding to customers and loans
Consider the above E-R diagram, which consists of two entity sets,
customer and loan, related through a binary relationship set borrower.
The attributes associated with customer are customer_id, customer_name,

customer_street, customer_city. The attributes associated with loan are
loan_number and amount. In the above figure, attributes of an entity set that are
members of the primary key are underlined.
From the above diagram, we see that the relationship set borrower is
many-to-many.
MAPPING CARDINALITIES:

 We express cardinality constraints by drawing either a directed

line(),signifying “one”, or an undirected line(--), signifying “many”,
between the relationship set and the entity set.
One-to-one relationship:
 A customer is associated with at most one loan via the relationship

borrower.
 A loan is associated with at most one customer via borrower.
 If the relationship set borrower is one-to-one, then both lines from
borrower would have arrows: one pointing to the loan entity set and
one pointing to the customer entity set.
One-to-Many relationship:
 In the one-to-many relationship a loan is associated with at most one

customer via borrower, customer is associated with several loan is
borrower.
 If the relationship set borrower were one-to-many, from customer to loan,
then the line from borrower to customer would be directed, with an arrow
pointing to the customer entity set.
Many-to-One relationships:
 In a many-to-one relationship a loan is associated with several customers

via borrower; a customer is associated with at most one loan via borrower.

 If the relationship set borrower were many-to-one from customer to loan,

then the line from borrower to loan would have an arrow pointing to the
loan entity set.
Many-to-many Relationship:
 A customer is associated with several (possibly 0) loans via borrower

 A loan is associated with several (possibly 0) customers.
E-R Diagram with an attribute attached to a relationship set.
 If a relationship set has also some attributes associated with it, then we
link these attributes to that relationship set.
 For example, in the below diagram, we have the access_date descriptive
attribute attached to the relationship set depositor to specify the most
recent date on which a customer accessed that account.

E-R Diagram with composite, multi-valued, and derived attributes:
 The below diagram shows that how composite attributes can be

represented in the E-R notation. A composite attribute name, with
component attributes first_name, middle_initial and last_name replaces
the simple attribute customer_name of customer.
 Also, the component attribute address, whose component attributes are
street, city, state, zip code replaces the attributes customer_street,
customer_city of customer.
 The attribute street is itself a composite attribute whose component
attributes are street_number, street name, and apartment_number.
 It also shows that a multivalued attribute phone_number, depicted by a
double ellipse, and a derived attribute age, depicted by a dashed
ellipse.
Roles:
 Entity sets of a relationship need not be distinct.

 The labels “manager” and “worker” are called roles; they specify how
employee entities interact via the works-for relationship set.
 Roles are indicated in E-R diagrams by labeling the lines that connect
diamonds to rectangles.
 Role labels are optional, and are used to clarify semantics of the
relationship.

E-R Diagram with role indicators
PARTICIPATION CONSTRAINTS:
Participation of an entity set in a relationship set:

o Total participation
o Partial participation
Total participation (Indicated by double one):

 Every entity in the entity set participates in at least one relationship in
the relationship set.
Example: participation of loan in borrower is total. Every loan
must have a customer associated to it via borrower.
Partial participation:
 Some entities may not participate in any relationship in the relationship

set.
Example: participation of customer in borrower is partial.
E-R Diagram With A Ternary Relationship:
 Ternary relationship consists of the three entity sets employee, job, and
branch, related through the relationship set works_on.
 We can specify some types of many-to-one relationships in the case of
nonbinary relationship sets.
Cardinalities Limits On Relationship Sets:

 The edge between loan and borrower has a cardinality constraint of 1..1,
meaning the minimum and the maximum cardinality are both 1.
 The limit 0...* on the edge from customer to borrower indicates that a
customer can have zero or more loans.
WEAK ENTITY SETS:
 An entity set may not have sufficient attributes to form a primary key is
referred to as a weak entity set.
 The existence of a weak entity set depends on the existence of a
identifying entity set.
 It must relate to the identifying entity set via a total, one-to-many
 relationship set from the identifying to the weak entity set
 Identifying relationship depicted using a double diamond
 The discriminator (or partial key) of a weak entity set is the set of
attributes that distinguishes among all the entities of a weak entity set.
 The primary key of a weak entity set is formed by the primary key of the
strong entity set on which the weak entity set is existence dependent, plus
the weak entity set’s discriminator.
 We depict a weak entity set by double rectangles.
 We underline the discriminator of a weak entity set with a dashed line.
Example: payment-number – discriminator of the payment entity set
Primary key for payment – (loan-number, payment-number)

E-R Diagram with a weak entity sets
 The primary key of the strong entity set is not explicitly stored with the
weak entity set, since it is implicit in the identifying relationship.
 If loan-number were explicitly stored, payment could be made a strong
entity, but then the relationship between payment and loan would be
duplicated by an implicit relationship defined by the attribute loan-number
common to payment and loan.
ENTITY-RELATIONSHIP DESIGN ISSUES:
 Use of entity sets vs. attributes:

Choice mainly depends on the structure of the enterprise being modeled,
and on the semantics associated with the attribute in question.
 Use of entity sets vs. relationship sets:

Possible guideline is to designate a relationship set to describe an
action that occurs between entities
 Binary versus n-ary relationship sets:

 Although it is possible to replace any non binary (n-ary, for n>2)
relationship set by a number of distinct binary relationship sets,
a n-ary relationship set shows more clearly that several entities
participate in a single relationship.
 Some relationships that appear to be non-binary may be better

represented using binary relationships
 E.g. A ternary relationship parents, relating a child to his/her father
and mother, is best replaced by two binary relationships, father and
mother .
Using two binary relationships allows partial information
(e.g. only mother being know)
 But there are some relationships that are naturally non-binary

E.g. works-on
 In general, any non-binary relationship can be represented using

binary relationships by creating an artificial entity set.
o Replace R between entity sets A, B and C by an entity set E, and
three relationship sets:
1. RA, relating E and A
2. RB, relating E and B
3. RC, relating E and C
o Create a special identifying attribute for E
o Add any attributes of R to E
o For each relationship (ai , bi , ci) in R, create
1. a new entity ei in the entity set E
2. add (ei , ai ) to RA
3. add (ei , bi ) to RB
4. add (ei , ci ) to RC

Ternary relationship vs. three binary relationships.
 Also need to translate constraints

o Translating all constraints may not be possible
o There may be instances in the translated schema that cannot
correspond to any instance of R
o Exercise: add constraints to the relationships RA, RB and RC to
ensure that a newly created entity corresponds to exactly one
entity in each of entity sets A, B and C
o We can avoid creating an identifying attribute by making E a
weak entity set (described shortly) identified by the three
relationship sets
 Placement of relationship attributes:
 The cardinality ratio of a relationship can affect the placement of
relationship attributes. Thus, attributes of one-to-one or one-to-many
relationship set can be associated with one of the participating entity
sets, rather than with the relationship set.
 Example: Can make access-date an attribute of account, instead of a

relationship attribute, if each account can have only one customer i.e.,
the relationship from account to customer is many to one, or
equivalently, customer to account is one to many
Access-date as attribute of the account entity set.

EXTENDED E-R FEATURES:
Certain extensions of E-R Model are,
 Specialization
 Generalization
 Attribute Inheritance
 Constraints on Generalizations
 Aggregation
Specialization:
 Top-down design process; we designate sub groupings within an entity set

that are distinctive from other entities in the set.
 These sub groupings become lower-level entity sets that have attributes or
participate in relationships that do not apply to the higher-level entity set.
 Example: A person may be classified as,
o Customer
o Employee
 It depicted by a triangle component labeled ISA (E.g. customer “is a”
person).
 We can apply specialization repeatedly to refine a design scheme.
Employees may be further classified as,
o Officer
o Teller
o Secretary
 Each of these employee types is described by a set of attributes that

includes all the attributes of entity set employee plus additional attributes.
Example: officer entities -- office_number
teller entity – station_number,hours_worked
secretary --hours_worked(it may be participate in a

relationship secretary_for)
 The ISA relationship may also be referred to as a super class-subclass

relationship. Higher and lower level entity sets are depicted as regular
entity sets- i.e. as rectangles containing the name of the entity sets.
Generalization:
 A bottom-up design process – combine a number of entity sets that share

the same features into a higher-level entity set.
 Specialization and generalization are simple inversions of each other; they
are represented in an E-R diagram in the same way.
 The terms specialization and generalization are used interchangeably.

 Can have multiple specializations of an entity set based on different

features.
 E.g. permanent-employee vs. temporary-employee, in addition to officer
vs. secretary vs. teller Each particular employee would be a member of
one of permanent-employee or temporary-employee, and also a member
of one of officer, secretary, or teller.
 The ISA relationship also referred to as super class – subclass relationship
Specialization and Generalization
Attribute inheritance:
 The crucial property of the higher and lower level entities created by
specialization and generalization is attribute inheritance.
 A lower-level entity set inherits all the attributes and relationship
participation of the higher-level entity set to which it is linked.
Example: The officer, teller, and secretary entity sets can participate in
the works_for relationship set, since the super class employee participates in
the works_for relationship.
 The attributes of the higher-level entity sets are said to be inherited by the
lower-level entity sets.
Example: customer and employee entity inherit the attributes of person.

 If an entity set is a lower-level entity set in one ISA relationship, then it is

called as single inheritance.
 If an entity set is a lower-level entity set in more than one ISA relationship,
then the entity set has multiple inheritance and the resulting structure is
said to be a lattice.
Constraints on Generalizations:
 Constraint on which entities can be members of a given lower-level entity

set.
 Such memebership may be one of the following:
Condition-defined:
o In condition-defined lower level entity sets, membership is

evaluated on the basis of whether or not an entity satisfies an
explicit condition or predicate.
o E.g. all customers over 65 years are members of seniorcitizen entity
set; senior-citizen ISA person.
User-defined:
o User defined lower level entity sets are not constrained by a

membership condition; rather, the database user assigns entities to
a given entity set.
 Constraint on whether or not entities may belong to more than one lower-
level entity set within a single generalization.
• The lower-level entity set may be one of the following:

Disjoint
o An entity can belong to only one lower-level entity set

o Noted in E-R diagram by writing disjoint next to the ISA triangle.
Overlapping
o An entity can belong to more than one lower-level entity set

Completeness constraints
o It specifies whether or not an entity in the higher-level entity set

must belong to at least one of the lower-level entity sets within a
generalization.
o Total generalization or specialization: an entity must belong to one
of the lower-level entity sets.
o Partial generalization or specialization: an entity need not belong to
one of the lower-level entity sets.
Aggregation
o Consider the ternary relationship works-on, which we saw earlier

o Suppose we want to record managers for tasks performed by an employee

at a branch.
E-R diagram with aggregation
 Relationship sets works-on and manages represent overlapping

information
 Every manages relationship corresponds to a works-on
relationship
 However, some works-on relationships may not correspond to
any manages relationships
 So we can’t discard the works-on relationship
 Eliminate this redundancy via aggregation

 Treat relationship as an abstract entity
 Allows relationships between relationships
 Abstraction of relationship into new entity
 Without introducing redundancy, the following diagram represents:
 An employee works on a particular job at a particular branch
 An employee, branch, job combination may have an
associated manager
REDUCTION OF AN E-R SCHEMA TO TABLES
 Primary keys allow entity sets and relationship sets to be expressed

uniformly as tables which represent the contents of the database.

 A database which conforms to an E-R diagram can be represented by a

collection of tables.
 For each entity set and relationship set there is a unique table which is
assigned the name of the corresponding entity set or relationship set.
 Each table has a number of columns (generally corresponding to

attributes), which have unique names.
 Converting an E-R diagram to a table format is the basis for deriving a

relational database design from an E-R diagram.
Representing Entity Sets as Tables
 A strong entity set reduces to a table with the same attributes.
Composite and Multivalued Attributes
 Composite attributes are flattened out by creating a separate attribute

for each component attribute
 E.g. given entity set customer with composite attribute name with
component attributes first-name and last-name the table corresponding
to the entity set has two attributes name.first-name and name.last-
name
 A multivalued attribute M of an entity E is represented by a separate
table EM
 Table EM has attributes corresponding to the primary key of E and an
attribute corresponding to multivalued attribute M
 E.g. Multivalued attribute dependent-names of employee is represented
by a table
 employee-dependent-names( employee-id, dname)
 Each value of the multivalued attribute maps to a separate row of the
table EM
 E.g., an employee entity with primary key John and dependents Johnson
and Johndotir maps to two rows:
 (John, Johnson) and (John, Johndotir)

Representing Weak Entity Sets
 A weak entity set becomes a table that includes a column for the
primary key of the identifying strong entity set.
Representing Relationship Sets as Tables
 A many-to-many relationship set is represented as a table with columns

for the primary keys of the two participating entity sets, and any
descriptive attributes of the relationship set.
 E.g.: table for relationship set borrower

Redundancy of Tables
 Many-to-one and one-to-many relationship sets that are total on the

many-side can be represented by adding an extra attribute to the many
side, containing the primary
 key of the one side .
 E.g.: Instead of creating a table for relationship account-branch, add an
attribute branch to the entity set account
 For one-to-one relationship sets, either side can be chosen to act as the
“many” side i.e., extra attribute can be added to either of the tables
corresponding to the two entity sets.
 If participation is partial on the many side, replacing a table by an extra
attribute in the relation corresponding to the “many” side could result
in null values.
 The table corresponding to a relationship set linking a weak entity set
to its identifying strong entity set is redundant.
 E.g. The payment table already contains the information that would
appear in the loan-payment table (i.e., the columns loan-number and
payment-number).
Representing Specialization as Tables
Method 1:
oForm a table for the higher level entity

oForm a table for each lower level entity set, include primary key of
higher level entity set and local attributes
Table table attributes
Person name, street, city
Customer name, credit-rating
employee name, salary
 Drawback: getting information about, e.g., employee requires
accessing two tables
Method 2:
Form a table for each entity set with all local and inherited attributes
Table table attributes

Person name, street, city
customer name, street, city, credit-rating

employee name, street, city, salary
 If specialization is total, no need to create table for generalized entity
(person)
 Drawback: street and city may be stored redundantly for persons who
are both customers and employees
Relations Corresponding to Aggregation
 To represent aggregation, create a table containing primary key of the

aggregated relationship, the primary key of the associated entity set
 Any descriptive attributes
 E.g. to represent aggregation manages between relationship works-on
and entity set manager,
o create a table manages(employee-id, branch-name, title,
manager-name)
 Table works-on is redundant provided we are willing to store null values
for attribute manager-name in table manages
DESIGN OF AN E-R DATABASE SCHEMA:
E-R Design Alternatives:
 The use of an attribute or entity set to represent an object.

 Whether a real-world concept is best expressed by an entity set or a
relationship set.
 The use of a ternary relationship versus a pair of binary relationships.
 The use of a strong or weak entity set.
 The use of specialization/generalization – contributes to modularity in the
design.
 The use of aggregation – can treat the aggregate entity set as a

 Single unit without concern for the details of its internal structure.
For example, we consider the database design for banking enterprise.
We apply the two initial database-design phases, namely
 Gathering of data requirements

 Design of the conceptual schema
Ultimately, the design of the E-R design process is a relational database
schema.
Data requirements for the bank database:
The initial specification of user requirements may be based on interviews

with the database users and on the designer’s own analysis of the enterprise.
The major characteristics of the banking enterprise:
 The bank is organized into branches. Each branch is located in a

particular city is identified by a unique name.
 Bank customers are identified by their customer_id values.
 Bank employees are identified by their employee_id values.
 The bank offers two types of accounts-savings and checking
accounts.
 A loan originates at a particular branch and can be held by one of
more customers. A loan is identified by a unique loan number.
Entity sets for the Bank Enterprise:
 Branch={branch_name, branch_city, assets}

 Customer={customer_id, customer_name, customer_street,
customer_city}
o Additional attribute:banker_name.
 Employee={employee_id, employee_name, telephone_number,
salary, manager}
Additional attributes:
o Dependent_namemultivalued attribute.
o Start_datebase attribute.
o Employment_lengthderived attribute.
 Two account entities: account={account_number, balance}
o Saving _account={interest_rate}
o Checking_account={overdraft_amount}
 Loan={loan_number, amount, originating_branch}
 Loan_payment={payment_number, payment_date,
payment_number}
Relationship sets for the Bank Database:
 Borrowermany-to-many relationship between customer and loan.

 Loan_branchmany-to-one relationship between loan and branch.

It replaces by the attribute originating_branch of the entity set loan.
 Loan_paymentone-to-many relationship between loan and
payment.
 Depositormany-to-may relationship between customer and
account with an attribute access_date.
 Cust_bankermany-to-one relationship between customer and the
bank employee with an attribute type. It replaces by the attribute
banket_name of the entity set customer.
 Works_forrelationship set between employee entities with role
indicators manager and worker. It replaces by the manager
attribute of employee.
E-R diagram for Bank Enterprise:

 The above diagram includes the entity sets, relationship sets, and mapping
cardinalities arrived at through the design processes.
 The attributes of the entity set are shown in one occurrence of the entity
and all other occurrence of the entity is shown without any attributes.

Unified Modeling Language (UML) Overview
What is UML?
The Unified Modeling Language (UML) is the successor of the wave of

object-oriented analysis and design methods that appeared in the late 80's and
early 90's.
It most directly unifies the methods of Booch, Rumbaugh (OMT) and

Jacobson ("the three amigos"). The UML is called a modelling language, not a
method.
A method consists of a modelling language (mainly graphical notation)

used to express designs and a process (i.e. inception, elaboration, construction
and transition) which defines the steps in doing a design.
• Use Cases and Use Case Diagrams

• Class Diagrams
• Interaction Diagrams
o Sequence Diagrams
o Collaboration Diagrams
• State Diagram
Use Cases and Use Case Diagrams
Use Cases
A Use Case is a narrative document that describes the sequence of events

of a user (actor) using a system to complete a process.
It is built to represent understanding of the user's requirements regarding

a certain process/function that the system will be used for.
Components of a Use Case
The essential components of a Use Case are:
• NAME - starts with a verb to emphasise that it is a process

• ACTOR - is a role that a user plays in respect to the system. Actors are
generally shown when they are the ones who need the use case.
• TYPE
o Primary - represents major common processes
o Secondary - represents minor or rare processes
o Optional - processes that may not be implemented in the system
depending upon resources
• DESCRIPTION - short and essential

Use Case Level of Detail

A use case narrative can have different levels of detail from abstract to concrete:
• High-level - terse description of the activities without going into details of

how activities are performed. Only use the essential components in the
narrative.
• Expanded - shows more detail than a high-level one, details the sequence
of events in the process
o generally done in a conversational style
o shows alternative courses
o shows the initiator of the use case
• Real - details concretely the process in terms of its real current design,
input and output technologies
Use Case Example
The following is a possible expanded Use Case narrative:
Borrow books
Actors: student (initiator), librarian
Purpose: This use case is to enable a student to borrow a book
Overview: A student goes to library and borrows books. The student brings a
few books to the counter. The librarian asks the student id. The student shows
the student id and the book that he/she wishes to borrow. The librarian scans the
student id and the book through the bar code scanner.
Type: Primary and Abstract
Cross-References: None
Typical Course of Events:
Actor Actions System Responses
1. A student brings a few

books to the counter.
2. The library scans the 3. The system checks the fine

student Id. status whether excess over
$20.
4. The library scans the

books' bar code.

5. The system checks the

borrowing limit of the books
whether has been reached.
6. The system approves the

loan for the borrower.
7. print out the statement of

due date for the books.
Alternative Courses:
Line 3: The librarian tells the student to pay the fine. Use case cannceled.
Line 5: The system notifies the librarian that the limit borrowing amount of book
is over. Use case cancelled.
Identifying Use Cases
There are two ways to identify Use Cases:
• Using the actors

o identify the actors related to a system or organisation
o for each actor, identify the processes it initiates or participates in
• Using events
o identify the external events that a system must respond to
o relate the events to actors and use cases
Relationship between Use Cases

Association relationship
A Use Case diagram illlustrates a set of use cases for a system, i.e. the
actors and the relationships between the actors and use cases. Each use case
diagram is for a particular subject area.
Include relationship
An include relationship connects a base use case (i.e. borrow books) to an
inclusion use case (i.e. check Fine). An include relationship specifies how
behaviour in the inclusion use case is used by the base use case.

The include relationship adds additional functionality not specified in the

base use case.
Extend relationship
An extend relationship specifies how the functionality of one use case can
be inserted into the functionality of another use case. The base use case
implicitly incorporates the behaviour of another use case at a location specified
indirectly by the extending use case. The extend relationships are important
because they show optional functionality or system behavior.
Notice the extend relationship between Request a book and Search. The
extend relationship is significant because it shows optional functionality. If the
student desires, he/she can search the book through the system. However, the
student may only Request a book through the system without searching the book
if the student knows the call number.
Generalisation relationship

A generalisation relationship means that a child use case inherits the

behaviour and meaning of the parent use case. The child may add or override the
behaviour of the parent.
Class Diagrams
A Class diagram (Business Class Diagram) describes the attributes and

behaviours of the objects and the relationships between them. The initial use is
to provide a conceptual model of the systems. Class diagrams are used from the
analysis stage to the design stage of a system, hence there are three
perspectives of class diagrams:
• Conceptual
o used in the initial analysis (not a model of software design)
o shows real world concepts
o better to overspecify than underspecify it
• Specification
o does describe software
o describes the interface of a software, not the implementation
o types rather than classes
• Implementation
o describes the actual software in the system
o classes
Components of a Class Diagram
There are two kinds of relationships between classes:
1. Generalisations (subtyping)
2. Associations

Generalisation at the implementation perspective is associated with

inheritance. The subclass inherits all the non-private methods and attributes of
the superclass and may override inherited methods.
Each association has two association ends. Each end is attached to one of the
classes in the association. An end can be explicitly named with a label. This label
is called a role name. An association end also has multiplicity, which is an
indication of how many objects may participate in the given relationship.
A class has attributes and operations/methods. The format of an attribute is:
accessSpecifier attributeName: DataTypeName [=defaultValue] [{choice1,

choice2, choice3}]
The format of a method is:

accessSpecifier methodName([parameterName: parameterDataType

[=defaultValue],....]): returnDataType
Note: The notation [] represents optional arguments
The following are the additional model elements covered in a class diagram:
Model Element UML Notation
Abstract classes and methods {abstract}
Interfaces <<interface>>
Access specifiers: public, private, +, -, # ( , , - in

protected Rational Rose)
Initial values = sign, e.g. gstRate = 0.1
static attributes or methods underlined
limited number of valid options

{option1, option2, option3}
for a variable

Example of Class Diagram
The following is a typical class diagram.
Q-BANK

1.a)Sketch the architecture of DBMS. (7)- nov 2008
b)With an example explain the entity relationship model. (8)-nov 2008
2.a)What are the features of a database administrator Explain. (7)-nov 2008
b)Explain E-R diagram with a suitable example. (8)-nov 2008
3)Outline architecture of a database system. (15)-nov2007
4)Define concept of specialization and generalization. And explain the constraints

on Generalizations. (15)-nov2007
5)Explain the basic concepts of Entity-Relatioship model (15)-april 2008
6)Explain the design of an E-R database schema. (15)-april 2008
7.a)Define and Explain the term database and database management system
with example. (7)-may 2007
b)Compare and contrast the conventional file system with DBMS. (8)-april 2008
8)What is data model?Compare and contrast three popular data models with
example. (15)-april 2008
9.a)What are the various disadvantages of storing organizational information in a

file-processing system? (8)-april 2009
b)Describe the various data models of database. (7)-april 2009
10)Discuss some of the extended features of Entity-Relationship diagram. (15)-

april 2009
11)Explain the various possible associations between entities with suitable

examples. What is E-R diagram?Explain with an example. (15)-may 2007
12)Draw an E-R diagram for a hospital with a set of patients and a set of
doctors.Associate with each patient a log of the various tests and examinations
conducted. (15)-may 2007
13) Explain the distinctions among the super key, candidate key and primary key
(7)-May2007

UNIT - II
2Mark Q & A
1. Give the syntax for creating the table with composite primary key.
Multicolumn primary key is called composite primary key.
Syntax: create table <table name>(columnname1 data type (size),

columnname2 data type (size), constraint name primary key
(columnname1, columnname2));
2. Write a query to display loan number, branch name where loan

amount is between 500 and 1000 using comparison operators.
Query: select loan no, branch name from loan where amount>=500 and
amount<=1000;
3. Find the names of all branches with customers who have an account
in the bank and who live in the Harrison city using Equi-join.
Query: select branch_name from customer,account,depositor where

cust_city='harrison' and customer.cust_name = depositor.cust_name
and depositor.acc_no=account.acc_no;
4. Find the names of all branches with customers who have an account
in the bank and who live in the Harrison city using Sub-Queries.
Query: select branch_name from account where acc_no in(select

acc_no from depositor where cust_name in(select cust_name from
customer where cust_city='harrison'));
5. Select the rows from borrower such that the loan numbers are
lesser than any loan number where the branch name is Downtown.
Query: select * from borrower where loan_no < any (Select loan_no
from loan where branch_name='downtown');
6. Define self-join
Joining of a table to itself is called self-join. i.e., it joins one row in a
table to another row.
7. What is a view and give an example query to create a view from an

existing table.
Any relation that is not a part of the logical model but which is made
visible to the user as a virtual (imaginary) relation is called a view.
Query: create view custall (name, city) as (Select cust_name, cust_city

from customer);
8. Write short notes on relational model

The relational model uses a collection of tables to represent

both data and the relationships amon g those data. The relational
model is an example of a record based model.
9. Define tuple and attribute

• Attributes: column headers
• Tuple : Row
10. Define the term relation .

Relation is a subset of a Cartesian product of list domains.
11. Define tuple variable

Tuple variable is a variable whose domain is the set of all tuples.
12. Define the term Domain.

For each attribute there is a set of permitted values called the
domain of that attribute.
13. What is a candidate key?

Minimal super keys are called candidate keys.
14. What is a primary key?

Primary key is chosen by the d atabase designer as the
principal means of identifying an entity in the entity set.
15. What is a super key?

A super key is a set of one or more attributes that
collectively allows us to identify uniquely an entity in the entity set.
16. Define- relational algebra.

The relational algebra is a procedural query language. It
consists of a set of operations that take one or two relation as
input and produce a new relation as output.
17. What is a SELECT operation?

The select operation selects tuples that satisfy a given predicate.
We  use the lowercase letter s s to den ote selection. ss
18. Define Degree and Domain of a relation.
Degree:
Number of attributes ‘n’ of its relation schema is called a degree of a
relation. eg. Account table degree is 3, since three attributes are there in
that relation.
Domain:
Set of permitted values for each attribute (or) data type describing
the types of values that can appear in each column is called a domain of a
relation. eg. Set of all account numbers of the account table.
19. Define how a relation is defined mathematically.

A relation is defined mathematically as a subset of a Cartesian
product of a list of domains.
For eg., in account table,

D1 -> set of all acc_nos
D2 -> set of all branch_names
D3 -> set of all balances
And the relation account is a subset of D1 X D2 X D3
20. Define super key and give example to illustrate the super key.
Set of one or more attributes taken collectively, allowing to identify
uniquely an entity in the entity set.
Eg1. {SSN} and {SSN, Cust_name} of customer table are super

keys.
Eg2. {Branch_name} and {Branch_name, Branch_city} of Branch
table are super keys.
21. Define candidate key and give example to illustrate the candidate
key.
Super keys with no proper subset are called the candidate keys.
Otherwise it is called minimal super key. Candidate key is nothing but the
primary key used in SQL.
Eg1. {SSN} is the candidate key for the super keys {SSN} and
{SSN, Cust_name} of customer table.
Eg2. {Branch_name} is the candidate key for the super keys
{Branch_name} and {Branch_name, Branch_city} of Branch table.
22. List out the six fundamental operators and 4 additional operators in
relational algebra.
Six Fundamental operators:
 Selection ( )
 Projection ()
 Union ()
 Set Difference (-)
 Cartesian Product (X)
 Rename ()
Four Additional operators:

 Set Intersection ()
 Natural Join (*)
 Division (%)
 Assignment ()
23. Which operators are called as unary operators and explain why they
are called so.
Unary operators:
• Selection ()
• Projection ()
• Rename ()
These operators are called as unary operators because they operate
on only one relation.
24. Which operators are called as binary operators and explain why
they are called so.
Binary operators:
• Union ()
• Set Difference (-)
• Cartesian Product (X)
These operators are called as binary operators because they
operate on pairs of relations.
25. Write a relational algebra expression to find those tuples pertaining

to loans of more than 1200 made by the Perryridge branch.
Relational algebra expression:
branch_name = “perryridge” amount >1200 (loan).
26. Explain the use of set difference operator and give an example to
illustrate the same.
Use of set difference operator:
Allows finding tuples that are in one relation but are not in another
relation.
Example: Find all customers of the bank who have an account but
not a loan.
Relational Algebra Expression:

Cust_name (depositor) - cust_name (borrower)
27. Explain the two conditions needed for the set difference operation
(union operation) to be valid.
Two conditions are

• Relations, r and s must be of the same arity ie., they must have
same number of attributes.
• Domains of the ith attribute of r and the ith attribute of s must be
same for all i.
28. Explain with one example why the additional operators are
separated from the fundamental operators?
Additional operators are used instead of fundamental operators to
reduce the complexity of long relational algebra expressions.
Eg. r s
 =r–(r–s)
Intersection can be used for repeated set difference operations.
29. Define and give the general format used for generalized projection.
Give one example expression to illustrate the same.
Generalized projection extends the projection operation by allowing
arithmetic functions to be used in the projection list.
General format used for Generalized projection is
F1 F2 … Fn(E) where
E is the relational algebra expression.
F1 F2 … Fn are the arithmetic expressions involving constants
and attributes in the schema of E. Special case these can be
simply an attribute or a constant.
Example expression for illustration:
acc_no, branch_name, balance + 100 (account)

30. What is the use of outer join and list out the three types of outer
join with the notations used in relational algebra?
Natural join combines only the common columns. So some information
will be lost, if it has no common column. So outer join is used. It avoids this
loss of information.
Three types of outer join:
Left outer join
Right outer join
Full outer join
31. Write a relational algebraic expression to delete all accounts at

branches located in Brooklyn.
r1 branch_city
 = “Brooklyn” (account branch)
r2 branch_name,
 acc_no, balance (r1)
account account
 – r2
where r1 and r2 are temporary relations.
32. Write a relational algebraic expression to insert the information

about Smith with his new account number ‘A-157’ taken at the
Perryridge branch with Rs.1200.
Relational algebraic expression for insertion:

account account {(“A-157”, “Perryridge”, 1200)}
depositor depositor {(“Smith”, “A-157”)}
33. Define materialized views and explain the use of such views.
Definitions:
Certain database systems allow view relations to be stored, but they
make sure that if the actual relations used in the view definition change
then the view is kept up to date. Such views are called materialized
views.The process of keeping views up to date is called view maintenance.
Use of materialized views:

If the views are used frequently then materialized views are used.
But benefits of materialization must be weighed against the storage cost
and the added overhead of updates.
34. Differentiate assertions and triggers. (i.e., write definitions)

An assertion is a predicate expressing a condition that we wish the
database to satisfy always.
A Trigger is a statement that is executed automatically by the
system as a side effect of a modification to the database.
35. List out the two requirements of triggers in database.

Specify the conditions under which the trigger is to be executed.
Specify the actions to be taken when the trigger executes.
36. What is a PROJECT operation?

The project operation is a unar y operation that returns its
argument relation with certain attributes left out. Projection is denoted by
pie ( pp p p ).

37. Write short notes on tuple relational calculus.

The tuple relational calculation is anon procedural quer y
language. It describes the desired information with out giving a
specific procedure for obtaining that information.
A query or expression can be expressed in tuple relational calculus as
{t | P (t)}
which means the set of all tuples‘t’ such that predicate P is true
for‘t’.
38. Write short notes on domain relational calculus
The domain relational calculus uses domain variables that take on
values from an attribute domain rather than values for entire tuple.
39. Define query language?

A query is a statement requestin g the retrieval of information.
The portion of DML that involves information retrieval is called a query
language.
40. Write short notes on Schema diagram.

A database schema along with primary key and foreign key
dependencies can be depicted pictorially b y schema diagram. Each
relation appears as a box with attributes listed inside it and the relation
name above it.
41. What is foreign key?

A relation schema r1 derived from an ER schema may include
among its attributes the primary key of another relation schema
r2.this attribute is called a foreign key from r1 referencing r2.
Lecture notes:
Relational Model

Why Study the Relational Model?
 Most widely used model.

o Vendors: IBM, Informix, Microsoft, Oracle, Sybase, etc.
 “Legacy systems” in older models
o E.G., IBM’s IMS
 Recent competitor: object-oriented model
o ObjectStore, Versant, Ontos
o A synthesis emerging: object-relational model
 Informix Universal Server, UniSQL, O2, Oracle, DB2
Definitions
 Relational database: a set of relations

 Relation: made up of 2 parts
o Instance: a table, with rows and columns.
#Rows = cardinality, #fields = degree
o Scheme: specifies name of relation, plus name and type of each
column.
Example:
Students(sid: string, name: string, login: string, age: integer, gpa: real)
Example Instance of Students Relation

Sid name login age gpa
53666 Jones jones@cs 18 3.4
53688 Smith smith@eecs 18 3.2
53650 Smith smith@math 19 3.8
 Cardinality = 3, degree = 5, all rows distinct
Relational database:
The relational database consists of collection of tables, each of which is

assigned a unique name. A row in a table represents a relationship among a set
of values. Informally, a table is an entity set, and a row is an entity.
Basic structure and terminology in it:
Example of relation:

Attribute:
Each attribute of a relation has a name.
The set of allowed values for each attribute is called the domain of the attribute.
Attribute values are (normally) required to be atomic; that is, indivisible
E.g. the value of an attribute can be an account number, but cannot be a set of
account numbers. Domain is said to be atomic if all its members are atomic. The
special value null is a member of every domain. The null value causes
complications in the definition of many operations.
Database schema:
Relation schema:
Formally, given domains D1, D2, …. Dn a relation r is a subset of

D1 x D2 x … x Dn
Thus, a relation is a set of n-tuples (a1, a2, …, an) where each ai ∈ Di
Schema of a relation consists of attribute definitions name type/domain
integrity constraints.
Relation Instance:
The current values (relation instance) of a relation are specified by a table
An element t of r is a tuple, represented by a row in a table
Order of tuples is irrelevant (tuples may be stored in an arbitrary order)

attributes
customer_nam customer_stre customer_cit (or

Jones e Main et Harrison
y columns)
tuples
Smith North Rye
(or
Curry North Rye
rows)
custome
Lindsay Park Pittsfield
Rr
Database:
A database consists of multiple relations.Information about an enterprise is

broken up into parts, with each relation storing one part of the information
E.g.
account: information about accounts

depositor: which customer owns which account
customer: information about customers
The Customer Relation:
The Depositor Relation:

Why Split Information Across Relations?
Storing all information as a single relation such as bank (account_number,

balance, customer_name, ..) results in repetition of information e.g., if two
customers own an account (What gets repeated?) the need for null values.
e.g., to represent a customer without an account
Keys:
K⊆ R
K is a super key of R if values for K are sufficient to identify a unique tuple of

each possible relation r(R)
By “possible r” we mean a relation r that could exist in the enterprise we are

modeling.
Example: {customer_name, customer_street} and

{customer_name}
are both super keys of Customer, if no two customers can possibly have the
same name.In real life, an attribute such as customer_id would be used instead of
customer_name to uniquely identify customers, but we omit it to keep our
examples small, and instead assume customer names are unique.
Types of keys:
1) K is a candidate key if K is minimal

Example: {customer_name} is a candidate key for Customer,

since it is a super key and no subset of it is a super key.
2) Primary key: a candidate key chosen as the principal means of identifying

tuples within a relation
Should choose an attribute whose value never, or very rarely, changes.
E.g. email address is unique, but may change
3) Foreign Keys:
A relation schema may have an attribute that corresponds to the primary

key of another relation. The attribute is called a foreign key.
E.g. customer_name and account_number attributes of depositor are

foreign keys to customer and account respectively.
Only values occurring in the primary key attribute of the referenced

relation may occur in the foreign key attribute of the referencing relation.
Relation Algebra
The relational algebra is a procedural query language. It consists of a

set of operations that take one or two relations as input and produce a new
relation as their result.
 The operations can be divided into,

 Basic operations: Select, Project, Union, rename, set difference
and Cartesian product
 Additional operations: Set intersections, natural join, division and
assignment.
 Extended operations: Aggregate operations and outer join
The fundamental operations:
1. Select
2. Project
3. Union
4. Set difference
5. Cartesian product and
6. Rename.

Example
Consider simple Relation r
A B C D
a a 1 7
a b 5 7
b b 1 3
2
b b 2 1
3 0
A=B ^ D > 5 (r)

A B C D
a a 1 7
b b 2 1
3 0
Selection Operation
 Notation: p(r)
 p is called the selection predicate
 Defined as:
p(r)
 = {t | t Î r and p(t)}
 Where p is a formula in propositional calculus consisting of
terms connected by :  (and), (or), (not)

 Each term is one of:
<attribute> op <attribute> or <constant>
 where op is one of:     
 Select
 It selects tuples that satisfy a given predicate, To denote selection.
(Sigma) is used.
Syntax
condition
 (tablename)
Example
sal>1000(emp)
It selects tuples whose employee sal is > 1000
Project Operation

 It selects attributes from the relation.

 Symbol for project
Syntax
(Table
∏ <Attribute name)
Example
list>
∏ (employee)
eid,sal
 Combining Select & Project Operation
σSal>10
∏ ( (employee)
)
eid,sal 00
 Selects tuples where sal>1000 & from them only eid and salary attributes
are selected.
Union Operation
Consider a query to find the names of all bank customers who have either an
account or a loan or both. Note that the customer relation does not contain the
information, since a customer does not need to have either an account or a loan
at the bank.
1. We know how to find the names of all customers with a loan in the
bank:
Πcustomer-name (borrower)
2. We also know how to find the names of all customers with an

account in the bank:

Πcustomer-name (depositor)
The binary operation union, denoted, as in set theory, by U. So the expression

needed is
Πcustomer-name (borrower) U Πcustomer-name (depositor)
Therefore, for a union operation r U s to be valid, we require that two conditions

hold:
1. The relations r and s must be of the same arity. That is, they must have the
same number of attributes.
2. The domains of the ith attribute of r and the ith attribute of s must be the
same, for all i.
The Set Difference Operation:
1. The set-difference operation, denoted by −, allows us to find tuples

that are in one relation but are not in another.
2. The expression(r − s) produces a relation containing those tuples in
r but not in s.
3. We can find all customers of the bank who have an account but not
a loan by writing
Πcustomer-name (depositor) − Πcustomer-name (borrower)
The Cartesian-Product Operation
1. The Cartesian-product operation, denoted by a cross (×), allows us

to combine information from any two relations.
2.We write the Cartesian product of relations r1 and r2 as (r1 × r2).
3.Suppose that we want to find the names of all customers who have a
loan at the Perryridge branch. We need the information in both the loan
relation and the borrower relation to do so.

4. If we write
σbranch-name =“Perryridge” (borrower × loan)
However, the customer-name column may contain customers
5. Finally, since we want only customer-name, we do a projection
Πcustomer-name (σborrower .loan-number =loan.loan-number

(σbranch-name =“Perryridge” (borrower × loan)))
The Rename Operation:
1. Unlike relations in the database, the results of relational-algebra

expressions do not have a name that we can use to refer to them.
2. It is useful to be able to give them names; the rename operator,

denoted by the lowercase Greek letter rho (ρ).
Consider the query:
Find the names of all customers who live on the same street and in the same city
as Smith.” We can obtain Smith’s street and city by writing
Πcustomer-street, customer-city (σcustomer-name = “Smith” (customer))
However, in order to find other customers with this street and city, we must
reference the customer relation a second time.In the following query, we use the
rename operation on the preceding expression to give its result the name smith-
addr, and to rename its attributes to street and city, instead of customer-street
and customer-city:
Πcustomer.customer-name
(σcustomer.customer-street=smith-addr.street ∧

Customer.customer-city=smith-addr.city
(customer × ρsmith-addr (street, city)
(Πcustomer-street, customer-city (σcustomer-name = “Smith”

(customer)))))
Additional Operations:
The fundamental operations of the relational algebra are sufficient to

express any relational-algebra query.
However, if we restrict ourselves to just the fundamental operations, certain

common queries are lengthy to express.
We define additional operations that do not add any power to the algebra,
but simplify common queries.
The various additional operations are:
i) The Set-Intersection Operation

ii) The Natural-Join Operation
iii) The Division Operation
iv) The Assignment Operation
The Set-Intersection Operation
1. The first additional-relational algebra operation that we shall define

is set intersection (∩).
2. Suppose that we wish to find all customers who have both a loan
and an account. Using set intersection, we can write:
Πcustomer-name (borrower) ∩ Πcustomer-name (depositor)
The Natural-Join Operation
1. It is often desirable to simplify certain queries that require a

Cartesian product.
2. Usually, a query that involves a Cartesian product includes a
selection operation on the result of the Cartesian product.

Consider the query “Find the names of all customers who have a loan at the
bank, along with the loan number and the loan amount”
Πcustomer-name, loan. loan-number, amount (σborrower .loan-number

=loan.loan-number (borrower× loan))
The natural join is a binary operation that allows us to combine certain selections
and Cartesian product into one operation.
The Division Operation:

The division operation, denoted by ÷, is suited to queries that include the
phrase” for all.”
Suppose that we wish to find all customers who have an account at all the
branches located in Brooklyn.
Step 1:
We can obtain all branches in Brooklyn by the expression
r1 = Πbranch-name (σbranch-city =“Brooklyn” (branch))
Step 2:
We can find all (customer-name, branch-name) pairs for which the customer has
an account at a branch by writing
r2 = Πcustomer-name, branch-name (depositor account)
Step 3:
Now, we need to find a customer who appears in r2 with every branch name in
r1.operation that provides exactly those customers is divide operation. We
formulate the query by writing
Πcustomer-name, branch-name (depositor  account)
÷ Πbranch-name (σbranch-city =“Brooklyn” (branch))
The Assignment Operation:

1. It is convenient at times to write a relational-algebra expression

by assigning parts of it to temporary relation variables.
2. The assignment operation, denoted by ←, works like assignment
in a programming language.
To illustrate this operation, consider the definition of division
temp1 ← ΠR−S (r)
temp2 ← ΠR−S ((temp1 × s) − ΠR−S,S(r))
Result = temp1 − temp2
The evaluation of an assignment does not result in any relation being displayed
to the user. Rather, the result of the expression to the right of the ← is assigned
to the relation variable on the left of the←. This relation variable may be used in
subsequent expressions.
Thus various relational algebra operations are explained.
TUPLE RELATIONAL CALCULUS
The tuple calculus is a calculus that was introduced by Edgar F. Codd as

part of the relational model in order to give a declarative database query
language for this data model.
It formed the inspiration for the database query languages QUEL and SQL
of which the latter, although far less faithful to the original relational model and
calculus, is now used in almost all relational database management systems as
the ad-hoc query language.
Along with the tuple calculus Codd also introduced the domain calculus
which is closer to first-order logic and showed that these two calculi (and the
relational algebra) are equivalent in expressive power. Subsequently query
languages for the relational model were called relationally complete if they could
express at least all these queries.
Definition of the calculus
Relational database
Since the calculus is a query language for relational databases we first

have to define a relational database. The basic relational building block is the
domain, or data type.
A tuple is an ordered multiset of attributes, which are ordered pairs of

domain and value. A relvar (relation variable) is a set of ordered pairs of domain
and name, which serves as the header for a relation. A relation is a set of tuples.
Although these relational concepts are mathematically defined, those

definitions map loosely to traditional database concepts. A table is an accepted
visual representation of a relation; a tuple is similar to the concept of row.
We first assume the existence of a set C of column names, examples of

which are "name", "author", "address" et cetera. We define headers as finite
subsets of C.
A relational database schema is defined as a tuple S = (D, R, h) where D is

the domain of atomic values (see relational model for more on the notions of
domain and atomic value), R is a finite set of relation names, and
h : R → 2C
a function that associates a header with each relation name in R. (Note that this
is a simplification from the full relational model where there is more than one
domain and a header is not just a set of column names but also maps these
column names to a domain.)
Given a domain D we define a tuple over D as a partial function
t:C→D
that maps some column names to an atomic value in D. An example would be

(name : "Harry", age : 25).
The set of all tuples over D is denoted as TD. The subset of C for which a tuple t is
defined is called the domain of t (not to be confused with the domain in the
schema) and denoted as dom(t).
Finally we define a relational database given a schema S = (D, R, h) as a function
db : R → 2TD
that maps the relation names in R to finite subsets of TD, such that for every
relation name r in R and tuple t in db(r) it holds that
dom(t) = h(r).
The latter requirement simply says that all the tuples in a relation should contain
the same column names, namely those defined for it in the schema.
Atoms
For the construction of the formulas we will assume an infinite set V of tuple
variables. The formulas are defined given a database schema S = (D, R, h) and a
partial function type : V -> 2C that defines a type assignment that assigns
headers to some tuple variables. We then define the set of atomic fomulas
A[S,type] with the following rules:

1. if v and w in V, a in type(v) and b in type(w) then the formula " v.a = w.b "
is in A[S,type],
2. if v in V, a in type(v) and k denotes a value in D then the formula " v.a = k
" is in A[S,type], and
3. if v in V, r in R and type(v) = h(r) then the formula " r(v) " is in A[S,type].
Examples of atoms are
• (t.name = "Codd") -- tuple t has a name attribute and its value is "Codd"
• (t.age = s.age) -- t has an age attribute and s has an age attribute with the
same value
• Book(t) -- tuple t is present in relation Book.
The formal semantics of such atoms is defined given a database db over S and a
tuple variable binding val : V -> TD that maps tuple variables to tuples over the
domain in S:
1. " v.a = w.b " is true if and only if val(v)(a) = val(w)(b)

2. " v.a = k " is true if and only if val(v)(a) = k
3. " r(v) " is true if and only if val(v) is in db(r)
Formulas
The atoms can be combined into formulas, as is usual in first-order logic, with the
logical operators ∧ (and), ∨ (or) and ¬ (not), and we can use the existential
quantifier (∃) and the universal quantifier (∀) to bind the variables. We define the
set of formulas F[S,type] inductively with the following rules:
1. every atom in A[S,type] is also in F[S,type]

2. if f1 and f2 are in F[S,type] then the formula " f1 ∧ f2 " is also in F[S,type]
3. if f1 and f2 are in F[S,type] then the formula " f1 ∨ f2 " is also in F[S,type]
4. if f is in F[S,type] then the formula " ¬ f " is also in F[S,type]
5. if v in V, H a header and f a formula in F[S,type[v->H]] then the formula " ∃
v : H ( f ) " is also in F[S,type], where type[v->H] denotes the function that is
equal to type except that it maps v to H,
6. if v in V, H a header and f a formula in F[S,type[v->H]] then the formula " ∀
v : H ( f ) " is also in F[S,type]
Examples of formulas:
• t.name = "C. J. Date" ∨ t.name = "H. Darwen"

• Book(t) ∨ Magazine(t)
• ∀ t : {author, title, subject} ( ¬ ( Book(t) ∧ t.author = "C. J. Date" ∧ ¬ (
t.subject = "relational model")))
Note that the last formula states that all books that are written by C. J. Date have
as their subject the relational model. As usual we omit brackets if this causes no
ambiguity about the semantics of the formula.

We will assume that the quantifiers quantify over the universe of all tuples over
the domain in the schema. This leads to the following formal semantics for
formulas given a database db over S and a tuple variable binding val : V -> TD:
1. " f1 ∧ f2 " is true if and only if " f1 " is true and " f2 " is true,
2. " f1 ∨ f2 " is true if and only if " f1 " is true or " f2 " is true or both are true,
3. " ¬ f " is true if and only if " f " is not true,
4. " ∃ v : H ( f ) " is true if and only if there is a tuple t over D such that dom(t)
= H and the formula " f " is true for val[v->t], and
5. " ∀ v : H ( f ) " is true if and only if for all tuples t over D such that dom(t) =
H the formula " f " is true for val[v->t].
Queries
Finally we define what a query expression looks like given a schema S = (D, R,
h):
{ v : H | f(v) }
where v is a tuple variable, H a header and f(v) a formula in F[S,type] where type
= { (v, H) } and with v as its only free variable. The result of such a query for a
given database db over S is the set of all tuples t over D with dom(t) = H such
that f is true for db and val = { (v, t) }.
Examples of query expressions are:
• { t : {name} | ∃ s : {name, wage} ( Employee(s) ∧ s.wage = 50.000 ∧

t.name = s.name ) }
• { t : {supplier, article} | ∃ s : {s#, sname} ( Supplier(s) ∧ s.sname =
t.supplier ∧ ∃ p : {p#, pname} ( Product(p) ∧ p.pname = t.article ∧ ∃ a :
{s#, p#} ( Supplies(a) ∧ s.s# = a.s# ∧ a.p# = p.p# ) }
Semantic and syntactic restriction of the calculus
Domain-independent queries
Because the semantics of the quantifiers is such that they quantify over all the
tuples over the domain in the schema it can be that a query may return a
different result for a certain database if another schema is presumed. For
example, consider the two schemas S1 = ( D1, R, h ) and S2 = ( D2, R, h ) with
domains D1 = { 1 }, D2 = { 1, 2 }, relation names R = { "r1" } and headers h =
{ ("r1", {"a"}) }. Both schemas have a common instance:
db = { ( "r1", { ("a", 1) } ) }
If we consider the following query expression
{ t : {a} | t.a = t.a }
then its result on db is either { (a : 1) } under S1 or { (a : 1), (a : 2) } under S2. It

will also be clear that if we take the domain to be an infinite set, then the result
of the query will also be infinite. To solve these problems we will restrict our
attention to those queries that are domain independent, i.e., the queries that
return the same result for a database under all of its schemas.
An interesting property of these queries is that if we assume that the tuple

variables range over tuples over the so-called active domain of the database,
which is the subset of the domain that occurs in at least one tuple in the
database or in the query expression, then the semantics of the query expressions
does not change. In fact, in many definitions of the tuple calculus this is how the
semantics of the quantifiers is defined, which makes all queries by definition
domain independent.
Safe queries
In order to limit the query expressions such that they express only domain-
independent queries a syntactical notion of safe query is usually introduced. To
determine whether a query expression is safe we will derive two types of
information from a query. The first is whether a variable-column pair t.a is bound
to the column of a relation or a constant, and the second is whether two variable-
column pairs are directly or indirectly equated (denoted t.v == s.w).
For deriving boundedness we introduce the following reasoning rules:
1. in " v.a = w.b " no variable-column pair is bound,

2. in " v.a = k " the variable-column pair v.a is bound,
3. in " r(v) " all pairs v.a are bound for a in type(v),
4. in " f1 ∧ f2 " all pairs are bound that are bound either in f1 or in f2,
5. in " f1 ∨ f2 " all pairs are bound that are bound both in f1 and in f2,
6. in " ¬ f " no pairs are bound,
7. in " ∃ v : H ( f ) " a pair w.a is bound if it is bound in f and w <> v, and
8. in " ∀ v : H ( f ) " a pair w.a is bound if it is bound in f and w <> v.
For deriving equatedness we introduce the following reasoning rules (next to the
usual reasoning rules for equivalence relations: reflexivity, symmetry and
transitivity):
1. in " v.a = w.b " it holds that v.a == w.b,

2. in " v.a = k " no pairs are equated,
3. in " r(v) " no pairs are equated,
4. in " f1 ∧ f2 " it holds that v.a == w.b if it holds either in f1 or in f2,
5. in " f1 ∨ f2 " it holds that v.a == w.b if it holds both in f1 and in f2,
6. in " ¬ f " no pairs are equated,
7. in " ∃ v : H ( f ) " it holds that w.a == x.b if it holds in f and w<>v and
x<>v, and
8. in " ∀ v : H ( f ) " it holds that w.a == x.b if it holds in f and w<>v and
x<>v.
We then say the a query expression { v : H | f(v) } is safe if
• for every column name a in H we can derive that v.a is equated with a
bound pair in f,

• for every subexpression of f of the form " ∀ w : G ( g ) " we can derive that
for every column name a in G we can derive that w.a is equated with a
bound pair in g, and
• for every subexpression of f of the form " ∃ w : G ( g ) " we can derive that
for every column name a in G we can derive that w.a is equated with a
bound pair in g.
The restriction to safe query expressions does not limit the expressiveness since
all domain-independent queries that could be expressed can also be expressed
by a safe query expression. This can be proven by showing that for a schema S =
(D, R, h), a given set K of constants in the query expression, a tuple variable v
and a header H we can construct a safe formula for every pair v.a with a in H that
states that its value is in the active domain. For example, assume that K={1,2},
R={"r"} and h = { ("r", {"a, "b"}) } then the corresponding safe formula for v.b
is:
v.b = 1 ∨ v.b = 2 ∨ ∃ w ( r(w) ∧ ( v.b = w.a ∨ v.b = w.b ) )
This formula, then, can be used to rewrite any unsafe query expression to an
equivalent safe query expression by adding such a formula for every variable v
and column name a in its type where it is used in the expression. Effectively this
means that we let all variables range over the active domain, which, as was
already explained, does not change the semantics if the expressed query is
domain independent.
2.1.6 DOMAIN RELATIONAL CALCULUS
In computer science, domain relational calculus (DRC) is a calculus that was

introduced by Michel Lacroix and Alain Pirotte as a declarative database query
language for the relational data model [1].
In DRC, queries have the form:
< X1,X2,....,Xn > | p( < X1,X2,....,Xn > )
where each Xi is either a domain variable or constant, and p(<X1, X2, ...., Xn>)
denotes a DRC formula. The result of the query is the set of tuples Xi to Xn which
makes the DRC formula true.
This language uses the same operators as tuple calculus; Logicial operators ∧
(and), ∨ (or) and ¬ (not). The existential quantifier (∃) and the universal
quantifier (∀) can be used to bind the variables.
[2]
Its computational expresivity is equivalent to that of Relational algebra ..
Examples
Let A, B, C mean Rank, Name, ID and D, E, F to mean Name, DeptName, ID
Find all captains of the starship USS Enterprise:

• {<A, B, C> | <A, B, C> in Enterprise ∧ A = "Captain" }
In this example, A, B, C denotes both the result set and a set in the table
Enterprise.
Find Names of Enterprise crewmembers who are in Stellar Cartography:
• {<B> | ∃ A, C ( <A, B, C> in Enterprise ∧ ∃ D, E, F(<D, E, F> in

Departments ∧ F = C ∧ E = "Stellar Cartography" ))}
In this example, we're only looking for the name, so <B> denotes the column
Name. F = C is a requirement, because we need to find Enterprise crew members
AND they are in the Stellar Cartography Department.
An alternate representation of the previous example would be:
• {<B> | ∃ A, C (<A, B, C> in Enterprise ∧ ∃ D (<D, "Stellar Cartography",

C> in Departments))}
In this example, the value of the requested F domain is directly placed in the
formula and the C domain variable is re-used in the query for the existence of a
department, since it already holds a crew member's id.
SQL
History
 IBM Sequel language developed as part of System R project at the IBM San
Jose Research Laboratory
 Renamed Structured Query Language (SQL)
 ANSI and ISO standard SQL:
o SQL86
o SQL89
o SQL92
o SQL:1999 (language name became Y2K compliant!)
o SQL:2003
 Commercial systems offer most, if not all, SQL92 features, plus varying
feature sets from later standards and special proprietary features.
o Not all examples here may work on your particular system.
SQL is nothing but a set of commands/statements and are categorized into

following five groups viz., DQL: Data Query Language, DML: Data Manipulation
Language, DDL: Data Definition Language, TCL: Transaction Control Language,
DCL: Data Control Language.
DQL: SELECT
DML: DELETE, INSERT, UPDATE
DDL: CREATE, DROP, TRUNCATE, ALTER

TCL: COMMIT, ROLLBACK, SAVEPOINT

DCL: GRANT, REVOKE
Data Definition Language
Allows the specification of not only a set of relations but also information about
each relation, including
 The schema for each relation.
 The domain of values associated with each attribute.
 Integrity constraints
 The set of indices to be maintained for each relations.
 Security and authorization information for each relation.
 The physical storage structure of each relation on disk.

Select
The SELECT statement as the name says is used to extract data from Oracle
Database. The syntax for the simplest SELECT statement is as follows.
SELECT column_name1, column_name2, …

FROM table_name1;
Example:
SELECT *
FROM emp;
This command will display all the fields of the table emp and all of the records.
Example:
SELECT ename, sal

FROM emp
WHERE sal > 2000;
The result of this statement will be only two columns of emp table and only those
records where salary is greater than 2000.
Example:
SELECT ename, salary

FROM emp
WHERE sal > 2000
ORDER BY ename;
The output of this statement will be exactly the same as the one above except
that the output will be sorted based on ename column.
SQL Operators
Figure - SQL Operators: Comparison, Arithmetic, Logical & Other.
Example:
SELECT ename, sal, sal + sal*5/100 “Next Year Sal”

FROM emp;

Logical Operators and the operators in the Other category can be best
understood by looking at their respective real world examples.
Example:
SELECT sal
FROM emp
WHERE deptno = 30 AND sal > 2000;
The output of the query will be only one column i.e. sal and only those records
will be displayed where department number is 30 and the salary is greater than
2000. So when you use AND operator it means both conditions needs to satisfy
for the record to appear in the output but in case of OR, either first condition
needs to be true or the second one e.g.
SELECT sal
FROM emp
WHERE deptno = 30 OR sal > 2000;
Example:
SELECT *
FROM emp
WHERE job IN ('CLERK','ANALYST');
The output of the query will be all the coulums of emp table but only those
records where job column contains either “CLERK” or “ANALYST”. You can also
use IN operator with as follows.
SELECT *
FROM emp
WHERE sal IN (SELECT sal
FROM emp
WHERE deptno = 30);
By having NOT before IN can complete invert the results like in the following
example. Such types of queries fall under the category called “Sub-Queries”
which we will discuss in the article ahead in this chapter. There is a special
technique to interpret them.
SELECT *
FROM emp
WHERE sal NOT IN (SELECT sal
FROM emp
WHERE deptno = 30);
Example:
SELECT *
FROM emp
WHERE sal BETWEEN 2000 AND 3000;
Only those records will be displayed where the salary is between 2000 and 3000
including both 2000 and 3000.
Example:
SELECT ename, deptno

FROM dept
WHERE EXISTS (SELECT *
FROM emp
WHERE dept.deptno = emp.deptno);
TRUE if a sub-query returns at least one row. In other words the output will be
two columns from dept table, all the records only if the query after EXISTS results
in at least one record.
Example:
SELECT sal
FROM emp
WHERE ename LIKE 'SM%';
The output will be only those salaries from emp (employee) table where ename
(employee name) begins with “SM”. Another variation of above query is as
follows.
ename LIKE 'SMITH_'
The output will be only those records where ename begins with “SMITH” and
there should not be more than one character after it.
Example:
SELECT ename, deptno

FROM emp
WHERE comm IS NULL;
The output will be ename and deptno but only those records where comm field
has NULL value. NULL is a special value and just keep in mind that its not Zero. It
can be visualized as empty field occupying zero byte

Functions
Single Row Functions:
Single Row functions are further subdivided into five categories viz., Character,
Data, Numeric, Conversion & other functions. First we will start with Character
Functions or more precisely “Single Row Character Functions.”
Character Functions
Following are the functions that fall under this category.
CHR
LTRIM
INSTR
ASCII
RTRIM
INSTRB
CONCAT
TRIM
LENGTH
INITCAP
REPLACE
LENGTHB

LOWER
SOUNDEX
UPPER
SUBSTR
LPAD
SUBSTRB
RPAD
Example:
SELECT CHR(67)||CHR(65)||CHR(84) "Pet"

FROM DUAL;
Output:
Pet
---
CAT
Example:
SELECT ASCII('Q')
FROM DUAL;
Output:
ASCII('Q')
----------
81
Example:
SELECT CONCAT(ename, ' is a good boy') "Result"

FROM emp
WHERE empno = 7900;
Output:
Result
-----------------
JAMES is a good boy
Example:
SELECT INITCAP('the king') "Capitals"

FROM DUAL;

Output:
Capitals
---------
The King
Example:
SELECT LOWER('THE KING') "Lowercase"

FROM DUAL;
Output:
Lowercase
-------------
the king
Similarly we can use the UPPER function.
Example:
SELECT LPAD('Page 1',15,'*+') "LPAD example"

FROM DUAL;
Output:
LPAD example
---------------
*+*+*+*+*Page 1
Similarly we can use RPAD.
Example:
SELECT LTRIM('121SPIDERMAN','12') "Result"

FROM DUAL;
Output:
Result
--------
SPIDERMAN
Similarly we can use RTRIM.
Example:
SELECT TRIM (0 FROM 001234567000) Result"

FROM DUAL;
Output:

Result
--------
1234567
SELECT TRIM (LEADING 0

FROM 001234567000) Result"
FROM DUAL;
Output:
Result
--------
1234567000
Similarly we can replace LEADING with TRAILING to omit trailing zeros in

001234567000 and the output will then be 001234567.
Example:
SELECT REPLACE('KING KONG','KI','HO') Result"

FROM DUAL;
Output:
Changes
--------------
HONG KONG
Example:
SELECT ename
FROM emp
WHERE SOUNDEX(ename) = SOUNDEX('SMYTHE');
Output:
ENAME
----------
SMITH
This function allows you to compare words that are spelled differently, but sound
alike in English. You must have noticed that if you do a search in google
(www.google.com) using wrong spelling e.g. I made a wrong spelled word “Neus”
search on google and it came up with, “Did you mean News?”. That is basically
the beauty of this function.

Figure 7: Google’s implementation of SOUNDEX function.

________________________________________
Example:
SELECT SUBSTR('SPIDERMAN',7,3) "Result"

FROM DUAL;
Output:
Result
---------
MAN
Similarly we can use SUBSTRB; for a single-byte database character set,

SUBSTRB is equivalent to SUBSTR. Floating-point numbers passed as arguments
to SUBSTRB are automatically converted to integers. Assume a double-byte
database character set:
SELECT SUBSTRB(' SPIDERMAN',7,4.3) "Result"

FROM DUAL;
Output:
Result
--------
DE
Example:
SELECT INSTR('CORPORATE FLOOR','OR', 3, 2) "Instring"

FROM DUAL;
Output:
Result
----------
14
Similarly we can use INSTRB; for a single-byte database character set, INSTRB is
equivalent to INSTR. Lets suppose a double-byte database character set.

SELECT INSTRB('CORPORATE FLOOR','OR',5,2)

"Result"
FROM DUAL;
Output:
Result
--------
27
Example:
SELECT LENGTH('SPIDERMAN') "Result"

FROM DUAL;
Output:
Result
--------
9
Similarly we can use LENGTHB; for a single-byte database character set,

LENGTHB is equivalent to LENGTH. Lets suppose a double-byte database
character set.
SELECT LENGTHB ('SPIDERMAN') "Result"

FROM DUAL;
Output:
Result
--------
14
Date Functions:
Following functions fall under this category.
ADD_MONTHS
MONTHS_BETWEEN
LAST_DAY
ROUND
SYSDATE
TRUNC
Example:
SELECT ADD_MONTHS(hiredate,1)
FROM emp
WHERE ename = 'SMITH';
Example:

SELECT SYSDATE,
LAST_DAY(SYSDATE) "Last",
LAST_DAY(SYSDATE) - SYSDATE "Days Left"
FROM DUAL;
Output:
SYSDATE Last Days Left

--------- --------- ----------
23-OCT-97 31-OCT-97 8
Example:
SELECT MONTHS_BETWEEN(SYSDATE, hiredate) “Months of Service"

FROM DUAL;
Example:
SELECT ROUND (TO_DATE ('27-OCT-92'),'YEAR')

"New Year" FROM DUAL;
Output:
New Year
---------
01-JAN-93
Example:
SELECT TRUNC(TO_DATE('27-OCT-92','DD-MON-YY'), 'YEAR')

"New Year" FROM DUAL;
Output:
New Year
---------
01-JAN-92

Numeric Functions:
The following functions fall under this category.
ABS
ROUND
SIGN
TRUNC
CEIL
SQRT
FLOOR
MOD
Example:
SELECT ABS(-25) "Result"

FROM DUAL;
Output:
Result
----------
25
Example:
SELECT SIGN(-15) "Result"

FROM DUAL;
Output:
Result
----------
-1
Example:
SELECT CEIL(25.7) "Result"

FROM DUAL;
Output:
Result
----------
26
Example:
SELECT FLOOR(25.7) "Result"

FROM DUAL;
Output:
Result
----------
25
Example:
SELECT ROUND(25.29,1) "Round"

FROM DUAL;
Output:
Round
----------
25.3
SELECT ROUND(25.29,-1) "Round"

FROM DUAL;
Output:
Round
----------
30
Example:
SELECT TRUNC(25.29,1) "Truncate"

FROM DUAL;
Output:
Truncate
----------
25.2
Example:
SELECT TRUNC(25.29,-1) "Truncate"

FROM DUAL;
Output:
Truncate
----------
20
-1 will truncate (make zero) first digit left of the decimal point of 25.29
Example:
SELECT MOD(11,4) "Modulus"

FROM DUAL;
Output:
Modulus
----------
3
Example:
SELECT SQRT(25) "Square root"

FROM DUAL;
Output:
Square root
-----------
5
Conversion Functions:
The following functions fall under this category
TO_CHAR
TO_DATE
TO_NUMBER
Example:
SELECT TO_CHAR(HIREDATE, 'Month DD, YYYY') "Result"

FROM emp
WHERE ename = 'BLAKE';
Output:
Result
------------------
May 01, 1981
Example:

SELECT TO_CHAR(-10000,'L99G999D99MI') "Result"

FROM DUAL;
Output:
Result
--------------
$10,000.00-
Example:
SELECT TO_DATE('January 15, 1989, 11:00 A.M.',

'Month dd, YYYY, HH:MI A.M.',
'NLS_DATE_LANGUAGE = American')
FROM DUAL;
Example:
SELECT TO_NUMBER('$10,000.00-', 'L99G999D99MI') "Result"

FROM DUAL;
Output:
Result
---------
-1000
Other Single Row Functions:
The following functions fall under this category.
NVL
VSIZE
Example:
SELECT ename, NVL(TO_CHAR(COMM), 'NOT APPLICABLE') "COMMISSION"

FROM emp
WHERE deptno = 30;
Output:
ENAME COMMISSION
---------- -------------------------
ALLEN 300
WARD 500
MARTIN 1400

BLAKE NOT APPLICABLE

TURNER 0
JAMES NOT APPLICABLE
Example:
SELECT ename, VSIZE (ename) "BYTES"

FROM emp
WHERE deptno = 10;
Output:
ENAME BYTES
---------- ----------
CLARK 5
KING 4
MILLER 6
Group Functions:
A group function as the name says, gets implemented on more than one record
within a column. They can be better understood by looking at their real world
examples. There are five very important group functions
AVG
COUNT
MAX
MIN
SUM
Example:
SELECT AVG(sal) "Average"

FROM emp;
Output:
Average
----------
2077.21429
Example:
SELECT COUNT(*) "Total"

FROM emp;
Output:
Total
----------
18

Total number of records will be returned with a table.
Example:
SELECT MAX(sal) "Maximum"

FROM emp;
Output:
Maximum
----------
5000
On the same line we can find out minimum value using the MIN group function.
Example:
SELECT SUM(sal) "Total"

FROM emp;
Output:
Total
----------
29081
BASIC STRUCTURE
SQL keywords
SQL keywords fall into several groups.
Data retrieval
The most frequently used operation in transactional databases is the data

retrieval operation. When restricted to data retrieval commands, SQL acts as a
declarative language.

• SELECT is used to retrieve zero or more rows from one or more tables in a
database. In most applications, SELECT is the most commonly used Data
Manipulation Language command. In specifying a SELECT query, the user
specifies a description of the desired result set, but they do not specify
what physical operations must be executed to produce that result set.
Translating the query into an efficient query plan is left to the database
system, more specifically to the query optimizer.
o Commonly available keywords related to SELECT include:
 FROM is used to indicate from which tables the data is to be
taken, as well as how the tables JOIN to each other.
 WHERE is used to identify which rows to be retrieved, or
applied to GROUP BY. WHERE is evaluated before the GROUP
BY.
 GROUP BY is used to combine rows with related values into
elements of a smaller set of rows.
 HAVING is used to identify which of the "combined rows"
(combined rows are produced when the query has a GROUP
BY keyword or when the SELECT part contains aggregates),
are to be retrieved. HAVING acts much like a WHERE, but it
operates on the results of the GROUP BY and hence can use
aggregate functions.
 ORDER BY is used to identify which columns are used to sort
the resulting data.
Example 1:
SELECT * FROM books WHERE price > 100.00 and price < 150.00 ORDER BY
title
This is an example that could be used to get a list of expensive books. It retrieves
the records from the books table that have a price field which is greater than
100.00. The result is sorted alphabetically by book title. The asterisk (*) means to
show all columns of the books table. Alternatively, specific columns could be
named.
Example 2:
SELECT books.title, count(*) AS Authors FROM books JOIN book_authors ON

books.book_number = book_authors.book_number GROUP BY books.title
Example 2 shows both the use of multiple tables in a join, and aggregation
(grouping). This example shows how many authors there are per book. Example
output may resemble:
Title Authors
---------------------- -------
SQL Examples and Guide 3
The Joy of SQL 1
How to use Wikipedia 2
Pitfalls of SQL 1
How SQL Saved my Dog 1

Data manipulation
First there are the standard Data Manipulation Language (DML) elements. DML is
the subset of the language used to add, update and delete data.
• INSERT is used to add zero or more rows (formally tuples) to an existing

table.
• UPDATE is used to modify the values of a set of existing table rows.
• MERGE is used to combine the data of multiple tables. It is something of a
combination of the INSERT and UPDATE elements. It is defined in the
SQL:2003 standard; prior to that, some databases provided similar
functionality via different syntax, sometimes called an "upsert".
• TRUNCATE deletes all data from a table (non-standard, but common SQL
command).
• DELETE removes zero or more existing rows from a table.
Example:
INSERT INTO my_table (field1, field2, field3) VALUES ('test', 'N', NULL);
UPDATE my_table SET field1 = 'updated value' WHERE field2 = 'N';
DELETE FROM my_table WHERE field2 = 'N';
Data transaction
Transaction, if available, can be used to wrap around the DML operations.
• BEGIN WORK (or START TRANSACTION, depending on SQL dialect) can be

used to mark the start of a database transaction, which either completes
completely or not at all.
• COMMIT causes all data changes in a transaction to be made permanent.
• ROLLBACK causes all data changes since the last COMMIT or ROLLBACK to
be discarded, so that the state of the data is "rolled back" to the way it
was prior to those changes being requested.
COMMIT and ROLLBACK interact with areas such as transaction control and
locking. Strictly, both terminate any open transaction and release any locks held
on data. In the absence of a BEGIN WORK or similar statement, the semantics of
SQL are implementation-dependent.
Example:
BEGIN WORK;
UPDATE inventory SET quantity = quantity - 3 WHERE item = 'pants';
COMMIT;
Data definition
The second group of keywords is the Data Definition Language (DDL). DDL allows
the user to define new tables and associated elements. Most commercial SQL
databases have proprietary extensions in their DDL, which allow control over
nonstandard features of the database system.
The most basic items of DDL are the CREATE and DROP commands.
• CREATE causes an object (a table, for example) to be created within the

database.
• DROP causes an existing object within the database to be deleted, usually
irretrievably.
Some database systems also have an ALTER command, which permits the user
to modify an existing object in various ways -- for example, adding a column to
an existing table.
Example:
CREATE TABLE my_table (
my_field1 INT,
my_field2 VARCHAR (50),
my_field3 DATE NOT NULL,
PRIMARY KEY (my_field1, my_field2)
)
All DDL statements are auto commit so while droping a table need to have close
look at its future needs.
Data control
The third group of SQL keywords is the Data Control Language (DCL). DCL
handles the authorization aspects of data and permits the user to control who
has access to see or manipulate data within the database.
Its two main keywords are:
• GRANT — authorizes one or more users to perform an operation or a set of

operations on an object.
• REVOKE — removes or restricts the capability of a user to perform an
operation or a set of operations.
Example:
GRANT SELECT, UPDATE ON my_table TO some_user, another_user
ther
• ANSI-standard SQL supports -- as a single line comment identifier (some

extensions also support curly brackets or C-style /* comments */ for multi-
line comments).
Example:
SELECT * FROM inventory -- Retrieve everything from inventory table
• Some SQL servers allow User Defined Functions
2.2.3 SET OPERATIONS
Early on in school students were encouraged to draw "Venn diagrams" showing

intersecting sets, and then colour in the union or intersection of sets, or even the
complement of a set (everything outside the set).
The set operators in SQL are based on the same principles, except they don't
have a complement, and can determine the 'difference' between two sets. Here
are the operators which we apply to combine two queries:
• union - all elements in both queries are returned;

• intersect - elements common to both queries are returned;
• except - elements in the first query are returned excluding any that were
returned by the second query.
These are powerful ways of manipulating information, but take note: you can
only apply them if the results of the two queries (that are going to be combined)
have the same format - that is, the same number of columns, and identical
column types! (Although many SQLs try to be helpful by, for example, coercing
one data type into another, an idea which is superficially helpful and fraught with
potential for errors). The general format of such queries is illustrated by:
selectcolumns1fromtable1union select columns2 from table2
Different strokes..
Different vendor implementations of SQL have abused the SQL-92 standard in

different ways. For example, Oracle uses minus where SQL-92 uses except.
{give other examples}.
An outer join could be used (with modification for NULLs if these little
monstrosities are present) to achieve the same result as except.
Similarly, an inner join (with select distinct) can do what intersect does.
Set operators can be combined (as you would expect when playing around with
sets) to achieve results that simply cannot be obtained using a single set
operator.
Note that there are some restrictions on using order by with set operators -
order by may only be used once, no matter how big the compound statement,
and the select list must contain the columns being used for the sort.
Pseudoset operators
Not content with implementing set operators, SQL database creators have also
introduced what are called "pseudoset operators". These operators don't fit
conveniently into set theory, because they allow multiple rows (redundancies)
which are forbidden in true sets.
We use the pseudoset operator union all to combine the outputs of two queries
(all that is done is that the results of the second query are appended to the
results of the first). Union all does exactly what we required from a FULL OUTER
JOIN, which as we've already mentioned, is not implemented in many nominally
"SQL-92 compliant" databases!
2.2.4 NULL VALYES
This special mark can appear instead of a value wherever a value can appear in
SQL, in particular in place of a column value in some row. The deviation from the
relational model arises from the fact that the implementation of this ad hoc
concept in SQL involves the use of three-valued logic, under which the
comparison of NULL with itself does not yield true but instead yields the third
truth value, unknown; similarly the comparison NULL with something other than
itself does not yield false but instead yields unknown. It is because of this
behaviour in comparisons that NULL is described as a mark rather than a value.
The relational model depends on the law of excluded middle under which
anything that is not true is false and anything that is not false is true; it also
requires every tuple in a relation body to have a value for every attribute of that
relation. This particular deviation is disputed by some if only because E.F. Codd
himself eventually advocated the use of special marks and a 4-valued logic, but
this was based on his observation that there are two distinct reasons why one
might want to use a special mark in place of a value, which led opponents of the
use of such logics to discover more distinct reasons and at least as many as 19
have been noted, which would require a 21-valued logic. SQL itself uses NULL for
several purposes other than to represent "value unknown". For example, the sum
of the empty set is NULL, meaning zero, the average of the empty set is NULL,
meaning undefined, and NULL appearing in the result of a LEFT JOIN can mean
"no value because there is no matching row in the right-hand operand".
2.2.5 NESTED SUB-QUERIES, VIEWS AND COMPLEX QUERIES
NESTED SUBQUERIES
Subqueries
Wouldn't it be nice if you could perform one query, temporarily store the
result(s), and then use this result as part of another query? You can, and the

trickery used is called a subquery. The basic idea is that instead of a 'static'
condition, you can insert a query as part of a where clause! An example is:
Select * from tablename where value> ( insert select statement here);
Note that in the above query, the inner select statement must return just one
value (for example, an average). There are other restrictions - the subquery must
be in parenthesis, and it must be on the right side of the conditional operator
(here, a greater than sign). You can use such subqueries with =, >, <, >=, <=
and <>, but not {to the best of my knowledge?} with between .. and.
Multiple select subqueries can be combined (using logical operators) in the same
statement, but avoid complex queries if you possibly can!
Subqueries that return multiple values

In the above, we made sure that our subquery only returned a single value. Can
you think of a way you might use a subquery that returns a list of values? (Such a
thing is possible)! Yes, you need an operator that works on lists. An example of
such an operator is in, which we've encountered before. The query should look
something like:
select*from tablename where value in ( insert select statement here);
The assumption is that the nested select statement returns a list. The outer shell
of the statement can then use in to get cracking on the list, looking for a match
of value within the list! There is a surprisingly long list of operators that resemble
in, and can be used in a similar fashion. Here it is:
Operat
What it does
or
There is no match with any value retrieved by the nested

not in
select statement.
We know how this works. Note that = any is a synonym

in
for in that you'll sometimes encounter!
The value is greater than any value in the list produced

by the inner submit statement. This is a clumsy way of
> any
saying "Give me the value if it's bigger than the smallest
number retrieved"!
>= any
Similar to >. Usage should be obvious.

< any
<= any
> all Compare this with > any - it should be clear that the

condition will only succeed if the value is bigger than the

largest value in the list returned by the inner select
statement!
>= all
If you understand > all, these should present no
< all
problem!
<= all
You're not likely to use this one much. It implies that (to
= all succeed) all the values returned by the inner subquery
are equal to one another and the value being tested!
(It is even possible in some SQL implementations to retrieve a 'table' (multiple

columns) using the inner select statement, and then use in to simultaneously
compare multiple values with the rows of the table produced by this inner select.
You'll probably never need to use something like this. Several other tricks are
possible, including the creation of virtual views by using a subquery [See Ladányi
p 409 if you're interested]. Views are discussed in the next section.)
Correlated Subqueries
Whew, we're nearly finished with the subqueries, but there is one more distinct
flavour! The correlated subquery is a nested select statement that can (using
trickery) refer to the outer select statement containing it. By so doing, we can
successively apply the inner select statement to each line generated by the outer
statement! The trick that we use is to create an alias for the outer select
statement, and then refer to this alias in the inner select statement, thus
constraining the inner select to dealing with the relevant row. For an example, we
return to our tedious drug table:
DrugDosing
Dosin DoseM Frequenc

g g y
D1 30 OD
D2 10 OD
D3 200 TDS
Let's say we wanted (for some obscure reason) all doses that are greater than
average dose, for each dosing frequency. [Meaningless, but it serves the
purposes of illustration].
Select Dose Mg,Frequency from DrugDosing fred where

DoseMg>(selectavg(DoseMg) from DrugDosing where Frequency =
fred.Frequency) ;
The sense of this statement should be clear - we use the outer select to choose
a row (into which we put DoseMg, and Frequency). We then check whether this
row is a candidate (or not) using the where statement. What does the where
statement check? Well, it makes sure that DoseMg is greater than a magic
number. The magic number is the average dose for a particular Frequency, the
frequency associated with the current row. The only real trickery is how we use
the label fred to refer to the current line from within the inner select statement.
This label fred is called an alias. We'll learn a lot more about aliases later (Some
will argue that aliases are so important you should have encountered them long
before, but we disagree).
Correlated subqueries are not the only way of doing the above. Using a
temporary view is often more efficient, but it's worthwhile knowing both
techniques. We discuss views next.
VIEWS
As the name suggests, a view gives a particular user access to selected portions
of a table. A view is however more than this - it can limit the ability of a user to
update parts of a table, and can even amalgamate rows, or throw in additional
columns derived from other columns. Even more complex applications of views
allow several tables to be combined into a single view!
How do we make a view?

Interestingly enough, you use a select statement to specify the view you wish to
create. The syntax is:
Create view nameofviewas select here have details of select statement
A variant that you will probably use rather often is:
Create or replaceview nameofviewas select here have details of select

statement
(Otherwise you have to explicitly destroy a view - SQL won't simply overwrite a
view without the or replace instruction, but will instead give you an irritating
error).
Remember that if one alters the view, you alter the underlying table at
the same time!
One cannot use an order by statement (or something else called a for update
clause) within a view. There is a whole lot of other convenient things you can do
to views. Where you include summary statistics (eg count, sum, etc) in a view it

is termed an aggregate view. Likewise, using distinct, you can have a view on
the possible values of a column or grouping of columns.
You can even create a view that is derived from several tables (A multi-table
view). This is extremely sneaky, as you can largely avoid complex join
statements in code which pulls data out of several tables! Ladányi puts things
rather well:
"Pre-joined and tested views reduce errors .. that subtly or obviously undermine
the accuracy of reports, and thus the credibility and subsequent professional
well-being of the people creating them".
How do we limit access to a view?
This is implicit in the way we sneakily use a select statement - to limit access to
certain columns, for example, we just select the column-names we want access
to, and ignore the rest! There is of course a catch (Isn't there always?) - if you
insert a row, then SQL doesn't know what to put into the column entry that's not
represented in the view, so it will insert either NULL, or the default value for that
column. (Likewise, with delete, the entire row will be deleted, even the column
entry that is invisible).
It is also obvious how we limit access to certain rows - we use a where clause
that only includes the rows we want in the view. Note that (depending on your
selection criterion) it is possible to insert a row into a view (and thus the
underlying database) and then not be able to see this row in the view! With
(in)appropriate selection criteria for the view, one can also alter the properties of
rows visible in the view so that they now become hidden!
More draconic constraints are possible. The option with read only prevents any
modifications to the view (or the underlying database); while with check option
prevents you from creating rows in the view that cannot be selected (seen) in the
view itself. If you use the "with check option", then you should follow this with a
name, otherwise SQL will create an arbitrary and quite meaningless name for the
constraint that will only confuse you when an error occurs!
How do we remove a view?

{One can check this out by presumably same as dropping a table}
More things we can do with views

One trick is to create a "virtual view" and then use this as a "table" which you can
update. Instead of specifying a table name, you specify (in parenthesis) the
select statement that defines the "virtual view", and all actions are performed on
this temporary table! This is particularly useful where you don't have the system
authority to create a view, yet need the power of a view.
{Does SQL92/98 support the "comment on" facility for views??}

An under-utilised but rather attractive use of views is to make them based on set
(or pseudo-set) operators, for example the union of two sets.
COMPLEX QUERIES
In descriptive complexity, a query is a mapping from structures of one

vocabulary to structures of another vocabulary. Neil Immerman, in his book
"Descriptive Complexity", "use[s] the concept of query as the fundamental
paradigm of computation"
Given vocabularies σ and τ, we define the set of structures on each language,

STRUC[σ] and STRUC[τ]. A query is then any mapping
Computational complexity theory can then be phrased in terms of the power of

the mathematical logic necessary to express a given query.
Order-independent queries
A query is order-independent if the ordering of objects in the structure does

not affect the results of the query. In databases, these queries correspond to
generic queries (Immerman 1999, p. 18). A query is order-indpendent iff
for any isomorphic structures and .
2.2.6 JOINED RELATIONS
Joins (multiple tables)
Remember the tables we discussed when we were talking about foreign keys,
and even earlier when we talked about common sense in normalising data, (not
that we are displaying conspicuous amounts of this with the following tables
which are, after all, only for demonstration purposes)! How does one amalgamate
the data in the tables? (In dataspeak, we call the relationship between the tables
master-detail, or sometimes, parent-child). Well, let's look at the tables..
DrugRegimen DrugDosing
Regime Dosin Dosin DoseM Frequenc

Drug
n g g g y
R1 Carbimazole D1 D1 30 OD
R2 Carbimazole D2 D2 10 OD
R3 Carbamazepi D3 D3 200 TDS

ne
You might think that a natural extension of the good old select statement is the
following:
select * from DrugRegimen, DrugDosing;
and you would be perfectly correct, but what does the above statement give us
when we actually use it? Here we go..
REGIMEN DRUG DOSING DOSING DOSEMG FREQUENCY
R1 CARBIMAZOLE D1 D1 30 OD
CARBAMAZEPI
R3 D3 D1 30 OD
NE
CARBAMAZEPI
R3 D3 D2 10 OD
NE
R1 CARBIMAZOLE D1 D3 200 TDS
R2 CARBIMAZOLE D2 D3 200 TDS
CARBAMAZEPI
R3 D3 D3 200 TDS
NE
Every single row of the first table has been joined with each and every row of the
second table, not just the rows that we think should correspond! (This is called a
Cartesian join or cross join, and can rapidly generate enormous tables - as
Ladányi points out, if you perform a cross join on three tables, each with a
thousand rows, then - voila - you have 1000 * 1000 * 1000 = one billion rows,
enough to bring most databases to their knees).
For our purposes, most of the rows in the above cross join are meaningless, but
we can easily reduce the rows to only those we are interested in. We simply use
a where statement to join the tables on the Dosing column, thus:
Select * from DrugRegimen,DrugDosing where DrugRegimen.Dosing =

DrugDosing.Dosing;

REGIMEN DRUG DOSING DOSING DOSEMG FREQUENCY
CARBAMAZEPIN
R3 D3 D3 200 TDS
E
It's so important to always have a where condition with your Cartesian joins, let's
make it into a rule:
If the from clause in a select statement has a comma in it, check the where
clause.
Then check the where clause again. And again! ... Rule #4.
Also note that the two tables each had a column with the same name - "Dosing".
We easily sidestepped this one by simply talking about DrugRegimen.Dosing =
DrugDosing.Dosing, rather than, say, Dosing = Dosing, which would have forced
an error! Needless to say, you can select individual columns from the cross join,
rather than having to say select *.
Inner versus Outer Join
The above is an example of an inner join. What this means is that if, for every
value in the DrugRegimen.Dosing column, there's a corresponding value in the
DrugDosing.Dosing column, and vice versa, then everything's fine. However, if
(due to some silly person not enforcing relational integrity) there is no matching
value in the corresponding column, the whole row with its unmatched value will
disappear from the final report - it will softly and suddenly vanish away! Apart
from being a goad to ensure relational integrity in all of your databases, this
should alert you to the possibility that you might trustingly run a query on a
database, and get complete garbage out, because you used an inner join! The
solution is an outer join.
Needless to say, few vendors have stuck to the SQL-92 standard as regards outer
joins. For example, Oracle sneaks three tiny characters into the where statement
thus:
Select * from DrugRegimen, DrugDosing where DrugRegimen.Dosing (+) =

DrugDosing.Dosing;
The (+) tells SQL to "join in a NULL row if you can't find anything that matches a
problem row in DrugDosing.Dosing " - all very convenient, but not standard SQL-
92. Also note that the (+) is on the same side of the equals sign as the table that

is 'augmented', not the one that's causing the problem. It should be clear why
this is called a left outer join, and
Select * from DrugRegimen, DrugDosing where DrugRegimen.Dosing =

DrugDosing.Dosing (+);
.. is a right outer join.
Many vendor SQLs also do not implement the SQL standard for a full outer join,
that lists all rows (whether matched or not) from all tables. The SQL-92 syntax for
the from clause is:
from table1 FULL OUTER JOIN table2
There are other ways of achieving a full outer join in most SQLs. Remember
that the way to avoid all this is to meticulously enforce constraints on
integrity! Also note that an outer join will NOT help you if there are duplicate
entries in one of the tables you are using for the join (which can only occur in a
'relationally challenged' database).
2.2.7 DATA DEFINITION LANGUAGE
A Data Definition Language (DDL) is a computer language for defining data.

XML Schema is an example of a pure DDL (although only relevant in the context
of XML). A subset of SQL's instructions form another DDL.
For example in Oracle the DDL statements refer to CREATE, DROP, ALTER, etc..
These SQL statements define the structure of a database, including rows,

columns, tables, indexes, and database specifics such as file locations. DDL SQL
statements are more part of the DBMS and have large differences between the
SQL variations. DDL SQL commands include the following:
• Create - To make a new database, table, index, or stored query.
• Drop - To destroy an existing database, table, index, or view.
• Alter - To modify an existing database object.
DBCC (Database Console Commands) - Statements check the physical and logical
consistency of a database.
2.2.8 EMBEDDED SQL
Embedded SQL statements are SQL statements written within application

programming languages such as C and preprocessed by an SQL preprocessor

before the application program is compiled. There are two types of embedded
SQL: static and dynamic.
Embedded SQL is a method of combining the computing power of a

programming language (like C/C++, Pascal, etc.) and the database manipulation
capabilities of SQL.

programming languages and preprocessed by an SQL preprocessor before the
application program is compiled. There are two types of embedded SQL: static
and dynamic.
The SQL standard defines embedding of SQL as embedded SQL and the language
in which SQL queries are embedded is referred as host language.
• SQL provides a powerful declarative query language. However, access to a

database from a general-purpose programming language is required
because,
o SQL is not as powerful as a general-purpose programming language.
There are queries that cannot be expressed in SQL, but can be
programmed in C, Fortran, Pascal, Cobol, etc.
o Nondeclarative actions -- such as printing a report, interacting with a
user, or sending the result to a GUI -- cannot be done from within
SQL.
• The SQL standard defines embedding of SQL as embedded SQL and the
language in which SQL queries are embedded is referred as host language.
• The result of the query is made available to the program one tuple (record)
at a time.
• To identify embedded SQL requests to the preprocessor, we use EXEC SQL
statement:
EXEC SQL embedded SQL statement END-EXEC
Note: A semi-colon is used instead of END-EXEC when SQL is embedded in C or

Pascal.
• Embedded SQL statements: declare cursor, open, and fetch statements.
EXEC SQL
declare c cursor for select cname, ccity from deposit, customer where
deposit.cname = customer.cname and deposit.balance > :amount
END-EXEC
where amount is a host-language variable.
EXEC SQL open c END-EXEC

This statement causes the DB system to execute the query and to save the
results within a temporary relation. A series of fetch statement are executed to
make tuples of the results available to the program.
EXEC SQL fetch c into :cn, :cc END-EXEC
The program can then manipulate the variable cn and cc using the features of
the host programming language. A single fetch request returns only one tuple.
We need to use a while loop (or equivalent) to process each tuple of the result
until no further tuples (when a variable in the SQLCA is set). We need to use
close statement to tell the DB system to delete the temporary relation that held
the result of the query.
EXEC SQL close c END-EXEC
• Embedded SQL can execute any valid update, insert, or delete statements.
• Dynamic SQL component allows programs to construct and submit SQL
queries at run time
• SQL-92 also contains a module language, which allows procedures to be
defined in SQL.

programming languages such as C and preprocessed by an SQL preprocessor
before the application program is compiled. There are two types of embedded
SQL: static and dynamic
2.2.9 DYNAMIC SQL AND OTHER SQL FEATURES
Dynamic SQL allows you to write SQL that will then write and execute more SQL
for you. This can be a great time saver because you can:
• Automate repetitive tasks.

• Write code that will work in any database or server.
• Write code that dynamically adjusts itself to changing conditions
Most dynamic SQL is driven off the system tables, and all the examples I use here
will use system tables, but if you have suitable data in user tables, you can write
dynamic SQL from the contents of these tables, too.
These two statements generate Update Stats and DBCC commands for each user
table in the database. You can cut and paste the results into a SQL window and
run them, or save the output to a file for later use.
For the first example there is an even quicker way of achieving this--you can use
the built-in SP sp_msforeachtable like this:
sp_msforeachtable
'update statistics ?'

The system stored procedure loops through the tables in the current database
and executes the command "Update statistics" with each table name substituted
where you see the '?' character.
Executing dynamic SQL automatically
You can use cursors and the EXEC or sp_executesql commands to execute your
dynamic SQL at the same time as you generate it. These commands take a char
or varchar parameter that contains a SQL statement to execute, and runs the
command in the current database. This script will execute the DBCC REINDEX
command for all the user tables in the current database.
Cross-database dynamic SQL
This short query interrogates the master database table sysdatabases, and
executes a DBCC command against every database. This sort of thing is
especially useful for carrying out database maintenance and backups with earlier
versions of SQL Server where the maintenance wizards are not very well
developed, or where you want to do some kind of non-standard task across all
databases.
select 'dbcc newalloc (' + name + ')'

from master..sysdatabases
where name not in ('tempdb', 'pubs')
If you were to expand this routine using the cursor method above you could write
a code block that would, for example, dynamically back up all databases, with
new and deleted databases catered for automatically.
2.2.10 QUERY BY EXAMPLE
Query by Example (QBE) is a database query language for relational

databases. It was devised by Moshé M. Zloof at IBM Research during the mid
1970s, in parallel to the development of SQL. It is the first graphical query
language, using visual tables where the user would enter commands, example
elements and conditions. Many graphical front-ends for databases use the ideas
from QBE today.
QBE is based on the notion of Domain relational calculus.
Query by Example (QBE) is a powerful search tool that allows anyone to search a
system for document(s) by entering an element such as a text string or
document name, quickly searching through documents to match your entered
criteria. It’s commonly believed that QBE is far easier to learn than other, more
formal query languages, (i.e. SQL) while still providing people with the
opportunity to perform powerful searches.
Searching for documents based on matching text is easy with QBE; the user
simply enters (or copy and paste) the target text into the search form field. When
the user clicks search (or hits enter) the input is passed to the QBE parser for
processing. The query is created and then the search begins, using key words
from the input the user provided. It auto-eliminates mundane words such as and,
is, or, the, etc… to make the search more efficient and not to barrage the user
with results. However, when compared with a formal query, the results in the
QBE system will be more variable.
The user can also search for similar documents based on the text of a full
document that he or she may have. This is accomplished by the user’s
submission of documents (or numerous documents) to the QBE results template.
The analysis of these document(s) the user has inputted via the QBE parser will
generate the required query and submits it to the search engine, that will
obviously then search for relevant and similar material for the specified list.
Example
A simple example using the Suppliers and Parts database is given here, just to
give you a feel for how QBE works.
This "query" selects all supplier numbers (S#) where the owner of the supplier
company is "J. Doe" and the supplier is located in "Rome".
Other commands like the "P." (print) command are: "U." (update), "I." (insert) and
"D." (delete).
The result of this query depends on what the values are for your the Suppliers
and Parts database.
2.2.11 DATALOG
Datalog is a query and rule language for deductive databases that syntactically
is a subset of Prolog. Its origins date back to around 1978 when Hervé Gallaire
and Jack Minker organized a workshop on logic and databases. The term Datalog
was coined in the mid 1980's by a group of researchers interested in database
theory.
Features, limitations and extensions
Query evaluation with Datalog is sound and complete and can be done efficiently
even for large databases. Query evaluation is usually done using bottom up
strategies. For restricted forms of datalog that don't allow any function symbols,
safety of query evaluation is guaranteed.

In contrast to Prolog, it
1. disallows complex terms as arguments of predicates, e.g. P(1, 2) is

admissible but not P(f1(1), 2),
2. imposes certain stratification restrictions on the use of negation and
recursion, and
3. only allows range restricted variables, i.e. each variable in the conclusion
of a rule must also appear in a not negated clause in the premise of this
rule.
Datalog was popular in academic database research but never succeeded in

becoming part of a commercial database system. Advantages of Datalog over
SQL such as the clean semantics or recursive queries were not sufficient. Modern
database systems, however, include ideas and algorithms developed for Datalog:
the SQL99 standard includes recursive queries and the Magic Sets algorithm
initially developed for the faster evaluation of Datalog queries is implemented in
IBMs DB2.
Extensions to Datalog were made to make it object-oriented, or to allow

disjunctions as heads of clauses. Both extensions have major impacts on the
definition of the semantics and the implementation of a corresponding Datalog
interpreter.
Example
Example Datalog program:
parent(bill,mary).
parent(mary,john).
ancestor(X,Y) :- ancestor(X,Z),ancestor(Z,Y).
ancestor(X,Y) :- parent(X,Y).
The ordering of the clauses is irrelevant in Datalog in contrast to Prolog which

depends on the ordering of clauses for computing the result of the query call.
Systems implementing Datalog
Most implementations of Datalog stem from university projects. Here is a short

list of a few systems that are either based on Datalog or are providing a Datalog
interpreter:
• bddbddb, an implementation of Datalog done at the Stanford University. It

is mainly used to query Java bytecode including points-to analysis on large
Java programs.
• ConceptBase, a deductive and object-oriented database system based on

a Datalog query evaluator. It is mainly used for conceptual modeling and
meta-modeling.
• DES, an open-source implementation of Datalog to be used for teaching

Datalog in courses.

• DLV, is a a Logic Programming and Deductive Database system which

implements Disjunctive Datalog, using a disjunctive version of Answer set
programming and adds many useful features also available in commercial
database systems and query languages, such as aggregates, ODBC
binding, etc.
• XSB, is a Logic Programming and Deductive Database system for Unix and
Windows.
2.2.12 USER INTERFACES AND TOOLS
The SmallSQL Database is a 100% pure Java DBMS for desktop applications. It
has a JDBC 3.0 driver and supports SQL-92 and SQL-99 standards. It has a very
small footprint of approx. 200KB for driver and database together. This is very
small for a JDBC 3.0 interface.
The difference to other 100% pure Java databases is that it has no network
interface and user management. The target applications are Java desktop
applications. There is no installation required.
Enterprise Manager is the primary administrative tool for Microsoft SQL Server
2000 and provides a MMC–compliant user interface that allows users to:
• Define groups of servers running SQL Server.

• Register individual servers in a group.
• Configure all SQL Server options for each registered server.
• Create and administer all SQL Server databases, objects, logins,
users, and permissions in each registered server.
• Define and execute all SQL Server administrative tasks on each
registered server.
• Design and test SQL statements, batches, and scripts interactively
by invoking SQL Query Analyzer.
• Invoke the various wizards defined for Microsoft SQL Server.
MMC is a tool that presents a common interface for managing different server
applications in a Microsoft Windows network. Server applications provide a
component called an MMC snap-in that presents MMC users with a user interface
for managing the server application. SQL Server Enterprise Manager is the
Microsoft SQL Server 2000 MMC snap-in.
To launch Enterprise Manager, select the Enterprise Manager icon in the

Microsoft SQL Server program group. On computers running Microsoft Windows
2000, you can also launch Enterprise Manager from Computer Management in
Control Panel. MMC snap-ins launched from Computer Management do not have
the ability to open child windows enabled by default. You may have to enable this
option to use all the Enterprise Manager features.

In SQL Server 2005, Enterprise Manager is replaced by Management Studio,

which provides a single interface to functionality provided by Enterprise Manager,
Query Analyzer and Profiler in SQL Server 2000
Knowledge Xpert for SQL Server is a comprehensive Windows-based

technical and process-oriented knowledge base covering the lifecycle of Microsoft
Transact-SQL development and administration. Hundreds of knowledge-
engineered topics provide best practices and explanations, along with examples
for writing optimized code for Microsoft SQL Server 2000 and 2005. Knowledge
Xpert for SQL Server gives developers and database administrators tools for high-
quality development and administration of SQL Server.
Knowledge Xpert for SQL Server is produced by Quest Software.
Capacity Manager for SQL Server is a graphical database administration tool

for managing disk-space utilization, managing table partitioning, and measuring
growth trends of instances, databases, and objects. A consolidated view lets the
user to manage all file groups and data files from one place.
Capacity Manager for SQL Server is produced by Quest Software
Q-BANK
1.a)Explain about relational data base design. (7)-nov 2008
b)Illustrate how deletion is done in QBE with an example. (8)nov 2008
2.a)Write notes on the various set operations in relational algebra. (8)-nov 2008
b) Write short notes on embedded SQL. (7)-nov 2008
3)Explain the various relation algebra operations in detail (15)-nov 2007
4) Give a detailed account on Domain Relational Calculus (15)-nov 2007
5a)Explain structure of relational databases. (7)-april 2008
b)Explain the basic structure of SQL. (8)-april 2008
6)Discuss the features of query-by-example. (15)-april 2008
7)Describe the various concepts underlying the relational data model of a

banking enterprise. (15)-april 2009
8)Explain the following relational languages:
a)QBE (8)

b)Data log (7)-april 2009
9. Explain about View
10. Discuss about Tuple Relational Calculus
11. Consider the following employee database-May 2007
Employee database
Employee(emp_name,street,city,phone,pin)
Works(emp_name,company_name,salary)
Company(company_name,city)
Manages(emp_name,manager_name)
Write an SQL expression for the following queries
(a) Find the names,street,cities,phones,pins of residence of all the employees

who work for ICICI insurance company and earn more than Rs.20,000.
(4)
(b) Find all the employees names who do not work for ICICI insurance company
(3)
(c) Find all the employees who live in the same cities and on the same streets
as do their managers .(4)
(d) Assume that ICICI insurance company may be located in several cities .find
all companies located in every sity in which UTI insurance company is located.(4)
(12)Define relational algebra? Explain different operations of relational algebra

(8) May2007

UNIT - III
2Marks Q & A:
1. Define Functional Dependencies.

Let X  R and Y  R, then the Functional dependency (FD) X  Y holds
on R if, in any relation r(R), for all pairs of tuples t1 and t2 in r such that
t1[X] = t2[X], it is also the case that t1[Y] = t2[Y].
2. Define Closure of Functional Dependency.

Let F be a set of FDs, then the closure of F is the set of all FDs logically
implied by F. It is denoted by F+
3. Define the three Armstrong’s Axioms or rules of inference.

• Reflexivity rule:
o If  is a set of attributes and Y  X, then X  Y holds.
• Augmentation rule:
o If X  Y holds and Z is a set of attributes, then ZX  ZY holds.
• Transitivity rule:
o If X  Y holds and Y  Z holds, then X  Z holds.
4. Define union, decomposition, and Pseudo-transitivity rules.

Union rule:
o If X  Y holds and X  Z holds, then X  YZ holds.
Decomposition rule:
o If X YZ holds, then X  Y holds and X  Z holds.
Pseudo-transitivity rule:
o If X  Y holds and ZY  W holds, then XZ  W hold
5. Define normalization of data and denormalization.

Normalization:
Process of decomposing an unsatisfactory relation schema into
satisfactory relation schemas by breaking their attributes, so as to satisfy
the desirable properties.
Denormalization:
Process of storing the join of higher normal form relations as base
relation, which is in the lower normal form.
6. Explain shortly the four properties or objectives of normalization.

• To minimize redundancy
• To minimize insertion, deletion, and updating anomalies.
• Lossless-join or non-additive join decomposition
• Dependency preservation
7. Define Partial Functional Dependency.

A FD x  y is a partial dependency if some attribute A  x can be
removed from x and the dependency still holds; ie., for some A  x, ( x –
{ A } )  y.
8. What are the parts of SQL language?

The SQL language has several parts:
• data – definition language
• Data manipulation language
• View definition
• Transaction control
• Embedded SQL
• Integrity
• Authorization
9. What are the categories of SQL command?

SQL commands are divided in to the following categories:
1. data - definition language
2. data manipulation language
3. Data Query language
4. data control language
5. data administration statements
6. transaction control statements
10. What are the three classes of SQL expression?

SQL expression consists of three clauses:
Select
From
where
11. Give the general form of SQL query?

Select A , A …………., An
12
Fro m R , R ……………, R
12 m
Where P
12. What is the use of rename operation?

Rename operation is used to rename both r elations and a attributes.
It uses the as clause, taking the form:
Old-name as new-name
13. Define tuple variable?
Tuple variables are used for comparin g two tuples in the same
relation. The tuple variables are defined in the from clause by way
of the as clause.
14. List the string operations supported by SQL?

1) Pattern matching Operation
2) Concatenation
3) Extracting character strings
4) Converting between uppercase and lower case letters.
15. List the set operations of SQL?

1) Union
2) Intersect operation
3) The except operation
16. What is the use of Union and intersection operation?

Union : The result of this operation includes all tuples that are either
in r1 or in r2 or in both r1 and r2.Duplicate tuples are automatically
eliminated.
Intersection: The result of this relation includes all tuples that are
in both r1 and r2.
17. What are aggregate functions? And list the aggregate functions
supported by SQL?
Aggregate functions are functions that take a collection of
values as input and return a single value.
Aggregate functions supported by SQL are

Average: avg
Minimum: min
Maximum: max
Total: sum
18. What is the use of group by clause?

Group by clause is used to apply aggregate functions to a set
of tuples.The attributes given in the group by clause are used
to form groups.Tuples with the same value on all attributes in the
group by clause are placed in one group.
19. What is the use of sub queries?

A sub query is a select-from-where expression that is nested
with in another query. A common use of sub queries is to perform tests
for set membership, make setcomparisions, and determine set cardinality.

20. What is view in SQL? How is it def ined?

Any relation that is not part of the logical model, but is made visible
to a user as a virtual relation is called a view. We define view in SQL by
using the create view command. The form of the create view
command is
Create view v as <quer y expression>
21. What is the use of with clause in SQL?

The with clause provides a way of defining a temporary view
whose definition is available only to the query in which the with
clause occurs.
22. List the table modification commands in SQL?

Deletion
Insertion
Updates
Update of a view
23. List out the statements associated with a database transaction?

Commit work
Rollback work
24. What is transaction?

Transaction is a unit of program ex ecution that accesses and possibly
updated various data items.
25. List the SQL domain Types?

SQL supports the following domain types.
1) Char(n) 2) varchar(n) 3) int 4) numeric(p,d)
5) float(n) 6) date.
26. What is the use of integrity constraints?

Integrity constraints ensure that changes made to the database
by authorized users do not result in a loss of data consistency. Thus
integrity constraints guard against accidental damage to the database.
27. Mention the 2 forms of integrity constraints in ER model?

Key declarations
Form of a relationship
28. What is trigger?

Triggers are statements that are executed automatically by the
system as the side effect of a modification to the database.
29. What are domain constraints?

A domain is a set of values that may be assigned to an
attribute .all values that appear in a column of a relation must be
taken from the same domain.
30. What are referential integrity constraints?

A value that appears in o ne relation for a given set of attributes
also appears for a certain set of attributes in another relation.

31. What is assertion? Mention the forms available.

An assertion is a predicate ex pressing a condition that we wish
the database always to satisfy.
Domain integrity constraints.
Referential integrity constraints
32. Give the syntax of assertion?

Create assertion <assertion name> check <predicate>
33. What is the need for triggers?

Triggers are useful mechanisms for alerting humans or for
starting certain tasks automatically when certain conditions are met.
34. List the requirements needed to design a trigger.

The requirements are
Specifying when a trigger is to be executed.
Specify the actions to be taken when the trigger executes.
35. Give the forms of triggers?

The triggering event can be insert or delete.
For updated the trigger can specify columns.
The referencing old row as clause
The referencing new row as clause
The triggers can be initiated before the event or after the event.
36. What does database security refer to?
Database security refers to the protection from unauthorized access
and malicious destruction or alteration.
37. List some security violations (or) name any forms of malicious access.
Unauthorized reading of data
Unauthorized modification of data
Unauthorized destruction of data.
38. List the types of authorization.

Read authorization
Write authorization
Update authorization
Drop authorization
39. What is authorization graph?

Passing of authorization from one user to another can be
represented by an
authorization graph.
40. List out various user authorization to modify the database schema.
Index authorization
Resource authorization
Alteration authorization
Drop authorization
41. What are audit trails?

An audit trail is a log of all changes to the database along with
information such as which user perfo rmed the change and when th e
change was performed
42. Mention the various levels in security measures.

Database system
Operating system
Network
Physical
human
43. Name the various privileges in SQL?

Delete
Select
Insert
update
44. Mention the various user privileges.
All privileges directly granted to the user or role.
All privileges granted to roles that have been granted to the user or
role.
45. Give the limitations of SQL authorization.

The code for checking authorization becomes intermixed with
the rest of the application code.
Implementing autho rization through application code rather
than specif ying it declaratively in SQL makes it hard to ensure th e
absence of loopholes.
46. Give some encryption techniques?

DES
AES
Public key encryption
47. What does authentication refer?

Authentication refers to the task of verifying the identity of a person.
48. List some authentication techniques.

Challenge response scheme
Digital signatures
Nonrepudiation
49. Define Boyce codd normal form

A relation schema R is in BCNF with respect to a set F of
functional+ dependencies if, for all functional dependen cies in F
of the form. a->ß, where a
50. List the disadvantages of relational database system

Repetition of data
Inability to represent certain information.
51. What is first normal form?

The domain of attribute must include only atomic (simple, indivisible)
values.
52. What is meant by functional dependencies?

Consider a relation schema R and a C R and ß C R. The

functional dependency a ß holds on relational schema R if in
any legal relation r(R), for all pairs of tuples t1 and t2 in r such
that t1 [a] =t1 [a], and also t1 [ß] =t2 [ß].
53. What are the uses of functional dependencies?
To test relations to see whether they are legal under a given
set of functional dependencies. To specify constraints on the set of legal
relations.
54. Explain trivial dependency?

Functional dependency of the form a ß is trivial if ß C a. Trivial
functional dependencies are satisfied by all the relations.
55. What are axioms?

Axioms or rules of inference provide a simpler technique for
reasoning about functional dependencies.
56. What is meant by computing the closure of a set of functional dependency?

+ The closure of F denoted b y F is the set of functional
dependencies logically implied by F.
57. What is meant by normalization of data?

It is a process of analyzing the given relation schemas based
on their Functional Dependencies (FDs) and primary key to achieve
the properties Minimizing redundancy Minimizing insertion, deletion
and updating anomalies
58. Explain the desirable properties of decomposition.

Lossless-join decomposition
Dependency preservation
Repetition of information
59. What is 2NF?

A relation schema R is in 2NF if it is in 1NF and every non-prime
attribute A in R is fully functionally dependent on primary key.

Lecture Notes:
Integrity and Security
Domain Constraints – Referential Integrity – Assertions – Triggers – Security

and Authorization – Authorization in SQL – Encryption and Authentication.
Relational-Database Design
First Normal Form – Pitfalls in Relational-Database Design – Functional

Dependencies – Decomposition – Desirable Properties of Decomposition – Boyce-
Codd Normal Form – Third Normal Form – Fourth Normal Form – More Normal
Forms – Overall Database Design Process.
Integrity and Security
• Integrity constraints ensure that changes made to the database by

authorized users do not result in a loss of data consistency.
• Thus, integrity constraints guard against accidental damage to the
database.
Example
 An a/c balance cannot be null.
 No 2 accounts can have the same acc_no.
 Every acc-No in the depositor relation must have a matching acc-no in the
A/c relation
Integrity and Security involves the following,
1. Domain Constraints
2. Referential Integrity
3. Assertions
4. Triggers
5. Security and Authorization
6. Authorization in SQL
7. Encryption and Authentication
1. Domain Constraints:
 Domain constraints are the most elementary form of integrity
constraint.
 They are tested easily by the system whenever a new data item is
entered into the database.
 It is possible for several attributes to have the same domain.
• For example, the attributes customer-name and employee-name might

have the same domain: The set of all person names.
• However, the domains of balance and branch-name certainly ought to
distinct.

• It is perhaps less clear whether customer-name and branch-name should

have the same domain.
• At the implementation level, both customer names and branch names are
character strings.
• However, we would normally not consider the query“ Find all customers
who have the same name as a branch ” to be a meaningful query.
• Thus, if we view the database at the conceptual, rather than the physical
level, customer-name and branch-name should have distinct domains.
The create domain clause can be used to define new domains.
For example, the statements:
create domain Dollars numeric (12,2)

create domain Pounds numeric (12,2)
Assigning dollars type to pounds type will result in syntax error.
Consider another example,
create domain HourlyWage numeric (5,2)

constraint wage-value-test check (value >=4 .00)
The domain HourlyWage has a constraint that ensures that the hourly
wage is greater than 4 .00. The clause constraint wage-value-test is optional,
and is used to give the name wage-value-test to the constraint. The name is used
to indicate which constraintan update violated.
As another example, the domain can be restricted to contain only a specified set
of values by using the in clause:
create domain AccountType char (10)
constraint account-type-test
check (value in (’Checking ’,’Saving ’))
The preceding check conditions can be tested quite easily, when a tuple is
inserted or modified.
2. Referential Integrity:
It is used to ensure that a value that appears in one relation for a given set
of attributes also appears for a certain set of attributes in another relation.
Example:
create table department (
dept_no number(5) primary key,
dept_name varchar2(10),
location varchar2(10)
);
create table employee (
employee_name varchar2(10),
salary number(5),
doj date,
dept_no number references department(dept_no)
);
An employee database stores the department in which each employee works.
The field "Dept_no" in the employee table is declared a foreign key, and it
refers to the field "Dept_no" in the Department table which is declared a primary
key.
Referential integrity would be broken by deleting a department from the

Department table if employees listed in the Employee table are listed as working
for that the department.
Referential Integrity in E-R Model:
 Consider the relationship set R between entity sets E1 and E2. The
relational schema for R includes the primary keys K1 of E1 and K2 of E2.
 Then K1 and K2 form foreign keys on the relational schemas for E1 and E2
respectively.
3.Assertions
 An assertion is a predicate expressing a condition that we wish the

database always to satisfy.
 Domain constraints and referential-integrity constraints are special forms
of assertions.
 Assertions can be easily tested and can apply to a wide range of database
applications.
 Examples are

• The sum of all loan amounts for each branch must be less than the
sum of all account balances at the branch.
• Every loan has at least one customer who maintains an account
with a minimum balance of $1000.00.
 An assertion in SQL takes the form (Syntax)
create assertion <assertion-name> check

<predicate>
 Since SQL does not provide a “for all X, P(X)” construct (where P is a
predicate), we are forced to implement the construct by the equivalent
“not exists X such that not P(X)” construct, which can be written in SQL.
create assertion sum-constraint check

(not exists (select * from branch
where (select sum(amount) from loan
where loan.branch-name = branch.branch-name)
>= (select sum(balance) from account
where account.branch-name = branch.branch-name)))
 When an assertion is created, the system tests it for validity. If the

assertion is valid, then any future modification to the database is allowed
only if it does not cause that assertion to be violated.
 The high overhead of testing and maintaining assertions has led some
system developers to omit support for general assertions, or to provide
specialized forms of assertions that are easier to test.
4.Triggers
 A trigger is a statement that the system executes automatically as a side

effect of a modification to the database.
 To design a trigger mechanism, we must meet two requirements:
o Specify when a trigger is to be executed. This is broken up into an

event that causes the trigger to be checked and a condition that
must be satisfied for trigger execution to
proceed.
o Specify the actions to be taken when the trigger executes. The
above model of triggers is referred to as the event-condition-
action model for triggers.
 The database stores triggers just as if they were regular data, so that they
are persistent and are accessible to all database operations.
 Once we enter a trigger into the database, the database system takes on
the responsibility of executing it whenever the specified event occurs and
the corresponding condition is satisfied.
The triggering event and actions can take many forms:

 The triggering event can be insert or delete , instead of update .

 For updates, the trigger can specify columns whose update causes the
trigger to execute.
 The referencing old row as clause can be used to create a variable
storing the old value of an updated or deleted row.
 The referencing new row as clause\can be used with inserts in addition
to updates.
 Triggers can be activated before the event (insert/delete/update) instead
of after the event.
 The clauses referencing old table as or referencing new table as can
then be used to refer to temporary tables (called transition tables )
containing all the affected rows.
Example of trigger for reordering an item,
create trigger reorder-trigger after update of amount on inventory

referencing old row as orow ,new row as nrow
for each row
when nrow.level <=(select level
from minlevel
where minlevel.item =orow.item )
and orow.level >(select level
from minlevel
where minlevel.item =orow.item )
begin
insert into orders
(select item,amount
from reorder
where reorder.item =orow.item )
end
 minlevel(item,level),which notes the minimum amount of the item to be

maintained.
 reorder(item,amount),which notes the amount of the item to be ordered
when its level falls below the minimum.
 orders(item,amount),which notes the amount of the item to be ordered.
Another Example
create trigger overdraft_trigger after update on account

referencing new row as nrow
for each row when nrow.balance<0
begin atomic
insert into borrower(select customer_name,acc_no
from depositor where nrow.acc_no=depositor.acc_no);
insert into loan values
(nrow.acc_no,nrow.branch_name,nrow.balance);
update account set balance = 0 where
account.acc_no=nrow.acc_no
end.

5. Security and Authorization
 Security of data is an important concept in DBMS because it is essential to

safeguard the data against any unwanted users.
 5 different levels of security

 DB System Level
 Operating System Level
 Network Level
 Physical Level
 Human Level
1. DB system level
 Authentication (verification) and authorization mechanism to allow

specific users access only to required data.
2. Operating System Level:
 Protection from invalid logins

 File-level access protection
 Protection from improper use of “superuser” authority,
 Protection from imprope use of privileged machine instructions.
3. Network level:
 Each site must ensure that it communicates with treated sites.

 Links must be protected from theft or modification of messages.
Mechanisms Used:
* Identification protocol (password based).
* Cryptography.
4. Physical Level:
* Protection of equipment from floods, power failure etc..

* Protection of disks from theft etc…
* Protection of network and terminal cables from wire tapes etc…
Solution:
* Physical security by locks etc…
* Software techniques to detect physical security breaches.

5. Human Level:
Protection from stolen passwords etc…
Solution:
* Frequent change of passwords.
* Data audits
 The data stored in the database need protection from unauthorized access
and malicious destruction or alteration, in addition to the protection
against accidental introduction of inconsistency that integrity constraints
provide.
Security Violations
Among the forms of malicious access are:
• Unauthorized reading of data (theft of information)
• Unauthorized modi .cation of data
• Unauthorized destruction of data
Database security refers to protection from malicious access. Absolute

protection of the database from malicious abuse is not possible, but the cost to
the perpetrator can be made high enough to deter most if not all attempts to
access the database without proper authority.
To protect the database, we must take security measures at several levels:
• Database system .Some database-system users may be authorized to

access only a limited portion of the database.Other users may be allowed
to issue queries,but may be forbidden to modify the data.It is the
responsibility of the database system to ensure that these authorization
restrictions are not violated.
• Operating system .No matter how secure the database system
is,weakness in operating-system security may serve as a means of
unauthorized access to the database.
• Network .Since almost all database systems allow remote access through
terminals or networks,software-level security within the network software
is as important as physical security,both on the Internet and in private
networks.
• Physical .Sites with computer systems must be physically secured against
armed or surreptitious entry by intruders.
• Human .Users must be authorized carefully to reduce the chance of any
user giving access to an intruder in exchange for a bribe or other favors.
Authorization:
We may assign a user several forms of authorization on parts of the
database. For
example,
o Read authorization allows reading,but not modi .cation,of data.
o Insert authorization allows insertion of new data,but not modi .cation of
existing data.
o Update authorization allows modi .cation,but not deletion,of data.
o Delete authorization allows deletion of data.

We may assign the user all,none,or a combination of these types of

authorization.In addition to these forms of authorization for access to data,we
may grant a user authorization to modify the database schema:
o Index authorization allows the creation and deletion of indices.
o Resource authorization allows the creation of new relations.
o Alteration authorization allows the addition or deletion of attributes in a
relation.
o Drop authorization allows the deletion of relations.
Authorization in SQL
• The SQL language offers a fairly powerful mechanism for defining
authorizations.
Privileges in SQL:
• The SQL standard includes the privileges delete ,insert ,select ,and
update .
• The select privilege corresponds to the read privilege.
• SQL also includes a references privilege that permits a user/role to
declare foreign keys when creating relations.
• If the relation to be created includes a foreign key that references
attributes of another relation, the user/role must have been granted
references privilege on those attributes.
The SQL data-definition language includes commands to grant and revoke

privileges. The grant statement is used to confer authorization. The basic form
of this statement is:
grant <privilege list >on <relation name or view name >to <user/role list >
The privilege list allows the granting of several privileges in one command.
• The following grant statement grants users U 1 ,U 2 ,and U 3 select

authorization on
the account relation:
grant select on account to U 1 ,U 2 ,U 3
• In the below example, the grant statement gives users U 1 ,U 2 ,and U 3

update authorization on the amount attribute of the loan relation:
grant update (amount )on loan to U 1 ,U 2 ,U 3
Encryption and Authentication:

The various provisions that a database system may make for authorization
may still not provide sufficient protection for highly sensitive data. In such cases,
data may be stored in encrypted form. It is not possible for encrypted data to be
read unless the reader knows how to decipher (decrypt ) them. Encryption also
forms the basis of good schemes for authenticating users to a database.
Encryption Techniques:
• There are a vast number of techniques for the encryption of data.
• Simple encryption techniques may not provide adequate security, since it
may be easy for an unauthorized user to break the code.

• As an example of a weak encryption technique, consider the substitution of

each character with the next character in the alphabet. Thus,
Perryridge
becomes
Qfsszsjehf
A good encryption technique has the following properties:
 It is relatively simple for authorized users to encrypt and decrypt data.

 It depends not on the secrecy of the algorithm,but rather on a parameter
of the algorithm called the encryption key .
 Its encryption key is extremely difficult for an intruder to determine.
Data Encryption Standard (DES):
The Data Encryption Standard (DES), does both a substitution of characters

and a rearrangement of their order on the basis of an encryption key. For this
scheme to work, the authorized users must be provided with the encryption
key via a secure mechanism. This requirement is a major weakness, since the
scheme is no more secure than the security of the mechanism by which the
encryption key is transmitted.
Public-key encryption:
Public-key encryption is an alternative scheme that avoids some of the

problems that we face with the DES. It is based on two keys; a public key and a
private key. Each user U i has a public key E i and a private key D i .All public
keys are published: They can be seen by anyone. Each private key is known to
only the one user to whom the key belongs. If user U 1 wants to store encrypted
data, U 1 encrypts them using public key E 1 .Decryption requires the private key
D 1.
For public-key encryption to work there must be a scheme for encryption that
can be made public without making it easy for people to figure out the scheme
for decryption. In other words, it must be hard to deduce the private key, given
the public key. Such a scheme does exist and is based on these conditions:
• There is an efficient algorithm for testing whether or not a number is
prime.
• No efficient algorithm is known for finding the prime factors of a number.
Authentication:
• Authentication refers to the task of verifying the identity of a
person/software connecting to a database. The simplest form of
authentication consists of a secret password which must be presented
when a connection is opened to a database.
• Password-based authentication is used widely by operating systems as
well as databases.
• A more secure scheme involves a challenge-response system.The
database system sends a challenge string to the user. The user encrypts
the challenge string using a secret password as encryption key, and then
returns the result. The database system can verify the authenticity of the
user by decrypting the string with the same secret password, and checking
the result with the original challenge string. This scheme ensures that no
passwords travel across the network.
• Another interesting application of public-key encryption is in digital

signatures to verify authenticity of data; digital signatures play the
electronic role of physical signatures on documents. The private key is
used to sign data, and the signed data can be made public. Anyone can
verify them by the public key, but no one could have generated the signed
data without having the private key. Furthermore, digital signatures also
serve to ensure nonrepudiation .

Relational Database Design
Relational Database Design

 It requires that we find a “good” collection of relation schemas.
Pit-falls in Relational Database design:

 A bad design may lead to
 a. Repetition of information – that leads to
 insertion, deletion, updation problems.
 b. Inability to represent certain Information.
Design Goals:
a. Avoid redundant data.
b. Ensure that relationships among attributes are represented.
c. Facilitate the checking of updates for violation of db integrity
constraints.
 Example: Consider the relation schema:

 Lending-
schema=(branch_name,branch_city,assets,customer_name,loan_no,amoun
t)
Branch_ Branch_city assets Customer_n Loan_n amount

Name ame o
Adayar Chennai 90,00,000 Anu L-01 1000

Bbbb XXXX 20,00,000 aaa L-02 2000
Cccc YYYY 30,00,000 bbb L-03 3000
Adayar Chennai 90,00,000 Barathi L-04 3500
Here branch Adayar details are represented 2 times . This leads to a redudancy
problem.
Redundancy:
 Data for branch_name,branch_city, assets are represented for each loan
that a branch makes
a. wastage space
b. Complicates updating,introducing inconsistency of assets value.
Decomposition:
* Decompose the relation-schema, lending schema into,
 Branch-schema=(branch_name,branch_city,assets)
 Loan-schema = (customer_name,loan_no,branch_name,amount)
 All attributes of original schema R must appear in decomposition(R1,R2)
 R= R1 U R2
 Lossless join decomposition.

 All possible relations r on schema R.

r = ∏R1 (r) ∏R2(r)
FUNCTIONAL DEPENDENCIES:
• Functional dependencies play a key role in differentiating good database
designs from bad database designs.
• A functional dependency is a type of constraint that is a generalization
of the notion of key.
• Functional dependencies are constraints on the set of legal relations.
• They allow us to express facts about the enterprise that we are modeling
with our database.
• There is no fool-proof algorithmic method of identifying dependency. We
have to use our commonsense and judgment to specify dependencies.

 It requires that the value for a certain set of attributes determines uniquely
the value for another set of attributes.
 In a given relation R, X and Y are attributes. Attributes Y is functionally

dependent on attribute X if each value of X determines exactly one value
of Y, which is represented as
X Y
i.e… “ X determines Y” or “Y is functionally dependent on X”
E.G…
Marks Grade
Types:
A. Full Dependencies:
 In relation R, X and Y are attributes. X is functionally determines Y. Subset

of X should not be functionally determine Y.
 In the above eg. Marks is fully functionally dependent on student_no and

course_no together and not on subset of {student_no,course_no}
B. Partial Dependencies:
 Attribute Y is partial dependent on the attribute X only if it is dependent

on a subset of attribute X.
 For eg.. Course_name, Instructor_name are partially dependent on

composite attributes { student no, course_no} because course_no alone
defines course_name, Instructor_name.

C. Transitive Dependencies:
X,Y and Z are 3 attributes in the relation R
X Y
Y Z
X Z
For e.g.. Grade depends on marks and in turn make depends on {student_no
course_no}, hence Grade depends fully transitively on {student_no course_no}
Definition[single attribute]:
Let X and Y be two attributes of a relation. Given the value of X, if there is

only one value of Y corresponding to it, then Y is said to be functionally
dependentent on X. this is indicated by the notation:
XY
For example, given the value of item code, there is only one value of item
name for it. Thus item name is functionally dependent on item code. This is
shows as:
Item code  Item name
Similarly in table 1, given an order number, the date of the order is known.
Definition[composite attribute]:
Functional dependency may also be based on a composite attribute. For

example, if we write
X, Z Y
It means that there is only one value of Y corresponding to given values of X, Z.

In other words, Y is functionally dependent on the composite X, Z. in table1

mentioned below, for example, Order no and Item code together determine Qty
and Price. Thus:
Order no, Item code  Qty, Price
Order no Order Item Quantity Price/uni

date code t
1456 260289 3687 52 50.40
1456 260289 4627 38 60.20
1456 260289 3214 20 17.50
1886 040389 4629 45 20.25
1886 040389 4627 30 60.80
1788 040489 4627 40 60.20
Table 1: Normalized form of relation
As another example, consider the relation,
Student (Roll no, Name, Address, Dept, Year of study)
Observations:
 In this relation, Name is functionally dependent on Roll no.

 In fact, given the value of roll no, the values of all the other attributes can
be uniquely determined.
 Name and Department are not functionally dependent because given the
name of a student, one cannot find this department uniquely. This is due
to the fact that there may be more than one student with the same name.
 Name in this case is not a key.
 Department and Year of study are not functionally dependent as year of
Study pertains to a student whereas Department is an independent
attribute.
The functional dependency of this relation is shown in figure 1

Figure 1: Dependency diagram for the relation “Student”
Name
Address
Roll no
Department
Year of study
Relation key:
Given relation, if the value of an attribute X uniquely determines the

values of all other attributes in a row, then X is said to be the key of that relation.
For example, using Vendor code, the Vendor name and address are uniquely
determined.
 Thus vendor code is the relation key.
Composite relation key:
Sometimes more than one attribute is needed to uniquely determine

other attributes in a relation row.
Example:
‘Supplies” ( Vendor code, item code, Qty.supplied, Date of supply, price/unit);
 Order no and item code together form the key.
In this Vendor code and Item code together form the key. This dependency is
shown in the following diagram ( figure 2).

Quantity
supplied
Vendor code
Date of
Item code
supply
Price/unit
Dependency diagram for the relation “Supplies”
JOIN DEPENDENCIES:
• Join dependencies constrain the set of legal relations over a schema R to

those relations for which a given decomposition is a lossless-join
decomposition.
• Let R be a relation schema and R1 , R2 ,..., Rn be a decomposition of R.

If R = R1 ∪ R2 ∪ …. ∪ Rn, we say that a relation r(R) satisfies the join
dependency *(R1 , R2 ,..., Rn) if:
r =∏ R1 (r) ⋈ ∏ R2 (r) ⋈ …… ⋈ ∏ Rn(r)
A join dependency is trivial if one of the Ri is R itself.
• A join dependency *(R1, R2) is equivalent to the multivalued dependency R1

∩ R2 R2. Conversely, α β is equivalent to *(α ∪(R - β ), α ∪ β )
• However, there are join dependencies that are not equivalent to any
multivalued dependency.
Project-Join Normal Form (PJNF):
• A relation schema R is in PJNF with respect to a set D of functional,

multivalued, and join dependencies if for all join dependencies in D+ of the
form
*(R1 , R2 ,..., Rn ) where each Ri ⊆ R
and R =R1∪ R2 ∪ ... ∪ Rn
at least one of the following holds:
 *(R1 , R2 ,..., Rn ) is a trivial join dependency.
 Every Ri is a superkey for R.
• Since every multivalued dependency is also a join dependency,

every PJNF schema is also in 4NF.
Example:
o Consider Loan-info-schema = (branch-name, customer-name, loan-
number, amount).
o Each loan has one or more customers, is in one or more branches and has
a loan amount; these relationships are independent, hence we have the
join dependency
o *(=(loan-number, branch-name), (loan-number, customer-name), (loan-
number, amount))
o Loan-info-schema is not in PJNF with respect to the set of dependencies

containing the above join dependency.
o To put Loan-info-schema into PJNF, we must decompose it into the three

schemas specified by the join dependency:
 (loan-number, branch-name)
 (loan-number, customer-name)
 (loan-number, amount)
Normal Forms
Normalization
 Db designed based on E-R model may have some amount of inconsistency

(variation), uncertainty (in security) and redundancy (duplication).
 To eliminate these drawbacks some refinement has to be done on the db.
 Refinement process is called normalization.
 Normalisation is defined as a step by step process of decomposing a

complex relation into a simple and stable relations.
 A DB schema design tool
 A process of replacing associations among attributes in a relation

schema
 An approximation of the relation schemas that should be created
 Objectives: accomplish the goals of relational DB design
 2 approaches: decomposition and synthesis


Purpose of Normalization
 Minimize redundancy in data
 Remove insert, delete and update anomaly (irregularity) during db

activities.
 Reduce the need to recognize the data when it is modified or enhanced.
 Because of duplicate data elimination, we will be able to reduce the overall

size of the db.
 The different stages of normalization is called as normal forms. They are,
 1NF
 2NF
 3NF
 BCNF
Restrictions on the DB schema that preclude certain undesirable properties

(data redundancy, update anomaly, loss of information, etc.) from the DB.
A relation schema R is in PJNF if
R is in 4NF if
R is in BCNF if
R is in 3NF if
R is in 2NF if
R is in 1NF

Decomposition
 A process to split or decompose a relation until the resultant

relations no longer exhibit the undesirable problems, e.g., data
redundancy, data inconsistency, anomaly, etc.
 Decomposing a relation schema R means breaking R into a pair

of schemas, possibly intersecting
• this process is repeated until all the decomposed relation

schemas are in the desired (normal) form.
First Normal Form(1 NF)
 A relation schema R is in 1NF if

* all the attributes of the relation R are atomic in nature.
E.G .. DEPT
Suppose we extend it by including DLOCATIONS attribute as shown below. We

assume that each dept may have a no. of Locations.
This is not 1NF bcoz DLOCATIONS is not an atomic attribute.
DNAME DNO DHEAD DLOCATIONS
DNAME DNO DHEAD DLOCATIONS

Research 3 John (Mianus,Rye,Stratford
)
Administrat 2 prince Mianus
or
Headquarte 1 Peter Rye
r
Second Normal Form (2 NF)
A relation R is in 2NF if and only if,

• It is in the 1NF and

• No partial dependency exists between non-key attributes and key

attributes.
• The test for 2NF involves testing for functional dependencies whose left
hand side attributes are part of primary key. If the primary key contains a
single attribute, the test need not be applied at all.
• A relation schema R is in 2NF if every non-prime attribute A in R is
fully functionally dependent on the primary key of R.
• E.G.. Consider the EMP_PROJ relation, it is in 1NF but not in 2NF,

The non-prime attribute ENAME violates 2NF because of FD2, as do the non-
prime attribute PNAME and PLOCATION because of FD2 and FD3 make ENAME,
PNAME and PLOCATION partially dependent on the primary key {SSN,PNO}, thus
violating 2NF test.
.
 The Functional dependencies FD1,FD2 and FD3 leads to the decomposition

of EMP_PROJ into the 3 relation schemas EP1,EP2 and EP3, each of which is
in 2NF.
EP1 EP2 EP3
SS PNO HOURS PN PNAME PLOCATION

N O
SS ENAME
N
FD1 FD2
FD3

THIRD NORMAL FORM
RULES:
1. The relation should be in second normal form.
2. All the non-key attribute should be functionally depend on only one

key attribute.
3. If there is functional dependency between two non-key attributes

then there will be duplication of data.
Consider a relation,
Student ( rollno, name, dept, year, hostel_name);
A 2NF Form Relation
Rollno Name Dept Year Hostel_nam

e
1784 Raman Physics 1 Ganga
1648 Krishnan Chemistry 1 Ganga
1768 Gopalan Maths 2 Kaveri
1848 Raju Botany 2 Kaveri
1682 Maya Geology 3 Krishna
1485 Singh Zeology 4 Godavari
Dependency diagram for the relation
Name
Department
Rollno
Year
Hostel_name
If it is decided to ask all first year students to move to Kaveri hostel, and
all second year students to Ganga hostel, this change should be made in many
places in first table. Also when a student year of study changes, his hostel
change should also be noted in first table. This is undesirable.

First table is thus in 2NF but not in 3NF, to transform it to 3NF, we should
introduce another relation which includes the functionally related non-key
attributes. This is shown below,
Conversion of first table into two 3NF relations
Rollno Name Dept Year

1784 Raman Physics 1
1648 Krishnan Chemistry 1
1768 Gopalan Maths 2
1848 Raju Botany 2
1682 Maya Geology 3
1485 Singh Zeology 4
Year Hostel_nam
e
1 Ganga
2 Kaveri
3 Krishna
4 Godavari
BOYCE-CODD NORMAL FORM (BCNF):
Assumptions:
1. Assume that a relation has more than one possible key.
2. Assume further that the composite keys have a common attribute in that
relation.
If an attribute of a composite key is dependent on an attribute of the other

composite key, a normalization called BCNF is needed. Consider, as an example,
the relation Professor:
Example:
Professor (Professor code, Dept, Head of Dept, Percent time);
Constraints in that relation:
It is assumed that
1. A professor can work in more than one department
2. The percentage of the time he spends in each department is given.
3. Each department has only one Head of Department.

Keys:
1. Professor Code
2. Department
Composite keys: (Possible cases)
Case 1: Professor code +Department
By combining these two keys we can find out in each department how
many percentage of hours each professor have spent
Case 2: Professor code +Head of Dept
Using Professor and Head of Dept we can find out in particular dept. how
many percent of hours he has spent.
Case 3: Professor code +Percent of time
In each dept how many percent of time he spent we cannot find out.
Case 4: Department + Head of Dept cannot be composite key
Case 5: Department +Percent of time  cannot be composite key
The relationship diagram for the above relation is given in fig. Table gives
the relation attributes. The two possible composite keys are professor code and
Head of Dept. observe that department as well as Head of Dept are not non-key
attributes. They are a part of a composite key

Dependency diagram of Professor relation
Head of Dept
Department
Professor code
Percent time
Department Head of Dept
Head of Dept Department
Professor code Percent time

Normalisation of relation “Professor”
Profess Departmen Head of Percent

or code t Dept time
P1 Physics Ghosh 50
P1 Mathematic Krishnan 50
s
P2 Chemistry Rao 25
P2 Physics Ghosh 75
P3 Mathematic Krishnan 100

s
The relation given in table is in 3NF. observe, however, that the names of
Dept and Head of Dept are duplicated. further if professor P2 resigns, row 3 and
4 are deleted. We lose the information that Rao is the Head of Dept of chemistry.
The normalization of the relation is done by creating a new relation for

Dept and Head of Dept and deleting Head of Dept from professor relation. The
normalized relations are shown in the following table.
Professo Department Perce

r code nt Department Head od Dept
time
Physics Ghosh
P1 Physics 50
Mathematics Krishnan
P1 Mathematics 50
Chemistry Rao
P2 Chemistry 25
P2 Physics 75 The dependency diagrams for these

P3 Mathematics 100 new relations are shown in fig. the
dependency diagram gives the
important clue to this normalization
step as is clear from figures .
Department
Percent time
157 | P a g Professor
e code
Department of Information Technology
Department Head of Dept
Dependency diagram for professor relation
Comparison of BCNF and 3NF
1. We have seen BCNF and 3NF.

o It is always possible to obtain a 3NF design without sacrificing
lossless-join or dependency-preservation.
o If we do not eliminate all transitive dependencies, we may need to
use null values to represent some of the meaningful relationships.
o Repetition of information occurs.
2. These problems can be illustrated with Banker-schema.

o As banker-name bname , we may want to express relationships
between a banker and his or her branch.
ENAME BANKER-NAME BNAME

Bill Jhon SFU
Tom Jhon SFU
Mary Jhon SFU
null Tim Austin
Figure: An instance of Banker-schema.
o Figure shows how we must either have a corresponding value for

customer name, or include a null.
o Repetition of information also occurs.
o Every occurrence of the banker's name must be accompanied by
the branch name.
3. If we must choose between BCNF and dependency preservation, it is
generally better to opt for 3NF.
o If we cannot check for dependency preservation efficiently, we
either pay a high price in system performance or risk the integrity of
the data.
o The limited amount of redundancy in 3NF is then a lesser evil.
4. To summarize, our goal for a relational database design is

o BCNF.
o Lossless-join.
o Dependency-preservation.
5. If we cannot achieve this, we accept
o 3NF
o Lossless-join.
o Dependency-preservation.
6. A final point: there is a price to pay for decomposition. When we
decompose a relation, we have to use natural joins or Cartesian products
to put the pieces back together. This takes computational time.
Overall database design process
Database design is the process of producing a detailed data model of a

database. This model contains all the needed logical and physical design choices
and physical storage parameters needed to generate a design in a Data
Definition Language, which can then be used to create a database. A fully
attributed data model contains detailed attributes for each entity.
This process tells who normalization fits into overall database design. Let us
assume that a relation schema R is given, proceeded to normalize it. There are
several ways in which we could have come up with the schema R:
1. R could have been generatedwhen converting a E-R diagram to a set of tables.

2. R could have been a single relation containing all attributes that are of
interest. The normalization process then breaks up R into smaller relations.
3. R could have been the result of some ad hoc design of relations, which we
then test to verify that it satisfies a desired normal form.
E-R Model and Normalization
When we carefully define an E-R diagram, identifying all entities correctly, the
tables generated from the E-R diagram should not need further normalization.
However, there can be functional dependencies between attributes of an entity.
For instance, suppose an employee entity had attributes department-number and
department-address, and there is a functional dependency department-number
→ department-address. We would then need to normalize the relation generated
from employee.
Most examples of such dependencies arise out of poor E-R diagram design.
Functional dependencies can help us detect poor E-R design. If the generated
relations are not in desired normal form, the problem can be fixed in the E-R
diagram. That is, normalization can be done formally as part of data modeling.
Alternatively, normalization can be left to the designer’s intuition during E-R
modeling, and can be done formally on the relations generated from the E-R
model.
The Universal Relation Approach
The second approach to database design is to start with a single relation

schema containing all attributes of interest, and decompose it. One of our goals
in choosing a decomposition was that it be a lossless-join decomposition. To
consider losslessness, we assumed that it is valid to talk about the join of all the
relations of the decomposed database. Tuples that disappear when we compute

the join are dangling tuples. Dangling tuples may occur in practical database
applications. They represent incomplete information, as they do in our example,
wherewe wish to store data about a loan that is still in the process of being
negotiated. This relation is called a universal relation, since it involves all the
attributes in the universe defined by R1 U R2 U · · · U Rn. Thus, a particular
decomposition defines a restricted form of incomplete information that is
acceptable in our database. The normal forms that we have defined generate
good database designs from the point of view of representation of incomplete
information.
Another consequence of the universal relation approach to database design is

that attribute names must be unique in the universal relation.We cannot use
name to refer to both customer-name and to branch-name. It is generally
preferable to use unique names, as we have done. Nevertheless, if we defined
our relation schemas directly, rather than in terms of a universal relation, we
could obtain relations on schemas such as the following for our banking example:
branch-loan (name, number)

loan-customer (number, name)
amt (number, amount
We believe that using the unique-role assumption—that each attribute
name has a unique meaning in the database—is generally preferable to reusing
of the same name in multiple roles. When the unique-role assumption is not
made, the database designer must be especially careful when constructing a
normalized relational-database design
.
Demoralization for Performance
• Some aspects of database design are not caught by normalization

• Examples of bad database design, to be avoided:
Instead of earnings (company_id, year, amount ), use
o earnings_2004, earnings_2005, earnings_2006, etc., all on the
schema (company_id, earnings).
 Above are in BCNF, but make querying across years
difficult and needs new table each year
o company_year(company_id,earnings_2004,earnings_2005,ear
nings_2006)
 Also in BCNF, but also makes querying across years
difficult and requires new attribute each year.
 Is an example of a crosstab, where values for one
attribute become column names
 Used in spreadsheets, and in data analysis tools
Other design issues:
• Some aspects of database design are not caught by normalization

• Examples of bad database design, to be avoided:
Instead of earnings (company_id, year, amount ), use

o earnings_2004, earnings_2005, earnings_2006, etc., all on the

schema (company_id, earnings).
 Above are in BCNF, but make querying across years difficult
and needs new table each year
o company_year(company_id,earnings_2004,earnings_2005,
earnings_2006)
 Also in BCNF, but also makes querying across years difficult
and requires new attribute each year.
 Is an example of a crosstab, where values for one attribute
become column names
 Used in spreadsheets, and in data analysis tools.
Q - BANK
1. Discuss in detail about assertations
2. Write short notes on:-May 2007
(a) Database security and integrity (7)

(b) Query Processing (8)
3.Write short notes on the following:-April/May 2009
(a)Referential Integrity(7)
(b) Functional Dependencies(8)
4.Explain the two normal forms for Relational database schemes and also
compare it (15)-April/May 2009
5.(a) Explain security ad authorization (7) April/may2008
(b) Explain the basic structure of SQL (8)
6.(a) Explain Boyce-code normal form (8) April/may2008
(b) Explain overall database design process(7)
7. (a)What is meant by authorization ?
Explain the different forms of authorization(7) Nov/Dec 2008
(b)Discuss in detail about the Normalization using functional and join

independencies (8)
8.(a) Write short note on encryption (7)-Nov/dec2008
(b) Explain third normal form and BCNF(8)
9. Give an algorithm for testing lossless join and describe the pipeline in detail
(15)- Nov2007
10. Explain good and bad decomposition of normalization with examples (15)-
Nov2007

UNIT - IV
2Marks Q & A:
1. Give the measures of quality of a disk.

Capacity
Access time
Seek time
Data transfer rate
Reliability
Rotational latency time.
2. Compare sequential access devices versus random access devices with an

example
Sequential access devices Random access devices

Must be accessed from the It is possible to read data from any
beginning location
Eg:- Tape storage Eg:-Disk storage
Access to data is much slower Access to data is faster
Cheaper than disk Expensive when compared with
disk
3. What are the types of storage devices?

Primary storage
Secondary storage
Tertiary storage
4. What are called jukebox systems?

Jukebox systems contain a few d rives and numerous disks that can be
loaded into one of the drives automatically.
5. What is called remapping of bad sectors?

If the controller detects that a sector is damaged when the disk is initially
formatted, or when an attempt is made to write the sector, it can
logically map the sector to a different physical location.
6. Define access time.

Access time is the time from when a read or write request is issued to
when data transfer b egins.
7. Define seek time.

The time for repositioning the arm is called the seek time and it
increases with the distance that the arm is called the seek time.
8. Define average seek time.
The average seek time is the average of the seek times, measured over a
sequence of random requests
9. Define rotational latency time.

The time spent waiting for the sector to be accessed to appear under the
head is called the rotational latency time.
10. Define average latency time.

The average latency time of the disk is one-half the time for a full
rotation of the disk.
11. What is meant by data-transfer rate?

The data-transfer rate is the rate at which data can be retrieved from or
stored to the disk.
12. What is meant by mean time to failure?

The mean time to failure is the amount of time that the system could run
continuously without failure.
13. What are a block and a block number?

A block is a contiguous sequence of sectors from a single track of one
platter.
Each request specifies the address on th e disk to be referenced. That
address is in the form of a block number.
14. What are called journaling file systems?

File systems that support log disks are called journaling file systems.
15. What is the use of RAID?

A variety of disk-organization techniques, collectively called redundant
arrays of independent disks are used to improve the performance and
reliability.
16. Explain how reliability can be improved through redundancy?

The simplest approach to introducing redundancy is to duplicate every
disk. This technique is called mirroring or shadowing. A logical disk then
consists of two physical disks, and write is carried out on both the disk. If
one of the disks fails the data can be read from the other. Data will be
lost if the second disk fails before the first failed disk is repaired.
17. What is called mirroring?
The simplest approach to introducing redundancy is to duplicate every
disk. This technique is called mirroring or shadowing.

18. What is called mean time to repair?

The mean time to failure is the time it takes to replace a failed disk and
to restore the data on it.
19. What is called bit-level striping?

Data striping consists of splitting the bits of each byte across multiple
disks. This is called bit-level striping.
20. What is called block-level striping?

Block level striping stripes blocks across multiple disks. It treats the array
of disks as a large disk, and gives blocks logical numbers.
21. What are the two main goals of parallelism?

Load –balance multiple small accesses, so that the throughput of such
accesses increases. Parallelize large accesses so that the response time
of large accesses is reduced
22. What are the factors to be taken into account when choosing a RAID level?
• Monetary cost of extra disk storage requirements.
• Performance requirements in terms of number of I/O operations
• Performance when a disk has failed.
• Performances during rebuild.
23. What is meant by software and hardware RAID systems?

RAID can be implemented with no change at the hardware level, using
only software modification. Such RAID implementations are called
software RAID systems and the systems with special hardware support
are called hardware RAID systems.
24. Define hot swapping?

Hot swapping permits the removal of faulty disks and replaces it by new
ones without turning power off. Hot swapping reduces the mean time to
repair.
25. Which level of RAID is best? Why?
RAID level 1 is the RAID level of choice for many applications with
moderate storage requirements and high I/O requirements. RAID 1
follows mirroring and provides best write performance.
26. Distinguish between fixed length records and variable length records?
Fixed length records
Every record has the same fields and field lengths are fixed.
Variable length records
File records are of same typ e but one or more of the fields are of
varying size.
27. What are the ways in which the variable-length records arise in database
systems?
o Storage o f multiple record types in a file.
o Record types that allow variable lengths for one or more fields.
o Record types that allow repeating fields.
28. Explain the use of variable length records.

o They are used for Storing of multiple record types in a file.

o Used for storing records that has varying len gths for one or more
fields.
o Used for storing records that allow repeating fields
29. What is the use of a slotted-page structure and what is the

information present in the header?
The slotted-page structure is used for organizing records within a
single block. The header contains the following information.
o The number of record entries in the header.
o The end of free space
o An array whose entries contain the location and size of each record.
30. What are the two types of blocks in the fixed –length representation?
Define them.
Anchor block: Contains the first record of a chain.
Overflow block: Contains the records other than those that are the
first record o f a chain.
31. What is known as heap file organization?

In the heap file organization, any record can be placed anywhere in the
file where there is space for the record. There is no ordering of records.
There is a single file for each relation.
32. What is known as sequential file organization?

In the sequential file organization, the records are stored in sequential
order, according to the value of a “search key” of each record.
33. What is hashing file organization?

In the hashing file organization, a hash function is computed on some
attribute of each record. The result of the hash function specifies in which
block of the file the record should be placed.
34. What is known as clustering file organization?

In the clustering file organization, records of several different relations
are stored in the same file.
35. What is an index?

An index is a structure that helps to locate desired records of a relation
quickly, without examining all records.
36. What are the two types of ordered indices?

Primary index
Secondary index
37. What are the types of indices?

Ordered indices
Hash indices
38. What are the techniques to be evaluated for both ordered indexing and
hashing?
Access types

Access time
Insertion time
Deletion time
Space overhead
39. What is known as a search key?

An attribute or set of attributes used to look up records in a file is called
a search key.
40. What is a primary index?

A primary index is an index whose search k ey also defines the
sequential order of the file.
41. What are called index-sequential files?
The files that are ordered sequentially with a primary index on the
search key are called index-sequential files.
42. What are the two types of indices?

Dense index
Sparse index
43. What are called multilevel indices?

Indices with two or more levels are called multilevel indices.
44. What are called secondary indices?

Indices whose search key specifies an order different from sequential
order of the file are called secondary indices. The pointers in secondary
index do not point directly to the file. Instead each points to a bucket that
contains pointers to the file.
45. What are the disadvantages of index sequential files?

The main disadvantage of the index sequential file organization is
that performance degrades as the file grows. This degradation is
remedied by reorganization of the file.
46. What is a B+-Tree index?

A B+-Tree index takes the form of a balanced tree in which every path
from the root of the root of the root of the tree to a leaf of the tree is of
the same length.
47. What is B-Tree?

A B-tree eliminates the redundant storage of search-key values. It allows
search key values to appear only once.
48. What is hashing?

Hashing allows us to find the address of a data item directly by
computing a hash function on the search key value of the desired record.
49. How do you create index in SQL?

Create index <index name> on <relationname>(<attribute list>)
50. Distinguish between static hashing and dynamic hashing?

Static hashing
Static hashing uses a h ash function in which the set of bucket
adders is fixed. Such hash functions cannot easily accommodate

databases that grow larger over time.
Dynamic hashing
Dynamic hashing allows us to modif y the hash function dynamically.
Dynamic hashing copes with changes in database size by splitting
and coalescing buckets as the database grows and shrinks.
51. What is a hash index?

A hash index organizes the search keys, with their associated pointers,
into a hash file structure.
52. What can be done to reduce the occurrences of bucket overflows in a hash
file organization?
To reduce bucket overflow the number of bucket is chosen to be
(nr/fr)*(1+d).
We handle bucket overflow by using
• Overflow chaining(closed hashing)
• Open hashing
53. Differentiate open hashing and closed hashing (overflow chaining)
Closed hashing (overflow chaining)

If a record must be inserted in to a bucket b, and b is already
full, the system provides an overflow bucket for b, and inserts the
record in to the overflow bucket. If the overflow bucket is also full,
the system provides another overflow bucket, and so on. All the
overflow buckets of a given buckets are chained together in a
linked list, overflow handling using linked list is known as closed
hashing.
Open hashing
The set of buckets is fixed, and there are no overflow chains.
Instead, if a bucket is full, the system inserts records in some other
bucket in the initial set of buckets.
54. What is linear probing?

Linear probing is a type of open hashing. If a bucket is full the system
inserts records in to the next bucket that has space. This is known as
linear probing.
55. What is called query processing?

Query processing refers to the range of activities involved in extracting
data from a database.
56. What are the steps involved in query processing?

The basic steps are:
parsing and translation
optimization
evaluation
57. What is called an evaluation primitive?

A relational algebra operation annotated with instructions on how to
evaluate is called an evaluation primitive.
58. What is called a query –execution engine?

The query execution engine takes a qu ery evaluation plan, executes that
plan, and returns the answers to the query.
59. How do you measure the cost of query evaluation?

The cost of a query evaluation is measured in terms of a number of
different resources including disk accesses, CPU time to execute a query,
and in a distributed database system the cost of communication
60. List out the operations involved in query processing

Selection operation
Join operations.
Sorting.
Projection
Set operations
Aggregation
61. What are called as index scans?

Search algorithms that use an index are referred to as index scans.
62. What is called as external sorting?

Sorting of relations that do not fit into memory is called as external
sorting.
63. What is meant by block nested loop join?

Block nested loop join is the variant of the nested loop join where
every block of the inner relation is paired with every block of the
outer relation. With in each pair of blocks every tuple in one block
is paired with every tuple in the other blocks to generate all pairs of
tuples.
64. What is meant by hash join?
In the h ash join algorithm a hash function h is used to implement
partition tuples of both relations.
65. What is called as recursive partitioning?

The system repeats the splitting of the input until each partition of the
build input fits in the memory. Such partitioning is called recursive
partitioning
66. What is called as an N-way merge?

The merge op eration is a gen eralization of the two-way mer ge used by
the standard in-memory sort-merge algorithm. It mer ges N runs, so it is
called an N-way merge.
67. What is known as fudge factor?

The number of partitions is increased by a small value called the fudge
factor, which is usually 20 percent of the number of hash partitions
computed.
68. Define query optimization.

Query optimization refers to the process of findin g the lowest –cost
method of evaluating a given query.
Lecture notes:
Storage and File Structures
Overview of Physical Storage Media – Magnetic Disks – RAID – Tertiary

Storage – Storage Access – File Organization – Organization of Records in Files –
Data-Dictionary Storage.
Indexing and Hashing
Basic Concepts – Ordered Indices – B+-Tree Index Files – B-Tree Index Files
– Static Hashing – Dynamic Hashing – Comparison of Ordered Indexing and
Hashing – Index Definition in SQL – Multiple-Key Access.
Storage and File Structure
• Overview of Physical Storage Media

• Magnetic Disks
• RAID
• Tertiary Storage
• Storage Access
• File Organization
• Organization of Records in Files
• DataDictionaryStorage
Classification of Physical Storage Media

• Speed with which data can be accessed

• Cost per unit of data
• Reliability
o data loss on power failure or system crash
o physical failure of the storage device
• Can differentiate storage into:
o volatile storage: loses contents when power is switched
off
nonvolatile storage:
 Contents persist even when power is switched off.
 Includes secondary and tertiary storage, as well as
Batterybacked up mainmemory
Physical Storage Media
Cache
fastest and most costly form of storage; volatile;managed by the computer
system hardware
(Note: “Cache” is pronounced as “cash”)
Main memory
• Fast access (10s to 100s of nanoseconds; 1 nanosecond =
10–9 seconds)
• Generally too small (or too expensive) to store the entire database
capacities of up to a few Gigabytes widely used currently
Capacities have gone up and per byte costs have decreased steadily
and rapidly (roughly factor of 2 every 2 to 3 years)
• Volatile — contents of main memory are usually lost if a power

failure or system crash occurs.
Flash memory
 Data survives power failure
 Data can be written at a location only once, but location can be erased
and written to again
 Can support only a limited number (10K – 1M) of write/erase
cycles.
 Erasing of memory has to be done to an entire bank of
memory
 Reads are roughly as fast as main memory
 But writes are slow (few microseconds), erase is slower
 NOR Flash
 Fast reads, very slow erase, lower capacity
 Used to store program code in many embedded devices
 NAND Flash
 Page-at-a-time read/write, multi-page erase
 High capacity (several GB)
 Widely used as data storage mechanism in portable devices
Magnetic-disk
 Data is stored on spinning disk, and read/written magnetically

 Primary medium for the long-term storage of data; typically stores entire
database.
 Data must be moved from disk to main memory for access, and written
back for storage
 direct-access – possible to read data on disk in any order, unlike
magnetic tape
 Survives power failures and system crashes
• disk failure can destroy data: is rare but does happen
n Optical storage
l non-volatile, data is read optically from a spinning disk using a laser
l CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms
l Write-one, read-many (WORM) optical disks used for archival
storage (CD-R, DVD-R, DVD+R)
l Multiple write versions also available (CD-RW, DVD-RW, DVD+RW,
and DVD-RAM)
l Reads and writes are slower than with magnetic disk
l Juke-box systems, with large numbers of removable disks, a few
drives, and a mechanism for automatic loading/unloading of disks
available for storing large volumes of data
n Tape storage
l non-volatile, used primarily for backup (to recover from disk failure),
and for archival data
l sequential-access – much slower than disk
l very high capacity (40 to 300 GB tapes available)
l tape can be removed from drive ⇒ storage costs much cheaper than
disk, but drives are expensive
l Tape jukeboxes available for storing massive amounts of data
n hundreds of terabytes (1 terabyte = 109 bytes) to even a
petabyte (1 petabyte = 1012 bytes)
Storage Hierarchy


n primary storage: Fastest media but volatile (cache, main memory).
n secondary storage: next level in hierarchy, non-volatile, moderately fast

access time
l also called on-line storage
l E.g. flash memory, magnetic disks
n tertiary storage: lowest level in hierarchy, non-volatile, slow access time
l also called off-line storage
l E.g. magnetic tape, optical storage
Magnetic Disk
n Read-write head
l Positioned very close to the platter surface (almost touching it)
l Reads or writes magnetically encoded information.
n Surface of platter divided into circular tracks
l Over 50K-100K tracks per platter on typical hard disks
n Each track is divided into sectors.
l Sector size typically 512 bytes
l Typical sectors per track: 500 (on inner tracks) to 1000 (on outer
tracks)
n To read/write a sector
l disk arm swings to position head on right track
l platter spins continually; data is read/written as sector passes under
head
n Head-disk assemblies
l multiple disk platters on a single spindle (1 to 5 usually)
l one head per platter, mounted on a common arm.
n Cylinder i consists of ith track of all the platters
n Earlier generation disks were susceptible to “head-crashes” leading to loss
of all data on disk
l Current generation disks are less susceptible to such disastrous
failures, but individual sectors may get corrupted
Disk controller – interfaces between the computer system and the disk drive
hardware.
l accepts high-level commands to read or write a sector
l initiates actions such as moving the disk arm to the right track and
actually reading or writing the data
l Computes and attaches checksums to each sector to verify that
data is read back correctly
 If data is corrupted, with very high probability stored
checksum won’t match recomputed checksum
l Ensures successful writing by reading back sector after writing it
l Performs remapping of bad sectors
Disk Subsystem

n Disk interface standards families
l ATA (AT adaptor) range of standards

l SATA (Serial ATA)
l SCSI (Small Computer System Interconnect) range of standards
l Several variants of each standard (different speeds and capabilities)
n Access time – the time it takes from when a read or write request is
issued to when data transfer begins. Consists of:
l Seek time – time it takes to reposition the arm over the correct
track.
 Average seek time is 1/2 the worst case seek time.
– Would be 1/3 if all tracks had the same number of
sectors, and we ignore the time to start and stop arm
movement
 4 to 10 milliseconds on typical disks
l Rotational latency – time it takes for the sector to be accessed to
appear under the head.
 Average latency is 1/2 of the worst case latency.
 4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.)
n Data-transfer rate – the rate at which data can be retrieved from or
stored to the disk.
l 25 to 100 MB per second max rate, lower for inner tracks
l Multiple disks may share a controller, so rate that controller can
handle is also important
 E.g. ATA-5: 66 MB/sec, SATA: 150 MB/sec, Ultra 320 SCSI: 320
MB/s
 Fiber Channel (FC2Gb): 256 MB/s
.
n Mean time to failure (MTTF) – the average time the disk is expected to
run continuously without any failure.
l Typically 3 to 5 years
l Probability of failure of new disks is quite low, corresponding to a
theoretical MTTF of 500,000 to 1,200,000 hours for a new disk
 E.g., an MTTF of 1,200,000 hours for a new disk means that
given 1000 relatively new disks, on an average one will fail
every 1200 hours
l MTTF decreases as disk ages
Optimization of Disk Block Access
Block – a contiguous sequence of sectors from a single track
data is transferred between disk and main memory in blocks
sizes range from 512 bytes to several kilobytes
 Smaller blocks: more transfers from disk

 Larger blocks: more space wasted due to partially filled blocks
 Typical block sizes today range from 4 to 16 kilobytes
Disk arm scheduling- algorithms order pending accesses to tracks so that disk
arm movement is minimized
Elevator algorithm - move disk arm in one direction (from outer to inner tracks
or vice versa), processing next request in that
direction, till no more requests in that direction, then reverse
direction and repeat
File organization – optimize block access time by organizing the
blocks to correspond to how data will be accessed E.g. Store related information
on the same or nearby blocks/cylinders.
 File systems attempt to allocate contiguous chunks of blocks(e.g. 8

or 16 blocks) to a file
 Files may get fragmented over time E.g. if data is inserted
to/deleted from the file Or free blocks on disk are scattered, and
newly created file has its blocks scattered over the disk
 Sequential access to a fragmented file results in increased disk arm
movement
 Some systems have utilities to defragment the file system, in order
to speed up file access
Nonvolatile write buffers- speed up disk writes by writing blocks to a

nonvolatile RAM buffer immediately
 Nonvolatile RAM: battery backed up RAM or flash memory

 Even if power fails, the data is safe and will be written to disk when
power returns
 Controller then writes to disk whenever the disk has no other
requests or request has been pending for some time
 Database operations that require data to be safely stored before
continuing can continue without waiting for data to be written to
disk
 Writes can be reordered to minimize disk arm movement
Log disk – a disk devoted to writing a sequential log of block updates
 Used exactly like nonvolatile RAM

 Write to log disk is very fast since no seeks are required
 No need for special hardware (NVRAM)
 File systems typically reorder writes to disk to improve performance
 Journaling file systems write data in safe order to NVRAM or log

disk
 Reordering without journaling: risk of corruption of file system data
RAID: Redundant Arrays of Independent Disks
Disk organization techniques that manage a large numbers of disks, providing

a view of a single disk of high capacity and high speed by using multiple disks
in parallel, and high reliability by storing data redundantly, so that data can
be recovered even if a disk fails. The chance that some disk out of a set of N
disks will fail is much higher than the chance that a specific single disk will
fail. E.g., a system with 100 disks, each with MTTF of 100,000 hours(approx.
11 years), will have a system MTTF of 1000 hours (approx.41 days)Originally a
cost effective alternative to large, expensive disks. I in RAID originally stood
for ``inexpensive’’. Today RAIDs are used for their higher reliability and
bandwidth. The “I” is interpreted as independent
Improvement of Reliability via Redundancy:
Redundancy – store extra information that can be used to rebuild
information lost in a disk failure.
• The simplest approach to redundancy is to duplicate every disk. This

technique is called mirroring.
• Logical disk consists of two physical disks.
• Every write is carried out on both disks
• Reads can take place from either disk
• If one disk in a pair fails, data still available in the other
• Data loss would occur only if a disk fails, and its mirror disk also fails
before the system is repaired
• Probability of combined event is very small Except for dependent failure
modes such as fire or building collapse or electrical power surges
• Mean time to data loss depends on mean time to failure, and mean time to
repair
• E.g. MTTF of 100,000 hours, mean time to repair of 10 hours gives
• mean time to data loss of 500*106 hours (or 57,000 years) for a mirrored
pair of disks (ignoring dependent failure modes)
Improvement in Performance via Parallelism:
Two main goals of parallelism in a disk system:
1. Load balance multiple small accesses to increase throughput
2. Parallelize large accesses to reduce response time.

Improve transfer rate by striping data across multiple disks.
Bit level striping – split the bits of each byte across multiple disks
In an array of eight disks, write bit i of each byte to disk i. Each access can read
data at eight times the rate of a single disk. But seek/access time worse than for
a single disk. Bit level striping is not used much any more
Block level striping –it stripes block along multiple disks. with n disks, block i of
a file goes to disk (i mod n) + 1. Requests for different blocks can run in parallel if
the blocks reside on different disks. A request for a long sequence of blocks can
utilize all disks in parallel.
RAID Levels:
Mirroring provides high reliability, but it is expensive. Striping provides high data-
transfer rates, but doesn’t improve reliability. various schemes provide
redundancy at lower cost by combining disk striping with” “parity” bits. These
schemes are classified into RAID levels. They are as follows:
RAID Level 1:
• Mirrored disks with block striping

• Offers best write performance.
• Popular for applications such as storing log files in a database system.
RAID Level 0:
• Block striping; non-redundant.

• Used in high performance
• applications where data lost is not critical.

RAID Level 2:
• Memory Style Error-Correcting-Codes (ECC) with bit striping.

• Memory system have long used parity bits for error detection and
correction.
• Each byte in a memory system may have a parity bit associated with it
that records whether the numbers of bits in the byte that are set to 1 is
even(parity=0) or odd(parity=1).
• In the fig.c level 2 scheme is shown.the disk labeled p store the error-
correction bits. If one of the disk fails, the remaining bits of the byte and
the associated error-correction bits can be read from the other disks, and
can be used to reconstruct the data.
RAID Level 3:
• Bit Interleaved Parity

• a single parity bit is enough for error correction, not just detection, since
we know which disk has failed
• When writing data, corresponding parity bits must also be computed and
written to a parity bit disk
• To recover data in a damaged disk, compute XOR of bits from other disks
(including parity bit disk).
• RAID level 3 is as good as level 2,but is less expensive in the number of
extra disks, so level 2 is not used in practice.
• Level 3 has two benefits over level 1. It needs only one parity disk for
several regular disks, whereas level 1 needs one mirror disk for every disk,
and thus reduces the storage overhead.
• Faster data transfer than with a single disk, but fewer I/Os per second
since every disk has to participate in every I/O.

RAID Level 4:
• Block-Interleaved Parity;
• uses block level striping, and keeps a parity block on a separate disk for
corresponding blocks from N other disks.
• When writing data block, corresponding block of parity bits must also be
computed and written to parity disk
• To find value of a damaged block, compute XOR of bits from corresponding
blocks (including parity block) from other disks.
• Provides higher I/O rates for independent block reads than Level 3
• block read goes to a single disk, so blocks stored on different disks can be
read in parallel
• Provides high transfer rates for reads of multiple blocks than no striping
• Before writing a block, parity data must be computed Can be done by
using old parity block, old value of current block and new value of current
block (2 block reads + 2 block writes)
• Or by re-computing the parity value using the new values of blocks
corresponding to the parity block
• More efficient for writing large amounts of data sequentially
• Parity block becomes a bottleneck for independent block writes since
every block write also writes to parity disk
RAID Level 5:
• Block-Interleaved Distributed Parity;

• partitions data and parity among all N + 1 disks, rather than storing data
in N disks and parity in 1 disk.
• E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod
5) + 1, with the data blocks stored on the other 4disks.
• Higher I/O rates than Level 4.
• Block writes occur in parallel if the blocks and their parity blocks are on
different disks.
• Subsumes Level 4: provides same benefits, but avoids bottleneck of parity
disk.

RAID Level 6:
• P+Q Redundancy scheme; similar to Level 5, but stores extra redundant

information to guard against multiple disk failures.
• Better reliability than Level 5 at a higher cost; not used as widely.
Choice of RAID Level:

• Factors in choosing RAID level

• Monetary cost
• Performance: Number of I/O operations per second, and bandwidth during
normal operation
• Performance during failure
• Performance during rebuild of failed disk
• Including time taken to rebuild failed disk
• RAID 0 is used only when data safety is not important
• E.g. data can be recovered quickly from other sources
• Level 2 and 4 never used since they are subsumed by 3 and 5
• Level 3 is not used since bit-striping
• forces single block reads to access all disks, wasting disk arm movement,
which block striping (level 5) avoids
• Level 6 is rarely used since levels 1 and 5 offer adequate safety for almost
all applications. So competition is between 1 and 5 only
• Level 1 provides much better write performance than level 5
• Level 5 requires at least 2 block reads and 2 block writes to write a single
block, whereas Level 1 only requires 2 block writes
• Level 1 preferred for high update environments such as log disks
• Level 1 had higher storage cost than level 5
• disk drive capacities increasing rapidly (50%/year) whereas disk access
times have decreased much less (x 3 in 10 years)
• I/O requirements have increased greatly, e.g. for Web servers
• When enough disks have been bought to satisfy required rate of I/O, they
often have spare storage capacity. so there is often no extra monetary
cost for Level 1.
• Level 5 is preferred for applications with low update rate, and large
amounts of data
• Level 1 is preferred for all other applications
Hardware Issues:
• Software RAID: RAID implementations done entirely in software, with no

special hardware support
• Hardware RAID: RAID implementations with special hardware
• Use nonvolatile RAM to record writes that are being executed
• Beware: power failure during write can result in corrupted disk
• E.g. failure after writing one block but before writing the second in a
mirrored system
• Such corrupted data must be detected when power is restored
– Recovery from corruption is similar to recovery from failed disk
– NVRAM helps to efficiently detected potentially corrupted blocks
Otherwise all blocks of disk must be read and compared with
mirror/parity block
• Hot swapping: replacement of disk while system is running, without

power down Supported by some hardware RAID systems,
• reduces time to recovery, and improves availability greatly

• Many systems maintain spare disks which are kept online, and used as
replacements for failed disks immediately on detection of failure
• Reduces time to recovery greatly
• Many hardware RAID systems ensure that a single point of failure will not
stop the functioning of the system by using Redundant power supplies with
battery backup
• Multiple controllers and multiple interconnections to guard against
controller/interconnection failures
Tertiary storage:
In large database system some of the data may have to reside on tertiary
storage. The two main tertiary storage media are:
• Optical disks
• Magnetic tapes
Opticall Disks
Compact diskread only memory (CDROM) Removable disks, 640 MB per disk
Seek time about 100 msec (optical read head is heavier and slower) Higher
latency (3000 RPM) and lower datatransfer rates (36 MB/s) compared to
magnetic disks n Digital Video Disk (DVD) DVD5 holds 4.7 GB , and DVD9 holds
8.5 GB
DVD10 and DVD18 are double sided formats with capacities of 9.4 GB and 17 GB
Slow seek time, for same reasons as CDROM Record once versions (CDR and
DVDR) are popular data can only be written once, and cannot be erased.high
capacity and long lifetime; used for archival storage Multiwrite versions (CDRW,
DVDRW,DVD+RW and DVDRAM) also available
Magnetic Tapes
Hold large volumes of data and provide high transfer rates
Few GB for DAT (Digital Audio Tape) format, 1040
GB with DLT (Digital Linear Tape) format, 100 – 400 GB+ with Ultrium format,
and 330 GB with Ampex helical scan format Transfer rates from few to 10s of
MB/s Currently the cheapest storage medium Tapes are cheap, but cost of drives
is very highn Very slow access time in comparison to magnetic disks and optical
disks limited to sequential access. some formats (Accelis) provide faster seek
(10s of seconds) at cost of lower capacity Used mainly for backup, for storage of
infrequently used information,and as an offline medium for transferring
information from one system

to another. Tape jukeboxes used for very large capacity storage (terabyte (1012
bytes) to petabye (1015 bytes)
Storage Access
A database file is partitioned into fixedlength
storage units called blocks. Blocks are units of both storage allocation and data
transfer.
Database system seeks to minimize the number of block transfers between

the disk and memory. We can reduce the number of disk accesses by keeping as
many blocks as possible in main memory.
Buffer – portion of main memory available to store copies of disk blocks.
Buffer manager – subsystem responsible for allocating buffer space in

main memory.
Bufffer Manager
Programs call on the buffer manager when they need a block from disk.
Buffer manager does the following:
• If the block is already in the buffer, return the address of the block in
main memory
1. . If the block is not in the buffer
• . Allocate space in the buffer for the block
• . Replacing (throwing out) some other block, if
required, to make space for the new block.
• Replaced block written back to disk only if it was modified since the
most recent time that it was written to/fetched from the disk.
• Read the block from the disk to the buffer, and return the
address of the block in main memory to requester.
BufferReplacement
Policies
Most operating systems replace the block least recently used (LRU
strategy) Idea behind LRU – use past pattern of block references as a predictor
of future references Queries have welldefined access patterns (such as
sequential scans), and
a database system can use the information in a user’s query to predict future
references
LRU can be a bad strategy for certain access patterns involving
repeated scans of data

 e.g. when computing the join of 2 relations r and s by a nested loops for each
tuple tr of r do for each tuple ts of s do
if the tuples tr and ts match … Mixed strategy with hints on replacement strategy
provided by the query optimizer is preferable
BuffferReplacement
Policies (Cont.)
Pinned block – memory block that is not allowed to be written back to disk.
Tossimmediate
strategy – frees the space occupied by a block as soon as the final tuple of
that block has been processed n Most recently used (MRU) strategy – system
must pin the block currently being processed. After the final tuple of that block
has been processed, the block is unpinned, and it becomes the most recently
used block. Buffer manager can use statistical information regarding the
probability that a request will reference a particular relation
E.g., the data dictionary is frequently accessed. Heuristic: keep
datadictionary blocks in main memory buffer Buffer managers also support

forced output of blocks for the purpose of recovery (more in Chapter 17)
FILE ORGANISATION:
• A file is organized logically as a sequence of records.

• These records are mapped on to disk blocks.
• The blocks are of a fixed size are determined by the physical properties of the
disk and by the operating system, the record size may vary.
• We have to consider ways os representing logical data models in terms of
files.
• The following is the example of a file containing the account records.
Fig-1.1 FILE CONTAINING ACCOUNT RECORDS
Record A- Perryridg 400

0 102 e
Record A- Round Hill 350

1 305

Record A- Mianus 700

2 215
Record A- Downtow 500

3 101 n
Record A- Redwood 700

4 222

5 201 e
Record A- Brighton 750

6 217

7 110 n

8 218 e
• In a relational database, tuples of distinct relations are generally of different

sizes.
• One approach to mapping the database to files is to use several files and to
store records of only one fixed length in any given files. It is called as FIXED
LENGTH RECORDS.
• An alternative is to structure our files is accommodating multiple lengths for
records. It is called as VARIABLE-LENGTH RECORDS.
*FIXED-LENGTH RECORDS:
For an example, let us consider a file of account records for our bank
database. Each record of this file is defined as:
type deposit=record
account-number:char(10);
branch-name:char(22);
balance:real;
end
• If we assume that each character occupies 1 byte and that a real occupies 8
bytes, our account record is 40 bytes long. A simple approach is to use the
first 40 bytes for the first record, the next 40 bytes for the second record, and
so on.
• Though, we have two problems with this simple approach:
1. It is difficult to delete a record from this structure. The space

occupied by the record to be deleted must be filled with some other record of
this file, or we must have way of marking deleted records so that they can be
ignored.
2. Unless the block happens to be a multiple of 40(which is
unlikely), some records will cross block boundaries. That is, part of the record
will be stored in one block and part in another. It would thus require two block
accessses to read or write such a record.
3. When a record is deleted, we could move the record that came

after it into the space formely occupied by the deleted record and so on, until
every record following the deleted record has been moved ahead.(fig-1.2)
FIG 1.2 ,WITH RECORD 2 DELETED AND ALL RECORDS MOVED

0 102 e

1 305

3 101 n

4 222

5 201 e

6 217

7 110 n

8 218 e
• It might be easier simply to move the final record of the file into the space
occupied by the deleted record.(fig-1.3)
Fig-1.3 WITH RECORD 2 DELETED AND FINAL RECORD MOVED


0 102 e

1 305

8 218 e

3 101 n

4 222

5 201 e

6 217

7 110 n
• A simple marker on a deleted record is not sufficient, since it is hard to find

this available space when an insertion is being done. Thus, we need to
introduce an additional structure.
• At the beginning of the file, we allocate a certain number of bytes as a FILE
HEADER.
• The header will contain a variety of information about the file.
• For now, all we need to store there is the address of the first record whose
contents are deleted. We use this first record to store the address of the
second available record and so on.
• These deleted records form a linked list, which is often referred to as a FREE
LIST.(shown in the fig 1.4).
Fig 1.4,WITH FREE LIST AFTER DELETION OF RECORDS

1,4,AND 6
Header

A- Perryridg 400
Record 102 e
0
Record
1
Record A- Mianus 700

2 215

3 101 n
Record
4

5 201 e
Record
6

7 110 n

8 218 e
• On insertion of a new record, we use the record pointed to by the header.

• We change the header pointer to point to the next available record. If no
space is available, we add the new file to the end of the record.
♦ NOTE: Insertion and deletion for files of fixed-length records are simple to
implement, because
♦ The space made available by a deleted record is exactly the space needed to
insert a record. An inserted record may not fit in the space left free by a
deleted record, or it may fill only part of that space.
*VARIABLE-LENGTH RECORDS:
variable-length records arise in database systems in several ways:
 Storage of multiple record types in a file.

 Record types that allow variable lengths for one or more fields.
 Record types that allow repeating fields.

Different techniques for implementing variable-length records exist.
They are:
*BYTE-STRING REPRESENTATION.
*FIXED-LENGTH REPRESENTATION.
♦ We shall consider a different representaion of the account information stored

in the file of fig-1.1 to demonstrate the implementation techniques.
♦ In this, we use one variable-length record for each branch name and for all
the account information for that branch.
♦ The format of the record is:
type account-list=record
branch-name:char(22);
account-info:array[1…. ]of
record
account-number:char(10);
balance:real;
end
end
Here, we define account-info as an array with an arbitary number of

elements.
NOTE:There is no limit on how large a record can be.
*BYTE-STRING REPRESENTATION:
 A simple method for implementing variable-length records is to

attach a special end-of-record( ) symbol to the end of each record.
 We can store each record as a string of consecutive bytes.
 The fig-1.5 shows such an organisation to represent the file of fixed-
length records of fig 1.1 as variable length records.

FIG 1.5 BYTE-STRING REPRESENTATION OF VARIABLE-

LENGTHRECORDS
Perryridge A- 400 A-
102 201
Round Hill A- 350 ⊥

305
Perryridge A- 700 ⊥
218
Downtown A- 500 A-
101 110
Redwood A- 700 ⊥
222
A- 750 ⊥
Brighton 217
The byte-string representation described in the fig 1.5 has some

disadvantages:
It is not easy to reuse space occupied formerly by a deleted record. Although

technique exist to manage insertion and deletion, they lead to a large number
of small fragments of disk storage that are wasted.
There is no space, in general, for records to grow longer. If a variable-

length records becomes longer, it must be moved—movement is costly if
pointers to the record are stored elsewhere in the database (eg., in indices, or
in other records),since the pointers must be located and updated.
• The basic byte-string representation described here is not usually

used for implementation of variable-length records.
• The modified form of the byte-string representation is called the
slotted-page structure.
• It is usually used for organizing records within a single block.

• The slotted-page structure (fig1.6)having the header at the

beginning of each block.
• This block contains the following information:
1. The number of record entries in the header.
2. The end of free space in the block.
3. An array whose entries contain the location and size of each record.
Fig 1.6 SLOTTED-PAGE STRUCTURE
Block Header Records
#Entr Free Space

ies
• The actual records are allocated contiguously in the block starting from the
end of the block.
• The free space in the block is contiguous,between the final entry in the
header array,and the first record.
• If a record is inserted,space is allocated fopr it at the end of free space,and
an entry containing its size and location is added to the header.
• If a record is deleted,the space that it occupies is freed,and its entry is set
to deleted(its size set to –1;for example).
• Records can be grown or shrunk by similar techniques,as long as there is
space in the block.
• The cost of moving the records is not too high,since the size of a block is
limited:A typical values is 4 kilobytes.
• The slotted-page structures requires that there be no pointers that directly
to records.
NOTE:
Instead, pointers must point to the entry in the header that
contains the actual location of the record.This level of indirection
allows records to be moved to prevent fragmentation of space inside
a block,while supporting indirect pointers to the record.
*FIXED –LENGTH REPRESENTATION:

Another way to implement variable-length records efficiently in a file
system is to use one or more fixed-length records to represent one
variable-length record.
There are two way of doing this:
1.Reserved space:If there is a maximum record length that is never exceeded,we

can use fixed-length records of that length.Unused space(for records shorter than
the maximum space )is filled with the special null,or end-of-record, symbol.
2. List Representation:we can represent vaiable-length records by lists of fixed-

length records,chained together by pointers.
FILE OF FIG 1.5,USING THE RESERVED-SPACE METHOD
0 Perryridge A-102 400 A-201 900 A-218 700
1 Round Hill A-305 350 ⊥ ⊥ ⊥ ⊥
2 Mianus A-215 700 ⊥ ⊥ ⊥ ⊥
3 Downtown A-101 500 A-110 600 ⊥ ⊥
4 Redwood A-222 700 ⊥ ⊥ ⊥ ⊥
5 Brighton A-217 750 ⊥ ⊥ ⊥ ⊥
• A record int this file is of the account-list type,but with the array containing
eactly three elements. Those branches with fewer than three accounts(for
example,Round Hill) have records with null fields.
• We use the symbol ⊥ to represent this situation in the fig 1.7.
• The reserved-space method is useful when most records have a length close
to the maximum.

• In our bank example,some branches may have may have many more
accounts than others.This situation leads us to consider the linked list
method.
• To represent the file by the linked list nmethod,we add a pointer field as we
did in the fig 1.4 The resulting structure appears in the fig 1.8.
FIG 1.8 FILE FIG 1.5 USING LINKED LISTS
0 Perryridge A-102 400
1 Round Hill A-305 350
2 Mianus A-215 700
3 Downtown A-101 500
4 Redwood A-222 700
5 A-201 900
6 Brighton A-217 750
7 A-110 600
8 A-218 700
The file structures of fig 1.4 and 1.8 both use pointers,the difference is that,in fig
1.6,we use pointers to chain together only deleted records,whereas in the fig
1.8,we chain together all records pertaining to the same branch.
A disadvantage to the structure of fig 1.8 is that we waste space in all records
except the first in a chain
This wasted space is significant since we except,in practice,that each branch has
large number of accounts.
To deal with this problem,we allow two kinds of blocks in our file:
1. Anchor block,which contains the first record of a chain.

2. Overflow block,which contains records other than those that are the first
record of a chain.

3. Thus,all recors within a block have the same length,even though not all
records in the file have the same length.Fig1.8 shows this file structure.
ORGANISATION OF RECORDS IN FILES:
We have several of the possible ways of organizing records in files are:
• Heap file organization:Any record can be placed anywhere in the file where
there is space for the record.There is no ordering of records.Typically,there is
a single file for each relation.
• Sequentail file organization:Records are stroed in sequentail order,according
to the value of a “search key” of each record.
• Hashinf file organization: Ahash function is computed on some attribute of
each record.The result of the hash function specifies in which blocck of the file
the record should be placed.
• Generally,a separate file is used to store the records of each relation.
• However,in a clustering file organization,records of several different relations
are stored in the same file; further related records of the different relations
are stored in the same block,so that one I\o operation fetches related records
from all the relations.
SEQUENTIAL FILE ORGANIZATION:
• A sequential file is designed for efficient processing of records in sorted order

based on some search-key.
• A search key is any attribute or set of attributes;it need not be the primary
key,or even a superkey.
• To permit fat retrieval of records in search-key order,we chain together

records by pointers.The pointer in each record points to the next record in
search-key order,.

• The fig1.11 shows a sequential file of account records taken from our banking
example.In that example,the records are stored in search-key order,using
branch name as the search key.
• It is difficult,however,to maintain physical sequential order as records are
minserted and deleted,since it is costly to move many records as a result of a
single insertion or deletion.
• We can manage deletion by using pointer chains,as we saw previously.
• For insertion,we apply the following rules:
• Locate the record in the file that comes before the record to be inserted in
search-ckey order.
• If there is a free record (ie., space left after a deletion)within the same block
as this record, insert the new record there. Otherwise, insert the new record in
an overflow block. In either case,adjust the pointers so as to chain together
the records in search-key order.
Sequential file after an insertion.
• Fig 1.12 shows the file file of fig1.11 after the insertion of the
record(NorthTown,A-888,800). The structure in the fig1.12 allows fast
insertion of new records,but forces sequential file-processing applications to
process records in an order that does not match thephysical order of the
records.
• If relatively few records need to be stored in overflow blocks,this approach
works well.
• At this point,the file should be reorganized so that it is once again physically
in sequential order.Such reorganizations are costly,and must be dne during
times when the system load is low..

• In the extreme case in which insertions rarely occur,it is possible always to

keep the file in physically sorted order.In such a case,the pointer field in the
fig1.11 is not needed.
CLUSTERING FILE ORGANIZATION:
• Many relational-database systems store each relation in a separate file,so that

they can take full advantage of the file syste, that the operating systems
provides
• The tuples of a relation can be represented as fixed-lenfth recordsand can be
mapped to simple file structure.
• The simple implementation of a relational database system is well suited to
low-cost databse implementations as in,for example,embedded systerms or
portable devices.
• A simple file structure reduces the amount of code needed to implement the
system.
• This simple approach to relational-database implementation becomes l;ess
satisfactory as the size of the database increases.
However,many large-scalr databse ystemss do not rely directly on the underlying
operating system for file management.
Instead, one large operating-system file is allocated to the databse system.
The databsde system stores all relations in this one file,and manages the file
itself.
To see the advantage of storing many relations in this one file,consider the
following SQL query for the bank database:
select account-number,customer-name,customer-street,customer-city
from depositer,customer
where depositer.customer-name=customer.customer-name.
• This query computes a join of the depositer and customer relations.

• Thus,for each tuple of depositer, the system must locate the customer tuples
with same value for customer-name.
• Ideally,these records will be located with the help of indices.
• In the worst case,each record will reside on a different block,forcing us to
done block read for each record required by the query.
• As a concrete example,consider the depositer and customer relations fig 1.13
and fig 1.14 respectively.

• The fig 1.15 show a file structure designed for efficient excecution of queries
involving depositer&customer.
• The depositer tuples for each customer-name are stored near the customer
tuple for the corresponding customer-name.
• This structure mixes together tuples of two relations, but allows for efficient
processing of the join.
• When a tuple of the customer relation is read,the entire block containg that
tuple is copied from disk into main memory.
• Since the depositer tuples are stored on the disk near the customer-tuple,the
block containing the customer tuple contains tuples of the depositer relation
needed to process the query.If a customer has so many accounts that the
depositor records do not fit in one block,the remaining records appear on
nearby blocks.
• A clustering file organization is a file organization,such as that illustrated in

the the fig 1.14 that stores related records of two or more relations in each
block.Our use of clustering has enhanced processing of a particular
join(depositer&cxustomer),but it results in slowing processing of other types
of query.
• Requires more block accesses the than it did in the scheme under which we
stored each relation in a separate file.
select account-number,customer-name,customer-street,customer-city
from depositer,customer
where depositer.customer-name=customer.customer-name
• when clustering is to be used depends on the types of query that the

database designer believes to be most frequent.Careful use of clustering can
produce significant performance gains in query processing.
DATA-DICTIONARY STORAGE:
A relational-database system needs to maintain data about the relations,such as

the schema of the relations.The information is called the data dictionary,or
system catalog.
Among the types of information that the system must store are these:
 Names of the relations

 Names of the attributes of each relation
 Domains and lengths of attributes
 Names of vies defined on the database,and definitions of those views
 Integrity constriants(For example,key constraints)

In addition,many systems keep the following data on users of the system:
 Names of authorized users

 Accounting information about users.
 Passwords or other information used to authenticate users
Further,the database may store statistical and descriptive data about the
relations such as:
 Number of tuples in each relation

 Method of storage of each relation(for example, clustered or non-clustered)
The data dictionary may also note the storage organization(Sequential ,hash or
heap of relations,and the location where each relation is stored:
 If relations are stored in operating system files,the dictionaary would note the
names of the file(or files) containing each relation.
 If the database stores all relations in a single file,the dictionary may note the
blocks containing records of each relation in a data structure such as a linked
list.
In indices, we shall need to store information about each index on each of the
relations:
 Name of the index

 Name of the relation being indexed
 Attributes on which the index is defined
 Type of index formed.
All this information constitutes,in effect, a miniature database.Some database

systems store this information by using special-purpose data structure and
code.It is generally-preferable to store the data about the database in the
database itself.
The exact choice of how to represent system data by relations must be made by
the system designers. One possible representation,with primary keys underlined
is:

Relation-metadata(relation-name, number of attributes,storage-

organizatin,location)
Attribute-metadata(attribute-name, relation-name, domain-type, position, length)
User-metadata(user-name, encrypted-password, group)
Index-metadata(index-name, relation-name, index-type, index-attribute)
View-metadata(view-name, definition)
Indexing and Hashing
 Basic Concepts
 Ordered Indices
 B+-Tree Index Files
 B-Tree Index Files
 Static Hashing
 Dynamic Hashing
 Comparison of Ordered Indexing and Hashing
 Index Definition in SQL
 Multiple-Key Access
Basic Concepts
 Indexing mechanisms used to speed up access to desired data.

E.g., author catalog in library
 Search Key - attribute to set of attributes used to look up records in a file.
 An index file consists of records (called index entries) of the form
 Index files are typically much smaller than the original file
 Two basic kinds of indices:
Ordered indices: search keys are stored in sorted order
Hash indices: search keys are distributed uniformly across “buckets”
using a “hash function”.
1. Notice how we would find records for Perryridge branch using both
methods. (Do it!)

2. Dense indices are faster in general, but sparse indices require less space
and impose less maintenance for insertions and deletions. (Why?)
3. A good compromise: to have a sparse index with one entry per block.
Why is this good?
o Biggest cost is in bringing a block into main memory.

o We are guaranteed to have the correct block with this method,
unless record is on an overflow block (actually could be several
blocks).
o Index size still small.
Multi-Level Indices
1. Even with a sparse index, index size may still grow too large. For 100,000
records, 10 per block, at one index record per block, that's 10,000 index
records! Even if we can fit 100 index records per block, this is 100 blocks.
2. If index is too large to be kept in main memory, a search results in several
disk reads.
o If there are no overflow blocks in the index, we can use binary
search.
o This will read as many as blocks (as many as 7 for our 100
blocks).
o If index has overflow blocks, then sequential search typically used,
reading all b index blocks.
3. Solution: Construct a sparse index on the index (Figure 12.4).

Use binary search on outer index. Scan index block found until correct
index record found. Use index record as before - scan block pointed to for
desired record.
4. For very large files, additional levels of indexing may be required.

5. Indices must be updated at all levels when insertions or deletions require
it.
6. Frequently, each level of index corresponds to a unit of physical storage
(e.g. indices at the level of track, cylinder and disk).
Index Update
Regardless of what form of index is used, every index must be updated whenever
a record is either inserted into or deleted from the file.
1. Deletion:
o Find (look up) the record
o If the last record with a particular search key value, delete that
search key value from index.
o For dense indices, this is like deleting a record in a file.
o For sparse indices, delete a key value by replacing key value's entry
in index by next search key value. If that value already has an index
entry, delete the entry.
2. Insertion:
o Find place to insert.
o Dense index: insert search key value if not present.

o Sparse index: no change unless new block is created. (In this case,
the first search key value appearing in the new block is inserted into
the index).
Secondary Indices
1. If the search key of a secondary index is not a candidate key, it is not

enough to point to just the first record with each search-key value because
the remaining records with the same search-key value could be anywhere
in the file. Therefore, a secondary index must contain pointers to all the
records.
2.
We can use an extra-level of indirection to implement secondary indices on

search keys that are not candidate keys. A pointer does not point directly
to the file but to a bucket that contains pointers to the file.
o See Figure 12.5 on secondary key account file.
3. Secondary indices must be dense, with an index entry for every search-
key value, and a pointer to every record in the file.
4. Secondary indices improve the performance of queries on non-primary
keys.
5. They also impose serious overhead on database modification: whenever a
file is updated, every index must be updated.
6. Designer must decide whether to use secondary indices or not.
B -TREE INDEX FILES
1. Primary disadvantage of index-sequential file organization is that

performance degrades as the file grows. This can be remedied by costly
re-organizations.
2. B -tree file structure maintains its efficiency despite frequent insertions

and deletions. It imposes some acceptable update and space overheads.
3. A B -tree index is a balanced tree in which every path from the root to a
leaf is of the same length.
4. Each nonleaf node in the tree must have between and n children,
where n is fixed for a particular tree.
Structure of a B -Tree
1. A B -tree index is a multilevel index but is structured differently from that

of multi-level index sequential files.
A typical node (Figure 12.6) contains up to n-1 search key values

, and n pointers . Search key values in a node
are kept in sorted order.
2. For leaf nodes, ( ) points to either a file record with search

key value , or a bucket of pointers to records with that search key value.
Bucket structure is used if search key is not a primary key, and file is not
sorted in search key order.
Pointer (nth pointer in the leaf node) is used to chain leaf nodes together
in linear order (search key order). This allows efficient sequential
processing of the file.
The range of values in each leaf do not overlap.
3. Non-leaf nodes form a multilevel index on leaf nodes.
A non-leaf node may hold up to n pointers and must hold pointers.

The number of pointers in a node is called the fan-out of the node.
Consider a node containing m pointers. Pointer ( ) points to a

subtree containing search key values and . Pointer points to a
subtree containing search key values . Pointer points to a subtree
containing search key values .
4. Figures 11.7 (textbook Fig. 11.8) and textbook Fig. 11.9 show B -trees for
the deposit file with n=3 and n=5.

Queries on B -Trees
1. Suppose we want to find all records with a search key value of k.

o Examine the root node and find the smallest search key value
.
o Follow pointer to another node.
o If follow pointer .
o Otherwise, find the appropriate pointer to follow.
o Continue down through non-leaf nodes, looking for smallest search
key value > k and following the corresponding pointer.
o Eventually we arrive at a leaf node, where pointer will point to the
desired record or bucket.
2. In processing a query, we traverse a path from the root to a leaf node. If
there are K search key values in the file, this path is no longer than
.
This means that the path is not long, even in large files. For a 4k byte disk
block with a search-key size of 12 bytes and a disk pointer of 8 bytes, n is
around 200. If n =100, a look-up of 1 million search-key values may take
nodes to be accessed. Since root is in usually in the
buffer, so typically it takes only 3 or fewer disk reads.
Updates on B -Trees
1. Insertions and Deletions:
Insertion and deletion are more complicated, as they may require

splitting or combining nodes to keep the tree balanced. If splitting or
combining are not required, insertion works as follows:
o Find leaf node where search key value should appear.

o If value is present, add new record to the bucket.
o If value is not present, insert value in leaf node (so that search keys
are still in order).
o Create a new bucket and insert the new record.
If splitting or combining are not required, deletion works as follows:
o Deletion: Find record to be deleted, and remove it from the bucket.

o If bucket is now empty, remove search key value from leaf node.
2. Insertions Causing Splitting:
When insertion causes a leaf node to be too large, we split that node. In
Figure 11.8, assume we wish to insert a record with a bname value of
``Clearview''.
o There is no room for it in the leaf node where it should appear.

o We now have n values (the n-1 search key values plus the new one
we wish to insert).
o We put the first values in the existing node, and the remainder
into a new node.
o Figure 11.10 shows the result.
o The new node must be inserted into the B -tree.
o We also need to update search key values for the parent (or higher)
nodes of the split leaf node. (Except if the new node is the leftmost
one)
o Order must be preserved among the search key values in each
node.
o If the parent was already full, it will have to be split.
o When a non-leaf node is split, the children are divided among the
two new nodes.
o In the worst case, splits may be required all the way up to the root.
(If the root is split, the tree becomes one level deeper.)
o Note: when we start a B -tree, we begin with a single node that is
both the root and a single leaf. When it gets full and another
insertion occurs, we split it into two leaf nodes, requiring a new root.
3. Deletions Causing Combining:
Deleting records may cause tree nodes to contain too few pointers. Then
we must combine nodes.
o If we wish to delete ``Downtown'' from the B -tree of Figure 11.11,

this occurs.
o In this case, the leaf node is empty and must be deleted.
o If we wish to delete ``Perryridge'' from the B -tree of Figure 11.11,
the parent is left with only one pointer, and must be coalesced with
a sibling node.
o Sometimes higher-level nodes must also be coalesced.
o If the root becomes empty as a result, the tree is one level less deep
(Figure 11.13).
o Sometimes the pointers must be redistributed to keep the tree
balanced.
o Deleting ``Perryridge'' from Figure 11.11 produces Figure 11.14.

4. To summarize:
o Insertion and deletion are complicated, but require relatively few
operations.
o Number of operations required for insertion and deletion is
proportional to logarithm of number of search keys.
o B -trees are fast as index structures for database.
B -Tree File Organization
1. The B -tree structure is used not only as an index but also as an organizer
for records into a file.
2. In a B -tree file organization, the leaf nodes of the tree store records
instead of storing pointers to records, as shown in Fig. 11.17.
3. Since records are usually larger than pointers, the maximum number of
records that can be stored in a leaf node is less than the maximum
number of pointers in a non leaf node.
4. However, the leaf node are still required to be at least half full.
5. Insertion and deletion from a B -tree file organization are handled in the
same way as that in a B -tree index.
6. When a B -tree is used for file organization, space utilization is particularly
important. We can improve the space utilization by involving more sibling
nodes in redistribution during splits and merges.
7. In general, if m nodes are involved in redistribution, each node can be
guaranteed to contain at least entries. However, the cost of

update becomes higher as more siblings are involved in redistribution.
B-TREE INDEX FILES
1. B-tree indices are similar to B -tree indices.

o Difference is that B-tree eliminates the redundant storage of search
key values.
o In B -tree of Figure 11.11, some search key values appear twice.
o A corresponding B-tree of Figure 11.18 allows search key values to
appear only once.
o Thus we can store the index in less space.
Figure 11.8: Leaf and nonleaf node of a B-tree.
2. Advantages:
• Generally, the structural simplicity of B -tree is preferred.

• Lack of redundant storage (but only marginally different).
• Some searches are faster (key may be in non-leaf node).

3. Disadvantages:
o Leaf and non-leaf nodes are of different size (complicates storage)
o Deletion may occur in a non-leaf node (more complicated)
STATIC HASHING
1. Index schemes force us to traverse an index structure. Hashing avoids this.

2. hashing also provides a way of constructing indices.
Hash File Organization
1. Hashing involves computing the address of a data item by computing a

function on the search key value.
2. A hash function h is a function from the set of all search key values K to
the set of all bucket addresses B.
o We choose a number of buckets to correspond to the number of
search key values we will have stored in the database.
o To perform a lookup on a search key value , we compute ,
and search the bucket with that address.
o If two search keys i and j map to the same address, because
, then the bucket at the address obtained will contain
records with both search key values.
o In this case we will have to check the search key value of every
record in the bucket to get the ones we want.
o Insertion and deletion are simple.
Hash Functions
1. A good hash function gives an average-case lookup that is a small

constant, independent of the number of search keys.
2. We hope records are distributed uniformly among the buckets.
3. The worst hash function maps all keys to the same bucket.
4. The best hash function maps all keys to distinct addresses.
5. Ideally, distribution of keys to addresses is uniform and random.
6. Suppose we have 26 buckets, and map names beginning with ith letter of
the alphabet to the ith bucket.
o Problem: this does not give uniform distribution.
o Many more names will be mapped to ``A'' than to ``X''.
o Typical hash functions perform some operation on the internal
binary machine representations of characters in a key.
o For example, compute the sum, modulo of buckets, of the binary
representations of characters of the search key.
o See Figure 11.18, using this method for 10 buckets (assuming the
ith character in the alphabet is represented by integer i).

Handling of bucket overflows
1. we handle bucket overflow by using overflow buckets.

2. All the overflow buckets of a given bucket are chained together in a linked
list , as in fig 12.22

3. Open hashing occurs where records are stored in different buckets.

Compute the hash function and search the corresponding bucket to find a
record.
4. Closed hashing occurs where all records are stored in one bucket. Hash
function computes addresses within that bucket. (Deletions are difficult.)
Not used much in database applications.
5. Drawback to our approach: Hash function must be chosen at
implementation time.
o Number of buckets is fixed, but the database may grow.
o If number is too large, we waste space.
o If number is too small, we get too many ``collisions'', resulting in
records of many search key values being in the same bucket.
o Choosing the number to be twice the number of search key values
in the file gives a good space/performance tradeoff.
Hash Indices
1. A hash index organizes the search keys with their associated pointers into
a hash file structure.
2. We apply a hash function on a search key to identify a bucket, and store
the key and its associated pointers in the bucket (or in overflow buckets).
3. Strictly speaking, hash indices are only secondary index structures, since if
a file itself is organized using hashing, there is no need for a separate hash
index structure on it.

DYNAMIC HASHING
1. As the database grows over time, we have three options:

o Choose hash function based on current file size. Get performance
degradation as file grows.
o Choose hash function based on anticipated file size. Space is wasted
initially.
o Periodically re-organize hash structure as file grows. Requires
selecting new hash function, recomputing all addresses and
generating new bucket assignments. Costly, and shuts down
database.
2. Some hashing techniques allow the hash function to be modified
dynamically to accommodate the growth or shrinking of the database.
These are called dynamic hash functions.
o Extendable hashing is one form of dynamic hashing.
o Extendable hashing splits and coalesces buckets as database size
changes.
o This imposes some performance overhead, but space efficiency is
maintained.
o As reorganization is on one bucket at a time, overhead is acceptably
low.
3. How does it work?

Figure 12.4 General extendable hash structure.
o We choose a hash function that is uniform and random that

generates values over a relatively large range.
o Range is b-bit binary integers (typically b=32).
o is over 4 billion, so we don't generate that many buckets!
o Instead we create buckets on demand, and do not use all b bits of
the hash initially.
o The i bits are used as an offset into a table of bucket addresses.

o Value of i grows and shrinks with the database.
o Figure 12.4 shows an extendable hash structure.
o Note that the i appearing over the bucket address table tells how
many bits are required to determine the correct bucket.
o It may be the case that several entries point to the same bucket.
o All such entries will have a common hash prefix, but the length of
this prefix may be less than i.
o So we give each bucket an integer giving the length of the common
hash prefix.
o The integer associated with bucket j is shown as
o Number of bucket entries pointing to bucket j is then .
4. To find the bucket containing search key value :
o Compute .
oTake the first i high order bits of .
oLook at the corresponding table entry for this i-bit string.
oFollow the bucket pointer in the table entry.
5. We now look at insertions in an extendable hashing scheme.
o Follow the same procedure for lookup, ending up in some bucket j.
o If there is room in the bucket, insert information and insert record in
the file.

o If the bucket is full, we must split the bucket, and redistribute the
records.
o If bucket is split we may need to increase the number of bits we use
in the hash.
6. Two cases exist:
1. If , then only one entry in the bucket address table points to bucket
j.
o Then we need to increase the size of the bucket address table so

that we can include pointers to the two buckets that result from
splitting bucket j.
o We increment i by one, thus considering more of the hash, and
doubling the size of the bucket address table.
o Each entry is replaced by two entries, each containing original
value.
o Now two entries in bucket address table point to bucket j.
o We allocate a new bucket z, and set the second pointer to point to z.
o Set and to i.
o Rehash all records in bucket j which are put in either j or z.
o Now insert new record.
o It is remotely possible, but unlikely, that the new hash will still put
all of the records in one bucket.
o If so, split again and increment i again.
2. If , then more than one entry in the bucket address table points to
bucket j.
o Then we can split bucket j without increasing the size of the bucket
address table
o Note that all entries that point to bucket j correspond to hash
prefixes that have the same value on the leftmost bits.
o We allocate a new bucket z, and set and to the original value
plus 1.
o Now adjust entries in the bucket address table that previously
pointed to bucket j.
o Leave the first half pointing to bucket j, and make the rest point to
bucket z.
o Rehash each record in bucket j as before.
o Reattempt new insert.
7. Note that in both cases we only need to rehash records in bucket j.
8. Deletion of records is similar. Buckets may have to be coalesced, and
bucket address table may have to be halved.
9. Insertion is illustrated for the example deposit file of Figure 12.25

o 32-bit hash values on bname are shown in Figure 12.26
. An initial empty hash structure is shown in Figure 12.27.
o We insert records one by one.

o We (unrealistically) assume that a bucket can only hold 2 records, in
order to illustrate both situations described.
o As we insert the Perryridge and Round Hill records, this first bucket
becomes full.
o When we insert the next record (Downtown), we must split the
bucket.

o Since , we need to increase the number of bits we use from the

hash.
o We now use 1 bit, allowing us buckets.
o This makes us double the size of the bucket address table to two
entries.
o We split the bucket, placing the records whose search key hash
begins with 1 in the new bucket, and those with a 0 in the old
bucket (Figure 11.23).
o Next we attempt to insert the Redwood record, and find it hashes to
1.
o That bucket is full, and .
o So we must split that bucket, increasing the number of bits we must
use to 2.
o This necessitates doubling the bucket address table again to four
entries (Figure 12.28).
o We rehash the entries in the old bucket.

o We continue on for the deposit records of Figure 12.25, obtaining
the extendable hash structure

of Figure 12.31
10.Advantages:
o Extendable hashing provides performance that does not degrade as
the file grows.
o Minimal space overhead - no buckets need be reserved for future
use. Bucket address table only contains one pointer for each hash
value of current prefix length.
11.Disadvantages:
o Extra level of indirection in the bucket address table
o Added complexity

Index Definition in SQL:
Some SQL implementations includes data definition commands to create and

drop indices. The IBM SAA-SQL commands are
An index is created by
create index <index-name>
on r (<attribute-list>)
The attribute list is the list of attributes in relation r that form the search key for
the index.
To create an index on bname for the branch relation:
create index b-index
on branch (bname)
If the search key is a candidate key, we add the word unique to the definition:
create unique index b-index
on branch (bname)
If bname is not a candidate key, an error message will appear.
If the index creation succeeds, any attempt to insert a tuple violating this
requirement will fail.
The unique keyword is redundant if primary keys have been defined with
integrity constraints already.
To remove an index, the command is
drop index <index-name>

Multiple-Key Access:
Until now, we have assumed implicitly that only one index (or hash table) is used
to
process a query on a relation. However, for certain types of queries, it is

advantageous
to use multiple indices if they exist.
Using Multiple Single-Key Indices
Assume that the account file has two indices: one for branch-name and one for
balance.
Consider the following query: “Find all account numbers at the Perryridge branch
with balances equal to $1000.” We write
select loan-number
from account
where branch-name = “Perryridge” and balance = 1000
There are three strategies possible for processing this query:
1. Use the index on branch-name to find all records pertaining to the Perryridge
branch. Examine each such record to see whether balance = 1000.
2. Use the index on balance to find all records pertaining to accounts with
balances
of $1000. Examine each such record to see whether branch-name =

“Perryridge.”
3. Use the index on branch-name to find pointers to all records pertaining to the
Perryridge branch. Also, use the index on balance to find pointers to all records
pertaining to accounts with a balance of $1000. Take the intersection of these

two sets of pointers. Those pointers that are in the intersection point to
records
pertaining to both Perryridge and accounts with a balance of $1000.
The third strategy is the only one of the three that takes advantage of the
existence
of multiple indices. However, even this strategy may be a poor choice if all of the
following hold:
• There are many records pertaining to the Perryridge branch.
• There are many records pertaining to accounts with a balance of $1000.
• There are only a few records pertaining to both the Perryridge branch and
accounts with a balance of $1000.
If these conditions hold, we must scan a large number of pointers to produce a

small
result. An index structure called a “bitmap index” greatly speeds up the

intersection
operation used in the third strategy.

Q – BANK
1.Give a detailed note on various file organization techniques(15)-Nov2007
2.Give a detailed note on different indexing techniques (15)-Nov2007
3.(a) Explain about file and system structure of DBMS (7)-Nov/Dec2008
(b) What is meant by Indexing explain (8)
4.Explain the magnetic disk storage device in detail-April/May2009
5.Describe in detail about B+ - index files(15) –April/May2009
7.Explain DBTG data structure and architecture .Also discuss the DBTG data
retrieval facility (15)-May2007
8.Discuss about RAID and Tertiary storage(15)April/May2008
9. Discuss static hashing and dynamic hashing (15) April/May2008

UNIT - V
Transactions
Transaction concept – Transaction State – Implementation of Atomicity and

Durability – Concurrent Executions – Serializability – Recoverability –
Implementation of Isolation – Transaction Definition in SQL – Testing for
Serializability
Concurrence Control
Lock-Based Protocols – Timestamp-Based Protocols – Validation-Based

Protocols – Multiple Granularity – Multiversion Schemens – Deadlock Handling –
Insert and Delete Operations – Weak Levels of Consistency – Concurrency of
Index Structures.
Recovery System
Failure Classification – Storage Structure – Recovery and Atomicity – Log-

Based Recovery – Shadow Paging – Recovery with Concurrent Transactions –
Buffer Management – Failure with Loss of Nonvolatile Storage – Advance
Recovery Techniques – Remote Backup Systems

2Marks Q & A:
1. What is transaction?
Collections of operations that form a single logical unit of work are
called transactions.
2. What are the two statements regarding transaction?

The two statements regarding transaction of the fo rm:
Begin transaction
End transaction
3. What are the properties of transaction?

The properties o f transactions are:
Atomicity
Consistency
Isolation
Durability
4. What is recovery management component?

Ensuring durability is the responsibility of a software component of
the base system called the recovery management component.
5. When is a transaction rolled back?

Any changes that the aborted transaction made to the database
must be undone.
Once the changes caused by an aborted transaction have been
undone, then the transaction has been rolled back.
6. What are the states of transaction?

The states of transaction are
Active
Partially committed
Failed
Aborted
Committed
Terminated
7. What is a shadow copy scheme?

It is simple, but efficient, scheme called the shadow copy schemes. It is
based on making copies of the database called shadow copies that
one transaction is active at a time. The scheme also assumes that the
database is simply a file on disk.
8. Give the reasons for allowing concurrency?
The reasons for allowing concurrency is if the transactions run
serially, a short transaction may have to wait for a preceding long
transaction to complete, which can lead to unpredictable delays in
running a transaction. So concurrent execution reduces the
unpredictable delays in running transactions.
9. What is average response time?

The average response time is that the average time for a
transaction to be completed after it has been submitted.
10. What are the two types of serializability?

The two types of serializability is
Conflict serializability
View serializability
11. Define lock?

Lock is the most common used to implement the requirement is
to allow a transaction to access a data item only if it is currently holding
a lock on that item.
12. What are the different modes of lock?

The modes of lock are:
Shared
Exclusive
13. Define deadlock?

Neither of the transaction can ever proceed with its normal
execution. This situation is called deadlock.
14. Define the phases of two phase locking protocol
Growing phase: a transaction may obtain locks but not release any lock
Shrinking phase: a transaction may release locks but may not obtain
any new locks.
15. Define upgrade and downgrade?

It provides a mechanism for conversion from shared lock to
exclusive lock is known as upgrade.
It provides a mechanism for conversion from exclusive lock to
shared lock is known as downgrade.
16. What is a database graph?

The partial ordering implies that the set D may now be viewed as
a directed acyclic graph, called a database gr aph.
17. What are the two methods for dealing deadlock problem?
The two methods for dealing deadlock problem is deadlock
detection and deadlock recovery.
18. What is a recovery scheme?

An integral part of a database system is a recovery scheme that
can restore the database to the consistent state that existed before the
failure.
19. What are the two types of errors?

The two types of errors are:
Logical error
System error
20. What are the storage types?

The storage types are:
Volatile storage
Nonvolatile storage
21. Define blocks?

The database system resides permanently on nonvolatile storage,
and is partitioned into fixed-length storage units called blocks.
22. What is meant by Physical blocks?

The input and output operations are done in block units. The
blocks residing on the disk are referred to as physical blocks.
23. What is meant by buffer blocks?

The blocks residing temporarily in main memory are referred to as buffer
blocks.
24. What is meant by disk buffer?

The area of memory where blocks reside temporarily is called the disk
buffer.
25. What is meant by log-based recovery?

The most widely used structures for recording database
modifications is the log.
The log is a sequence of log records, recording all the update
activities in the database.
There are several types of log records.
26. What are uncommitted modifications?

The immediate-modification technique allows database modifications to
be output to the database while the transaction is still in the active
state. Data modifications written by active transactions are called
uncommitted modifications.
27. Define shadow paging.

An alternative to log-based crash recovery technique is shadow
paging. This technique needs fewer disk accesses than do the log-based
methods.
28. Define page.

The database is partitioned into some number of fixed-length
blocks, which are referred to as pages.
29. Explain current page table and shadow page table.

The key idea behind the shadow paging technique is to maintain two
page tables during the life of the transaction: the current page table
and the shadow p age table. Both the page tables are identical
when the transaction starts. The current page table may be
changed when a transaction performs a write operation.
30. What are the drawbacks of shadow-paging technique?

• Commit Overhead
• Data fragmentation
• Garbage collection
30. Define garbage collection.

Garbage may be created also as a side effect of crashes.

Periodically, it is necessary to find all the garbage pages and to
add them to the list of free pages. This process is called garbage
collection.
32. Differentiate strict two phase locking protocol and rigorous two phase
locking protocol.
In strict two phase locking protocol all exclusive mode locks taken by a
transaction is held until that transaction commits. Rigorous two phase
locking protocol requires that all locks be held until the transaction
commits.
33. How the time stamps are implemented
• Use the value of the system clock as the time stamp. That is a
transaction’s time stamp is equal to the value of the clock when the
transaction enters the system.
• Use a logical counter that is incremented after a new timestamp
has been assigned; that is the time stamp is equal to the value of the
counter.
34. What are the time stamps associated with each data item?
• W-timestamp (Q) denotes the largest time stamp if any
transaction that executed WRITE (Q) successfully.
• R-timestamp (Q) denotes the largest time stamp if any
transaction that executed READ (Q) su ccessfully
35. Define transaction-processing systems.

The concept of transaction provides a mechanism for describing logical
units or database processing. Transaction-processing systems are systems
with large databases and many concurrent users work with those systems.
36. Define read only transaction.

The database operations in a transaction do not update the database but
only retrieve data and that transaction is called a read-only transaction.
37. When does the transaction go into an active state and partially committed
state?
A transaction goes into an active state immediately after it starts
execution, where it can issue read and write operations. When the
transaction ends, it moves into the partially committed state.
38. What is called as committed state?

Once a transaction is committed, it has concluded its execution
successfully and all its changes must be recorded permanently in the database.
39. Define ACID property.

A–Atomicity, maintained by transaction manager component.
C–Consistency, maintained by application programmer by coding with
integrity constraints.
I – Isolation, maintained by concurrency control component.
D–Durability, maintained by recovery management component.
40. What is isolation of ACID properties?

The execution of a transaction should not be interfered with any other

transactions executing concurrently.
41. Define cascading rollback.

An uncommitted transaction will be rolled back because of the failure of
the first transaction, from which other transactions reads the data item. This
phenomenon of wasting the desirable amount of work is called cascading
rollback.
42. What is blind write?

If a transaction writes a data item without reading the data is called blind
write. This sometimes causes inconsistency.
43. Define serial schedule.

A schedule, S is serial if for every transaction T participating in the
schedule and all the operations of T are executed consecutively in the schedule;
otherwise the schedule is called Non-serial schedule.
44. What is the use of locking?

It is used to prevent concurrent transactions from interfering with one
another and enforcing an additional condition that guarantees serializability.
45. What is called as a time stamp?

A time stamp is a unique identifier for each transaction generated by the
system. Concurrency control protocols use this time stamp to ensure
serializability.
46. What is shared lock and Exclusive lock?

Shared lock allows other transactions to read the data item and write is not
allowed. Exclusive lock allows both read and write of data item and a single
transaction exclusively holds the lock
47. When does a deadlock occur?

Deadlock occurs when one transaction T in a set of two or more
transactions is waiting for some item that is locked by some other
transaction in the set.
48. What is meant by transaction rollback?

If a transaction fails for reasons like power failure, hardware failure
or logical error in the transaction after updating the database, it is rolled
back to restore the previous value.
49. Write the reasons for using concurrent execution.

High throughput
Number of transactions that are executed in a given amount
of time is high.
Low delay
Average response time is reduced. Average response time
is the average time for a transaction to be completed after it has
been submitted.
50. Define recoverable schedule.

Recoverable schedule is the one where for each pair of transactions

Ti and Tj such that Tj reads a data item previously written by Ti, the
commit operation of Ti appears before the commit operation of Tj.
51. Define Query optimization.

The DBMS must devise an execution strategy for retrieving the
result of the query from the database files. Process of choosing a suitable
execution strategy for processing a query is known as Query optimization.
52. Define Distributed database systems.

A distributed system consists of a collection of sites, connected
together via some kind of communication network, in which
o Each site is a full database system site in its own right.
o But the sites have agreed to work together so that a user at any
site can access data anywhere in the network exactly as if the data
were all stored at the user’s own site.
53. What are the advantages of distributed system?

Reliability:
 the probability that the system is up and running at any
given moment.
Availability:
 the probability that the system is up and running
continuously throughout a specific period.
54. What is query processor?

The objective of minimizing network utilization implies the query
optimization process itself needs to be distributed as well as the query
execution process.
55. Explain Client/Server.

Client/Server refers primarily to architecture, or logical division of
responsibility, the client is the application called the front end. The server
is the DBMS, which is the back end.
56.What are the rules that have to be followed during fragmentation?

The three correctness rules are:
o Completeness: ensure no loss of data
o Reconstruction: ensure Functional dependency preservation
o Disjoint ness: ensure minimal data redundancy.
57.What are the objectives of concurrency control?

o To be resistant to site and communication failure.
o To permit parallelism to satisfy performance requirements.
o To place few constraints on the structure of atomic actions.
58. What is multiple-copy consistency problem?

It is a problem that occurs when there is more than one copy of data
item in different locations and when the changes are made only in some
copies not in all copies.

59. What are the types of Locking protocols?

o Centralized 2 phase locking
o Primary 2 phase locking
o Distributed 2 phase locking
o Majority locking
60. What are the failures in distributed DBMS?

o The loss of message
o The failure of a communication link
o The failure of a site
o Network partitioning
61. What is replication?

The process of generating and reproducing multiple copies of data
at one or more sites is called replication.
62. What are the types of replication?
o Read-only snapshots
o Updateable snapshots
o Multimaster replication
o procedural replication
63. What are the applications of Data warehouse?

OLAP
 ( Online Analytical Processing )
DSS
 ( Decision Support System )
OLTP ( On Line Transaction Processing )
64. What is parallel database?

In some of the architectures multiple CPUs are working in parallel
and are physically located in a close environment in the same building and
communicating at a very high speed. The databases operating in such
environment is known as parallel databases.
65. Define Data warehousing.

Data warehouse is a “subject oriented, integrated, non-volatile, time
variant collection of data in support of managements decisions”.
66. Define data mining.

The data mining refers to the mining or discovery of new
information in terms of patterns or rules from vast amount of data.

Lecture Notes:
TRANSACTIONS
• Collections of operations that form a single logical unit of work is called as

transactions.
• A database system must ensure proper execution of transactions despite

failures—either the entire transaction executes, or none of it does.
• A transaction is a unit of program execution that accesses and possibly

updates various data items.
• Usually, a transaction is initiated by a user program written in a High-level

data-manipulation language or programming language, where it is
delimited by statements (or function calls) of the form begin transaction
and end transaction.
• To ensure integrity of the data, we require that the database system

maintain the following properties of the transactions:
 Atomicity. Either all operations of the transaction are reflected
properly in the database, or none are.
 Consistency. Execution of a transaction in isolation (that is, with

no other transaction executing concurrently) preserves the
consistency of the database.
 Isolation. Even though multiple transactions may execute

concurrently, the system guarantees that, for every pair of
transactions Ti and Tj, it appears to Ti that either Tj finished
execution before Ti started, or Tj started execution after Ti
finished. Thus, each transaction is unaware of other transactions
executing concurrently in the system.
 Durability. After a transaction completes successfully, the

changes it has made to the database persist, even if there are
system failures.
• These properties are often called the ACID properties; the acronym is
derived from the first letter of each of the four properties.
• Transactions access data using two operations:

1. read(X), which transfers the data item X from the database to a
local buffer belonging to the transaction that executed the read
operation.
2. write(X), which transfers the data item X from the local buffer of
the transaction that executed the write back to the database.

• Let Ti be a transaction that transfers $50 from account A to account B. This

transaction can be defined as
Ti: read (A);

A: = A − 50;
write (A);
read (B);
B: = B + 50;
write (B).
• Let us now consider each of the ACID requirements.
• Consistency: The consistency requirement here is that the sum of A

and B be unchanged by the execution of the transaction. If the
database is consistent before an execution of the transaction, the
database remains consistent after the execution of the transaction.
• Atomicity: Suppose that, just before the execution of transaction Ti

the values of accounts A and B are $1000 and $2000, respectively.
Suppose that the failure happened after the write (A) operation but
before the write (B) operation. In this case, the values of accounts A and
B reflected in the database are $950 and $2000. The system destroyed
$50 as a result of this failure. We term such a state an inconsistent
state. If the atomicity property is present, all actions of the transaction
are reflected in the database, or none are. Ensuring atomicity is the
responsibility of the database system itself; specifically, it is handled by
a component called the transaction-management component.
• Durability: The durability property guarantees that, once a transaction

completes successfully, all the updates that it carried out on the
database persist, even if there is a system failure after the transaction
completes execution. We can guarantee durability by ensuring that
either
1. The updates carried out by the transaction have been written to
disk before the transaction completes.
2. Information about the updates carried out by the transaction and
written to disk is sufficient to enable the database to reconstruct the
update when the database system is restarted after the failure.
Ensuring durability is the responsibility of a component of the database
system called the recovery-management component.
• Isolation: Even if the consistency and atomicity properties are ensured

for each transaction, if several transactions are executed concurrently,
their operations may interleave in some undesirable way, resulting in an
inconsistent state. A way to avoid the problem of concurrently executing
transactions is to execute transactions serially—that is, one after the
other. Ensuring the isolation property is the responsibility of a component
of the database system called the concurrency-control component
executes transactions serially—that is, one after the other.
Transaction State:

A transaction may not always complete its execution successfully. Such a

transaction is termed aborted. Any changes that the aborted transaction made to
the database must be undone. Once the changes caused by an aborted
transaction have been undone, we say that the transaction has been rolled back.
A transaction that completes its execution successfully is said to be committed.
A committed transaction that has performed updates transforms the database
into a new consistent state, which must persist even if there is a system failure.
The only way to undo the effects of a committed transaction is to execute a
compensating transaction. A transaction must be in one of the following states:
1. Active, the initial state; the transaction stays in this state while it is
executing.
2. Partially committed, after the final statement has been execute.
3. Failed, after the discovery that normal execution can no longer Proceed.
4. Aborted, after the transaction has been rolled back and the database has
been restored to its state prior to the start of the transaction.
5. Committed, after successful completion.
 A transaction starts in the active state.

 When it finishes its final statement, it enters the partially committed state.
 The database system then writes out enough information to disk.
 If a transaction enters the failed state such a transaction must be rolled
back.
 Then, it enters the aborted state. At this point, the system has two options:
• It can restart the transaction.
• It can kill the transaction.
 The state diagram corresponding to a transaction appears as follows:

Implementation of Atomicity and Durability:
 The recovery-management component of a database system can support

atomicity and durability by a variety of schemes.
 We first consider a simple, but extremely inefficient, scheme called the
shadow copy scheme.
 This scheme, which is based on making copies of the database, called
shadow copies, assumes that only one transaction is active at a time.
 A pointer called db-pointer is maintained on disk; it points to the current
copy of the database.
 In the shadow-copy scheme, a transaction that wants to update the
database first creates a complete copy of the database.
 All updates are done on the new database copy, leaving the original copy,
the shadow copy, untouched.
 If at any point the transaction has to be aborted, the system merely
deletes the new copy.
 The old copy of the database has not been affected.

Shadow-copy technique for atomicity and durability.

 The transaction is said to have been committed at the point where the
updated db-pointer is written to disk.
 If the transaction fails at any time before db-pointer is updated, the old
contents of the database are not affected.
 Suppose that the system fails at any time before the updated db-pointer is
written to disk.
 Then, when the system restarts, it will read db-pointer and will thus see
the original contents of the database, and none of the effects of the
transaction will be visible on the database.
 Next, suppose that the system fails after db-pointer has been updated on
disk.
 Before the pointer is updated, all updated pages of the new copy of the
database were written to disk.
 Thus, the atomicity and durability properties of transactions are ensured
by the shadow-copy implementation of the recovery-management
component.
Concurrent Executions:
 Transaction-processing systems usually allow multiple transactions to run

concurrently.
 Allowing multiple transactions to update data concurrently causes several
complications with consistency of the data. There are two good reasons
for allowing concurrency:
• Improved throughput and resource utilization. A transaction

consists of many steps. Some involve I/O activity; others involve
CPU activity. Therefore, I/O activity can be done in parallel with
processing at the CPU. All of this increases the throughput of the
system correspondingly; the processor and disk utilization also
increases.
• Reduced waiting time. There may be a mix of transactions running
on a system, some short and some long. Concurrent execution
reduces the unpredictable delays in running transactions. Moreover,
it also reduces the average response time: the average time for a
transaction to be completed after it has been submitted.

 The motivation for using concurrent execution in a database is essentially

the same as the motivation for using multiprogramming in an operating
system.
 Let T1 andT2 be two transactions that transfer funds from one account to
another. Transaction T1 transfers $50 from account A to account B. It is
defined as:
T1: read (A);
A: = A − 50;
write (A);
read (B);
B: = B + 50;
write (B).
 Transaction T2 transfers 10 percent of the balance from account A to
account B. It is defined as:
T2: read (A);
temp: = A * 0.1;
A: = A − temp;
write (A);
read (B);
B: = B + temp;
write (B).
 Suppose the current values of accounts A and B are $1000 and $2000,
respectively.
 Suppose also that the two transactions are executed one at a time in the
order T1 followed by T2. This execution sequence appears as follows:
Schedule 1—a serial schedule in which T1 is followed by

T2.
 The final values of accounts A and B, after the execution in this figure
takes place, are $855 and $2145, respectively.

 Similarly, if the transactions are executed one at a time in the order T2

followed by T1, then the corresponding execution sequence is as follows:
Schedule 2—a serial schedule in which T2 is followed by T1.
 Again, as expected, the sum A + B is preserved, and the final values of

accounts A and B are $850 and $2150, respectively.
 The execution sequences just described are called schedules.
 suppose that the two transactions are executed concurrently as follows:
Schedule 3—a concurrent schedule equivalent to schedule 1.
 After this execution takes place, we arrive at the same state as the one in
which the transactions are executed serially in the order T1 followed by T2.
The sum A + B is indeed preserved.
 Not all concurrent executions result in a correct state. To illustrate,

consider the schedule as follows:
Schedule 4—a concurrent schedule.

 After the execution of this schedule, we arrive at a state where the final
values of accounts A and B are $950 and $2100, respectively.
 This final state is an inconsistent state.
Serializability
The database system must control concurrent execution of transactions, to

ensure that the database state remains consistent.
Since transactions are programs, it is computationally difficult to

determine exactly what operations a transaction performs and how operations of
various transactions interact.
For this reason, we shall not interpret the type of operations that a
transaction can perform on a data item.
Instead, we consider only two operations:
read and write. We thus assume that, between a read(Q) instruction and a
write(Q) instruction on a data item Q, a transaction may perform an arbitrary
sequence of operations on the copy of Q that is residing in the local buffer of the
transaction.
Thus, the only significant operations of a transaction, from a scheduling point of

view, are its read and write instructions.

T1 T2
read(A)
write(A) read(A)
write(A)
read(B)
write(B) read(B)
write(B)
Schedule3-showing only the read and write instructions
Conflict Serializability
Let us consider a schedule S in which there are two consecutive instructions Ii

and Ij, of transactions Ti and Tj , respectively (i _= j). If Ii and Ij refer to different
data items, then we can swap Ii and Ij without affecting the results of any
instruction in the schedule.
However, if Ii and Ij refer to the same data item Q, then the order of the two
steps maymatter.
Since we are dealing with only read and write instructions, there are four cases
that we need to consider:
1. Ii = read(Q), Ij = read(Q). The order of Ii and Ij does not matter, since the
same value of Q is read by Ti and Tj , regardless of the order.
2. Ii = read(Q), Ij = write(Q). If Ii comes before Ij, then Ti does not read the
value of Q that is written by Tj in instruction Ij. If Ij comes before Ii, then Ti
reads the value of Q that is written by Tj. Thus, the order of Ii and Ij
matters.
3. Ii = write(Q), Ij = read(Q). The order of Ii and Ij matters for reasons similar

to those of the previous case.
4. Ii = write(Q), Ij = write(Q). Since both instructions are write operations, the

order of these instructions does not affect either Ti or Tj . However, the
value obtained by the next read(Q) instruction of S is affected, since the
result of only the latter of the two write instructions is preserved in the
database. If there is no other write(Q) instruction after Ii and Ij in S, then

the order of Ii and Ij directly affects the final value of Q in the database
state that results from schedule S.
The write(A) instruction of T1 conflicts with the read(A) instruction of T2.
However, the write(A) instruction of T2 does not conflict with the read(B)
instruction of T1, because the two instructions access different data items.
Since the write(A) instruction of T2 in schedule 3 does not conflict with the
read(B) instruction of T1, we can swap these instructions to generate an
equivalent schedule.
We continue to swap nonconflicting instructions:
• Swap the read(B) instruction of T1 with the read(A) instruction of T2.
• Swap the write(B) instruction of T1 with the write(A) instruction of T2.
• Swap the write(B) instruction of T1 with the read(A) instruction of T2.
T1 T2
read(A)
write(A) read(A)
read(B) write(A)
write(B) read(B)
write(B)
If a schedule S can be transformed into a schedule S_ by a series of swaps of

nonconflicting instructions, we say that S and S_ are conflict equivalent.
We say that a schedule S is conflict serializable if it is conflict equivalent to a

serial schedule. Thus, schedule 3 is conflict serializable, since it is conflict
equivalent to the serial schedule 1.

T1 T2
read(A)
write(A)
read(B) read(A)
write(B) write(A)
read(B)
write(B)
Schedule 6—a serial schedule that is equivalent to schedule 3.
T3 T2
read(Q) write(Q)
write(Q)
Schedule 7.
View Serializability
Consider two schedules S and S_, where the same set of transactions
participates in both schedules. The schedules S and S_ are said to be view
equivalent if three conditions are met:
1. For each data item Q, if transaction Ti reads the initial value of Q in schedule S,
then transaction Ti must, in schedule S_, also read the initial value of Q.
2. For each data item Q, if transaction Ti executes read(Q) in schedule S, and if

that value was produced by a write(Q) operation executed by transaction Tj ,
then the read(Q) operation of transaction Ti must, in schedule S_, also read the
value of Q that was produced by the same write(Q) operation of transaction Tj .

3. For each data item Q, the transaction (if any) that performs the final write(Q)
operation in schedule S must perform the final write(Q) operation in schedule S_.
read(A)
T1 T2
A:=A-50
Write(A)
read(B)
B:=B-10
write(B)
read(B)
B:=B+5
0
read(A)
write(B)
Schedule 8. A:=A+10
The concept write(A)

of view equivalence leads to the concept of view serializability. We
say that a schedule S is view serializable if it is view equivalent to a serial
schedule.
T3 T4 T6
Read(Q)
Write(Q)
Write(Q)
Write(Q)
Schedule 9—a view-serializable schedule.
Recoverability
We now address the effect of transaction failures during concurrent

execution. If a transaction Ti fails, for whatever reason, we need to undo the
effect of this transaction to ensure the atomicity property of the transaction.
In a system that allowsconcurrent execution, it is necessary also to ensure

that any transaction Tj that isdependent on Ti (that is, Tj has read data written by
Ti) is also aborted. To achieve this surety, we need to place restrictions on the
type of schedules permitted in the system.

Recoverable Schedules
Most database system require that all schedules be recoverable. A

recoverable schedule is one where, for each pair of transactions Ti and Tj such
that Tj reads a data item previously written by Ti, the commit operation of Ti
appears before the commit operation of Tj .
Cascadeless Schedules
Even if a schedule is recoverable, to recover correctly from the failure of a

transaction Ti, we may have to roll back several transactions. Such situations
occur if transactions have read data written by Ti. As an illustration, consider the
partial schedule
T8 T9
read(A) Read(A)
write(A)
read(B)
Schedule 10
T10 T11 T12
Read(A)
Read(B)
Write(A)
Read(A)
Schedule 11 read(A)
Transaction T10 writes a value of A that is read by transaction T11.
Transaction T11 writes a value of A that is read by transaction T12. Suppose that,
at this point, T10 fails. T10 must be rolled back. Since T11 is dependent on T10,
T11 must be rolled back. Since T12 is dependent on T11, T12 must be rolled
back.
Thisphenomenon, in which a single transaction failure leads to a series of

transaction rollbacks, is called cascading rollback.
Cascading rollback is undesirable, since it leads to the undoing of a significant

amount of work. It is desirable to restrict the schedules to those where cascading
rollbacks cannot occur. Such schedules are called cascadeless schedules.
Formally, acascadeless schedule is one where, for each pair of transactions Ti

and Tj such that Tj reads a data item previously written by Ti, the commit
operation of Ti appears before the read operation of Tj . It is easy to verify that
every cascadeless schedule is also recoverable.
Implementation of Isolation
There are various concurrency-control schemes that we can use to ensure that,
even when multiple transactions are executed concurrently, only acceptable
schedules are generated, regardless of how the operating-system time-shares
resources (such as CPU time) among the transactions.
As a trivial example of a concurrency-control scheme, consider this scheme: A

transaction acquires a lock on the entire database before it starts and releases
the lock after it has committed. While a transaction holds a lock, no other
transaction is allowed to acquire the lock, and all must therefore wait for the lock
to be released.
As a result of the locking policy, only one transaction can execute at a time.
Therefore, only serial schedules are generated. These are trivially serializable,
and it is easy to verify that they are cascadeless as well.
The goal of concurrency-control schemes is to provide a high degree of

concurrency, while ensuring that all schedules that can be generated are conflict
or view serializable, and are cascadeless.
Transaction Definition in SQL
A data-manipulation language must include a construct for specifying the

set of actions that constitute a transaction.
The SQL standard specifies that a transaction begins implicitly. Transactions are
ended by one of these SQL statements:
• Commit work commits the current transaction and begins a new one.

• Rollback work causes the current transaction to abort.
The keyword work is optional in both the statements. If a program

terminates without either of these commands, the updates are either committed
or rolled back— which of the two happens is not specified by the standard and
depends on the implementation.
The standard also specifies that the system must ensure both
serializability and freedom from cascading rollback.
The definition of serializability used by the standard is that a schedule must have
the same effect as would some serial schedule. Thus, conflict and view
serializability are both acceptable.
The SQL-92 standard also allows a transaction to specify that it may be

executed in a manner that causes it to become nonserializable with respect to
other transactions.
Testing for Serializability
When designing concurrency control schemes, we must show that

schedules generated by the scheme are serializable. To do that, we must first
understand how to determine, given a particular schedule S, whether the
schedule is serializable.
We now present a simple and efficient method for determining conflict

serializability of a schedule.
Consider a schedule S. We construct a directed graph, called a precedence

graph, from S.
This graph consists of a pair G = (V, E), where V is a set of vertices and E is
a set of edges. The set of vertices consists of all the transactions participating in
the schedule.
The set of edges consists of all edges Ti →Tj for which one of three
conditions holds:
1. Ti executes write(Q) before Tj executes read(Q).
2. Ti executes read(Q) before Tj executes write(Q).
3. Ti executes write(Q) before Tj executes write(Q).
.
T1 T2 T2 T1

(a) (b)
Precedence graph for (a) schedule 1 and (b) schedule 2
If an edge Ti → Tj exists in the precedence graph, then, in any serial schedule

S_equivalent to S, Ti must appear before Tj .
T1 →T2, because T1 executes read(A) before T2 executes write(A). It also

contains the edge T2 → T1, because T2 executes read(B) before T1 executes
write(B).
If the precedence graph for S has a cycle, then schedule S is not conflict
serializable.
If the graph contains no cycles, then the schedule S is conflict serializable.
A serializability order of the transactions can be obtained through

topological sorting, which determines a linear order consistent with the partial
order of the precedence graph. There are, in general, several possible linear
orders that can be obtained through a topological sorting.
Thus, to test for conflict serializability, we need to construct the

precedence graph and to invoke a cycle-detection algorithm.
Cycle-detection algorithms can be found in standard textbooks on

algorithms. Cycle-detection algorithms, such as those based on depth-first
search, require on the order of n2 operations, where n is the number of vertices
in the graph (that is, the number of transactions). Thus, we have a practical
scheme for determining conflict serializability.
T1 T2
Precedence graph for schedule 4.

Ti
Tj Tk
Tm
(a)
Ti Ti
Tj Tk
Tk Tj
(b) (c)
Tm Tm
Illustration of topological sorting.
Testing for view serializability is rather complicated. In fact, it has been

shown that the problem of testing for view serializability is itself NP-complete.
Thus, almost certainly there exists no efficient algorithm to test for view
serializability.
Concurrency Control
 Lock-Based Protocols
 Graph-Based Protocols
 Timestamp-Based Protocols
 Multiple Granularity
 Multiversion Protocols
 Deadlock Handling
247 | Page Department of Information Technology
Lock-Based Protocols:
• One way to ensure serializability is to require that data items be

accessed in a mutually exclusive manner, while one transaction is
accessing a data item, no other transaction can modify it.
• Lock is the most common mechanism to implement this
requirement.
Lock:
• Mechanism to control concurrent access to a data item.

• Lock requests are made to concurrency-controlmanager.
• Transaction can proceed only after request is granted.Data items can be
locked in two modes:
1. exclusive mode (X): Data item can be both read as well as

written. X-lock is requested using lock-X(A) instruction.
2. shared mode (S): Data item can only be read. S-lock is requested
using lock- S(A) instruction. Locks can be released:U-lock(A)
Locking protocol:
A set of rules followed by all transactions while requesting and releasing

locks. Locking protocols restrict the set of possible schedules.Ensure serializable
schedules by delaying transactions that might violate serializability.
Lock-compatibility matrix tells whether two locks are compatible or not.Any

number of transactions canhold shared locks on a data item. If any transaction
holds an exclusive lock on a data item no other transaction may hold any lock on
that item.
Locking Rules/Protocol:
• A transaction may be granted a lock on an item if the requested

lock is compatible with locks already held on the item by other transactions.
• If a lock cannot be granted, the requesting transaction is made to

wait till all incompatible locks held by other transactions have been released. The
lock is then granted.
Lock 2
Lock 1

Pitfalls of Lock-Based Protocols:
• Too early unlocking can lead to non-serializable schedules.Too late

unlocking can lead to deadlocks.
Example
Transaction T1 transfers $50 from account B to account A.Transaction T2 displays

the total amount of money in accounts A and B, that is, the sum of A + B.
Early unlocking can cause incorrect results, non-serializable schedules.if A and B

get updated in-between the read of A and B, the displayed sum would be wrong.
e.g., A = $100, B = $200, display A + B shows $250
1. X-lock(B)
2. read B
3. B := B-50
4. write B
5. U-lock(B)
6. S-lock(A)
7. read A
8. U-lock(A)
9. S-lock(B)
10. read B
11. U-lock(B)
12. display A + B
13. X-lock(A)
14. read A
15. A := A+50
16. write A
17. U-lock(A)
T1 T2

Late unlocking causes deadlocks.Neither T1 nor T2 can make progress: executing

lock-S(B) causes T2 to wait for T1 to release its lock on B. executing lock-X(A)
causes T1 to wait for T2 to release its lock on A.To handle a deadlock
one of T1 or T2 must be rolled back and its locks released.
1. X-lock(B)
2. read(B)
3. B := B-50
4. write(B)
5. S-lock(A)
6. read(A)
7. S-lock(B)
8. X-lock(A)
T1 T2
Two-Phase Locking Protocol:
A locking protocol that ensures conflict-serializable schedules. It works in two

phases:
o Phase 1: Growing Phase: transaction may obtain locks transaction

may not release locks
o Phase 2: Shrinking Phase: transaction may release locks

transaction may not obtain locks
• When the first lock is released, the transaction moves from phase 1 to
phase 2.
• Properties of the Two-Phase Locking Protocol. Ensures serializability It can
be shown that the transactions can be serialized in the order of their lock
points (i.e. the point where a transaction acquired its final lock).
• Does not ensure freedom from deadlocks. Cascading roll-back is possible.
Modifications of the two-phase locking protocol
Strict two-phase locking
* A transaction must hold all its exclusive locks till it commits/aborts
* Avoids cascading roll-back
Rigorous two-phase locking

All locks are held till commit/abort. Transactions can be serialized in the
order in which they commit.
 Refine the two-phase locking protocol with lock conversions
Phase 1:Acquire a lock-S on item can acquire a lock-X on item can convert a
lock-S to a lock-X (upgrade)
Phase 2: can release a lock-S,can release a lock-X,can convert a lock-X to a lock-

S (downgrade)
* Ensures serializability; but still relies on the programmer to insert the various
locking instructions.
*Strict and rigorous two-phase locking (with lock conversions) are used
extensively in DBMS.
Automatic Acquisition of Locks:
 A transaction Ti issues the standard read/write instruction without explicit

locking calls (by the programmer).
 The operation read(D) is processed as:
if Ti has a lock on D then
read(D)
else
if necessary wait until no other transaction has a lock-X on D;
grant Ti a lock-S on D;
read(D);
end
 write(D) is processed as:

if Ti has a lock-X on D then
write(D)
else
 if necessary wait until no other transaction has any lock on D;

if Ti has a lock-S on D then
upgrade lock on D to lock-X
else
grant Ti a lock-X on D;
end
write(D);
end
All locks are released after commit or abort
Implementation of Locking:
 A lock manager can be implemented as a separate process to which

transactions send lock and unlock requests.
 The lock manager replies to a lock request by sending a lock grant
message (or a message asking the transaction to roll back, in case of a
deadlock).
 The requesting transaction waits until its request is answered.
 The lock manager maintains a data structure called a lock table to record
granted locks and pending requests.
Lock table:
 Implemented as in-memory hash table indexed on the data item being

locked. Black rectangles indicate granted locks.
 White rectangles indicate waiting requests. Records also the type of lock
granted/requested.
 Processing of requests:
 New request is added to the end of the queue of requests for the
data item, and granted if it is compatible with all earlier locks.
 Unlock requests result in the request being deleted, and later
requests are checked to see if they can now be granted.
 If transaction aborts, all waiting or grantedrequests of the
transaction are deleted. Index on transaction to implement this
efficiently.
Graph-Based Protocols:
 Impose a partial order (on

 the set D = {d1, d2 ,..., dh} of all data items.
 If di dj
 then any transaction accessing both di and dj must access di
before accessing dj.
 Implies that the set D may now be viewed as a directed acyclic graph,
called adatabase graph. Are an alternative to two-phase locking.Ensure
conflict serializability.
Tree-protocol: A simple kind of graphbased protocol which works as follows:
 Only exclusive locks lock-X are allowed.

 The first lock by Ti may be on any data item.
 Subsequently, a data item Q can be locked by Ti only if the parent of
Q is currently locked by Ti.
 Data items may be unlocked at any time. A data item that has been
locked and unlocked by Ti cannot subsequently be relocked by Ti.
Example: The following 4 transactions follow the tree protocol on the database
graph below.
 T10: lock-X(B); lock-X(E); lock-X(D); unlock(B); unlock(E); lock-

X(G); unlock(D); unlock(G);
 T11: lock-X(D); lock-X(H); unlock(D); unlock(H);

 T12: lock-X(B); lock-X(E); unlock(E); unlock(B);
 T13: lock-X(D); lock-X(H); unlock(D); unlock(H);
 The tree protocol is ensures conflict serializability, ensures freedom from

deadlock. the abort of a transaction might lead to cascading rollbacks
Unlocking may occur earlier in the tree-locking protocol than in the two-
phase locking protocol.
 shorter waiting times and increase in concurrency however, in the tree-
protocol a transaction may have to lock data items that it does not
access. increased locking overhead and additional waiting time potential
decrease in concurrency
 Schedules not possible under two-phase locking are possible under tree
protocol and vice versa.
TIMESTAMP BASED PROTOCOL:
 Timestamp based protocol is the locking protocols and the order between
every pair of conflicting transactions at execution time by the first that
both members of the pairs request that involves incompatible modes.
 Another method for determining the serializability order is to select an
ordering among transaction in advance.The most common method for
doing so is to use a timestamp ordering scheme.
TIMESTAMPS:
 This timestamp is assigned by the database system before the

transaction Ti starts execution .
 If a transaction Ti has been assigned timestampTS(Ti)and a new
transaction Tj enters the system,thenTS(Ti)<TS(Tj).There are two
simple method for implementing this scheme.
1. Use the value of the system clock as the timestamp,(i.e) a
transactions timestamp is equal to the value of the clock when the
transaction enters the system.
2. Use a logical counter that is incremented after a new timestamp
has been assigned and a transaction timestamp is equal to the
value of the counter when the transaction enter the system.
To implement this scheme we associate with each data item Q two timestamp
values.

 W-timestamp(Q) denotes the largest timestamp of any transaction that

executed write(Q) successfully.
 R-timestamp(Q) denotes the largest timestamp of any transaction that
executed read(Q) successfully.
THE TIMESTAMP-ORDERING PROTOCOL:
The timestamp-ordering protocol ensures that any conflicting read and write
operations are executed in timestamp order.This protocols operates as follows:
1.Suppose that transaction Ti issues read(Q).
• If TS(Ti)<W-timestamp(Q), then Ti needs to read a value of Q that was

already overwritten.hence the read operation is rejected and Ti is rolled
back.
• If TS(Ti)=>W-timestamp(Q) then the read operation is executed and R-

timestamp(Q) is set to the maximum of R-timestamp(Q) and TS(Ti).
2. Suppose that transaction Ti issues write(Q).
• If Ts(Ti)<R-timestamp(Q) the the value of Q thatTi is producing was

needed previously and the system assumed that the value would never
be produced.
• If TS(Ti)<W-timestamp(Q), then Ti is attempting to write an obsolete

value of Q.hence the system rejects this write operation and rolls Ti
back.
• Otherwise the system executes the write operation and sets W-

timestamp(Q)to TS(Ti).
The protocol can generate schedules that are not recoverable.However it

can be extended to make the schedules recoverable in one several ways
• Recoverability and cascadelesssness can be ensured by performing all

writes together at the end of the transaction.
• Recoverability and cascadelessness can also be guaranteed by using a
limited form of locking whereby reads of uncommitted items are
postponed until the transaction the updated the item commits.
• Recoverability alone can be ensured by tracking uncommitted writes and
allowing a transaction Ti to commit only after the commit of any
transaction that wrote a value that Ti read.
Example: The following schedule is possible under the timestamp ordering

protocol. Since TS(T14) < TS(T15), the schedule must be conflict equivalent to
schedule <T14,T15>
read(B)
read(B)
B := B – 50
write(B)
read(A)
read(A)
display(A+B)
A := A + 50
write(A)
display(A+B)
T14 T15
Thomas’ Write rule:
 Let us consider schedule 4 and apply the timestamp ordering

protocol.Since T16 starts before T17 we shall assume that
TS(T16)<TS(T17).
 The read(Q) operation of T16 succeeds as does the write (Q)
operation,we find that TS(T16)<W-timestamp(Q),since W-
timestamp(Q)=TS(T17).
 Thus the write (Q) by T16 is rejected and transaction T16 must be rolled
back.
 Although the rollback of T16 is required by the timestamp ordering
protocol it is unnecessary.Since T17 has already written Q,the value that
T16 is attempting to write is one that will never need to be read.
 Any transaction Ti with TS(Ti)<TS(T17) that attempts a read(Q) will be
rolled back.Since TS(Ti)<W-timestamp(Q).
The modification to the timestamp-ordering protocol called Thomas

write rule is this.Suppose that transaction Ti issues write(Q).
1. If TS(Ti)<R-timestamp(Q),then the value of Q that Ti

is producing was previously needed and it had been assumed that the
value would never be produced.Hence the system rejects the write
operation and rolls Ti back.
2. If TS(Ti)<W-timestamp(Q) then Ti is attempting to

write an obsolute value of Q.Hence the write operation can be ignored.
3. Otherwise the system executes the write operation

and sets
W-timestamp(Q) to TS(Ti).

Under Thomas’ write rule the write(Q) operation of T16 would be ign
ored.The result is a schedule that is view equivalent to the serial
schedule<T16,T17>.
VALIDATION-BASED PROTOCOLS:
 Validation based protocol is a majority of transaction are read only

transaction the rate of conflicts among transaction may be low.
 Thus may of this transaction if executed without the supervision of a
concurrency control schemewould nevertheless leave the system in a
consistent state.
 A concurrency control scheme imposes overhead of code execution and
possible delay of transactions.we assume that each transaction Ti
executes in two or three different phases are in order.
1.Read Phase:During the phase the system executes transaction Ti. It

reads the values of the various data items and stores them in variable local to Ti.
It perform all write operations on temporary local variables without
update of the actual database.
2.Validation Phase: Transaction Ti performs a validation test

to determine whether it can copy to the database the temporary local variables
that hold the results of write operation without causing a violation of
serializability.
3.Write Phase: If transaction Ti succeeds in valdation (step 2), then the

system applies the actual updates to the database. Otherwise the
system rolls back Ti.
To perfom the validation test we need to know when the various phases of
transaction Ti took place and then the associate three different timestamps with
transaction Ti.
1.Start(Ti) is the time when Ti started its execution.
2.Validation(Ti) is the time when Ti finished its read phase and started
its validation phase.
3.Finish(Ti) the time when Ti finished its write phase.
Thus the value TS(Ti)=Validation(Ti) and if TS(Tj)<TS(Tk) then any

produced schedule must be equivalent to a serial schedule in which transaction
Tj appears before transaction Tk.The validation test for transaction Tj requires
that for all transaction Ti with TS(Ti)<TS(Ti) one of the following two condition
must hold.

1. Finish(Ti)<Start(Tj).Since Ti completes its execution before Tj started

the serializability order is indeed maintained.
2.The set of data items written by Ti does not intersect with the set of data
items read by Tj and Ti completes its write phase before Tj starts its validation
phase(start(Ti)<validation(Tj)).This condition ensures that the writes of Ti and Tj
do not overlap.
The validation scheme is called the optimistic concurrency control scheme

since transactions execute optimistically assuming the will be able to finish
execution and validate at the end.In constrast locking and timestamp ordering
are pessimistic in that they force a wait or a rollback whenever a conflict is
detected and even though there is a chance that the schedule may be conflict
serializable.
Multiple Granularity
 Instead of locks on individual data items, sometimes it is advantageous to

group several data items and to treat them as one individual
synchronization unit (e.g. if a transaction accesses the entire DB).
 Define a hierarchy of data granularities of different size, where the small
granularities are nested within larger ones.
 Can be represented graphically as a tree When a transaction locks a node
in the tree explicitly, it implicitly locks all the node's descendents in the
same mode.
Example: Graphical representation of a hierarchy of Granularities The highest

level is the entire database.The levels below are of type area, file and record in
that order.
Granularity of locking (= level in tree where locking is done): fine granularity

(lower in tree): high concurrency, high locking overhead. coarse granularity
(higher in tree): low locking overhead, low concurrency
Multiversion Protocols:
Concurrency control protocols studied thus far ensure serializability by either

delaying an operation or aborting the transaction.
Multiversion schemes keep old versions of data items to increase concurrency.
Each successful write(Q) creates a new version of Q.Timestamps are used to

label versions.When a read(Q) operation is issued, select an appropriate version
of Q based on the timestamp of the transaction. reads never have to wait as an
appropriate version is available. Two types of multiversion protocols

• Multiversion timestamp ordering

• Multiversion two-phase locking
Multiversion Timestamp Ordering:
Each data item Q has a sequence of versions<Q1,Q2,....,Qm>. Each version

Qk contains 3 data fields:
Content – the value of version Qk.

W-timestamp(Qk) – timestamp of the transaction that created(wrote)
version Qk
 R-timestamp(Qk) – largest timestamp of transaction that successfully read
version Qk
 When a transaction Ti creates a new version Qk of Q,the W-timestamp and
R-timestamp of Qk are initialized to TS(Ti).
 R-timestamp of Qk is updated whenever a transaction Tj reads Qk, and
TS(Tj) > R-timestamp(Qk).
The following multiversion timestamp-ordering protocol ensures serializability.
1. If transaction Ti issues a read(Q), then the value returned is the content of

version Qk, which is the version of Q with the largest write timestamp less than
or equal to TS(Ti)
2. If transaction Ti issues a write(Q):
– If TS(Ti) < R-timestamp(Qk), then transaction Ti is rolled back.
– Otherwise, if TS(Ti) = W-timestamp(Qk), the contents of Qk are overwritten.
– Otherwise a new version of Q is created.
Properties of the multiversion timestamp-ordering protocol reads always

succeed and never have to wait A transaction reads the most recent version that
comes before it in time.In a typical DBMS reading is a more frequent operation
than writing, hence this advantage might be significant.
write: A transaction is aborted if it is “too late” in doing a
write, a write by Ti is rejected if another transaction Tj that should read Ti's
write has already read a version created by a transaction older than Ti.
Disadvantages
 Reading of a data item also requires the updating of the Rtimestamp,

resulting in two disk accesses rather than one.
 The conflicts between transactions are resolved through rollbacks rather

than through waits.

DEADLOCK HANDLING:
 A System is in a deadlock state if there exits a set of transactions such

that every transaction in the set is waiting for another transaction in the
set.
 More prescisely there exists a set of waiting transaction{To,T1…..Tn}
such that To is waiting for a data item that T1 holds and T1 is waiting for
a data item thatT2 holds and Tn-1 is waiting for a data item that Tn holds
and Tn is waiting for a data item that To holds.
 There are two principal method for dealing with the deadlock problem.
1. Deadlock Prevention
2. Deadlock detection and deadlock recovery.
Consider the following two transactions:
T1: write (X) T2: write(Y)
write(Y) write(X)
 Schedule with deadlock
T1 T2
lockX
on X
write (X)
lockX
on Y
write (Y)
wait for lockX
on X
wait for lockX
on Y
DEADLOCK PREVENTION:

 There are two approaches to deadlock prevention.One approach ensures

that no cyclic waits can occurby ordering the request for locks or
requiring all locks to be acquired together.
 The other approach is closer to deadlock recovery and performs
transaction rollback instead of waiting for a lock whenever the wait could
potentially result in a deadlock.
 The first approach requires that each transaction locks all its data items
before it begins exection.There are two main disadvantages to this
protocol.
1.It is often hard to predict before the transaction begins what
data items need to be locked.
2.Data-items utilization may be very low,since many of the data

items may be locked but unused for a long time.
 The second approach fo preventing deadlocks is to use preemption and

transaction rollbacks.
 In preemption when a transaction T2 requests a lock that transaction T1
holds the lock granted to T1 may be preempted by rolling back of T1 and
granting of the lock to T2.
 Two different deadlock prevention schemes using timestamps have been
proposed.
1.The wait-die scheme is a nonpreemptive technique.When transaction
Ti request a data item currently held byTj,Ti is allowed to wait only if it has
timestamp smaller than that of Tj and otherwise Ti is rolled.
2.The wound-wait scheme is a preemptive technique.It is a counterpart

to the wait-die scheme.when transaction Ti request a data item currently held by
Tj,Ti is allowed to wait only if it has a timestamp larger than that of Tj.
Whenever the system rolls back transaction it is important to ensure

that there is no starvation and no transaction gets rolled back
repeatedly and is never allowed to make progress.
TIMEOUT-BASED SCHEMES:
 Another simple approach to deadlock handling is based on lock

timeouts.In this approach a transaction that has requested a lock waits
for at specified amount of time.
 The timeout scheme is particularly easy to implement and works well if
transaction are short and if long waits are likely to be due to deadlocks.
 Too long a wait result in unnecessary delays once a deadlock has
occurred.
 Too short a wait result in transaction rollback even when there is no
deadlock leading to wasted resources.starvation is also a possibility
with scheme.The timeout-based scheme has limited applicability.
DEADLOCK DETECTION AND RECOVERY:

When a deadlock is detected, the system must recover from the deadlock.
The most common solution is to roll back one or more transactions to break the
deadlock. Three actions are required:

1.Selection of a victim: Select that transaction(s) to roll back that will incur
minimum cost.
2.Rollback: Determine how far to roll back transaction.Total rollback: Abort the
transaction and then restart it. More effective to roll back transaction only as far
as necessary to break deadlock.
3.Check Starvation: happens if same transaction is always chosen as victim.

Include the number of rollbacks in the cost factor to avoid starvation
DEADLOCK DETECTION:
Deadlock can be described precisely in terms of a directed graph called a

wait for graph.This graph consists of a pair G=(V,E),where V is a set of vertices
and E is a set of edges.The set of vertices consists of all the transaction in the
system.
Each element in the set E of edges is an ordered pair Ti->Tj.IfTi->Tjis in

E,then there is a directed edge from transactionTi to Tj implying that transactionti
is waiting for transaction Tj to release a data itemthat it needs.To illustrate this
consepts consider the wait for graph.
• Transaction T25 is waiting for transaction T26 and T27.

• Transaction T27 is waiting for transaction T26.
• Transaction T26 is waiting for transaction T28.
RECOVERY FROM DEADLOCK:
When a detection algorithm determines that a deadlock exists the system

must recover from the deadlock.The most common solution is to roll back one or
more transaction to break the deadlock.
1.Selection of a Victim:Given a set of deadlocked transactions, we must

determine which transaction to roll back to break the deadlock.We should roll
back those transactions that will incur the minimum cost.
2.Rollback:Once we have decided that a particular transaction must be

rolled back we must determine how far this transaction should be rolled back.The
simplest solution is a total rollback.However it is more effective to roll back the
transaction only as far as necessary to break the deadlock.Such partial rollback
requires the system to maintain additional information about the state of all the
running transactions.
3.In a system where the selection of victims is based primarly no cost

factors.It may happen that the same transaction is always picked as a victim.As a
result this transaction never completes its designated task thus there is a

starvation.The most common solution is to include the number of rollback is in

the cost factors.
INSERT AND DELETE OPERATIONS:
 This restriction limits transactions to data items already in the

database.some transaction require not only access to existing data
itemsbut also the ability to create new data items.
 Other require the ability to delete data items.To examine how such
transactions affect concurreny control,we introduce these additional
operations.
 Delet(Q) deletes data item Q from the database.
 Insert(Q) inserts a new data item Q into the database andassigns Q

an initial value.
 An attempt by a transaction Ti to perform a read(Q) operation after q

has been deleted results in a logical error in Ti.
 Likewise an attempt by a transaction Ti to perform a read(Q)

operation before Q has been inserted results in a logical error in Ti.it
is also logical error to attempt to delete a nonexistent data item.
DELETION:
The presence of delete instructions affects concurrency

control and we must decide when a delete instruction conflicts with
another instruction.Let Ii and Ij be instruction of Ti and Tj respectively
that appear in schedule S in consecutive order.let Ii=delete(Q).
• Ij=read(Q),Ii and Ij conflict.If Ii comes before Ij,Tj will have a

logical error.If Ij comes before Ii,Tj can execute the read
operation successfully.
• Ij=write(Q) Ii and Ij conflict.If Ii comes before Ij,Tj will have a
logical error.If Ij comes before Ii,Tj can execute the write
operation successfully.

• Ij=Delete(Q) Ii and Ij conflict.If Ii comes before Ij,Ti will have a

logical error.If Ij comes before Ii,Ti will have a logical error.
• Ij=insert(Q).Ii and Ij conflict.Suppose that data item Q did not
exit prior to the execution of Ii and Ij.then Ii comes before Ij a
logical error results for Ti.If Ij come before Ii the no logical error
result.
INSERTION:
Insert(Q) operation conflicts with a delete (Q)

operation.similarly insert(Q) conflicts with a read(Q)operation or a write (Q)
operation.no read or write can be performed on a data item before it exists.since
an insert(q) assigns a value to data item Q,an insert is treated similarly to a write
for concurrency control purpose.
• Under the two phase locking protocol.ifTi performs an Insert(Q) operation

Ti is given an exclusive lock on the newly created data itemQ.
• Under the timestamp ordering protocol if Ti performs an insert(q)

operation the values R-timestamp(Q) and W-timestamp(Q) are set to
Ts(Ti).
THE PHANTOM PHENOMENNON:
Consider the transaction t20 that executes the following SQL query on the
bank database.
Select sum(balance)
From account
Where branch_name=’perryridge’
Transaction T29 requires access to all tuples of the account relation

pertaining to the perryridge branch.
The major disadvantages of locking a data item corresponding to the

relation is the low degree of concurrency two transaction that insert different
tuples into a relation are prevented from executing concurrently. A better
solution for the index locking technique.
The index locking protocol takes advandages of the availability of indices

on a relation by turning instances of the phantom phenomenon into conflict on
locks on index leaf nodes.the protocol operates as follows.Every relation must
have one index.

• A transaction Ti can access tuples of a relation only after first finding them
through one or more of the indices on the relation.
• A transaction Ti that performs a lookupmust acquire a shared lock on all
the index leaf nodes that it accesses.
• A transaction Ti may not insert delete or update a tuple ti in arelation r
without updating all indices on r.
• The rules of the two phase locking protocol must be observed.

RECOVERY SYSTEM
Recovery system:
Recovering a system from failure crash is called recovery system or crash

recovery.
Failure Classification
There are various types of failure that may occur in a system.there are :
 Transaction failure :
1. Logical errors:
The Transaction cannot complete due to some internal error condition
2. System errors:
The database system must terminate an active transaction due to an

error condition.
(e.g., deadlock)
3. System crash:
A power failure or other hardware or software failure causes the system to

crash.
Fail stop assumption:
A non-volatile storage contents are assumed to not be corrupted by system

crash .Database systems have numerous integrity checks to prevent corruption
of disk data
Disk failure:
A head crash or similar disk failure destroys all or part of Disk . Destruction
is assumed to be detectable: disk drives use checksums to detect failures
Recovery Algorithms

Recovery algorithms are techniques to ensure database consistency and

transaction atomicity and durability despite failures.
Recovery algorithms have two parts
1. Actions taken during normal transaction processing to ensure enough

information exists to recover from failures
2. Actions taken after a failure to recover the database contents to a state that
ensures atomicity, consistency and durability.
Storage Structure :
The various data items in the database may be stored and accessed in a
number of different storage media.
Storage types
The types are :
 Volatile : It does not survive system crashes.

eg: main memory, cache memory
 Non volatile storage: It survives system crashes

eg: disk, tape, flash memory, non-volatile (battery backed
up) RAM.
 Stable storage: A mythical form of storage that survives all failures

approximated by maintaining multiple copies on distinct nonvolatile
media.
Stable Storage Implementation
• Maintain multiple copies of each block on separate disks

copies can be at remote sites to protect against disasters such as fire or flooding.
• Failure during data transfer can still result in inconsistent

copies:
Block transfer can result in
 Successful completion :
Information arrived safely at its destination
 Partial failure:
Destination block has incorrect information
 Total failure:
• Destination block was never updated .
• Protecting storage media from failure during data transfer (one
solution):
Execute output operation as follows (assuming two copies of each block):
• Write the information onto the first physical block.

• When the first write successfully completes, write the Same information
onto the second physical block.
• The output is completed only after the second write successfully
completes.
Protecting storage media from failure during data transfer copies of a block
may differ due to failure during output operation.
To recover from failure:
First find inconsistent blocks:
1. Expensive solution: Compare the two copies of every disk block.
2. Better solution: Record in-progress disk writes on nonvolatile

storage (Nonvolatile RAM or special area of disk).
 Use this information during recovery to find blocks that may be

inconsistent, and only compare copies of these.
 Used in hardware RAID systems If either copy of an inconsistent block

is detected to have an error (bad checksum), overwrite it by the other
copy. If both have no error, but are different, overwrite the second
block by the first block.
Data Access
 Physical blocks are those blocks residing on the disk.

 Buffer blocks are the blocks residing temporarily in main memory.
Block movements between disk and main memory are initiated through the
following two operations:

• input (B) transfers the physical block B to main memory.
• output(B) transfers the buffer block B to the disk, and replaces the
appropriate physical block there.
• Each transaction Ti has its private work-area in which local copies

of all data items accessed and updated by it are kept.
• Ti's local copy of a data item X is called xi.

We assume, for simplicity, that each data item fits in, and is stored inside, a
single block.
• Transaction transfers data items between system buffer

blocks and its private work-area using the following operations :
• read(X) assigns the value of data item X to the local

variable xi.
• write(X) assigns the value of local variable xi to data item

{X} in the buffer block.
• Both these commands may necessitate the issue of an input(BX)

instruction before the assignment, if the block BX in which X resides is
not already in memory.
Transactions:
• Perform read(X) while accessing X for the first time;

• All subsequent accesses are to the local copy.After last access,
transaction executes write(X).
• output(BX) need not immediately follow write(X). System
can perform the output operation when it deems fit.

Recovery and Atomicity
• Modifying the database without ensuring that the transaction will

commit may leave the database in an inconsistent state.
• Consider transaction Ti that transfers $50 from account A to

account B;goal is either to perform all database modifications
made by Ti or none at all.
• Several output operations may be required for Ti (to output A and B). A
failure may occur after one of these modifications have been made but
before all of them are made.

To ensure atomicity despite failures, we first output information describing

the modifications to stable storage without modifying the database itself.
two approaches:
1. log based recovery
2. shadow paging
Assume that transactions run serially, that is, one after the other.
Log Based Recovery
• A log is kept on stable storage.

• The log is a sequence of log records, and maintains a record
of update activities on the database.
• When transaction Ti starts, it registers itself by writing a<Ti

start>log record
• Before Ti executes write(X), a log record <Ti, X, V1, V2> is

written,
• where V1 is the value of X before the write, and V2 is the

value to be written to x.
• Log record notes that Ti has performed a write on data item

Xj Xj had value V1 before the write, and will have value V2 after the write.
• When Ti finishes it last statement, the log record <Ti

commit> is written to .x
We assume for now that log records are written directly to stable storage (that
is, they are not buffered)
 Two approaches using logs

 Deferred database modification
 Immediate database modification
Deferred Database Modification
• The deferred database modification scheme records all

modifications to the log, but defers all the writes to after partial
commit.
• Assume that transactions execute serially

• Transaction starts by writing <Ti start> record to log.
• A write(X) operation results in a log record <Ti, X, V> being

written, where V is the new value for X.
 Note: old value is not needed for this scheme.

• The write is not performed on X at this time, but is
deferred. When Ti partially commits, <Ti commit> is
written to the log .
• Finally, the log records are read and used to actually

execute the previously deferred writes.
• During recovery after a crash, a transaction needs to be

redone if and only if both <Ti start> and<Ti commit> are there in the log.
• Redoing a transaction Ti ( redoTi) sets the value of all

data items updated by the transaction to the new values.
• Crashes can occur while the transaction is executing

the original updates, or while recovery action is being taken
example : transactions T0 and T1 (T0 executes before T1):
T0 : read (A)
T1 : read (C)
A: - A - 50 C:- C- 100
Write (A) write (C)
read (B)
B:- B + 50
write (B)
Below we show the log as it appears at three instances of time.

• If log on stable storage at time of crash is as in case:
(a) No redo actions need to be taken
(b) redo(T0) must be performed since <T0 commit> is
present
(c) redo(T0) must be performed followed by redo(T1) since
<T0 commit> and <Ti commit> are present
Immediate Database Modification
• The immediate database modification scheme allows database

updates of an uncommitted transaction to be made asthe writes are
issued.
• since undoing may be needed, update logs must have both old
value and new value
• Update log record must be written before database item is written.
• We assume that the log record is output directly to stable
storage
• Can be extended to postpone log record output, so long as
prior to execution of an output(B) operation for a data
block B, all log records corresponding to items B must be
flushed to stable storage.
• Output of updated blocks can take place at any time before or after
transaction commit.
• Order in which blocks are output can be different from the
order in which they are written.

Example :
Log input output
Log Write Output
<T0 start>
<T0, A, 1000, 950>
To, B, 2000, 2050
A = 950
B = 2050
<T0 commit>
<T1 start>
<T1, C, 700, 600>
C = 600
BB, BC
<T1 commit>
BA
Note: BX denotes block containing X.
x1
Recovery procedure has two operations instead of one:
• undo(Ti) restores the value of all data items updated by Ti

to their old values, going backwards from the last log record for Ti
• redo(Ti) sets the value of all data items updated by Ti to the

new values, going forward from the first log record for Ti.
• Both operations must be idempotent That is, even if the

operation is executed multiple times the effect is the same as if it is
executed once
• Needed since operations may get re-executed during recovery

When recovering after failure:
• Transaction Ti needs to be undone if the log contains the record

• <Ti start>, but does not contain the record <Ti commit>.
• Transaction Ti needs to be redone if the log contains both the record
• <Ti start> and the record <Ti commit>.
Undo operations are performed first, then redo operations.
Recovery actions in each case above are:
.(a) undo (T0): B is restored to 2000 and A to 1000.
(b) undo (T1) and redo (T0): C is restored to 700, and then A and B are set to
950 and 2050 respectively.
(c) redo (T0) and redo (T1): A and B are set to 950 and 2050
respectively. Then C is set to 600
checkpoints:
When a system failure occurs, we must consult the log to determine those
transactions that need to be redone and those that need to be undone. We need
to search the entire log to determine this information.
There are two major difficulties with this approach:
1. The search process is time consuming.

2. Most of the transaction that, according to our algorithm, need to be redone

have already written their updates into the database.
To reduce these types of difficulties, we introduce checkpoints.

During execution, the system maintains the log, the system
periodically performs checkpoints, which require the following
sequence of action to take place:
1. Output onto stable storage all log records currently residing in main
memory.
2. Output to the disk all modified buffer blocks.
3. Output onto stable storage a log record <checkpoint>.
Transaction are not allowed to perform any update actions, such as writing
to a buffer block or writing a log record, while a checkpoint is in progress.
Consider a transaction Ti that committed prior to the checkpoint.
For such a transaction, the <Ti commit> record appears in the log before
the <checkpoint> record. Any database modifications made by Ti must been
written to the database either prior o the checkpoint or as part of the checkpoint
itself.
Shadow Paging
Shadow paging is an alternative to log-based recovery; this scheme is

useful if transactions execute serially
• It maintain two page tables during the lifetime of a transaction –the

current page table and the shadow page table
• Store the shadow page table in nonvolatile storage, such that state of the
database prior to transaction execution may be recovered.
• Shadow page table is never modified during execution. To start with, both
the page tables are identical. Only urrent page table is used for data item
accesses during execution of the transaction.
Whenever any page is about to be written for the first time.
• A copy of this page is made onto an unused page.

• The current page table is then made to point to the copy.
• The update is performed on the copy.


To commit a transaction
1. Flush all modified pages in main memory to disk
2. output current page table to disk
3. Make the current page table the new shadow page table, as follows:
• keep a pointer to the shadow page table at a fixed (known) location on

disk.
• To make the current page table the new shadow page table, simply update
the pointer to point to current page table on disk
• Once pointer to shadow page table has been written, transaction is
committed.
• No recovery is needed after a crash — new transactions can start right
away, using the shadow page table.
• Pages not pointed to from current/shadow page table should be freed
(garbage collected).

Advantages of shadow-paging over log-based schemes
• no overhead of writing log records

• recovery is trivial
Disadvantages :
• Copying the entire page table is very expensive

• Can be reduced by using a page table structured like a B+-tree
• No need to copy entire tree, only need to copy paths in the tree that lead
to updated leaf nodes
• Commit overhead is high even with above extension
• Need to flush every updated page, and page table
• Data gets fragmented (related pages get separated on disk)
• After every transaction completion, the database pages
containing old versions of modified data need to be garbage
collected
• Easier to extend log based schemes

Q-BANK:
1.Discuss about concurrency control (15) April/May2008
2.Write short notes on: April/May2008
(a)Log-Based recovery(7)
(b)Advanced recovery techniques(8)
3.(a) What are the advantages and disadvantages of distributed database

system?(6)May2007
(b) Briefly explain about deadlock(9)
4.(a) What is ACID properties? explain(6) May2007
(b)List different security concerns for a bank and state whether these concern
relates to physical security, human security, operating system security, or
database security (9)
5.(a) Discuss the use of two-phase locking protocol(8)Nov/Dec2008
(b)Explain about the remote backup system(7)
6.(a)Discuss about transaction states(7) Nov/Dec2008
(b) write notes on log-based recovery(8)
7.List the ACID properties and explain in detail on necessities of each of ACID
properties(15)
8.(a)explain the two principal methods for dealing with the deadlock
problem(9)April/May2009
(b) List and explain the properties of the transaction that the database system
maintain (6)

9.Discuss the following : April/May2009
(a)Shadow Paging
(b)Graph –based Protocol

DBMSit

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DBMSit

Загружено:

Авторское право:

Доступные форматы

Database Management Systems

DATABASE MANAGEMENT SYSTEMS

2 MARKS - LECTURE NOTES – Q BANK

CODE : UIT 6087

1 |Page Department of Information Technology

2 |Page Department of Information Technology

Concurrence Control: Lock-Based Protocols – Timestamp-Based Protocols –

 Silberschatz, Korth, Sudarshan, “Database System Concepts”, 4th Edition –

 Fred R McFadden, Jeffery A Hoffer, Mary B. Prescott, “Moden Database

 Elmasri, Navathe, “Fundamentals of database Systems”, Third Edition,

 Jefrey D.Ulman, Jenifer Widom, “A First Course in Database Systems:,

 Bipin C Desai, “An Introduction to Database Systems”, Galgotia

3 |Page Department of Information Technology

1. What is DBMS and what’s the goal of it?

2. What are the advantages of DBMS?

3. What are the Disadvantages of DBMS?

4. List any eight applications of DBMS.

5. What are the disadvantages of File Systems?

6. Give the levels of data abstraction?

4 |Page Department of Information Technology

7. Define the terms

Logical schema: The logical schema describes the database design at

8. What is conceptual schema?

9. Define data model?

10. What is storage manager?

11. What are the co mponents of storage manager?

12. What is the purpose of storage manager?

13. List the data structures implemented by the storage manager

14. What is a data dictionary?

15. What is an entity relationship model?

entities and relationship among those objects. An entity is a thing or

16. What are attributes? Give examples.

17. What is relationship? Give examples

19.Define and give examples to illustrate the four types of attributes in

20. Define the terms

21. What is meant by the degree of relationship set?

Value set: Each simple attribute of an entity type is associated with

23. Define relationship and participation.

24. Define mapping cardinality or cardinality ratio.

25. Explain the four types of mapping cardinality with example.

26. Differentiate total participation and partial participation. (i.e., write

• Total – participation of an entity set, E in relationship, R is total if every

• Partial – if only some entities in E participate in R. Eg., payment weak

7 |Page Department of Information Technology

27. Define E-R diagram.

29. Define discriminator or partial key of a weak entity set. Give

30. Explain Referential Integrity.

31. Define Instances and schemas.

32. Define and explain the two types of Data Independence.

33. Define transaction.

34. Define the type types of DML.

35. List out the functions of DBA.

36. What is the need for DBA?

37. Define weak and strong entity sets?

38. What does the cardinality ratio specify?

39. Explain the two types of participation constraint.

40. Explain DML pre-compiler.

9 |Page Department of Information Technology

41. Define file manager and buffer manager.

42. Define Data Dictionary.

DATABASE MANAGEMENT SYSTEMS

DBMS contains information about a particular enterprise

 A very large, integrated collection of data.